The Legal Battle Unfolds
In late December, the esteemed and historic New York Times filed a substantial lawsuit against OpenAI, the developer of ChatGPT, alleging extensive copyright infringement. This suit, which marks a continuation in a series of intellectual property (IP) claims against the AI pioneer, accuses OpenAI of utilizing a broad range of protected media to develop products like ChatGPT without appropriate authorization or compensation.
Microsoft, a significant investor in OpenAI, is also implicated in the lawsuit. The complaint notes that Microsoft has invested a substantial amount of $13 billion in a key OpenAI entity, earning 75% of its profits until this investment is repaid, after which Microsoft will hold a 49% stake in the OpenAI unit.
The Core of the Dispute
The New York Times has long licensed its content under negotiated agreements, including with major tech platforms. Prior to the lawsuit, in April 2023, the Times approached Microsoft and OpenAI to address IP concerns and discuss a potential amicable resolution that would involve commercial terms and technological safeguards for a mutually beneficial exchange. However, these discussions did not yield the desired outcome.
The lawsuit elaborates that the defendants, OpenAI and Microsoft, replicated a massive volume of the Times' copyrighted content without any form of license or compensation. It is claimed that these works were repeatedly copied and ingested for training OpenAI's GPT models.
The Technicalities and Accusations
The lawsuit provides an in-depth explanation of how large language models, like those developed by OpenAI, function. It highlights the significant role of datasets from sources such as Common Crawl in training these models. Notably, the Times' website was a prominently represented source in Common Crawl's dataset as of 2019, indicating its substantial use in the training process.
OpenAI's alleged acknowledgment of prioritizing higher-quality datasets, such as the Times' content, for more frequent sampling during training, is a critical aspect of the lawsuit. The suit cites examples where GPT-4, a model developed by OpenAI, purportedly replicated content from the Times verbatim in response to user prompts.
The lawsuit also points out that the infringement extends beyond the training datasets of ChatGPT. It includes the use of the Times' content in applications built on the GPT models, such as Bing Chat and Browse with Bing for ChatGPT, where extensive excerpts or paraphrases of the Times' content have been displayed.
The Broader Implications
One of the lawsuit's significant concerns is the alleged reputational damage inflicted on the New York Times by OpenAI's models. The models have been accused of associating the Times with misinformation, including the creation of non-existent articles and false statements, a phenomenon referred to as 'hallucination' in AI terminology.
Ultimately, the New York Times seeks legal redress for various claims, including vicarious and contributory copyright infringement, violations of the Digital Millennium Copyright Act (DMCA) concerning the removal of copyright-management information, common law unfair competition by misappropriation, and trademark dilution. This lawsuit against OpenAI and Microsoft is part of a broader legal landscape involving key AI players and their use of copyrighted material.
OpenAI's Stance on the Matter
In response to the lawsuit, OpenAI has vehemently argued that the case lacks merit. The company maintains that using publicly available data from the web, including news articles from outlets like the New York Times, for training AI models constitutes fair use. This stance is rooted in the belief that licensing or paying for such examples is not a requisite for their use in developing AI systems like GPT-4 and DALL-E 3.
OpenAI has also addressed the issue of regurgitation, where AI models reproduce training data verbatim or nearly so. The company argues that such occurrences are less likely with data from a single source and emphasizes the responsibility of users to avoid intentionally prompting their models to regurgitate content, which is against OpenAI's terms of use. Interestingly, OpenAI points out that the examples cited by the New York Times in its lawsuit are from older articles that have been widely disseminated on third-party websites, suggesting possible manipulation of prompts by the Times.
An Intensifying Copyright Debate
The copyright discussion surrounding generative AI is intensifying. Critics like Gary Marcus and visual effects artist Reid Southen have demonstrated instances where AI systems, including DALL-E 3, reproduce data without explicit prompting. This challenges OpenAI's claims and adds credibility to the concerns raised by the New York Times and other copyright holders.
Alternative Approaches
Interestingly, some news organizations have opted for a different route by entering licensing agreements with AI vendors instead of pursuing legal action. For instance, The Associated Press and German publisher Axel Springer have established agreements with OpenAI. However, the financial scope of these deals is relatively modest compared to OpenAI's substantial revenue.
In contrast, discussions between the New York Times and OpenAI for a potential high-value partnership involving the real-time display of the Times' brand in ChatGPT broke down, leading to the current legal impasse.
Conclusion
As the legal dispute between OpenAI and the New York Times unfolds, it raises fundamental questions about copyright, fair use, and the ethical boundaries of AI development. The outcome of this case could have far-reaching implications for the AI industry and the protection of intellectual property in the digital age.