A U.S. federal district court recently ruled that Anthropic, an AI startup backed by Amazon, did not infringe copyright when it used copyrighted books to train its large language model, Claude. The training was deemed to be highly “transformative” and thus qualified as “fair use.”
The lawsuit was filed in August 2024 by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who alleged that Anthropic had unlawfully “stolen” pirated copies of their books to build a multibillion-dollar AI business. According to the complaint, the plaintiffs’ works were included in a dataset of pirated books used to train Claude. “Anthropic downloaded known pirated versions of the plaintiffs’ works, copied them, and fed them into its model.” The lawsuit sought unspecified damages and a permanent injunction against further use of the authors’ works.The judge concluded that the training phase constituted fair use, emphasizing that the model did not replicate the creative expression or style of the original works, but rather “learned” like a human reader to generate new, independent content. However, Anthropic’s internal “central library,” which stored over 7 million pirated books, was found to constitute infringement. Even though the company later purchased legitimate copies, this did not absolve it of its earlier infringing conduct. A trial on potential damages is scheduled for December 2025.
Meanwhile, in May, the U.S. Copyright Office released “Copyright and Artificial Intelligence Report (Part 3: Generative AI Training)”, which provides a systematic legal framework and policy recommendations regarding data use in AI training. This third part of the report focuses on copyright implications throughout the generative AI training process—from data collection and processing to model training and output—directly addressing the most controversial technical questions in current litigation[1].
1. The key to "transformative use" lies in the ultimate purpose
The report emphasizes that transformative use should be evaluated based on the goals and deployment of the model: if the AI is trained to produce content different from the input, it is more likely to qualify as fair use. If it imitates the style of the original, the transformative quality diminishes.
2. AI learning ≠ human learning
The report challenges the analogy that AI training is similar to human reading. Unlike humans, AI can copy entire works perfectly and extract expression patterns at scale. This view aligns with the court’s ruling that Claude’s training mimicked human-style learning without reproducing core creative elements.
3. "Intermediate copying" must be assessed separately
The report recommends evaluating different phases—such as training copies, fine-tuning, retrieval-augmented generation (RAG), and outputs—individually under the four fair use factors. For example, the input stage may be fair use, while outputs that closely mirror original works may be infringing.
4. Pirated data affects fair use evaluation
The report underscores that acquiring data through illicit means—even if later replaced with legitimate copies—still constitutes infringement and negatively affects fair use analysis. This directly echoes the court’s ruling on Anthropic’s “central library.” It also warns that “publicly available” does not mean “legally authorized,” noting that some AI developers use pirate sources like Books3.
5. Market impact is a critical factor
The fourth fair use factor—impact on the market—is considered paramount. If the AI’s output substitutes for the original work or competes in the same market, it can severely harm copyright holders and weigh against fair use.
6. Call for licensing mechanisms
The report notes a lack of scalable licensing schemes for AI training. While options like collective licensing and compulsory licensing exist, they are complex and underdeveloped. The report suggests building scalable solutions to provide AI developers with lawful paths to training data while ensuring fair compensation for creators.
Conclusion:If AI training uses copyrighted materials in a highly transformative and non-substitutive way, it may qualify as fair use. However, unauthorized data use or outputs that compete with the original work undermine that defense. The court ruling offers a legal precedent, while the Copyright Office report provides a theoretical and policy roadmap for future governance.
Reference:
[1] https://www.copyright.gov/policy/artificial-intelligence/