AIAAIC - Anthropic "destructively" scans millions of books to train AI models

Anthropic "destructively" scans millions of books to train AI models

Occurred: 2021-2025
Page published: February 2026

Report incident🔥| Improve page 💁| Access database 🔢

Anthropic bought and "destructively" scanned millions of copyrighted books to train its AI models, raising legal and ethical concerns over copyright, author rights, and corporate transparency and accountability.

What happened

In a project internally referred to as “Project Panama,” Anthropic purchased large quantities of physical books and then cut them apart to scan every page into digital form for use as training data for its Claude AI models. The physical volumes were then discarded, effectively destroying the books.

Documents from a copyright lawsuit indicate millions of books were processed this way. The approach has been described as an industrial-scale application of a scanning method that destroys the originals after digitising them.

The resulting training process fueled ongoing legal action from authors and creators, who argue they were not compensated or asked for consent despite their works being used in this manner.

Anthropic co-founder Ben Mann reportedly personally approved the download of millions of books from "shadow libraries" such as LibGen, The Pile, and YouTube Subtitles to kickstart training.

Why it happened

The root cause is the industry-wide hunger for high-quality, "long-form" data, which is essential for teaching AI models logic and nuance.

While Anthropic markets itself as an "AI Safety" company, it is clear it quietly prioritised rapid scaling over copyright compliance and slidelines corporate accountability in favour of the "move fast" mentality prevalent in the AI arms race.

It is hardly alone. AI companies rarely disclose the specific datasets used for training, and the "black box" nature of these models makes it difficult for authors to prove their specific work was used until independent researchers or legal discovery processes unearth the truth.

What it means

For authors and content creators, this incident highlights significant copyright and economic concerns. AI companies may use large volumes of copyrighted work without acknowledgement, permission or compensation, potentially undermining creators’ control and compensation for their intellectual property.
For society, the episode raises seriosu questions about Anthropic's ethics, as well as the ethics of the AI industry. Destroying physical books en masse raises cultural and archival concerns beyond copyright, including the stewardship of human knowledge.
For policymakers: It underlines the real need for clear fair use definitions in the age of AI. Regulators are now tasked with deciding whether "training" constitutes a transformative use of data or if it is simply high-tech copyright infringement. This will likely lead to new mandates for dataset transparency and mandatory licencing frameworks.

System 🤖

Claude 🔗

Developer: Anthropic
Country: Multiple
Sector: Multiple
Purpose: Train AI models
Technology: Generative AI
Issue: Accountability; Appropriation; Transparency

Legal, regulatory 👩🏼‍⚖️

Bartz et al. v. Anthropic PBC

News, commentary, analysis 🗞️

https://www.washingtonpost.com/technology/2026/01/27/anthropic-ai-scan-destroy-books/
https://futurism.com/future-society/anthropic-destroying-books
https://www.techbrew.com/stories/2026/01/28/anthropic-ai-books-lawsuit
https://www.chosun.com/english/industry-en/2026/01/28/CGY3JHGJXFHMZPKGWXS3W5X2SQ/

Related 🌐

AIAAIC Repository ID: AIAAIC2196

Page updated

Google Sites

Report abuse