OpenAI deleted training datasets believed to contain copyrighted books

Occurred: May 2024

Unsealed documents reveal OpenAI deleted two large datasets used to train its AI models.

Documents unsealed during an ongoing class-action lawsuit between The Authors Guild and Open AI reveal the latter deleted two datasets used to train its GPT-2 and GPT-3 large language models. The datasets were named as 'Books1' and 'Books2' in the documents, and had been described in a 2020 technical document by OpenAI as 'corpora of books from the Internet' and formed 16 percent of the training data used to create GPT-3. 

The two datasets are also believed likely to have contained over 100,000 published, copyrighted books which, if true, would support The Guild’s case. The documents also revealed that the two researchers who created the datasets are no longer employed by OpenAI.

For months, the Guild has been seeking further information from OpenAI about the datasets. OpenAI initially resisted, citing confidentiality concerns, before revealing that it had deleted all copies of the data. In a statement OpenAI said “The models powering ChatGPT and our API today were not developed using these datasets”.

The Authors Guild filed the class-action lawsuit over concerns that OpenAI trained their large language models using published, copyrighted books without consent from or payment towards authors. The issue of training AI models on copyrighted works is prevalent across the industry, with OpenAI and other AI companies facing a series of lawsuits.

Operator: OpenAI

Developer: OpenAI

Country: USA

Sector: Media/entertainment/sports/arts

Purpose: Generate text

Technology: Chatbot; NLP/text analysis; Neural network; Deep learning; Machine learning; Reinforcement learning

Issue: Copyright; Ethics/values

Transparency: Governance

Research, advocacy 🧮