BookCorpus dataset accused of copyright abuse and bias

Occurred: May 2021

BookCorpus, a dataset used to train dozens of influential language models including Google’s BERT, OpenAI’s GPT, and Amazon’s Bort, was accused of abusing copyright, bias and misleading marketing. 

In a working paper, researchers Jack Bandy and Nicholas Vincent set out several concerns abut BookCorpus, and called for stronger standards for the documentation of datasets. 

Notably, they found that BookCorpus violated the copyright restrictions of over 200 books which had been taken from Smashwords, a website that describing itself as “the world’s largest distributor of indie ebooks.

The reseachers also discovered that the size of the BookCorpus dataset had been inaccurately described, with several books reproduced multiple times, and that the sample was skewed to certain genres and religious forms.

Despite these concerns, BookCorpus remains widely available on data sharing websites such as Hugging Face.

September 2016. The Guardian noted that the developers of BookCorpus failed to make contact with or seek consent from authors whose books were included in the dataset.

Operator: Alphabet/Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada; USA
Sector: Media/entertainment/sports/arts; Research/academia; Technology
Purpose: Train language models
Technology: Database/dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion; Ethics/values
Transparency: Governance; Privacy; Marketing

Page info
Type: Issue
Published: June 2024