BookCorpus dataset accused of copyright abuse and bias
BookCorpus dataset accused of copyright abuse and bias
Occurred: May 2021
Report incident 🔥 | Improve page 💁 | Access database 🔢
BookCorpus, a dataset used to train dozens of influential language models including Google’s BERT, OpenAI’s GPT and Amazon’s Bort, was accused of abusing copyright, bias and misleading marketing.
In a working paper, researchers Jack Bandy and Nicholas Vincent set out several concerns abut BookCorpus, and called for stronger standards for the documentation of datasets.
Notably, they found that BookCorpus violated the copyright restrictions of over 200 books which had been taken from Smashwords, a website that describing itself as “the world’s largest distributor of indie ebooks.
The reseachers also discovered that the size of the BookCorpus dataset had been inaccurately described, with several books reproduced multiple times, and that the sample was skewed to certain genres and religious forms.
Despite these concerns, BookCorpus remains widely available on data sharing websites such as Hugging Face.
➖ September 2016. The Guardian noted that the developers of BookCorpus failed to make contact with or seek consent from authors whose books were included in the dataset.
Operator: Alphabet/Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada; USA
Sector: Media/entertainment/sports/arts; Research/academia; Technology
Purpose: Train language models
Technology: Database/dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion; Ethics/values; Transparency
Bandy J., (2021). BookCorpus datacard/datasheet
Bandy J., Vincent N. (2021). Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Page info
Type: Issue
Published: June 2024