BookCorpus large language dataset
BookCorpus is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres culled from Smashwords.com, a site that describes itself as 'the world’s largest distributor of indie ebooks'.
Compiled in 2014 by a group of University of Toronto and MIT researchers and funded by Google and Samsung, BookCorpus has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.
BookCorpus was withdrawn following a critical review of the dataset and its governance by Northwestern University researchers Jack Bandy and Nicholas Vincent in May 2021.
In a working paper, Bandy and Vincent set out several concerns abut BookCorpus, and called for stronger standards for the documentation of datasets.
Notably, they found that BookCorpus violates copyright restrictions for over 200 books that explicitly state that 'may not be reproduced, copied and distributed for commercial or non-commercial purposes.'
The Guardian had previously noted that the developers of BookCorpus failed to make contact with or seek consent from authors whose books were included.
Dataset bias, misleading marketing
The reseachers also discovered that the size of the BookCorpus dataset had been inaccurately described, with several books reproduced multiple times, and that the sample is skewed to certain genres and religious forms.
Despite these concerns, BookCorpus remains widely available on data sharing websites.
Operator: Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Sector: Technology; Research/academia
Purpose: Train language models
Technology: Dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion
Transparency: Privacy; Marketing
BookCorpus dataset (Hugging Face)
Bandy J., (2021). BookCorpus datacard/datasheet
Investigations, assessments, audits
Bandy J., Towards Data Science (2021). Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning