BookCorpus large language dataset

Released: 2014
Occurred: May 2021

Can you improve this page?
Share your insights with us

BookCorpus is a dataset that draws on 7,000+ free, unauthored books representing 16 different genres culled from Smashwords.com, a site that describes itself as 'the world’s largest distributor of indie ebooks'.

Compiled in 2014 by a group of University of Toronto and MIT researchers and funded by Google and Samsung, BookCorpus has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.

BookCorpus was withdrawn following a critical review of the dataset and its governance by Northwestern University researchers Jack Bandy and Nicholas Vincent in May 2021.

Copyright

In a working paper, Bandy and Vincent set out several concerns abut BookCorpus, and called for stronger standards for the documentation of datasets.

Notably, they found that BookCorpus violates copyright restrictions for over 200 books that explicitly state that 'may not be reproduced, copied and distributed for commercial or non-commercial purposes.'

The Guardian had previously noted that the developers of BookCorpus failed to make contact with or seek consent from authors whose books were included.

Bias, misleading marketing

The reseachers also discovered that the size of the BookCorpus dataset had been inaccurately described, with several books reproduced multiple times, and that the sample is skewed to certain genres and religious forms.

Despite these concerns, BookCorpus remains widely available on data sharing websites.

Operator: Google; Amazon; OpenAI; Samsung
Developer:
Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada
Sector:
Technology; Research/academia
Purpose: Train language models
Technology: Dataset; NLP/text analysis; Deep learning
Issue:
Copyright; Bias/discrimination - race, religion
Transparency: Privacy; Marketing