BookCorpus dataset

Page created: May 2021
Updated: May 2022

BookCorpus is a dataset that draws on 7,000+ free, unauthored books representing 16 different genres culled from Smashwords.com, a site that describes itself as 'the world’s largest distributor of indie ebooks'.

Compiled in 2014 by a group of University of Toronto and MIT researchers and funded by Google and Samsung, BookCorpus has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.

BookCorpus was withdrawn following a critical review of the dataset and its governance by Northwestern University researchers Jack Bandy and Nicholas Vincent in May 2021.

Copyright, marketing, bias

In a working paper, Bandy and Vincent set out several concerns abut BookCorpus, and called for stronger standards for the documentation of datasets.

Notably, they found that BookCorpus violates copyright restrictions for over 200 books that explicitly state that 'may not be reproduced, copied and distributed for commercial or non-commercial purposes.'

The Guardian had previously noted that the developers of BookCorpus failed to make contact with or seek consent from authors whose books were included.

The reseachers also discovered that the size of the BookCorpus dataset had been inaccurately described, with several books reproduced multiple times, and that the sample is skewed to certain genres and religious forms.

Despite these concerns, BookCorpus remains widely available on data sharing websites.

Operator: Google; Amazon; OpenAI; Samsung
Developer:
Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada
Sector:
Research/academia; Technology
Purpose: Train language models
Technology: Dataset; NLP/text analysis; Neural network
Issue:
Copyright; Bias/discrimination - race, religion
Opacity: Privacy; Marketing