BookCorpus large language dataset

BookCorpus is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres culled from Smashwords.com, a site that describes itself as 'the world’s largest distributor of indie ebooks'.

Compiled in 2014 by a group of University of Toronto and MIT researchers and funded by Google and Samsung, BookCorpus has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.

Dataset 🤖

Dataset databank 🔢

Operator: Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada
Sector: Technology; Research/academia
Purpose: Train language models
Technology: Dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion
Transparency: Privacy; Marketing

Risks and harms 🛑

BookCorpus was withdrawn following a critical review of the dataset and its governance by Northwestern University researchers Jack Bandy and Nicholas Vincent in May 2021.

Copyright abuse

In a working paper, Bandy and Vincent set out several concerns abut BookCorpus, and called for stronger standards for the  documentation of datasets. 

Notably, they found that BookCorpus violates copyright restrictions for over 200 books that explicitly state that 'may not be reproduced, copied and distributed for commercial or non-commercial purposes.' 

The Guardian had previously noted that the developers of BookCorpus failed to make contact with or seek consent from authors whose books were included.

Dataset bias, misleading marketing

The reseachers also discovered that the size of the BookCorpus dataset had been inaccurately described, with several books reproduced multiple times, and that the sample is skewed to certain genres and religious forms.

Despite these concerns, BookCorpus remains widely available on data sharing websites.

Dataset documentation 🤖

Investigations, assessments, audits 🧐

Page info
Type: Data
Published: May 2021
Last updated: September 2023