BookCorpus large language dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

BookCorpus (also known as Toronto Book Corpus) is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres culled from Smashwords.com, a site that describes itself as 'the world’s largest distributor of indie ebooks'.

Compiled in 2014 by a group of University of Toronto and MIT researchers and funded by Google and Samsung, BookCorpus has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.

Dataset 🤖

Documents 📃

Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. (2015). Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Operator: Alphabet/Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada; USA
Sector: Media/entertainment/sports/arts; Research/academia; Technology
Purpose: Train language models
Technology: Database/dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion; Ethics/values
Transparency: Privacy; Marketing

Risks and harms 🛑

BookCorpus has been accused of illegally and unethically violating copyright, undermining authors' ability to control distribution of their creative works, which is essential for their livelihood, exhibiting racial and religious bias, misleading marketing and poor transparency.

Transparency and accountability 🙈

Here are some key transparency limitations of the BookCorpus large language dataset:

Source selection. Limited information is available about how books were selected for inclusion in the corpus, potentially introducing selection bias.

Copyright and permissions. It is unclear whether proper permissions were obtained for all books included, raising legal and ethical concerns.
Data cleaning process. Details about any preprocessing or cleaning steps applied to the text are not fully disclosed.
Demographic representation. There is limited transparency regarding the diversity of authors and perspectives represented in the corpus.
Language variety. While primarily English, the extent of dialect variation or inclusion of non-English text is not clearly specified.
Versioning and updates. Clear information about different versions of the dataset and any updates made over time are limited.
Quality control. Details about measures taken to ensure the quality and accuracy of the digitized text are not fully transparent.
Metadata availability. The extent and accuracy of metadata (e.g., author information, publication details) accompanying the text is unclear.
Duplicate content. Information about how potential duplicates or near-duplicates are handled within the corpus are lacking.
Ethical review. There is no information regarding any ethical review process conducted during the dataset's creation.

News, commentary, analysis 🗞️

Related 🌐

Page info
Type: Data
Published: May 2021
Last updated: June 2024