BookCorpus large language dataset

Report incident ๐Ÿ”ฅ | Improve page ๐Ÿ’ | Access database ๐Ÿ”ข

BookCorpus (also known as Toronto Book Corpus) is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres culled from Smashwords.com, a site that describes itself as 'the worldโ€™s largest distributor of indie ebooks'.

Compiled in 2014 by a group of University of Toronto and MIT researchers and funded by Google and Samsung, BookCorpus has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.

Documents ๐Ÿ“ƒ

Operator: Alphabet/Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada; USA
Sector: Media/entertainment/sports/arts; Research/academia; Technology
Purpose: Train language models
Technology: Database/dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion; Ethics/values
Transparency: Privacy; Marketing

Risks and harms ๐Ÿ›‘

BookCorpus has been accused of illegally and unethically violating copyright, undermining authors' ability to control distribution of their creative works, which is essential for their livelihood, exhibiting racial and religious bias, misleading marketing and poor transparency.

Transparency and accountability ๐Ÿ™ˆ

Here are some key transparency limitations of the BookCorpus large language dataset:

Investigations, assessments, audits ๐Ÿง

Page info
Type: Data
Published: May 2021
Last updated: June 2024