BookCorpus large language dataset

Report incident ๐Ÿ”ฅ | Improve page ๐Ÿ’ | Access database ๐Ÿ”ข

BookCorpus (also known as Toronto Book Corpus) is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres.ย 

The books were culled from Smashwords.com, a site that describes itself as 'the worldโ€™s largest distributor of indie ebooks'.

BookCorpus was compiled in 2014 by a group of University of Toronto and MIT researchers and was funded by Google and Samsung.ย 

The databset has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.

BookCorpus

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.ย 

Source: Wikipedia ๐Ÿ”—

Dataset ๐Ÿค–

Documents ๐Ÿ“ƒ

Dataset info ๐Ÿ”ข

Operator: Alphabet/Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada; USA
Sector: Media/entertainment/sports/arts; Research/academia; Technology
Purpose: Train language models
Technology: Database/dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion; Ethics/values
Transparency: Privacy; Marketing

Risks and harms ๐Ÿ›‘