BookCorpus large language dataset
Report incident ๐ฅ | Improve page ๐ | Access database ๐ข
BookCorpus (also known as Toronto Book Corpus) is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres.ย
The books were culled from Smashwords.com, a site that describes itself as 'the worldโs largest distributor of indie ebooks'.
BookCorpus was compiled in 2014 by a group of University of Toronto and MIT researchers and was funded by Google and Samsung.ย
The databset has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.
BookCorpus
BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.ย
Source: Wikipedia ๐
Dataset ๐ค
BookCorpus ๐
BookCorpus dataset (Hugging Face)
Documents ๐
Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. (2015). Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Dataset info ๐ข
Operator: Alphabet/Google; Amazon; OpenAI; Samsung
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Country: Canada; USA
Sector: Media/entertainment/sports/arts; Research/academia; Technology
Purpose: Train language models
Technology: Database/dataset; NLP/text analysis; Deep learning
Issue: Copyright; Bias/discrimination - race, religion; Ethics/values
Transparency: Privacy; Marketing