Report incident ๐ฅ | Improve page ๐ | Access database ๐ข
BookCorpus (also known as Toronto Book Corpus) is a dataset that draws on 11,000+ free, unauthored books representing 16 different genres.ย
The books were culled from Smashwords.com, a site that describes itself as 'the worldโs largest distributor of indie ebooks'.
BookCorpus was compiled in 2014 by a group of University of Toronto and MIT researchers and was funded by Google and Samsung.ย
The dataset has been used to train influential large language models such as Google's BERT, Amazon's Bort, and OpenAI's GPT.
Text mining - Text analytics
Text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.ย
Wikipedia: Text mining ๐
Website ๐
Dataset ๐
Released: 2014
Availability: Available
Developer: Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Type: Database/dataset
Purpose: Train language models
Technique: NLP/text analysis; Deep learning
Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. (2015). Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Here are some key transparency limitations of the BookCorpus large language dataset:
Source selection. Limited information is available about how books were selected for inclusion in the corpus, potentially introducing selection bias.
Copyright and permissions. It is unclear whether proper permissions were obtained for all books included, raising legal and ethical concerns.
Data cleaning process. Details about any preprocessing or cleaning steps applied to the text are not fully disclosed.
Demographic representation. There is limited transparency regarding the diversity of authors and perspectives represented in the corpus.
Language variety. While primarily English, the extent of dialect variation or inclusion of non-English text is not clearly specified.
Versioning and updates. Clear information about different versions of the dataset and any updates made over time are limited.
Quality control. Details about measures taken to ensure the quality and accuracy of the digitized text are not fully transparent.
Metadata availability. The extent and accuracy of metadata (e.g., author information, publication details) accompanying the text is unclear.
Duplicate content. Information about how potential duplicates or near-duplicates are handled within the corpus are lacking.
Ethical review. There is no information regarding any ethical review process conducted during the dataset's creation.
BookCorpus has been accused of illegally and unethically violating copyright, undermining authors' ability to control distribution of their creative works, which is essential for their livelihood, exhibiting racial and religious bias, misleading marketing and poor transparency.
Bandy J., (2021). BookCorpus datacard/datasheet
Bandy J., Vincent N. (2021). Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Bandy J., Towards Data Science (2021). Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning
Page info
Type: Data
Published: May 2021
Last updated: October 2024