BookCorpus dataset bias, copyright abuse

May 2021

BookCorpus, a large dataset of books culled from the web in 2014 that has been used to train influential language models such as Google's BERT, OpenAI's GPT and Amazon's Bort, has been found by Northwestern University researchers Jack Bandy and Nicholas Vincent to directly violate copyright restrictions for hundreds of books, falsely describe the size of its dataset, skew genre reprsentation, and potentially skew religious representation.