The Pile - dataset
The Pile - dataset
Page published: July 2024 | Last updated: February 2026
The Pile is a 886 GiB, open-source dataset of English-language text created to help train large language models (LLMs).
Developed by EleutherAI and publicly released in December 2020, The Pile consists of 22 smaller datasets, including Books3, BookCorpus and YouTube Subtitles, and 14 new ones.
Initially developed to train EleutherAI's GPT-Neo models, The Pile has been used to train many other models, including Microsoft's Megatron-Turing Natural Language Generation, Meta AI's Open Pre-trained Transformers, LLaMA, and Galactica, Stanford University's BioMedLM 2.7B, the Beijing Academy of Artificial Intelligence's Chinese-Transformer-XL, Yandex's YaLM 100B, and Apple's OpenELM.
Released: 2020
Purpose: Train large language models
Type: Database/dataset
Technique: Generative AI; Large language model; Machine learning
The Pile is seen to suffer from multiple transparency and accountability limitations:
Inadequate documentation. The dataset has not been thoroughly documented by its creators, making it difficult for researchers to fully understand and address potential issues.
Lack of clear filtering mechanisms. The Pile does not appear to have implemented extensive processes for filtering content deemed toxic or private, leaving much of this responsibility to individual researchers using the dataset.
Privacy consent. Several datasets included in The Pile were collected without the explicit consent of the individuals whose data is included. This is particularly concerning for datasets like the Enron emails, where individuals had no opportunity to consent to their inclusion.
Copyright compliance. Some components of The Pile may contain copyrighted material that was not collected or distributed in compliance with terms of service agreements. This raises legal and ethical questions about the use of such data.
Limited accountability. There appears to be a lack of established mechanisms for reporting and remedying faults or biases discovered in the dataset.
LLaMA
The Pile dataset has been accused of copyright and privacy abuse, and of enabling the creation and deployment of biased, unsafe and unethical AI models.
Anthropic "destructively" scans millions of books to train AI models
Books3 dataset shut down after legal notice from Danish anti-piracy group
Mike Huckabee books used to train language models without consent
OpenAI deleted training datasets believed to contain copyrighted books
17 authors sue OpenAI for 'systematic mass-scale copyright infringement'
Ada Lovelace Institute. Allocating accountability in AI supply chains
Subramani N., et al. Detecting Personal Information in Training Corpora: an Analysis (pdf)
Kim S., et al. ProPILE: Probing Privacy Leakage in Large Language Models
AIAAIC Repository ID: AIAAIC1596