The Pile dataset

Report incident ๐Ÿ”ฅ | Improve page ๐Ÿ’ | Access database ๐Ÿ”ข

The Pile is a 886 GiB, open-source dataset of English-language text created to help train large language models (LLMs).ย 

Developed by EleutherAI and publicly released in December 2020, The Pile consists of 22 smaller datasets, including Books3, BookCorpus and YouTube Subtitles, and 14 new ones.

Initially developed to train EleutherAI's GPT-Neo models, The Pile has been used to train many other models, including Microsoft's Megatron-Turing Natural Language Generation, Meta AI's Open Pre-trained Transformers, LLaMA, and Galactica, Stanford University's BioMedLM 2.7B, the Beijing Academy of Artificial Intelligence's Chinese-Transformer-XL, Yandex's YaLM 100B, and Apple's OpenELM.

Large language model

A large language model (LLM) is a type of computational model designed for task related to natural language processing, including language generation.

Source: Wikipedia ๐Ÿ”—

Dataset ๐Ÿค–

Derivatives, applications ๐Ÿˆธ

Transparency and accountability ๐Ÿ™ˆ

The Pile is seen to suffer from multiple transparency and accountability limitations:

Risks and harms ๐Ÿ›‘

The Pile dataset has been accused of copyright and privacy abuse, and of enabling the creation and deployment of biased, unsafe and unethical AI models.

Research, advocacy ๐Ÿงฎ