C4 large language model dataset
C4 ('Colossal Clean Crawled Corpus') is a public dataset created by Google and Meta as a smaller, cleaner version of the Common Crawl dataset.
C4 was used to train Google's LaMDA and Meta's LLaMA large language models.
Dataset 🤖
Allen Institute for AI C4 dataset
Dataset databank 🔢
Operator: Alphabet/Google; Meta/Facebook
Developer: Alphabet/Google
Country: USA
Sector: Technology; Research/academia
Purpose: Train large language models
Technology: Dataset
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing
Risks and harms 🛑
An April 2023 Washington Post/Allen Institute for AI investigation discovered that C4 had been trained on racist, pornographic, and copyright-protected web content.
The discovery raised questions about the safety and security of the dataset and the machine learning systems trained on it, the privacy of web users, abuse of copyright, bias, and the veracity of its creators' marketing claims.
Safety
C4 was found to have used content from Reddit, notorious message board 4chan, white supremacy site Stormfront, and far-right site Kiwi Farms, effectively hard-baking huge volumes of offensive content of every conceivable kind into the data. The finding raised concerns about the safety of C4 and the systems trained on it.
Mis/disinformation
The investigation found that Russia government news website RT and US hard right-wing political channel Breitbart were amongst the sites used to train C4. Both sites are known for their highly skewed political views and tendency to create and amplify false stories.
Privacy
The investigation also found that C4 included content from sites such as flvoters.com, raising concerns about the privacy of US voters in particular.
Copyright
According to the Washington Post, the copyright symbol appeared over 200 million times in the C4 dataset. Copyright has become a major issue for generative AI systems. The Post said it’s 'analysis suggests more legal challenges may be on the way.'
Bias/discrimination
The C4 dataset is heavily skewed to religious websites that reflect a western perspective. Of the top 20 religious sites, 14 were Christian, two were Jewish, and one was Muslim.
Transparency 🙈
Google claimed it heavily filtered the data before feeding it to C4. But the Times' discovery of high volumes of clearly offensive and unsafe content suggested its data cleansing processes are flawed, or its maketing was misleading, or both.
OpenAI refused to reveal any information about how its GPT-4 large language model was trained.
Research, advocacy 🧮
Google (2023). Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models
Dodge J., Sap M., Marasović A., Agnew W., Ilharco G., Groeneveld D., Mitchell M., Gardner M. (2021). Documenting the English Colossal Clean Crawled Corpus
Raffel C. et al (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (pdf)
Investigations, assessments, audits 🧐
Washington Post (2023). Inside the secret list of websites that make AI like ChatGPT sound smart
News, commentary, analysis 🗞️
Page info
Type: Data
Published: June 2023