C4 large language model dataset

C4 ('Colossal Clean Crawled Corpus') is a public dataset created by Google and Meta as a smaller, cleaner version of the Common Crawl dataset

C4 was used to train Google's LaMDA and Meta's LLaMA large language models.

C4 dataset

Dataset databank

Operator: Alphabet/Google; Meta/Facebook
Developer: Alphabet/Google
Country: USA
Sector: Technology; Research/academia
Purpose: Train large language models
Technology: Dataset
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing

Risks and harms

An April 2023 Washington Post/Allen Institute for AI investigation discovered that C4 had been trained on racist, pornographic, and copyright-protected web content

The discovery raised questions about the safety and security of the dataset and the machine learning systems trained on it, the privacy of web users, abuse of copyright, bias, and the veracity of its creators' marketing claims.

Safety

C4 was found to have used content from Reddit, notorious message board 4chan, white supremacy site Stormfront, and far-right site Kiwi Farms, effectively hard-baking huge volumes of offensive content of every conceivable kind into the data. The finding raised concerns about the safety of C4 and the systems trained on it.

Mis/disinformation

The investigation found that Russia government news website RT and US hard right-wing political channel Breitbart were amongst the sites used to train C4. Both sites are known for their highly skewed political views and tendency to create and amplify false stories.

Privacy

The investigation also found that C4 included content from sites such as flvoters.com, raising concerns about the privacy of US voters in particular.

Copyright

According to the Washington Post, the copyright symbol appeared over 200 million times in the C4 dataset. Copyright has become a major issue for generative AI systems. The Post said it’s 'analysis suggests more legal challenges may be on the way.'

Bias/discrimination

The C4 dataset is heavily skewed to religious websites that reflect a western perspective. Of the top 20 religious sites, 14 were Christian, two were Jewish, and one was Muslim. 

Transparency

Google claimed it heavily filtered the data before feeding it to C4. But the Times' discovery of high volumes of clearly offensive and unsafe content suggested its data cleansing processes are flawed, or its maketing was misleading, or both.

OpenAI refused to reveal any information about how its GPT-4 large language model was trained.

Investigations, assessments, audits