C4 dataset is trained on unsafe, copyright-protected web content

Occurred: April 2023

A dataset used to train Google and Meta large language models was trained on racist, pornographic, and copyright-protected web content, raising safety, privacy and other concerns.

A joint Washington Post/Allen Institute for AI investigation discovered that C4 used content from Reddit, notorious message board 4chan, white supremacy site Stormfront, and far-right site Kiwi Farms, effectively hard-baking huge volumes of offensive content of every conceivable kind into the data. 

The investigation also found that Russia government news website RT and US hard right-wing political channel Breitbart were amongst the sites used to train C4. Both sites are known for their highly skewed political views and tendency to create and amplify false stories. It also found that C4 included content from sites such as flvoters.com, raising concerns about the privacy of US voters in particular.

According to the Washington Post, the copyright symbol appeared over 200 million times in the C4 dataset. Copyright has become a major issue for generative AI systems. The Post said it’s 'analysis suggests more legal challenges may be on the way.'

The investigation raised questions about the safety and security of the dataset and the machine learning systems trained on it, the privacy of web users, abuse of copyright, bias, and the veracity of its creators' marketing claims.

System 🤖

Operator: Alphabet/Google; Meta/Facebook
Developer: Alphabet/Google
Country: USA
Sector: Research/academia; Technology
Purpose: Train large language models
Technology: Dataset/database
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing

Investigations, assessments, audits 🧐

Page info
Type: Issue
Published: June 2024