C4 large language model dataset
Released: April 2023
C4 was found to have used content from Reddit, notorious message board 4chan, white supremacy site Stormfront, and far-right site Kiwi Farms, effectively hard-baking huge volumes of offensive content of every conceivable kind into the data. The finding raised concerns about the safety of C4 and the systems trained on it.
The investigation found that Russia government news website RT and US hard right-wing political channel Breitbart were amongst the sites used to train C4. Both sites are known for their highly skewed political views and tendency to create and amplify false stories.
The investigation also found that C4 included content from sites such as flvoters.com, raising concerns about the privacy of US voters in particular.
According to the Washington Post, the copyright symbol appeared over 200 million times in the C4 dataset. Copyright has become a major issue for generative AI systems. The Post said it’s 'analysis suggests more legal challenges may be on the way.'
The C4 dataset is heavily skewed to religious websites that reflect a western perspective. Of the top 20 religious sites, 14 were Christian, two were Jewish, and one was Muslim.
Google claimed it heavily filtered the data before feeding it to C4. But the Times' discovery of high volumes of clearly offensive and unsafe content suggested its data cleansing processes are flawed, or its maketing was misleading, or both.
OpenAI refused to reveal any information about how its GPT-4 large language model was trained.
Operator: Alphabet/Google; Meta/Facebook
Sector: Technology; Research/academia
Purpose: Train large language models
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing
Investigations, assessments, audits
Washington Post (2023). Inside the secret list of websites that make AI like ChatGPT sound smart
News, commentary, analysis
Published: June 2023