C4 large language model dataset
Report incident 🔥 | Improve page 💁 | Access database 🔢
C4 ('Colossal Clean Crawled Corpus') is a public dataset created by Google and Meta as a smaller, cleaner version of the Common Crawl dataset.
C4 was used to train Google's T5 and LaMDA, and Meta's LLaMA large language models.
Operator: Alphabet/Google; Meta/Facebook
Developer: Alphabet/Google
Country: USA
Sector: Research/academia; Technology
Purpose: Train large language models
Technology: Dataset/database
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing
Risks and harms 🛑
The C4 dataset has been accused of providing unsafe, biased content that violates copyright and privacy.
Transparency and accountability 🙈
Here are key transparency limitations of the C4 (Colossal Clean Crawled Corpus) large language model dataset:
Web crawling methodology. Details about the specific web crawling techniques and criteria used to collect the data are not transparent.
Content filtering process. While some filtering steps are mentioned, the exact algorithms and thresholds used to clean and filter the data are not clear. Google claimed it heavily filtered Common Crawl data before feeding it to C4. But the discovery of high volumes of clearly offensive and unsafe content by the Washington Post suggested its data cleansing processes were flawed, or its maketing was misleading, or both.
Language identification. The process and accuracy of language detection used to focus on English-language content may have limitations that are not fully disclosed.
Demographic representation. There's limited information on the demographic diversity of content creators or perspectives represented in the dataset.
Temporal coverage. The time frame of data collection and how well different time periods are represented in the corpus is not clearly specified
Domain distribution. The distribution of web domains and content types in the dataset may not be fully documented, potentially leading to biases
Deduplication process. While deduplication is mentioned, the specific methods used and their effectiveness are not transparent.
Privacy considerations. Details about steps taken to protect individual privacy or remove personal information are limited.
Content quality assessment. The criteria used to assess the quality of included content and filter out low-quality text are not fully disclosed.
Bias mitigation efforts. Information about any steps taken to identify or mitigate potential biases in the dataset are lacking.
Versioning and updates. Clear documentation about different versions of the dataset and any updates made over time are limited.
Ethical considerations. There is limited transparency regarding the ethical review process or considerations taken into account during dataset creation.
Licensing and usage restrictions. Full details of licensing terms and any usage restrictions are readily available or clearly communicated.
Incidents and issues 🔥
Research, advocacy 🧮
Google (2023). Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models
Subramani N., et al (2023). Detecting Personal Information in Training Corpora: an Analysis (pdf)
Dodge J., Sap M., Marasović A., Agnew W., Ilharco G., Groeneveld D., Mitchell M., Gardner M. (2021). Documenting the English Colossal Clean Crawled Corpus
Raffel C. et al (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (pdf)
Investigations, assessments, audits 🧐
Washington Post (2023). Inside the secret list of websites that make AI like ChatGPT sound smart
Page info
Type: Data
Published: June 2023
Last updated: June 2024