C4 large language model dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

C4 ('Colossal Clean Crawled Corpus') is a public dataset of approximately 750GB of English-language text developed by Google and Meta as a smaller, cleaner version of the Common Crawl dataset.

C4 was created by taking a single month's scrape of Common Crawl and removing duplicate, placeholder, nonsensical and non-English language content.

C4 was used to train Google's open source T5 and LaMDA large language models as well as Meta's LLaMA large language models.

Text mining - Text analytics

Text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.

Wikipedia: Text mining 🔗

Dataset 🤖

Dataset 🔗
C4 dataset (Allen Institute for AI) 🔗
Status: Active
Released: 2020
Operator: Google; Meta
Developer: Google; Meta
Country: USA
Sector: Research/academia; Technology
Purpose: Train language models
Type: Database/dataset
Technique: NLP/text analysis

Documents 📃

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transparency and accountability 🙈

Here are key transparency limitations of the C4 (Colossal Clean Crawled Corpus) large language model dataset:

Web crawling methodology. Details about the specific web crawling techniques and criteria used to collect the data are not transparent.
Content filtering process. While some filtering steps are mentioned, the exact algorithms and thresholds used to clean and filter the data are not clear. Google claimed it heavily filtered Common Crawl data before feeding it to C4. But the discovery of high volumes of clearly offensive and unsafe content by the Washington Post suggested its data cleansing processes were flawed, or its maketing was misleading, or both.
Language identification. The process and accuracy of language detection used to focus on English-language content may have limitations that are not fully disclosed.
Demographic representation. There's limited information on the demographic diversity of content creators or perspectives represented in the dataset.
Temporal coverage. The time frame of data collection and how well different time periods are represented in the corpus is not clearly specified
Domain distribution. The distribution of web domains and content types in the dataset may not be fully documented, potentially leading to biases
Deduplication process. While deduplication is mentioned, the specific methods used and their effectiveness are not transparent.
Privacy considerations. Details about steps taken to protect individual privacy or remove personal information are limited.
Content quality assessment. The criteria used to assess the quality of included content and filter out low-quality text are not fully disclosed.
Bias mitigation efforts. Information about any steps taken to identify or mitigate potential biases in the dataset are lacking.
Versioning and updates. Clear documentation about different versions of the dataset and any updates made over time are limited.
Ethical considerations. There is limited transparency regarding the ethical review process or considerations taken into account during dataset creation.
Licensing and usage restrictions. Full details of licensing terms and any usage restrictions are readily available or clearly communicated.

Risks and harms 🛑

The C4 dataset has been found to include a considerable quantity of pornographic and offensive content, including over 72,000 instances of the word "swastika", as well as content sourced from sites associated with anti-trans perspectives, white supremacists, far-right conspiracy theories such as Q-Anon, and misinformation.

It has also been found to disproportionately exclude content about minority individuals, such as non-sexual and non-offensive content about LGBT+ people, and content associated with Black and Hispanic authors.

Incidents and issues 🔥

Related 🌐

Page info
Type: Data
Published: June 2023
Last updated: October 2024