Report incident ๐ฅ | Improve page ๐ | Access database ๐ข
C4 is a public dataset of approximately 750GB of English-language text developed by Google and Meta as a smaller, cleaner version of the Common Crawlย dataset.ย
C4 was created by taking a single month's scrape of Common Crawl and removing duplicate, placeholder, nonsensical and non-English language content.
C4 was used to train Google's open source T5 and LaMDA large language models as well as Meta's LLaMA large language models.
Text mining - Text analytics
Text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.ย
Wikipedia: Text mining ๐
Dataset ๐
C4 dataset (Allen Institute for AI) ๐
Released: 2020
Availability: Available
Developer: Google; Meta
Country: USA
Purpose: Train language models
Type: Database/dataset
Technique: NLP/text analysis; Machine learning
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Here are key transparency limitations of the C4 (Colossal Clean Crawled Corpus) large language model dataset:
Web crawling methodology. Details about the specific web crawling techniques and criteria used to collect the data are not transparent.
Content filtering process. While some filtering steps are mentioned, the exact algorithms and thresholds used to clean and filter the data are not clear. Google claimed it heavily filtered Common Crawl data before feeding it to C4. But the discovery of high volumes of clearly offensive and unsafe content by the Washington Post suggested its data cleansing processes were flawed, or its maketing was misleading, or both.
Language identification. The process and accuracy of language detection used to focus on English-language content may have limitations that are not fully disclosed.
Demographic representation. There's limited information on the demographic diversity of content creators or perspectives represented in the dataset.
Temporal coverage. The time frame of data collection and how well different time periods are represented in the corpus is not clearly specified
Domain distribution. The distribution of web domains and content types in the dataset may not be fully documented, potentially leading to biases
Deduplication process. While deduplication is mentioned, the specific methods used and their effectiveness are not transparent.
Privacy considerations. Details about steps taken to protect individual privacy or remove personal information are limited.
Content quality assessment. The criteria used to assess the quality of included content and filter out low-quality text are not fully disclosed.
Bias mitigation efforts. Information about any steps taken to identify or mitigate potential biases in the dataset are lacking.
Versioning and updates. Clear documentation about different versions of the dataset and any updates made over time are limited.
Ethical considerations. There is limited transparency regarding the ethical review process or considerations taken into account during dataset creation.
Licensing and usage restrictions. Full details of licensing terms and any usage restrictions are readily available or clearly communicated.
The C4 dataset has been found to include a considerable quantity of pornographic and offensive content, including over 72,000 instances of the word "swastika", as well as content sourced from sites associated with anti-trans perspectives, white supremacists, far-right conspiracy theories such as Q-Anon, and misinformation.ย
It has also been found to disproportionately exclude content about minority individuals, such as non-sexual and non-offensive content about LGBT+ people, and content associated with Black and Hispanic authors.
Google (2023). Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models
Subramani N., et al (2023). Detecting Personal Information in Training Corpora: an Analysis (pdf)
Dodge J., Sap M., Marasoviฤ A., Agnew W., Ilharco G., Groeneveld D., Mitchell M., Gardner M. (2021). Documenting the English Colossal Clean Crawled Corpus
Washington Post (2023). Inside the secret list of websites that make AI like ChatGPT sound smart
Page info
Type: Data
Published: June 2023
Last updated: October 2024