C4 large language model dataset

Report incident ๐Ÿ”ฅ | Improve page ๐Ÿ’ | Access database ๐Ÿ”ข

C4 ('Colossal Clean Crawled Corpus') is a public dataset of approximately 750GB of English-language text developed by Google and Meta as a smaller, cleaner version of the Common Crawlย  dataset.ย 

C4 was created by taking a single month's scrape of Common Crawl and removing duplicate, placeholder, nonsensical and non-English language content.

C4 was used to train Google's open source T5 and LaMDA large language models as well as Meta's LLaMA large language models.

Text mining - Text analytics

Text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.ย 

Wikipedia: Text mining ๐Ÿ”—

Dataset ๐Ÿค–


Documents ๐Ÿ“ƒ

Transparency and accountability ๐Ÿ™ˆ

Here are key transparency limitations of the C4 (Colossal Clean Crawled Corpus) large language model dataset:

Risks and harms ๐Ÿ›‘

The C4 dataset has been found to include a considerable quantity of pornographic and offensive content, including over 72,000 instances of the word "swastika", as well as content sourced from sites associated with anti-trans perspectives, white supremacists, far-right conspiracy theories such as Q-Anon, and misinformation.ย 

It has also been found to disproportionately exclude content about minority individuals, such as non-sexual and non-offensive content about LGBT+ people, and content associated with Black and Hispanic authors.

Incidents and issues ๐Ÿ”ฅ

Research, advocacy ๐Ÿงฎ

Investigations, assessments, audits ๐Ÿ‘๏ธ

News, commentary, analysis ๐Ÿ—ž๏ธ

Page info
Type: Data
Published: June 2023
Last updated: October 2024