C4 ('Colossal Clean Crawled Corpus') is a public dataset created by Google and Meta as a smaller, cleaner version of the Common Crawl dataset

C4 was used to train Google's T5 and LaMDA, and Meta's LLaMA large language models.

Operator: Alphabet/Google; Meta/Facebook
Developer: Alphabet/Google
Country: USA
Sector: Research/academia; Technology
Purpose: Train large language models
Technology: Dataset/database
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing

The C4 dataset has been accused of providing unsafe, biased content that violates copyright and privacy.

Here are key transparency limitations of the C4 (Colossal Clean Crawled Corpus) large language model dataset:

Type: Data
Published: June 2023
Last updated: June 2024