C4 large language model dataset

C4 ('Colossal Clean Crawled Corpus') is a public dataset created by Google and Meta as a smaller, cleaner version of the Common Crawl dataset

C4 was used to train Google's T5 and LaMDA, and Meta's LLaMA large language models.

Dataset 🤖

Operator: Alphabet/Google; Meta/Facebook
Developer: Alphabet/Google
Country: USA
Sector: Research/academia; Technology
Purpose: Train large language models
Technology: Dataset/database
Issue: Bias/discrimination - religion; Copyright; Mis/disinformation; Privacy; Safety
Transparency: Governance; Black box; Marketing

Risks and harms 🛑

The C4 dataset has been accused of providing unsafe, biased content that violates copyright and privacy.

Transparency and accountability 🙈

Here are key transparency limitations of the C4 (Colossal Clean Crawled Corpus) large language model dataset:

Research, advocacy 🧮

Investigations, assessments, audits 🧐

Page info
Type: Data
Published: June 2023
Last updated: June 2024