Common Crawl

The Common Crawl dataset is a free, open archive of web crawl data that can be accessed, analysed and used by researchers, data scientists and developers.

Common Crawl is composed of over 250 billion web pages, metadata and text extracts spanning 17 years and consists of crawling robots (software robots) that browse the Internet on a monthly basis to capture web pages in over 40 languages.

The pages are stored and made publicly accessible in standardised formats.

Common Crawl has been used to train GPT-3 and many other large language models.

Open data

Open data is data that is openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data is licensed under an open license.

Source: Wikipedia 🔗

Dataset 🤖

Reviews 🗣️

Transparency and accountability 🙈

Common Crawl is seen to suffer from important transparency and accountability limitations, including:

Risks and harms 🛑

Common Crawl is seen to presents multiple risks and harms associated with its use, particularly in the context of training generative AI models, including: 

Page info
Type: Dataset
Published: October 2024