Common Crawl
Common Crawl
Report incident 🔥 | Improve page 💁 | Access database 🔢
The Common Crawl dataset is a free, open archive of web crawl data that can be accessed, analysed and used by researchers, data scientists and developers.
Common Crawl is composed of over 250 billion web pages, metadata and text extracts spanning 17 years and consists of crawling robots (software robots) that browse the Internet on a monthly basis to capture web pages in over 40 languages.
The pages are stored and made publicly accessible in standardised formats.
Common Crawl has been used to train GPT-3 and many other large language models.
Open data
Open data is data that is openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data is licensed under an open license.
Source: Wikipedia 🔗
Website 🔗
Status: Active
Released: 2008
Developer: Common Crawl Foundation
Purpose: Provide open data
Type: Database/dataset
Technique: Machine learning
Common Crawl is seen to suffer from important transparency and accountability limitations, including:
Data collection and processing. There is insufficient transparency regarding the methods used for data collection, filtering processes, and the criteria for determining "high-quality" content. This obscurity can lead to misunderstandings about the dataset's reliability.
Common Crawl is seen to presents multiple risks and harms associated with its use, particularly in the context of training generative AI models, including:
Bias/discrimination. The crawling methodology prioritises frequently linked pages, often leading to underrepresentation of digitally marginalised communities. Most content is in English, which limits the dataset's global applicability and inclusivity.
Simplistic filtering. Filtered versions of Common Crawl used for training AI models often rely on basic algorithms that fail to adequately remove harmful content. This can result in biased outputs that reinforce stereotypes and misrepresent diverse perspectives.
Copyright. Common Crawl gathers data from various websites, including copyrighted material, without explicit permission. This raises significant legal concerns regarding intellectual property rights.
Safety. The dataset may inadvertently include illegal or socially unacceptable material, such as hate speech or explicit content. This poses risks for researchers and organisations that may unintentionally propagate such content.
Research validity. The uncurated nature of Common Crawl data may compromise the validity of research findings derived from it. Researchers may draw conclusions based on datasets that do not accurately represent the broader web or societal contexts.
Potential for harmful outputs. When used in AI training, the lack of careful curation can lead to models that generate harmful or misleading outputs, impacting users and society at large.
July 2024. Images of Australian children are used to train AI
June 2024. LAION-5B links to photos of identifiable Brazilian children
May 2024. OpenAI deleted training datasets believed to contain copyrighted books
February 2024. Three news publishers sue OpenAI for copyright infringement
December 2023. Child sex abuse images discovered on LAION-5B, LAION-400M datasets
April 2023. LAION trains Robert Kneschke photos without consent
April 2023. C4 dataset is trained on unsafe, copyright-protected web content
September 2022. Artist's private medical image trains LAION dataset
February 2022. Database of 16,000+ artists used to train Midjourney
January 2021. GPT-3 associates Muslims with violence
Mozilla Foundation. How Common Crawl’s Data Infrastructure Shaped the Battle Royale over Generative AI
Mozilla Foundation: Training data for the price of a sandwich
Baack S. A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl
Luccioni S., Viviano J. What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus
Page info
Type: Dataset
Published: October 2024