LAION-400M image-text pairing dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

LAION-400M is a large, open dataset of 400 million image and text pairings.

Developed by German non-profit collective LAION and launched in 2020, LAION-400M was used to train Imagen, Lensa, Stable Diffusion, and other text-to-image models.

The dataset's successor LAION-5B comprises 5 billion pairings.

Dataset 🤖

LAION-400M dataset

Developer 🧑🏼‍💻

Documents 📃

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Operator: Alphabet/Google; Prisma Labs; Stability AI
Developer: LAION
Country: Germany
Sector: Technology
Purpose: Train large language models
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Bias/discrimination - race, ethnicity; Copyright; Ethics/values; Privacy; Safety
Transparency: Governance

Risks and harms 🛑

The LAION-400M dataset is accused of violating privacy, enabling the generation of offensive, hateful, explicit, and derogatory content, and perpetuating biases due to unfiltered, large-scale web-scraped data.

Transparency and accountability 🙈

The LAION-400M dataset is seen to have several important transparency limitations.

Source data. While LAION-400M contains image-text pairs scraped from the internet, there is limited transparency about the specific websites and sources used.
Scraping methodology. Details about the exact web scraping techniques and filtering processes are not fully disclosed.
Consent and copyright. It's unclear if proper consent was obtained for all images and text, or how copyright issues were addressed.
Demographic representation. There's limited information on the demographic distribution and potential biases in the dataset.
Content moderation. The extent and methods of content moderation to remove inappropriate or offensive material are not fully detailed.
Quality assessment. Criteria for assessing the quality and relevance of image-text pairs are not thoroughly explained.
Versioning and updates. Information about dataset versions, updates, or changes over time may be limited.
Downstream impact. There's a lack of comprehensive analysis on how using this dataset impacts AI models trained on it.

Incidents and issues 🔥

Research, advocacy 🧮

Birhane A., Luccioni A.S. et al (2023). Into the LAIONs Den: Investigating Hate in Multimodal Datasets
Thiel D., Hancock J. (2023). Identifying and Eliminating CSAM in Generative ML Training Data and Models
Birhane A., Prabhu V.U., Kahembwe E. (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes

Related 🌐

Page info
Type: Data
Published: October 2021
Last updated: June 2024