LAION-5B image-text pairing dataset
Report incident 🔥 | Improve page 💁 | Access database 🔢
LAION-5B is a large, open dataset of 5.85 billion image and text pairings designed to train AI models.
Developed by German non-profit collective LAION (Large-scale Artificial Intelligence Open Network) and funded in part by Stability AI, LAION-5B was built from the Common Crawl dataset and artwork at Getty Images, Flickr, Pinterest, and elsewhere
The dataset has been used to train Google Imagen, Stable Diffusion, Midjourney, and hundreds of other AI image models.
LAION-5B was released in March 2022. It was preceded by LAION-400M.
Developer 🧑🏼💻
Documents 📃
Schuhmann C. et al (2022). LAION-5B: An open large-scale dataset for training next generation image-text models
LAION (2022). LAION-5B: A new era of open large-scale multi-modal datasets
Search LAION-5B 🔎
Operator: LAION
Developer: LAION
Country: Brazil; Germany; USA
Sector: Multiple
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Copyright; Employment; Ethics/values; Mis/disinformation; Privacy; Safety; Security
Transparency: Governance; Complaints/appeals
Risks and harms 🛑
Critics accuse the LAION-5B dataset of violating copyright and privacy, perpetuating bias and discrimination, enabling misinformation and disinformation and other harmful content, including the exploitation and abuse of children, and contributing to the degradation of creativity and loss of jobs.
Transparency and accountability 🙈
Whilst an 'open' dataset, LAION-5B fails to provide:
Clear documentation of where specific images originated, making it difficult to trace and verify the provenance of individual images.
Detailed documentation about the processes used to collect, filter, and curate the images in the dataset, hindering the ability to critically assess the dataset’s quality and the ethical considerations involved in its creation.
Comprehensive metadata about the images, such as context, usage rights, or the conditions under which the images were created and published, limiting the understanding of the dataset's composition and potential biases.
Information on whether or not explicit consent was obtained from the individuals depicted or the content creators, raising transparency issues regarding the permissions and rights associated with the images.
Information about the selection criteria and the inherent biases of the dataset's sources, resulting in users of LAION-5B not having sufficient information to understand these biases and their potential impacts on AI models.
Information about the methods and criteria used for cleaning, annotating, and pre-processing the images, making it challenging to assess the dataset's reliability and integrity.
A visual investigation of LAION-5B by digital think tank Open Future concluded that more dataset transparency is required to understand the structural configuration of today’s generative AI systems.
Incidents and issues 🔥
Legal, regulatory 👩🏼⚖️
Robert Kneschke v. LAION
Research, advocacy 🧮
Knowing Machines (2024). Models all the way down
Carlini N. et al (2023). Poisoning Web-Scale Training Datasets is Practical
Thiel D., Hancock J. (2023). Identifying and Eliminating CSAM in Generative ML Training Data and Models
Page info
Type: Data
Published: November 2023
Last updated: June 2024