LAION-5B image-text pairing dataset
LAION-5B is a large, open dataset of 5.85 billion image and text pairings developed by German non-profit collective LAION (Large-scale Artificial Intelligence Open Network).
Funded in part by Stability AI and released in March 2022, LAION-5B was built from the Common Crawl dataset and has been used to train Google Imagen, Stable Diffusion, Midjourney, and hundreds of other AI image models.
It's predecessor was LAION-400M.
Dataset databank 🔢
Operator: LAION
Developer: LAION
Country: Germany
Sector: Multiple
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Copyright; Employment; Ethics/values; Privacy; Safety
Transparency: Governance; Complaints/appeals
Copyright violations
LAION-5B was built using bots that crawled billions of websites, including large repositories of artwork at Getty Images, Flickr, Pinterest, and more, and collected millions of copyrighted images without permission.
German stock photographer Robert Kneschke discovered that his photos had been used to train LAION-5B, raising further questions about copyright protections from AI datasets and systems, and the practices and ethics of the dataset's eponymous developer.
LAION’s use of scraped web data means it has been associated with 'data laundering' and named in legal disputes. For instance, artists sued Stability AI and Midjourney for using copyrighted works in AI model development. However, it is thought possible that LAION’s approach of supplying web links to images (rather than hosting the images directly) may insulate it from copyright claims.
Privacy loss
AI artist 'Lapine' found that private medical photographs meant only to be available to her doctor had been used to train the image-text dataset LAION-5B. The dataset is supposed only to use publicly available images on the web.
Child sex abuse images
Stanford University researchers discovered thousands of child sex abuse images in LAION-5B, persuading its developers to take down the dataset until it was considered safe to republish.
Loss of creativity, jobs
LAION-5B's association with Stable Diffusion, Midjourney and other image generators has meant that it has been seen as involved in the 'theft' of art from artists, and with the potential or actual degradation of creativity and loss of jobs.
System documents 📚
Schuhmann C. et al (2022). LAION-5B: An open large-scale dataset for training next generation image-text models
LAION (2022). LAION-5B: A new era of open large-scale multi-modal datasets
Research, advocacy 🧮
Knowing Machines (2024). Models all the way down
Carlini N. et al (2023). Poisoning Web-Scale Training Datasets is Practical
Page info
Type: Data
Published: November 2023
Last updated: March 2024