LAION-5B
LAION-5B
Report incident ๐ฅ | Improve page ๐ | Access database ๐ข
LAION-5B is a large, open dataset of 5.85 billion image and text pairings designed to train AI models.
Developed by German non-profit collective LAION (Large-scale Artificial Intelligence Open Network) and funded in part by Stability AI, LAION-5B was built from the Common Crawl dataset and artwork at Getty Images, Flickr, Pinterest and elsewhere.
The dataset has been used to train Google Imagen, Stable Diffusion, Midjourney and hundreds of other AI image models.
LAION-5B was released in March 2022; in August 2024 it was superseded by Re-LAION-5B -ย claimed by LAION to be "the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM."
Multimodal learning
Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.
Wikipedia: Multimodal learning ๐
Website ๐
Data ๐
Re-LAION-5B ๐
Released: 2022
Availability: Available
Developer: LAION
Purpose: Train AI image models
Type: Database/dataset
Technique: Generative AI; Machine learning
Schuhmann C. et al (2022). LAION-5B: An open large-scale dataset for training next generation image-text models
LAION (2022). LAION-5B: A new era of open large-scale multi-modal datasets
Considered significant for its scale and relative openness, LAION-5B is seen to suffer from several notable transparency and accountability limitations:ย
Data provenance. There is a lack of clear documentation of where specific images originated, hindering independent verification and auditing efforts, making it difficult for researchers to assess the ethical implications of using the dataset[2][4]. The obfuscation of information surrounding dataset composition limits transparency and prevents the establishment of regulatory guidelines necessary for responsible AI development.
Data collection. Detailed documentation about the processes used to collect, filter and curate the images in the dataset, hindering the ability to critically assess the datasetโs quality and the ethical considerations involved in its creation. Furthermore, the absence of a "human in the loop" during data collection means that there is minimal accountability for content included in the dataset - a reliance on automated systems that has resulted in a dataset that may perpetuate biases and include harmful material without proper checks or balances.
Data integrity. Information about the methods and criteria used for cleaning, annotating, and pre-processing the images, making it challenging to assess the dataset's reliability and integrity.
Privacy. Information on whether or not explicit consent was obtained from the individuals depicted or the content creators, raising transparency issues regarding the permissions and rights associated with the images.
Critics accuse the LAION-5B dataset of violating copyright and privacy, perpetuating bias and discrimination, enabling misinformation and disinformation and other harmful content, including the exploitation and abuse of children.
More broadly, it is seen to have contributed to the degradation of creativity and the loss of jobs.
July 2024. Images of Australian children are used to train AI
June 2024. LAION-5B links to photos of identifiable Brazilian children
December 2023. Child sex abuse images discovered on LAION-5B, LAION-400M datasets
April 2023. LAION trains Robert Kneschke photos without consent
September 2022. Artist's private medical image trains LAION dataset
February 2022. Database of 16,000+ artists used to train Midjourney
Robert Kneschke v. LAION
Knowing Machines (2024). Models all the way down
Carlini N. et al (2023). Poisoning Web-Scale Training Datasets is Practical
Thiel D., Hancock J. (2023). Identifying and Eliminating CSAM in Generative ML Training Data and Models
https://hbr.org/2023/04/generative-ai-has-an-intellectual-property-problem
https://www.techpolicy.press/laion5b-stable-diffusion-and-the-original-sin-of-generative-ai/
https://openfuture.eu/note/seeing-like-an-algorithm-a-closer-look-at-laion-5b/
https://www.newyorker.com/culture/infinite-scroll/is-ai-art-stealing-from-artists
Page info
Type: Data
Published: November 2023
Last updated: October 2024