Report incident ๐ฅ | Improve page ๐ | Access database ๐ข
LAION-400M is a large, open dataset of 400 million image and text pairings.
Developed by German non-profit collective LAION and launched in 2020, LAION-400M was used to train Imagen, Lensa, Stable Diffusion and other text-to-image models.
The dataset's successor LAION-5B comprises 5 billion pairings.
Multimodal learning
Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.
Wikipedia: Multimodal learning ๐
The LAION-400M dataset is seen to have several important transparency limitations.
Source data. LAION provides little information on the websites and sources used to create its dataset.
Scraping methodology. Details about LAION's web scraping techniques and filtering processes are not fully disclosed.
Copyright and privacy. It is unclear if proper consent was obtained for all images and text, or how copyright issues were addressed.
Demographic representation. There's limited information on the demographic distribution and potential biases in the dataset.
Content moderation. The extent and methods used to remove inappropriate or offensive material are not fully detailed.
Quality assessment. Criteria for assessing the quality and relevance of image-text pairs are not thoroughly explained.
Downstream impact. There is little analysis on how using this dataset impacts the models trained on it.
The LAION-400M dataset is accused of violating privacy, enabling the generation of offensive, hateful, explicit and derogatory content, and perpetuating biases due to unfiltered, large-scale web-scraped data.
Birhane A., Luccioni A.S. et al (2023). Into the LAIONs Den: Investigating Hate in Multimodal Datasets
Thiel D., Hancock J. (2023). Identifying and Eliminating CSAM in Generative ML Training Data and Models
Birhane A., Prabhu V.U., Kahembwe E. (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes
Page info
Type: Data
Published: October 2021
Last updated: October 2024