AIAAIC - LAION-5B image-text pairing dataset

LAION-5B dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

LAION-5B is a large, open dataset of 5.85 billion image and text pairings designed to train AI models.

Developed by German non-profit collective LAION (Large-scale Artificial Intelligence Open Network) and funded in part by Stability AI, LAION-5B was built from the Common Crawl dataset and artwork at Getty Images, Flickr, Pinterest and elsewhere.

The dataset has been used to train Google Imagen, Stable Diffusion, Midjourney and hundreds of other AI image models.

LAION-5B was released in March 2022; in August 2024 it was superseded by Re-LAION-5B - claimed by LAION to be "the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM."

Multimodal learning

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

Wikipedia: Multimodal learning 🔗

Dataset 🤖

Website 🔗
Re-LAION-5B 🔗
LAION-5B dataset 🔗
Status: Active
Released: 2022
Operator: LAION
Developer: LAION
Country: Brazil; Germany; USA
Sector: Multiple
Purpose: Train AI image models
Type: Database/dataset
Technique: Generative AI; Machine learning

Documents 📃

Schuhmann C. et al (2022). LAION-5B: An open large-scale dataset for training next generation image-text models
LAION (2022). LAION-5B: A new era of open large-scale multi-modal datasets

Search LAION-5B for your work 🔎

Have I Been Trained?

Transparency and accountability 🙈

Considered significant for its scale and relative openness, LAION-5B is seen to suffer from several notable transparency and accountability limitations:

Data provenance. There is a lack of clear documentation of where specific images originated, hindering independent verification and auditing efforts, making it difficult for researchers to assess the ethical implications of using the dataset[2][4]. The obfuscation of information surrounding dataset composition limits transparency and prevents the establishment of regulatory guidelines necessary for responsible AI development.
Data collection. Detailed documentation about the processes used to collect, filter and curate the images in the dataset, hindering the ability to critically assess the dataset’s quality and the ethical considerations involved in its creation. Furthermore, the absence of a "human in the loop" during data collection means that there is minimal accountability for content included in the dataset - a reliance on automated systems that has resulted in a dataset that may perpetuate biases and include harmful material without proper checks or balances.
Data integrity. Information about the methods and criteria used for cleaning, annotating, and pre-processing the images, making it challenging to assess the dataset's reliability and integrity.
Privacy. Information on whether or not explicit consent was obtained from the individuals depicted or the content creators, raising transparency issues regarding the permissions and rights associated with the images.

Risks and harms 🛑

Critics accuse the LAION-5B dataset of violating copyright and privacy, perpetuating bias and discrimination, enabling misinformation and disinformation and other harmful content, including the exploitation and abuse of children.

More broadly, it is seen to have contributed to the degradation of creativity and the loss of jobs.

News, commentary, analysis 🗞️

Related 🌐

Page info
Type: Data
Published: November 2023
Last updated: October 2024