LAION-5B dataset

Report incident ๐Ÿ”ฅ | Improve page ๐Ÿ’ | Access database ๐Ÿ”ข

LAION-5B is a large, open dataset of 5.85 billion image and text pairings designed to train AI models.

Developed by German non-profit collective LAION (Large-scale Artificial Intelligence Open Network) and funded in part by Stability AI, LAION-5B was built from the Common Crawl dataset and artwork at Getty Images, Flickr, Pinterest and elsewhere.

The dataset has been used to train Google Imagen, Stable Diffusion, Midjourney and hundreds of other AI image models.

LAION-5B was released in March 2022; in August 2024 it was superseded by Re-LAION-5B -ย  claimed by LAION to be "the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM."

Multimodal learning

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

Wikipedia: Multimodal learning ๐Ÿ”—

Dataset ๐Ÿค–

Search LAION-5B for your work ๐Ÿ”Ž

Transparency and accountability ๐Ÿ™ˆ

Considered significant for its scale and relative openness, LAION-5B is seen to suffer from several notable transparency and accountability limitations:ย 

Risks and harms ๐Ÿ›‘

Critics accuse the LAION-5B dataset of violating copyright and privacy, perpetuating bias and discrimination, enabling misinformation and disinformation and other harmful content, including the exploitation and abuse of children.

More broadly, it is seen to have contributed to the degradation of creativity and the loss of jobs.

Legal, regulatory ๐Ÿ‘ฉ๐Ÿผโ€โš–๏ธ

Research, advocacy ๐Ÿงฎ

Page info
Type: Data
Published: November 2023
Last updated: October 2024