LAION-5B image-text pairing dataset

LAION-5B is a large, open dataset of 5.85 billion image and text pairings designed to train AI models. 

Developed by German non-profit collective LAION (Large-scale Artificial Intelligence Open Network) and funded in part by Stability AI, LAION-5B was built from the Common Crawl dataset and artwork at Getty Images, Flickr, Pinterest, and elsewhere

The dataset has been used to train Google Imagen, Stable Diffusion, Midjourney, and hundreds of other AI image models. 

LAION-5B was released in March 2022. It was preceded by LAION-400M.

Operator: LAION
Developer: LAION
Country: Brazil; Germany; USA
Sector: Multiple
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Copyright; Employment; Ethics/values; Mis/disinformation; Privacy; Safety; Security
Transparency: Governance; Complaints/appeals 

Risks and harms 🛑

Critics accuse the LAION-5B dataset of violating copyright and privacy, perpetuating bias and discrimination, enabling misinformation and disinformation and other harmful content, including the exploitation and abuse of children, and contributing to the degradation of creativity and loss of jobs.

Transparency and accountability 🙈

Whilst an 'open' dataset, LAION-5B fails to provide: 

A visual investigation of LAION-5B by digital think tank Open Future concluded that more dataset transparency is required to understand the structural configuration of today’s generative AI systems.

Legal, regulatory 👩🏼‍⚖️

Research, advocacy 🧮