LAION-400M image-text pairing dataset

LAION-400M is a large, openly available dataset of 400 million image and text pairings developed by German non-profit collective LAION. The dataset's successor LAION-5B comprises 5 billion pairings. 

Launched in 2020, LAION-400M was used to train Imagen, Lensa, Stable Diffusion, and other text-to-image models.

Dataset databank

Operator: Alphabet/Google; Prisma Labs; Stability AI
Developer: LAION
Country: Germany
Sector: Multiple
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Ethics; Safety; Bias/discrimination - race, ethnicity; Privacy; Copyright
Transparency: Governance

Safety

A 2021 University College Dublin, University of Edinburgh and UnifyID audit (pdf) of LAION-400M found that it contained 'troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.'

Politico noted: 'When the researchers typed the word “Korean,” LAION-400M didn’t bring up images of BTS or bulgogi, but naked Korean women. Searching the word “Indian” brought up pictures of South Asian women being raped. 'Best president' brought up images of Donald Trump.'

Privacy

The researchers argue that LAION-400M and other large language models are rarely managed in the interests of the individuals and organisation whose data is being collected and used. For instance, these systems often contain personal data collected without consent, and sometimes make it purposely difficult for individuals to remove their data.

Copyright

Furthermore, given content collected for large language models is often second or third hand and are are unlikely to have been properly cleaned, the researchers recommend dataset creators need to be much more careful about licensing.

'The rights of the data subject remain unaddressed here' the researchers argue (pdf). 'It is reckless and dangerous to underplay the harms inherent in such large scale datasets and encourage their use in industrial and commercial settings. The responsibility of the licence scheme under which the dataset is provided falls solely on the dataset creator.'

Dataset documents

Research, advocacy

Page info
Type: Data
Published: October 2021
Last updated: August 2023