LAION-400M dataset

Page created: October 2021
Updated: May 2022

LAION-400M is an open dataset of 400 million image and text pairings created by a group of researchers led by Christoph Schuhmann at University of Vienna that draws on content culled from the internet between 2014 and 2020.

Launched in 2020, LAION-400M is touted as the world's largest openly available dataset of image and text pairings.

NSFT (not safe for training) content

A 2021 audit (pdf) of LAION-400M by Abeba Birhane, Vinay Uday Prabhu and Emmanuel Kahembwe at University College Dublin, University of Edinburgh and UnifyID discovered that it contains 'troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.'

As Politico notes: 'When the researchers typed the word “Korean,” the model didn’t bring up images of BTS or bulgogi, but naked Korean women. Searching the word “Indian” brought up pictures of South Asian women being raped. 'Best president' brought up images of Donald Trump.'

Privacy, copyright

The researchers go on to argue that LAION-400M and other large language models are rarely managed in the interests of the individuals and organisation whose data is being collected and used.

For instance, they regularly contain personal data collected without consent, and sometimes make it purposely difficult for individuals to remove their data.

And dataset creators need to be more careful about licensing given content collected for their projects is often second or third hand and may not have been properly cleaned.

'The rights of the data subject remain unaddressed here' the researchers argue (pdf). 'It is reckless and dangerous to underplay the harms inherent in such large scale datasets and encourage their use in industrial and commercial settings. The responsibility of the licence scheme under which the dataset is provided falls solely on the dataset creator.'

Developer: LAION; Jenia Jitsev; Richard Vencu; Christoph Schuhumann
Sector: Research/academia
Improve large language training models
Technology: Dataset; Machine learning
Offensive/inappropriate content; Bias/discrimination - race, ethnicity; Privacy; Copyright
Opacity: Governance