LAION-400M dataset features racist, derogatory, pornographic content

Occurred: December 2022

The LAION-400M dataset contained 'troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content,' according to researchers.

A audit (pdf) by University College Dublin, University of Edinburgh and UnifyID researchers found that 'When the researchers typed the word “Korean,” LAION-400M didn’t bring up images of BTS or bulgogi, but naked Korean women. Searching the word “Indian” brought up pictures of South Asian women being raped. 'Best president' brought up images of Donald Trump.'

The researchers argued that LAION-400M and other large datasets are rarely managed in the interests of the individuals and organisations whose data is being collected and used, and that they are unlikely to have been properly cleaned. 

'The rights of the data subject remain unaddressed here' the researchers argue (pdf). 'It is reckless and dangerous to underplay the harms inherent in such large scale datasets and encourage their use in industrial and commercial settings. The responsibility of the licence scheme under which the dataset is provided falls solely on the dataset creator.'

Operator: Alphabet/Google; Prisma Labs; Stability AI
Developer: LAION
Country: Germany
Sector: Technology
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Bias/discrimination - race, ethnicity; Copyright; Ethics/values; Privacy; Safety
Transparency: Governance

Research, advocacy 🧮