LAION image-text pairings datasets
LAION is a German non-profit collective led by Christoph Schuhmann at the University of Vienna that has created a series of open-source datasets. These datasets have been used to train Stable Diffusion, Imagen and other text-to-image systems and models.
Launched in 2020 and comprising 400 million image and text pairings, LAION-400M was touted as the world's largest openly available dataset of image and text pairings. Its successor LAION-5b, comprising 5 billion pairings, was released in March 2022.
A 2021 audit (pdf) of LAION-400M by Abeba Birhane, Vinay Uday Prabhu and Emmanuel Kahembwe at University College Dublin, University of Edinburgh and UnifyID discovered that it contains 'troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.'
As Politico notes: 'When the researchers typed the word “Korean,” the model didn’t bring up images of BTS or bulgogi, but naked Korean women. Searching the word “Indian” brought up pictures of South Asian women being raped. 'Best president' brought up images of Donald Trump.'
The researchers argue that LAION-400M and other large language models are rarely managed in the interests of the individuals and organisation whose data is being collected and used. For instance, these systems often contain personal data collected without consent, and sometimes make it purposely difficult for individuals to remove their data.
In September 2022, AI artist 'Lapine' discovered that private medical photographs meant only to be available to her doctor had been used to train the image-text dataset LAION-5B. The dataset is supposed only to use publicly available images on the web.
Furthermore, given content collected for large language models is often second or third hand and are are unlikely to have been properly cleaned, the researchers recommend dataset creators need to be much more careful about licensing.
'The rights of the data subject remain unaddressed here' the researchers argue (pdf). 'It is reckless and dangerous to underplay the harms inherent in such large scale datasets and encourage their use in industrial and commercial settings. The responsibility of the licence scheme under which the dataset is provided falls solely on the dataset creator.'
Operator: Stability AI
Developer: LAION; Jenia Jitsev; Richard Vencu; Christoph Schuhumann
Sector: Technology; Research/academia
Purpose: Improve large language training models
Technology: Dataset; Machine learning
Issue: Safety; Bias/discrimination - race, ethnicity; Privacy; Copyright
Research, audits, investigations, inquiries, litigation
Birhane A., Prabhu V.U., Kahembwe E. (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes
News, commentary, analysis
Published: October 2021
Last updated: December 2022