Child sex abuse images discovered on LAION-5B dataset

Occurred: December 2023

Can you improve this page?
Share your insights with us

Researchers discovered thousands of child sex abuse pictures on open source AI image dataset LAION 5-B. 

Using a combination of perceptual and cryptographic hash-based detection and image analysis, the Stanford Internet Observatory, working with Project Arachnid Shield API and the Canadian Centre for Child Protection, found more than 3,200 images of suspected child sexual abuse material (CSAM) on the LAION-5B dataset. They also found 'nearest neighbor' matches within the dataset, where related images of victims were clustered together.

LAION responded by releasing a statement saying it 'has a zero-tolerance policy for illegal content, and in an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them.' However, public chats from LAION leadership in the organisation’s Discord server show they were aware of the possibility of CSAM being scraped into their datasets in 2021. 

The incident raised questions about the governance of LAION and the effectiveness of its technical guardrails. It also highlighted general concerns about the ethics of developing and publishing open source datasets without adequate oversight, specifically at AI community Hugging Face, the impact on systems - notably Stable Diffusion - trained using LAION-5B, and the potential impact on real victims of child sexual abuse.

Databank

Operator: David Thiel, Jeffrey Hancock
Developer: LAION
Country: Global
Sector: Multiple
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Safety
Transparency: Governance