Child sex abuse images discovered on LAION-5B dataset
Occurred: December 2023
Can you improve this page?
Share your insights with us
Researchers discovered thousands of child sex abuse pictures on open source AI image dataset LAION 5-B.
Using a combination of perceptual and cryptographic hash-based detection and image analysis, the Stanford Internet Observatory, working with Project Arachnid Shield API and the Canadian Centre for Child Protection, found more than 3,200 images of suspected child sexual abuse material (CSAM) on the LAION-5B dataset. They also found 'nearest neighbor' matches within the dataset, where related images of victims were clustered together.
LAION responded by releasing a statement saying it 'has a zero-tolerance policy for illegal content, and in an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them.' However, public chats from LAION leadership in the organisation’s Discord server show they were aware of the possibility of CSAM being scraped into their datasets in 2021.
The incident raised questions about the governance of LAION and the effectiveness of its technical guardrails. It also highlighted general concerns about the ethics of developing and publishing open source datasets without adequate oversight, specifically at AI community Hugging Face, the impact on systems - notably Stable Diffusion - trained using LAION-5B, and the potential impact on real victims of child sexual abuse.
Databank
Operator: David Thiel, Jeffrey Hancock
Developer: LAION
Country: Global
Sector: Multiple
Purpose: Pair text and images
Technology: Database/dataset; Neural network; Deep learning; Machine learning
Issue: Safety
Transparency: Governance
System
Research, advocacy
Thiel D., Hancock J. (2023). Identifying and Eliminating CSAM in Generative ML Training Data and Models
News, commentary, analysis
Page info
Type: Incident
Published: December 2023