Occurred: November 2023
Page published: November 2023
High profile text-to-image models can be manipulated into creating images of violent, nude and sexual images using 'sneaky' prompts, according to researchers, raising questions about their safety and security.
Researchers at Johns Hopkins University used a new jailbreaking method dubbed 'SneakyPrompt' to manipulate text-to-image generation models with indirect or 'sneaky' prompts to generate violent and sexually explicit imagery that platform safeguards were intended to block.
The process involved using reinforcement learning (RL) to create written prompts that AI models learned to recognise as hidden requests for disturbing images, thereby passing their safety filters.
For example, the researchers replaced the term 'naked', which is banned by OpenAI, with the term 'grponypui', resulting in the generation of explicit imagery.
When the nonsense word “sumowtawgha” was given to DALL-E 2 it created realistic pictures of naked people, and when “crystaljailswamew” was entered into DALL-E it returned a picture of a murder scene.
The technique raised concerns about the adequacy of safety measures and the potential misuse of Stable Diffusion, DALL-E, Midjourney, and other text-to-image systems.
The vulnerability stems from the fundamental way AI safety filters are designed, with most functioning as 'black boxes' or simple keyword classifiers that sit "on top" of the generative model.
Transparency & accountability limitations: AI developers often do not disclose the specific architecture of their safety filters. While intended to prevent attackers from gaming the system, this "security through obscurity" allows automated tools like SneakyPrompt to treat the filter as a mathematical puzzle. By observing which words are blocked and which are allowed, the RL agent can learn the "shape" of the filter and find the gaps.
Keyword vs. semantic understanding: Filters often prioritise blocking specific "trigger" words (e.g., "naked," "blood"). SneakyPrompt exploits this by using semantically similar but nonsensical tokens that the filter doesn't recognise as harmful, but the model’s internal embeddings still associate with the prohibited concept.
The findings have implications for individual safety and broader societal trust:
For users/misusers: The ease of bypassing filters significantly lowers the barrier for creating non-consensual intimate imagery (NCII) or 'deepfakes' of real people, including public figures. This creates immediate risks for harassment, extortion, and reputational damage.
For society: It demonstrates that current AI safety measures are 'cat and mouse' games rather than robust solutions. If filters can be bypassed with 20 simple queries, the current regulatory reliance on 'safety layers' may be misplaced.
Developer: Midjourney; OpenAI; Stability AI
Country: Global
Sector: Media/entertainment/sports/arts
Purpose: Generate images
Technology: Text-to-image; Generative adversarial network (GAN); Neural network; Deep learning; Machine learning
Issue: Safety; Security
AIAAIC Repository ID: AIAAIC1203