Chatbot guardrails bypassed using lengthy character suffixes
Occurred: July 2023
Can you improve this page?
Share your insights with us
Bard, ChatGPT, and Claude safety rules can be bypassed in 'virtually unlimited ways', researchers have discovered.
Using jailbreaks developed for open-source systems, Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI researchers demonstrated that automated adversarial attacks that added characters to the end of user queries could be used to overcome safety rules and provoke chatbots into producing harmful content, misinformation, or hate speech.
Furthermore, the researchers said they could develop a 'virtually unlimited' number of similar attacks given the automated nature of the jailbreaks.
Databank
Operator: Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson
Developer: Anthropic; Alphabet/Google; Microsoft; OpenAI
Country: USA
Sector: Technology
Purpose: Generate text
Technology: Chatbot; NLP/text analysis; Neural network; Deep learning; Machine learning
Issue: Mis/disinformation; Safety; Security
Transparency: Governance
System
Research, advocacy
Zou, A., et al (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models
News, commentary, analysis
https://www.businessinsider.com/ai-researchers-jailbreak-bard-chatgpt-safety-rules-2023-7
https://www.nytimes.com/2023/07/27/business/ai-chatgpt-safety-research.html
https://www.zdnet.com/article/vulnerabilities-in-chatgpt-and-other-chatbots/
https://www.businessinsider.com/ai-researchers-jailbreak-bard-chatgpt-safety-rules-2023-7
https://www.theregister.com/2023/07/27/llm_automated_attacks/
Page info
Type: Issue
Published: November 2023