AIAAIC - Google hate speech detection tricked by typos

Study: Google hate speech detection tricked by typos

Occurred: September 2018

Report incident 🔥 | Improve page 💁 | Access database 🔢

Google's AI-powered hate speech detection system Perspective was discovered to have significant vulnerabilities that could be easily exploited.

Researchers from Aalto University and the University of Padua discovered that the system could be tricked by simple modifications to text, rendering it less effective in identifying and flagging potentially harmful content.

Perspective assigns a toxicity score to text, categorising it as rude, disrespectful, or unreasonable enough to make someone leave a conversation.

However, the researchers found the system could be fooled by inserting typos, adding spaces between words, or including unrelated words in the original sentence.

Specifically, the system struggles to understand the context of expletives. For example, changing "I love you" to "I fucking love you" dramatically increases the toxicity score from 0.02 to 0.77. Or using "leetspeak" (replacing letters with numbers) can effectively trick it while maintaining the message's readability and emotional impact.

This research highlighted the unreliability of AI-based hate speech detection systems and the ease with which they can be bypassed to produce toxic conversations.

System 🤖

Perspective API

Operator:
Developer: Alphabet/Google
Country: Global
Sector: Media/entertainment/sports/arts
Purpose: Detect toxic language/hate speech
Technology: Machine learning
Issue: Accuracy/reliability; Safety

Research, advocacy 🧮

Grondahl T. et al. All You Need is “Love”: Evading Hate Speech Detection

News, commentary, analysis 🗞️

Related 🌐

Page info
Type: Issue
Published: August 2024

Page updated

Google Sites

Report abuse