Google hate speech detection tricked by typos
Occurred: September 2018
Report incident 🔥 | Improve page 💁 | Access database 🔢
Google's AI-powered hate speech detection system Perspective was discovered to have significant vulnerabilities that could be easily exploited.
Researchers from Aalto University and the University of Padua discovered that the system could be tricked by simple modifications to text, rendering it less effective in identifying and flagging potentially harmful content.
Perspective assigns a toxicity score to text, categorising it as rude, disrespectful, or unreasonable enough to make someone leave a conversation.
However, the researchers found the system could be fooled by inserting typos, adding spaces between words, or including unrelated words in the original sentence.
Specifically, the system struggles to understand the context of expletives. For example, changing "I love you" to "I fucking love you" dramatically increases the toxicity score from 0.02 to 0.77. Or using "leetspeak" (replacing letters with numbers) can effectively trick it while maintaining the message's readability and emotional impact.
This research highlighted the unreliability of AI-based hate speech detection systems and the ease with which they can be bypassed to produce toxic conversations.