Google GoEmotions dataset mis-labelling
Occurred: July 2022
GoEmotions is a 'fine-grained' dataset that enables users to train AI applications such as chatbots, content moderation, and customer support systems that can recognise emotional sentiment in text.
Released in October 2021, Google describes GoEmotions as 'a human-annotated dataset of 58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories.'
A July 2022 study of 1,000 random labeled comments from GoEmotions by Surge AI discovered that 30% had been mislabeled, indicating that the data had not been verified by humans.
Surge highlighted two aspects of Google's methodology - context and the complexity of English for non-American speakers - concluding that Google's outsourced Indian human labelers were likely given 'no additional metadata' about each Reddit comment, thereby losing its context and meaning to different types of users, not least those in the US.
TNW's Tristran Harris observed that 'any AI model trained on this dataset will produce erroneous outputs', 'causing demonstrable harm to other humans' and called for it to be deleted.
Harris blasts Google for ignoring the privacy of the Reddit community: 'It is entirely unethical to train an AI on human-created content without the expressed individual consent of the humans who created it.'
'When I post on Reddit', he says, 'I do so in the good faith that my discourse is intended for other humans. Google doesn’t compensate me for my data so it shouldn’t use it, even if the terms of service allow for it.'
News, commentary, analysis
Published: December 2022