GoEmotions dataset criticised for mislabelling content, stealing user content

Occurred: September 2022

30 percent of labels applied to Google's GoEmotions dataset were mislabeled, indicating that the data had not been verified by humans, according to a research study.

Using a random sample of 1,000 entries, US-based Surge AI concluded that Google's outsourced Indian human labelers were likely given 'no additional metadata' about each Reddit comment, thereby losing its context and meaning to different types of users, not least those in the US. 

Google was also criticised for ignoring the privacy of the Reddit community: 'It is entirely unethical to train an AI on human-created content without the expressed individual consent of the humans who created it.' 'When I post on Reddit', argued TNW's Tristran Harris. 'I do so in the good faith that my discourse is intended for other humans. Google doesn’t compensate me for my data so it shouldn’t use it, even if the terms of service allow for it,' he added.

The fracas prompted critics to argue that AI models trained on the dataset would produce erroneous output', 'causing demonstrable harm to other humans' and called for it to be deleted. It also highlighted the challenges of accurate labelling of data, notably when this is outsourced to people whose first language is not English and from other cultures.

Operator: Alphabet/Google
Developer: Alphabet/Google

Country: USA

Sector: Research/academia; Technology

Purpose: Classify emotions

Technology: Database/dataset
Issue: Accuracy/reliability; Cheating/plagiarism; Ethics/values

Transparency: Privacy

Research, advocacy 🧮