Google GoEmotions dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

GoEmotions is a 'fine-grained' dataset that enables users to train AI applications such as chatbots, content moderation, and customer support systems that can recognise emotional sentiment in text.

Released in October 2021, Google describes GoEmotions as 'a human-annotated dataset of 58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories.'

Dataset 🤖

GoEmotions dataset (Kaggle)
GoEmotions dataset (Papers with Code)

Documents 📃

GoEmotions: A Dataset of Fine-Grained Emotions research study (pdf)
Google (2021). GoEmotions: A Dataset for Fine-Grained Emotion Classification

Operator: Alphabet/Google
Developer: Alphabet/Google

Country: USA

Sector: Research/academia; Technology

Purpose: Classify emotions

Technology: Database/dataset
Issue: Accuracy/reliability; Cheating/plagiarism; Ethics/values

Transparency: Privacy

Risks and harms 🛑

GoEmotions has been criticised for its high rate of mislabeled data and for violating the privacy of Reddit users by exploiting their content without consent.

Transparency and accountability 🙈

The GoEmotions dataset has several transparency limitations.

Data collection process. There is limited information on how specific Reddit comments were selected for inclusion in the dataset.
Annotator demographics. The dataset lacks detailed information about the demographics and backgrounds of the annotators, which could influence emotion labeling.
Annotation guidelines. While some information is provided, the full set of detailed guidelines given to annotators is not publicly available.
Inter-annotator agreement. While some metrics are provided, there's limited insight into specific areas of disagreement among annotators.
Data cleaning process. The exact methods used to clean and preprocess the Reddit comments are not fully detailed.
Excluded data. There is limited information on what types of comments or content were excluded from the dataset and why.
Consent and privacy. It is unclear whether or how consent was obtained from the original comment authors, or how privacy concerns were addressed.
Potential biases: While some biases are acknowledged, there may be insufficient detail on potential biases in the data selection or annotation process.
Version control. Information about how the dataset might be updated or versioned over time is limited.

Incidents and issues 🔥

GoEmotions dataset criticised for mislabelling content, violating privacy

Research, advocacy 🧮

SurgeAI: 30% of Google's Emotions Dataset is Mislabeled

Related 🌐

Page info
Type: Data
Published: December 2022
Last updated: June 2024