IBM Diversity in Faces (DiF) dataset

IBM's Diversity in Faces (DiF) is a dataset of annotations of one million publicly available facial images that was intended to make artificial intelligence more fair and equitable across genders and skin colours and accelerate efforts towards creating more fair and accurate face recognition systems.'

Released in January 2019, IBM's dataset was based on Yahoo!'s YFCC100M dataset, which provides approximately 100 million photos from photo sharing website Flickr available under various Creative Commons licenses. 

IBM said DiF was meant to be an academic/research resource, was not publicly available for download or sale, and could not be used for commercial purposes.

Dataset 🤖

Operator: Alphabet/Google; Amazon; IBM; Microsoft
Developer: IBM
Country: USA
Sector: Technology; Research/academia
Purpose: Train & develop AI models
Technology: Dataset; Facial recognition; Computer vision
Issue: Privacy; Copyright; Ethics
Transparency: Governance; Privacy

Risks and harms 🛑


A March 2019 NBC News investigation discovered that IBM had been using its Diversity in Faces dataset to train its own AI products, including Watson Visual Recognition, without the consent of the people in the photos. Not only was IBM ignoring its own terms of use for the dataset, it also failed to provide attribution links or public credit for any images. 

In January 2020, IBM was sued in a class action seeking damages of USD 5,000 for each intentional violation of the Illinois Biometric Information Privacy Act, or $1,000 for each negligent violation, for all Illinois citizens whose biometric data was used in the DiF dataset. 

In June 2021, Amazon and Microsoft teamed up to defend themselves against lawsuits accusing them of using DiF to train their own facial recognition products, and failing to gain the permission of people whose photographs were used in the dataset.

Transparency 🙉

Per the BBC, while IBM said people whose photos had been included in the dataset could technically opt-out of the dataset through the company's generic research privacy policy, nobody was informed that their data had been used. 

In addition, image owners found it difficult to have their images removed from Diversity in Faces, and impossible to delete them from copies that had already been provided to researchers.

In June 2020, IBM announced it would no longer develop or sell facial recognition technologies to law enforcement authorities. 

Legal, regulatory 👩🏼‍⚖️

Investigations, assessments, audits 🧐

Research, advocacy 🧮

News, commentary, analysis 🗞️

Page info
Type: Dataset
Published: December 2022