IBM Diversity in Faces (DiF) dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

IBM's Diversity in Faces (DiF) is a dataset of annotations of one million publicly available facial images that was intended to make artificial intelligence more fair and equitable across genders and skin colours and accelerate efforts towards creating more fair and accurate face recognition systems.

Released in January 2019, IBM's dataset was based on Yahoo!'s YFCC100M dataset, which provides approximately 100 million photos from photo sharing website Flickr available under various Creative Commons licenses.

IBM said DiF was meant to be an academic/research resource, was not publicly available for download or sale, and could not be used for commercial purposes.

Dataset 🤖

Withdrawn

Documents 📃

Diversity in Faces research study (pdf)

Operator: Alphabet/Google; Amazon; IBM; Microsoft
Developer: IBM
Country: USA
Sector: Technology; Research/academia
Purpose: Train & develop AI models
Technology: Database/dataset; Facial recognition; Computer vision
Issue: Copyright; Ethics/values; Privacy
Transparency: Governance; Privacy

Risks and harms 🛑

IBM's Diversity in Faces dataset raised significant ethical concerns regarding privacy, consent, its potential misuse for surveillance, discriminatory and other purposes. It was also criticised for inadequate transparency.

Transparency and accountability 🙉

IBM was seen to have been opaque in a number of ways about its Diversity in Faces dataset.

Limited disclosure to subjects. Many individuals whose images were included were not directly informed or asked for consent. And while the dataset used photos from Flickr under Creative Commons licenses, people appearing in the photos were unlikely to have realised their images could be used for facial recognition research.
Unclear data handling practices. IBM provided limited public information on how it stored, processed, and secured the biometric data. Furthermore, the it did not make clear its policies regarding long-term retention and potential sharing of the dataset.
Opaque annotation process. The methods used to annotate facial attributes (e.g. skin tone, facial structure) were not fully detailed, resulting in questions about potential biases in the annotation process.
Restricted access for scrutiny. IBM limited access to the dataset, making independent auditing and verification challenging. This restricted the ability of third-party researchers to assess potential biases and other issues.
Ambiguous usage guidelines. IBM failed to make it clear how it expected the dataset to be used by third parties, potentially leading to potential misuse or applications beyond the intended scope.
Unclear impact on AI systems. The downstream effects of using the DiF dataset to train AI systems were not fully explored or disclosed, making it difficult to assess potential biases or issues in resulting facial recognition technologies.
Limited accountability mechanisms. IBM said people whose photos had been included in the dataset could technically opt-out of the dataset through the company's generic research privacy policy, but nobody was informed that their data had been used, there was a lack of clear processes for individuals to determine if their images were included or to request removal, and it was impossible to delete them from copies that had already been provided to third-parties.

Incidents and issues 🔥

IBM dataset uses millions of online photos without consent to train AI systems

Legal, regulatory 👩🏼‍⚖️

Vance v International Business Machines Corporation (pdf)

Investigations, assessments, audits 🧐

Harvey, A., LaPlace, J. (2019). Exposing.ai

Research, advocacy 🧮

Raji I.D., Gebru T., Mitchell M., Buolamwini J., Lee J., Denton E. (2020). Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing
Crawford K., Paglen T. (2021). Excavating AI: the politics of images in machine learning training sets

Related 🌐

Page info
Type: Data
Published: December 2022
Last published: June 2024