Page published: December 2022 | Last published: October 2024
Report incident๐ฅ| Improve page ๐| Access database ๐ข
Diversity in Faces (DiF) is a dataset of annotations of one million publicly available facial images that was intended to make artificial intelligence more fair and equitable across genders and skin colours and accelerate efforts towards creating more fair and accurate facial recognition systems.
Released in January 2019, IBM's dataset was based on Yahoo!'s YFCC100M dataset, which provides approximately 100 million photos from photo sharing website Flickr available under various Creative Commons licenses.ย
IBM said DiF was meant to be an academic/research resource, was not publicly available for download or sale, and could not be used for commercial purposes.
Facial recognition system
A facial recognition system is a technology potentially capable of matching a human face from a digital image or a video frame against a database of faces.
Source: Wikipedia ๐
Website
Released: 2019
Developer: IBM
Purpose: Democratise AI
Type: Database/dataset
Technique: Computer vision; Facial recognition
Michele Merler, Nalini Ratha, Rogerio S. Feris, John R. Smith. Diversity in Faces
IBM was seen to have been opaque in a number of ways about its Diversity in Faces dataset.
Limited disclosure to subjects. Many individuals whose images were included were not directly informed or asked for consent. And while the dataset used photos from Flickr under Creative Commons licenses, people appearing in the photos were unlikely to have realised their images could be used for facial recognition research.
Unclear data handling practices. IBM provided limited public information on how it stored, processed, and secured the biometric data. Furthermore, the it did not make clear its policies regarding long-term retention and potential sharing of the dataset.
Opaque annotation process. The methods used to annotate facial attributes (e.g. skin tone, facial structure) were not fully detailed, resulting in questions about potential biases in the annotation process.
Restricted access for scrutiny. IBM limited access to the dataset, making independent auditing and verification challenging. This restricted the ability of third-party researchers to assess potential biases and other issues.
Ambiguous usage guidelines. IBM failed to make it clear how it expected the dataset to be used by third parties, potentially leading to potential misuse or applications beyond the intended scope.
Unclear impact on AI systems. The downstream effects of using the DiF dataset to train AI systems were not fully explored or disclosed, making it difficult to assess potential biases or issues in resulting facial recognition technologies.
Limited accountability mechanisms. IBM said people whose photos had been included in the dataset could technically opt-out of the dataset through the company's generic research privacy policy, but nobody was informed that their data had been used, there was a lack of clear processes for individuals to determine if their images were included or to request removal, and it was impossible to delete them from copies that had already been provided to third-parties.
IBM's Diversity in Faces dataset raised significant ethical concerns regarding privacy, consent, its potential misuse for surveillance, discriminatory and other purposes. It was also criticised for inadequate transparency.
Harvey, A., LaPlace, J. (2019). Exposing.ai
Raji I.D., Gebru T., Mitchell M., Buolamwini J., Lee J., Denton E. (2020). Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing
Crawford K., Paglen T. (2021). Excavating AI: the politics of images in machine learning training sets
AIAAIC Repository ID: AIAAIC0317