80 Million Tiny Images dataset
Released: November 2008
80 Million Tiny Images is an image database that is used to train machine learning systems to identify people and objects in an environment.
Created in November 2008, the dataset contains over 79 million 32×32 pixel colour images, scaled down from images collected from search engine queries, and a set of 75,062 non-abstract nouns derived from WordNet.
In June 2020, University of Toronto researchers Vinay Uday Prabhu and Abeba Birhane discovered (pdf) that large-scale image datasets like 80 Million Tiny Images were associating offensive labels with real pictures.
According to the research, the dataset labeled Black and Asian people with racist slurs, women holding children labeled as whores, and included pornographic images.
They also found that WordNet, from which 80 Million Tiny Images copied content, contains derogatory terms, resulting in images and labels that confirm and reinforce stereotypes and biases, albeit inadvertently.
The creators of 80 Million Images apologised and took the dataset offline in June 2020 and urged that researchers refrain from using it in the future and delete downloaded copies.
How many copies were downloaded, how they were used, and whether their plea was followed remains unclear.