80 Million Tiny Images - dataset
80 Million Tiny Images - dataset
Page published: December 2022 | Last updated: November 2024
Created in November 2008 by MIT professors Bill Freeman and Antonio Torralba, and NYU professor Rob Fergus, 80 Million Tiny Images is an image dataset intended to help train machine learning systems to identify people and objects.
The dataset contains over 79 million 32×32 pixel colour images, scaled down from images collected from search engine queries.
It is arranged in 75,000-odd categories from a set of 75,062 non-abstract nouns derived from WordNet.
Website 🔗
Country: USA
Purpose: Identify and classify objects, people
Type: Database/dataset
Technique: Computer vision; Object recognition
The 80 Million Tiny Images dataset is seen to have several notable transparency and accountability limitations:
Insufficient documentation. The dataset lacks comprehensive documentation about its contents, collection methods, and potential biases.
Lack of consent. Images were collected without clear consent from the individuals depicted or the copyright holders.
Unclear data sourcing. The exact sources of the images and the criteria for inclusion are not well-defined or disclosed.
Limited metadata. Thereis insufficient information about the context, origin, or potential biases of individual images.
Difficulty in auditing. Due to its massive size, comprehensive manual review and auditing of the dataset is impractical.
Torralba A., Fergus R., Freeman W.T. Statement
Torralba A., Fergus R., Freeman W.T. 80 million tiny images: a large dataset for non-parametric object and scene recognition (pdf)
80 Million Tiny Images is seen to suffer from a number of important transparency and accountability limitations, not least that Google Images were quietly scraped without user acknowledgement or consent.
MIT removed the dataset from the internet in July 2020 after researchers Abeba Birhane and Vinay Prabhu discovered it contains racist and misogynistic slurs which were causing models trained on them to exhibit racial and sexual bias.
MIT may have withdrawn the dataset and warned against its use but it is not hard to find it online and the internet is littered with derivative sub-sets, many of which were produced before MIT was forced to move into defensive action.
Prabhu V.U., Birhane A. (2020). Large Image Datasets: A Pyrrhic Win for Computer Vision?
Krizhevsky A. (2009). Learning Multiple Layers of Features from Tiny Images (pdf)
AIAAIC Repository ID: AIAAIC0468