80 Million Tiny Images

Page published: December 2022 | Last updated: November 2024

Report incident🔥| Improve page 💁 | Access database 🔢

Created in November 2008 by MIT professors Bill Freeman and Antonio Torralba, and NYU professor Rob Fergus, 80 Million Tiny Images is an image dataset intended to help train machine learning systems to identify people and objects.

The dataset contains over 79 million 32×32 pixel colour images, scaled down from images collected from search engine queries.

It is arranged in 75,000-odd categories from a set of 75,062 non-abstract nouns derived from WordNet.

Computer vision

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions.

Source: Wikipedia 🔗

Dataset 🤖

Website 🔗
Released: 2008
Developer: Antonio Torralba; Bill Freeman; Rob Fergus
Country: USA
Purpose: Identify & classify objects, people
Type: Database/dataset
Technique: Computer vision; Object recognition

Documents 📃

Torralba A., Fergus R., Freeman W.T. Statement
Torralba A., Fergus R., Freeman W.T. 80 million tiny images: a large dataset for non-parametric object and scene recognition (pdf)

Derivatives, applications 🈸

Transparency, accountability 🙈

The 80 Million Tiny Images dataset is seen to have several notable transparency and accountability limitations:

Insufficient documentation. The dataset lacks comprehensive documentation about its contents, collection methods, and potential biases.
Lack of consent. Images were collected without clear consent from the individuals depicted or the copyright holders.
Unclear data sourcing. The exact sources of the images and the criteria for inclusion are not well-defined or disclosed.
Limited metadata. Thereis insufficient information about the context, origin, or potential biases of individual images.
Difficulty in auditing. Due to its massive size, comprehensive manual review and auditing of the dataset is impractical.

Risks, harms 🛑

80 Million Tiny Images is seen to suffer from a number of important transparency and accountability limitations, not least that Google Images were quietly scraped without user acknowledgement or consent.

MIT removed the dataset from the internet in July 2020 after researchers Abeba Birhane and Vinay Prabhu discovered it contains racist and misogynistic slurs which were causing models trained on them to exhibit racial and sexual bias.

MIT may have withdrawn the dataset and warned against its use but it is not hard to find it online and the internet is littered with derivative sub-sets, many of which were produced before MIT was forced to move into defensive action.

Incidents, issues 🔥

July 2020. Tiny Images dataset teaches AI systems to use racist slurs

Research, advocacy 🧮

Prabhu V.U., Birhane A. (2020). Large Image Datasets: A Pyrrhic Win for Computer Vision?
Krizhevsky A. (2009). Learning Multiple Layers of Features from Tiny Images (pdf)