MegaFace facial recognition dataset

Page published: December 2022 | Last updated: October 2024

Report incident🔥| Improve page 💁| Access database 🔢

MegaFace is a facial recognition training dataset consisting of 4,753,320 faces of 672,057 identities from 3,311,471 photos downloaded from 48,383 Flickr users' photo albums.

Created in 2015 by researchers at the University of Washington, the project was expanded in 2016 in the form of the MegaFace Challenge, in which facial recognition teams were encouraged to download the database and see how their algorithms performed when they had to distinguish between a million possible matches.

Like IBM's Diversity in Faces dataset, MegaFace was based on Yahoo!'s YFCC100M dataset, which provides approximately 100 million photos from photo sharing website Flickr under various Creative Commons licenses.

Partly due its size, MegaFace became one of the most important benchmarks for commercial face recognition vendors. The only public dataset with a comparable number of images was Microsoft's MS-Celeb-1M dataset, which was withdrawn after a Financial Times/Exposing.ai investigation.

The MegaFace Challenge and dataset were discontinued in June 2020.

MegaFace was financed by Samsung, Google’s Faculty Research Award, and by the National Science Foundation/Intel.

Facial recognition system

A facial recognition system is a technology potentially capable of matching a human face from a digital image or a video frame against a database of faces.

Source: Wikipedia 🔗

Dataset 🤖

Website 🔗
Data 🔗
Released: 2015
Developer: University of Washington
Sector: Technology; Research/academia
Purpose: Train facial recognition systems
Type: Database/dataset
Technique: Computer vision; Facial recognition; Machine learning

Documents 📃

Ira Kemelmacher-Shlizerman, Steve Seitz, Daniel Miller, Evan Brossard. The MegaFace Benchmark: 1 Million Faces for Recognition at Scale
MF2 research paper
UW News article

Derivatives, applications 🈸

DiveFace dataset
MegaAge dataset (Papers with Code)
TinyFace dataset (Papers with Code)

Transparency, accountability 🙈

MegaFace has been criticised for poor transparency and accountability.

Lack of informed consent. Users whose images were included in the dataset were not notified or asked for permission. The photos were originally uploaded to Flickr, often years earlier, without users knowing they would later be used for facial recognition training.
Unclear provenance. The dataset was created from Yahoo's Flickr collection, which was then further processed by University of Washington researchers. This chain of data handling was not transparent to the individuals whose images were included.
Insufficient safeguards. While the original Flickr collection used links to photos rather than distributing them directly, this safeguard was flawed. A security vulnerability allowed access to photos even after users made them private on Flickr.
Redistribution issues. Some researchers who accessed the database downloaded and redistributed the images, further removing them from the original context and user control.
Limited user recourse. Most people have no way of knowing if their images are in the dataset, and limited means to have them removed if they are.
Opaque usage. The dataset has been used by numerous companies and researchers, but the full extent of its usage and the resulting applications are not clear to the public or the individuals included in the dataset.
Lack of age consideration. The dataset includes images of children, used without parental consent or knowledge.
Geographical and legal oversight. The dataset's creation and distribution did not account for varying privacy laws in different jurisdictions, such as Illinois' biometric privacy law.

Risks, harms 🛑

By using 3.3 million images scraped from Flickr without user consent, the MegaFace dataset is seen to pose significant risks and harms leading to privacy violations, potential misuse in surveillance and military applications, and the perpetuation of biases in facial recognition technologies.

Given the misuse of people's data and the fact that the dataset has been downloaded and used by thousands of organisations and individuals across the world, and used to create multiple derivative datasets, many of which continue to exist, it has also raised questions about liability.

Incidents, issues 🔥

October 2019. MegaFace facial recognition dataset raises privacy, liability concerns

Investigations, assessments, audits 👁️

Harvey, A., LaPlace, J. (2019). Exposing.ai
University of Washington. FOIA request release results