MegaFace facial recognition dataset
Report incident 🔥 | Improve page 💁 | Access database 🔢
MegaFace is a facial recognition training dataset consisting of 4,753,320 faces of 672,057 identities from 3,311,471 photos downloaded from 48,383 Flickr users' photo albums.
Created in 2015 by researchers at the University of Washington, the project was expanded in 2016 in the form of the MegaFace Challenge, in which facial recognition teams were encouraged to download the database and see how their algorithms performed when they had to distinguish between a million possible matches.
Like IBM's Diversity in Faces dataset, MegaFace was based on Yahoo!'s YFCC100M dataset, which provides approximately 100 million photos from photo sharing website Flickr under various Creative Commons licenses.
Partly due its size, MegaFace became one of the most important benchmarks for commercial face recognition vendors. The only public dataset with a comparable number of images was Microsoft's MS-Celeb-1M dataset, which was withdrawn after a Financial Times/Exposing.ai investigation.
The MegaFace Challenge and dataset were discontinued in June 2020.
MegaFace was financed by Samsung, Google’s Faculty Research Award, and by the National Science Foundation/Intel.
Dataset 🤖
MegaFace dataset (Papers with Code)
Documents 📃
Derivatives, applications 🈸
MegaAge dataset (Papers with Code)
TinyFace dataset (Papers with Code)
Dataset info 🔢
Operator: Alibaba; Alphabet/Google; Amazon; Bytedance; EUROPOL; Huawei; In-Q-Tel; IntelliVision; Megvii; Mitsubishi Electric; Northrup Grumman; Ntechlab; Philips; Samsung; SenseTime; Sogou; Tencent; Vision Semantics
Developer: University of Washington
Country: USA
Sector: Technology; Research/academia
Purpose: Improve research quality
Technology: Database/dataset; Facial recognition; Computer vision
Issue: Copyright; Dual/multi-use; Privacy; Surveillance; Liability
Transparency: Privacy; Marketing
Risks and harms 🛑
By using 3.3 million images scraped from Flickr without user consent, the MegaFace dataset is seen to pose significant risks and harms leading to privacy violations, potential misuse in surveillance and military applications, and the perpetuation of biases in facial recognition technologies.
The dataset was downloaded and used by thousands of organisations and individuals across the world, and used to create multiple derivative datasets, many of which continue to exist.
Transparency and accountability 🙈
MegaFace has been criticised for poor transparency and accountability.
Lack of informed consent. Users whose images were included in the dataset were not notified or asked for permission. The photos were originally uploaded to Flickr, often years earlier, without users knowing they would later be used for facial recognition training.
Unclear provenance. The dataset was created from Yahoo's Flickr collection, which was then further processed by University of Washington researchers. This chain of data handling was not transparent to the individuals whose images were included.
Insufficient safeguards. While the original Flickr collection used links to photos rather than distributing them directly, this safeguard was flawed. A security vulnerability allowed access to photos even after users made them private on Flickr.
Redistribution issues. Some researchers who accessed the database downloaded and redistributed the images, further removing them from the original context and user control.
Limited user recourse. Most people have no way of knowing if their images are in the dataset, and limited means to have them removed if they are.