Microsoft Celeb (MS-Celeb-1M) facial recognition dataset

Report incident 🔥 | Improve page 💁 | Access database 🔢

MS-Celeb-1M (or Microsoft Celeb) is a dataset developed by Microsoft Research to accelerate research into facial recognition technologies.

Created and published in 2016, MS-Celeb-10 consisted of approximately 10 million facial images of 100,000 celebrities, journalists, artists, musicians, activists, policy makers, writers, and academics. Micosoft also provided a 'target list' of an additional 900,000 names whose images were to be collected.

According to Microsoft, the dataset was created for 'non-commercial research purpose only' and would be applicable to image captioning and news video analysis.

Reckoned to be the largest public dataset of its kind, Microsoft terminated the project mid-2019 shortly after the publication of researcher Adam Harvey's Exposing.ai project and a Financial Times investigation into facial recognition data sharing.

Dataset 🤖

Documents 📃

Research study
Guo Y., Zhang L. (2017). One-shot Face Recognition by Promoting Underrepresented Classes (pdf)

Operator: Alibaba; École Polytechnique Fédérale de Lausanne; Hitachi; Huawei; IBM; IDIAP Research Institute; Megvii; Microsoft; National University of Defense Technology (NUDT); Nvidia; Panasonic; SenseTime; Universidad Autónoma de Madrid; University of Leicester; Multiple
Developer: Microsoft
Country: USA
Sector: Technology; Research/academia
Purpose: Train facial recognition systems
Technology: Database/dataset; Facial recognition; Computer vision
Issue: Privacy; Copyright; Dual/multi-use
Transparency: Privacy

Risks and harms 🛑

Microsoft Celeb is seen to have posed significant privacy and ethical concerns due to its large-scale collection of celebrity images without explicit consent, potentially enabling unauthorised surveillance, identity theft, and the misuse of personal data.

Transparency and accountability 🙈

Transparency limitations associated with the Microsoft Celeb (MS-Celeb-1M) dataset include:

Unclear consent processes for individuals included in the dataset
Lack of transparency about data collection methods and sources
Limited information on how "celebrity" status was determined
Insufficient disclosure of potential biases in the dataset
Inadequate documentation of data quality and accuracy
Lack of clear policies on data usage, sharing, and restrictions
Limited transparency on the dataset's impact on privacy and civil liberties
Insufficient information on data retention and deletion processes
Unclear mechanisms for individuals to request removal from the dataset
Limited public reporting on how the dataset has been used by researchers and companies.

Related 🌐

Page info
Type: Data
Published: April 2022
Last updated: June 2024

Microsoft Celeb (MS-Celeb-1M) facial recognition dataset

Dataset 🤖

Documents 📃

Risks and harms 🛑

Transparency and accountability 🙈

Incidents and issues 🙈

Research, advocacy 🧮

Investigations, assessments, audits 🧐

Related 🌐