Microsoft Celeb (MS-Celeb-1M) facial recognition dataset
Report incident 🔥 | Improve page 💁 | Access database 🔢
MS-Celeb-1M (or Microsoft Celeb) is a dataset developed by Microsoft Research to accelerate research into facial recognition technologies.
Created and published in 2016, MS-Celeb-10 consisted of approximately 10 million facial images of 100,000 celebrities, journalists, artists, musicians, activists, policy makers, writers, and academics. Micosoft also provided a 'target list' of an additional 900,000 names whose images were to be collected.
According to Microsoft, the dataset was created for 'non-commercial research purpose only' and would be applicable to image captioning and news video analysis.
Reckoned to be the largest public dataset of its kind, Microsoft terminated the project mid-2019 shortly after the publication of researcher Adam Harvey's Exposing.ai project and a Financial Times investigation into facial recognition data sharing.
Documents 📃
Guo Y., Zhang L. (2017). One-shot Face Recognition by Promoting Underrepresented Classes (pdf)
Dataset info 🔢
Operator: Alibaba; École Polytechnique Fédérale de Lausanne; Hitachi; Huawei; IBM; IDIAP Research Institute; Megvii; Microsoft; National University of Defense Technology (NUDT); Nvidia; Panasonic; SenseTime; Universidad Autónoma de Madrid; University of Leicester; Multiple
Developer: Microsoft
Country: USA
Sector: Technology; Research/academia
Purpose: Train facial recognition systems
Technology: Database/dataset; Facial recognition; Computer vision
Issue: Privacy; Copyright; Dual/multi-use
Transparency: Privacy
Risks and harms 🛑
Microsoft Celeb is seen to have posed significant privacy and ethical concerns due to its large-scale collection of celebrity images without explicit consent, potentially enabling unauthorised surveillance, identity theft, and the misuse of personal data.
Transparency and accountability 🙈
Transparency limitations associated with the Microsoft Celeb (MS-Celeb-1M) dataset include:
Unclear consent processes for individuals included in the dataset
Lack of transparency about data collection methods and sources
Limited information on how "celebrity" status was determined
Insufficient disclosure of potential biases in the dataset
Inadequate documentation of data quality and accuracy
Lack of clear policies on data usage, sharing, and restrictions
Limited transparency on the dataset's impact on privacy and civil liberties
Insufficient information on data retention and deletion processes