Microsoft Celeb (MS-Celeb-1M) facial recognition dataset

MS-Celeb-1M (or Microsoft Celeb) is a dataset developed by Microsoft Research to accelerate research into facial recognition technologies. 

Created and published in 2016, MS-Celeb-10 consisted of approximately 10 million facial images of 100,000 celebrities, journalists, artists, musicians, activists, policy makers, writers, and academics. Micosoft also provided a 'target list' of an additional 900,000 names whose images were to be collected.

According to Microsoft, the dataset was created for 'non-commercial research purpose only' and would be applicable to image captioning and news video analysis. 

Reckoned to be the largest public dataset of its kind, Microsoft terminated the project mid-2019 shortly after the publication of researcher Adam Harvey's Exposing.ai project and a Financial Times investigation into facial recognition data sharing.

Dataset info 🔢

Operator: Alibaba; École Polytechnique Fédérale de Lausanne; Hitachi; Huawei; IBM; IDIAP Research Institute; Megvii; Microsoft; National University of Defense Technology (NUDT); Nvidia; Panasonic; SenseTime; Universidad Autónoma de Madrid; University of Leicester; Multiple
Developer: Microsoft
Country: USA
Sector: Technology; Research/academia
Purpose: Train facial recognition systems
Technology: Database/dataset; Facial recognition; Computer vision
Issue: Privacy; Copyright; Dual/multi-use
Transparency: Privacy

Risks and harms 🛑

Microsoft Celeb is seen to have posed significant privacy and ethical concerns due to its large-scale collection of celebrity images without explicit consent, potentially enabling unauthorised surveillance, identity theft, and the misuse of personal data. 

Transparency and accountability 🙈

Transparency limitations associated with the Microsoft Celeb (MS-Celeb-1M) dataset include: