Microsoft Celeb (MS-Celeb-1M) facial recognition dataset

Published: April 2022

MS-Celeb-1M (or Microsoft Celeb) is a dataset developed by Microsoft Research to accelerate research into facial recognition technologies.

Created and published in 2016, MS-Celeb-10 consisted of approximately 10 million facial images of 100,000 celebrities, journalists, artists, musicians, activists, policy makers, writers, and academics. Micosoft also provided a 'target list' of an additional 900,000 names whose images were to be collected.

According to Microsoft, the dataset was created for 'non-commercial research purpose only' and would be applicable to image captioning and news video analysis.

Reckoned to be the largest public dataset of its kind, Microsoft terminated the project mid-2019 shortly after the publication of researcher Adam Harvey's project and a Financial Times investigation into facial recognition data sharing.

Privacy and copyright misuse

Microsoft collected photographs for MS-Celeb-1M by automatically scraping them from search engines. This was done without informing or gaining the consent of those affected, and was oblivious to their copyright license.

The company also played fast and loose with the definition of public interest. 'Celebrities' whose data was collected include US blogger Cory Doctorow, journalist Glenn Greenwald, author and academic Soshana Zuboff, and former US FTC commissioner Julie Brill - who arguably should not be classified as public people.

Commercial and surveillance uses

Despite being restricted to academic use, research paper citations reveal MS-Celeb-1M has been used hundreds of times across the world by companies such as IBM, Panasonic, Hitachi, and Nvidia for a wide variety of commercial purposes.

Furthermore, it transpired that Microsoft used MS-Celeb-1M to train its own facial recognition systems, as had Chinese technology firms Huawei, Sensetime, and Megvii, whose products are allegedly used to detect and surveil Uyghurs, and to track foreign journalists.

Ongoing data availability

Having taken down the dataset, Microsoft told the FT that 'the site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed.'

But the dataset remains widely available online, with several versions on Github and Academic Torrents.

Operator: Alibaba; École Polytechnique Fédérale de Lausanne; Hitachi; Huawei; IBM; IDIAP Research Institute; Megvii; Microsoft; National University of Defense Technology (NUDT); Nvidia; Panasonic; SenseTime; Universidad Autónoma de Madrid; University of Leicester; Multiple
Sector: Technology; Research/academia
Train facial recognition systems
Dataset; Facial recognition; Computer vision
Privacy; Copyright; Dual/multi-use
Transparency: Privacy

Dataset reference

Marketing, publications, presentations

Research, audits, investigations, inquiries, litigation

News, commentary, analysis