Microsoft Celeb (MS-Celeb-1M) facial recognition dataset
MS-Celeb-1M (or Microsoft Celeb) is a dataset developed by Microsoft Research to accelerate research into facial recognition technologies.
Created and published in 2016, MS-Celeb-10 consisted of approximately 10 million facial images of 100,000 celebrities, journalists, artists, musicians, activists, policy makers, writers, and academics. Micosoft also provided a 'target list' of an additional 900,000 names whose images were to be collected.
According to Microsoft, the dataset was created for 'non-commercial research purpose only' and would be applicable to image captioning and news video analysis.
Reckoned to be the largest public dataset of its kind, Microsoft terminated the project mid-2019 shortly after the publication of researcher Adam Harvey's Exposing.ai project and a Financial Times investigation into facial recognition data sharing.
Privacy and copyright misuse
Microsoft collected photographs for MS-Celeb-1M by automatically scraping them from search engines. This was done without informing or gaining the consent of those affected, and was oblivious to their copyright license.
The company also played fast and loose with the definition of public interest. 'Celebrities' whose data was collected include US blogger Cory Doctorow, journalist Glenn Greenwald, author and academic Soshana Zuboff, and former US FTC commissioner Julie Brill - who arguably should not be classified as public people.
Commercial and surveillance uses
Despite being restricted to academic use, research paper citations reveal MS-Celeb-1M has been used hundreds of times across the world by companies such as IBM, Panasonic, Hitachi, and Nvidia for a wide variety of commercial purposes.
Furthermore, it transpired that Microsoft used MS-Celeb-1M to train its own facial recognition systems, as had Chinese technology firms Huawei, Sensetime, and Megvii, whose products are allegedly used to detect and surveil Uyghurs, and to track foreign journalists.
Ongoing data availability
Microsoft quietly took down the dataset in June 2019, telling the FT that 'the site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed.'
But the dataset remains widely available online, with several versions on Github and Academic Torrents.
Operator: Alibaba; École Polytechnique Fédérale de Lausanne; Hitachi; Huawei; IBM; IDIAP Research Institute; Megvii; Microsoft; National University of Defense Technology (NUDT); Nvidia; Panasonic; SenseTime; Universidad Autónoma de Madrid; University of Leicester; Multiple
Sector: Technology; Research/academia
Purpose: Train facial recognition systems
Technology: Dataset; Facial recognition; Computer vision
Issue: Privacy; Copyright; Dual/multi-use
Guo Y., Zhang L. (2017). One-shot Face Recognition by Promoting Underrepresented Classes (pdf)
Harvey, A., LaPlace, J. (2019). Exposing.ai
Peng K., Mathur A., Narayanan A. (2021). Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers
Investigations, assessments, audits
Murgia M., Financial Times (2019). Who’s using your face? The ugly truth about facial recognition
News, commentary, analysis
Published: April 2022