Iarpa Janus Benchmark-C (IJP-C) dataset
Report incident π₯ | Improve page π | Access database π’
Iarpa Janus Benchmark-C (IJP-C) is a database of YouTube video still-frames and Flickr and Wikimedia photos used for facial recognition benchmarking.Β
IJP-C was compiled in 2017 by US government subcontractor Noblis and contains 21,294 images of 3,531 people 'with diverse occupations' and of varying levels of fame.Β
The dataset averages six pictures and three videos per person, and is available on application to computer vision and facial recognition researchers.
Dataset π€
IJP-C π
Documents π
Dataset info π’
Operator: SenseTime; NEC; National University of Defense Technology (NUDT)
Developer: Noblis; Iarpa
Country: USA
Sector: Media/entertainment/sports/arts; Politics
Purpose: Enable facial recognition research
Technology: Database/dataset; Facial recognition; Computer vision; Neural network; Machine learning Β
Issue: Dual/multi-use; Privacy; Surveillance
Transparency: Governance; Privacy
Risks and harms π
The Iarpa Janus Benchmark-C (IJP-C) dataset has been criticised for using images of political activists, civil rights advocates, and journalists without their consent, and for enabling its potential misuse for military and security purposes.Β
Transparency and accountability π
The Iarpa Janus Benchmark-C (IJB-C) dataset is seen to suffer from a number of transparency limitations:
Limited access to raw data. The original dataset is over 200GB in size, making it difficult for many researchers to access and analyse the full dataset.
Unclear selection criteria. The reasons for selecting specific individuals for inclusion in the dataset are not clearly explained. The only stated criteria is that source material must include "well-labeled, person-centric data".
Potential consent issues. Many individuals included in the dataset, such as activists and journalists, were likely unaware their images were being used for facial recognition research. For example, digital rights activist Jillian York's images were included without her knowledge or consent.
Violation of platform policies. The dataset includes thousands of faces from over 11,000 YouTube videos, which violates YouTube's terms of service regarding the use of data for facial recognition.
Lack of diversity representation. While the dataset aims to improve representation of the global population, it is unclear how effectively it captures diversity across different demographics.
Limited annotation transparency. Although the dataset includes expanded annotations for covariate analysis, the full extent and accuracy of these annotations are not clearly detailed.
Unclear data processing methods. The exact methods used for processing and curating the images and videos in the dataset are not fully transparent.