AIAAIC - IBM dataset uses millions of online photos without consent to train AI syst

IBM dataset uses millions of online photos without consent to train AI systems

Occurred: March 2019

Report incident 🔥 | Improve page 💁 | Access database 🔢

IBM scraped over a million of photographs on Flickr to create a dataset to train its AI products without the consent of the people in the photos, according to an NBC News investigation.

The technology company used a publicly available database of pictures, known as YFCC100M, which Flickr's then parent company Yahoo had collected for research purposes, to create its own Diversity in Faces (DiF) dataset with the intention of training facial recognition systems, including its own Watson Visual Recognition, to be less biased.

Analysis of DiF showed that top image tags in DiF revealed tags like “party,” “family,” “wedding,” and “friends” - indicating that DiF included personal and private moments captured in photos. But while the dataset included many images licensed with attribution requirements, IBM failed to provide attribution links or public credit for any images. Meant as an academic resource, the dataset was not publicly available and users had to be granted permission before they could access it.

And the people in the pictures were not advised IBM was going to use their features to determine gender, race or any other identifiable features, such as eye colour, hair colour, or whether someone was wearing glasses. It also transpired IBM had ignored its own terms of use for the dataset. Later research by Exposing AI showed DiF was downloaded hundreds of times across the world - for reasons that remain unclear.

The finding prompted concerns that use of the DiF dataset could lead to surveillance and profiling, with minorities seen as particularly vulnerable to being targeted. It also sparked accusations of unethical conduct by IBM.

➕ January 2020. IBM was sued in a class action seeking damages of USD 5,000 for each intentional violation of the Illinois Biometric Information Privacy Act, or USD 1,000 for each negligent violation, for all Illinois citizens whose biometric data was used in the DiF dataset.

➕ June 2020. IBM said it would no longer provide facial recognition technology to US police departments for mass surveillance and racial profiling.

➕ June 2021. Amazon and Microsoft teamed up to defend themselves against lawsuits accusing them of using DiF to train their own facial recognition products, and failing to gain the permission of people whose photographs were used in the dataset.

System 🤖

IBM Diversity in Faces

Operator: Amazon; Microsoft
Developer: Alphabet/Google; Amazon; IBM; Microsoft
Country: USA
Sector: Research/academia; Technology
Purpose: Train and develop AI models
Technology: Database/dataset; Facial recognition; Computer vision
Issue: Bias/discrimination; Ethics/values; Privacy; Surveillance; Transparency

Research, advocacy 🧮

Harvey, A., LaPlace, J. (2019). Exposing.ai

Investigations, assessments, audits 🧐

NBC News (2019). Facial recognition's 'dirty little secret': Millions of online photos scraped without consent

News, commentary, analysis 🗞️

Related 🌐

Page info
Type: Incident
Published: June 2024