Labeled Faces in the Wild (LFW) dataset

Labeled Faces in the Wild (LFW) is an open source dataset aimed at researchers that was intended to establish a public benchmark for facial verification.

According to Papers with Code, 'Facial verification is the task of comparing a candidate face to another, and verifying whether it is a match. It is a one-to-one mapping: you have to check if this person is the correct one.'

Created by the University of Massachusetts, Amherst, and publicly released in 2007, LFW comprises over 13,000 facial images with different poses and expressions, under different lighting conditions. Each face is labeled with the name of the person, with 1,680 people having two or more distinct photos in the set.

Dataset 🤖

Documents 📃

Operator:
Developer: University of Massachussets, Amherst

Country: USA

Sector: Research/academia; Technology

Purpose: Train facial recognition systems

Technology: Dataset; Computer vision; Deep learning; Facial recognition; Facial detection; Facial analysis; Machine learning; Neural network; Pattern recognition
Issue: Bias/discrimination - race, ethnicity, gender; Ethics; Privacy

Transparency: Governance; Privacy

Risks and harms 🛑


LFW has been found to be highly skewed towards a very small subset of people, specifically white male faces. It also contains 'a significant number of duplicate or nearly-duplicate images and mislabeled images.' 


The researchers later admitted the dataset's limitations on their website. 'Many groups are not well represented in LFW,' it states. 'For example, there are very few children, no babies, very few people over the age of 80, and a relatively small proportion of women. In addition, many ethnicities have very minor representation or none at all.'

Despite these short-comings, LFW has become the most widely used facial recognition benchmark globally, according to the Financial Times. Tel Aviv University researcher Tomer Friedlander told The Register it is 'a widely used dataset in the academic literature for evaluating face recognition methods.'

LFW has also gained some notoriety amongst civil rights and privacy groups for being the first dataset for which 'wild' images were scraped from the internet. According to the Technology Review, it 'opened the floodgates to data collection through web search. Researchers began downloading images directly from Google, Flickr, and Yahoo without concern for consent.'

Research, advocacy 🧮

Investigations, assessments, audits 🧐