Speech2Face facial reconstructions

Released: May 2019

Can you improve this page?
Share your insights with us

Speech2Face is an algorithm developed by a group of researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Google AI that generates images of what someone might look like from snippets of audio recordings of their voice.

According to its associated research paper (pdf), the MIT researchers used a dataset of millions of clips from YouTube and elsewhere and created a neural network-based model that learns vocal attributes associated with facial features from the videos.


Reaction to Speech2Face has been mixed, with some commentators praising it for creating rough likenesses of people with little information, whilst others focused on the fact that the images created by Speech2Face often only bear a general resemblance to the speaker, and that the system is prone to creating images of the wrong gender and ethnicity. 

Others highlighted the system tends to identify people with high voices as female and low voices as malem and people speaking Asian languages as Asian, reflecting strong gender, racial and country of origin biases in its data and results. Some went further, accusing it of 'ethnic profiling at scale' and being little more than 'awful transphobic shit'. 

Speech2Face also kicked off a debate about the nature of data privacy, with some questioning whether this kind of technology could be used to identify individuals, despite the team claiming their method 'cannot recover the true identity of a person from their voice.'

An individual included in the dataset told Slate that he didn’t remember signing a waiver for the YouTube video he was featured in that was fed through the algorithm. It also prompted a conversation about the need for ethics review boards at conferences and funding agencies.


The MIT team urges caution on the project's GitHub page, acknowledging that the technology raises questions about discrimination and privacy. They said the training data used was a collection of educational videos from YouTube which may not represent the world population.

'Although this is a purely academic investigation, we feel that it is important to explicitly discuss in the paper a set of ethical considerations due to the potential sensitivity of facial information,' they wrote, recommending that 'any further investigation or practical use of this technology will be carefully tested to ensure that the training data is representative of the intended user population.'

Operator: Alphabet/Google
Developer: MIT; Alphabet/Google

Country: USA

Sector: Research/academia

Purpose: Reconstruct facial image

Technology: Neural network
Issue: Accuracy/reliability; Bias/discrimination - race, gender, LGBTQ; Privacy

Transparency: Privacy