We've all heard about racial bias in artificial intelligence via the media, whether it's found in recidivism software or object detection that mislabels African American people as Gorillas. Due to the increase in the media attention, people have grown more aware that implicit bias occurring in people can affect the AI systems we build.
Early this week, I was honored to give a talk on Racial Bias in Facial Recognition at PyCascades, a new regional Python conference. Last week I wrote a blog post on learning facial recognition through OpenFace where I went into deeper detail about both facial recognition and the OpenFace architecture, so if you want to give that a read through before checking out this talk, I highly encourage it.
Using OpenFace as an example face recognition model, this talk will cover the basics of facial recognition and why it’s important to have diverse datasets when building out a model.
We’ll explore racial bias in datasets using real world examples and cover a use case for developing an OpenFace model for a celebrity look-a-like app and show how it can fail with a homogenous dataset.
My hope is that you’ll walk away understanding that even with the best intentions, anyone can fall victim to bias when gathering data and training models.
And I hope you gain an understanding of the very real impacts that racial bias has in facial recognition software particularly, but also keep in mind that it’s not exclusive to that domain.
First I’d like to talk about the link between implicit and racial bias in humans and how it can lead to racial bias in AI systems.
Implicit bias can affect way we behave: This infographic refers to a field study done by Bertrand and Mullainathan (2004) showing the likelihood of getting through the hiring pipeline based on the whiteness of your name. There were low quality resumes and high quality resumes that they created, which were both fake, and then they submitted them to various job ads. This resulted in the finding that it mattered more how "white" the name sounded versus the quality of the resume. In other words, there was a bigger gap for getting a callback for an interview between white and black sounding applicants for whether they moved through the hiring pipeline.
While this is just one example of racial bias in the hiring process this kind of implicit bias in human behavior directly leads to us exposing those biases to the world through the artificial intelligence systems that we build
Above is an example of Propublica's approximation model output based on the risk assessment software called COMPAS which was created by the company Northpointe. The COMPAS software predicts the likelihood of a person reoffending and uses such factors such as poverty, joblessness and other variables (although race isn’t included in their assessment directly), and used by judges to gauge sentence harshness and length.
According to Propublica's analysis of the COMPAS algorithm that was based on their approximation model, the software creates almost twice as many false positives for black offenders than whites.
An example of this is in the slide, where Propublica's approximation model scored Brisha Borden, who had 4 juvenile misdemeanors, at a high risk, while Vernon Prater was rated low risk for reoffending although he had a lengthy record including armed robberies.
Two years later Borden had no other offenses while Prater had one for grand theft auto.
This leads to the question of whether software is “teaching” judges that people of color are more likely to reoffend and possibly confirming currently held biases of judges that black people are more likely to reoffend than white people.
So you can see how important it is to keep in mind implicit bias when building out AI models because they can have incredibly far reaching impacts on people's lives.
We've lightly covered some areas that implicit racial bias can occur in humans with the research done analyzing resume names, along with the potential racial bias occurring in recidivism software, but what are some real world applications for face recognition and other biometric software?
Biometric software is used for obvious cases like using your fingerprint or face recognition to unlock your phone.
In particular, facial recognition is used in animation to give real-life movement to characters on screen and it’s even used for personalized marketing campaigns.
What might be less obvious is how facial recognition is being used in security and police applications for identifying people in real time with surveillance cameras in public places.
According to perpetuallineup.org a Georgetown University Law site, the city of LA uses 16 cameras in undisclosed locations that placed in public areas in order to identify citizens who have warrants for their arrest.
Furthermore, according to the same site, police in Florida and Southern California have mobile devices with facial recognition software that they can use to match images of people when pulled over while driving or while walking around to mugshots in their databases.
Note: In Florida’s case they not only have access to mugshots, but also DMV images of non-offending citizens.
However, there is a severe lack of regulation in how police departments use these images. For instance, some departments don’t need reasonable suspicion to search these images while most departments aren’t audited for misuse. Other cities, like Seattle do have more stringent measures in place, as they only have mugshots in their database and also need to have a reasonable suspicion to search.
Along with the lack of regulation, it's important to recognize that there aren’t any public reports on the model accuracy used by police departments, but again, according to perpetuallineup.org, we know from an FBI co-authored study that there is evidence that facial recognition systems aren’t as accurate when used on African Americans versus caucasians.
So we've seen how real world applications of facial recognition and other software using statistical models lead to real world problems.
In order to understand how these models fail, we need to understand how they are built, learn about the data they are trained on and note other potential areas that might result in model failure.
At a high-level, facial recognition software detects one or more faces in an image, separates the face from the background of the image, normalizes the position of the face, gets thrown into a neural net for feature discover, and when it’s ready for classification, its used to compare a face in one image to faces in a database to see if there’s a match.
In the slide, we have two actors from the urban fantasy show True Blood. The actors are correctly classified as themselves by a facial recognition model called OpenFace.
OpenFace is an open source deep learning facial recognition model. It’s based off of the paper: FaceNet: A Unified Embedding for Face Recognition and Clustering by some folks at Google and is implemented using Python and Torch and can be run on CPU or GPU environments.
While OpenFace is only a couple of years old, it’s been widely adopted because it’s open source and offers a level of accuracy similar to facial recognition models found in private state-of-the-art systems such as Google’s FaceNet or Facebooks DeepFace.
What’s particularly nice about OpenFace, is that development of the model focused on real-time face recognition on mobile devices, so you can train a model with high accuracy with very little data on the fly.
In fact, the images of the actors in this slide were trained on 10 images each and are correctly classified on new images the model hasn’t seen before.
The training is done offline using Torch and it’s done only once by folks at OpenFace using 500k labeled images which come from two open source data sources. This results in the model being able to generalize better on public images than models such as Facebook's DeepFace that are only trained on Facebook user image data that wouldn't do well on real world data.
Those images are then thrown into a neural net for feature extraction using Google’s FaceNet Inception model. Running the 500,000 images through FaceNet results in producing 128 facial features that are embeddings in a Euclidean space that represent a generic face.
Then FaceNet’s triplet loss function is used to gauge the accuracy of the model and also enables the clustering of similar images which gives you faster model classification.
Now that the training of the images are done in Torch, you have a trained model to use on images that are first sent through a library called dlib for face detection producing 68 facial landmarks. Then OpenFace uses OpenCV’s affine transformation for normalization of the faces positioning so all faces are pointed forward and then cropped to 96x96 pixels for input to the trained neural net.
Then you use the trained neural net by inputting the normalized cropped images in a single forward pass that gives you a 128 facial embeddings of low-dimensional data making for faster classification.
Again, for a more detailed look at face recognition and OpenFace, check out my blog post from last week.
So here is our use case! We’re going to think through a Celebrity Look-a-Like app to see what True Blood characters we most look like. The reason I choose True Blood is that the show had a fairly diverse cast which is the crux of our racial bias issue in AI - having datasets that don’t correctly represent real world conditions.
First, let’s talk about the training data. I choose 10 images for each actress that had faces in various lighting, quality, and position (although note that most of these differences and issues are taken care of by dlib and OpenCV).
The next step in our pipeline will be running the images through the trained neural net to get 128 facial embeddings used for classification to get that low-dimensional face representation.
And then for the classification portion of using OpenFace, in this use case of building a celebrity look-a-like app you would probably want to use a clustering algorithm instead, but we already had an OpenFace classification model implemented on Algorithmia using a support vector classification model.
So for this example, I simply used the highest confidence score from my trained model's prediction.
So for my first implementation I trained on 10 images for each of the top 5 main actresses from True Blood.
Although their whole cast over time is diverse, the top 5 actresses are caucasian with one exception, Rutina Wesley, an African American actress.
Subjectively, I didn't think it was a great match and that their facial landmarks are quite different from each other, especially their face and eye shape.
Something to note is that the preprocessing step in dlib converts the images to greyscale and produces 68 landmarks that are fed into the trained neural net, so the neural net doesn’t see skin color, only facial features. When we pass our image through the trained neural net, we get 128 facial embeddings used by the SVM classifier.
Even though skin color isn’t in our feature space, certain ethnic groups or races share more similar features with each other than they do with those not in their ethnic group, for instance a study called Ethnicity Identification from Face Images by Xiaoguang Lu and Anil K. Jain found that Chinese had the widest faces and intercanthal space, which is the area between the eyes out of a sample space of Chinese, African American, and North American Caucasians.
Not surprisingly, according to a pbs article on AI bias called Ghosts in the Machine, cited that research has shown models built in Asia (and presumably trained on more Asian faces than on caucasian faces) have higher accuracy classifying Asian populations versus Caucasian ones and vice versa. This highlights the need for diversity in our training data to reduce the likelihood of sampling bias.
Since i wasn’t satisfied with the results of the trained model using the 5 main actresses from True Blood, I changed up the diversity of the training set.
To keep the sample size the same I removed 3 main actresses that were from my original dataset, but kept Carrie Preston (who Gaines was originally classified as) and Kristin Bauer who is another caucasian actress.
I then added 3 actresses that were more ethnically diverse, but not main characters. Our newly trained model predicted that Joanna Gaines was Vedette Lim.
Subjectively it seems their facial features are more similar, such as face shape, nose and eyes.
And I happen to know that they are both half Asian although Joanna Gaines is half Korean while Vedette Lim is half Chinese.
While the classification confidence scores are pretty low, remember that we’re creatively using this model to be used as a clustering algorithm. And also remember, if you were looking at the confidence score from Joanna Gaines’s first match compared to her second match, you would notice that the first score is higher than this one. So you have to remember to disregard any previous predictions you got from training your first model because now you’ve trained on a different dataset. You can’t compare confidence scores from one trained model to another, you can only compare the predictions within that trained model.
So you can see that when the distribution of your training data is dissimilar to your testing or real world data distribution, then your model won’t give you as accurate results as when training on a more diverse dataset that better represents real world conditions.
Now that we’ve seen how a model can give you less than stellar results, let’s talk about how to create more accurate models.
I hope it’s obvious now, that diverse training sets give you more accurate models, but getting diverse training sets isn’t as easy or intuitive as you think.
Many researchers building out these models, like the folks at Carnegie Mellon that produced OpenFace only have access to public open source images because it is both time consuming and potentially cost prohibitive to create labeled datasets themselves. So they have to use what’s out there which might not be as diverse of datasets as they wish for.
Also, if you are extracting your own data, think through how your data extraction methods can potentially lead to homogenous datasets.
If you were to really make the celebrity look-a-like app on for instance, the top 2,000 celebrities, if you were to get the permission to scrape them from a site like IMDb’s, that are sorted by popularity, then the majority of them would likely be white. If you only took from the top you wouldn’t have the same diversity as if you took top, middle, & bottom third images. So you have to think a bit harder when you gather your data.
Also, think about how others are gathering their data. For instance, OpenFace uses dlib for face detection which was trained on Labeled Faces in the Wild images which in turn used the Viola-Jones face detection that relies on Haar cascade classifier (no longer state-of-the-art) and these images were scraped from internet almost 10 years ago.
This raises questions about whether or not there have been changes in the diversity demographics over the past 10 years in available public image data, but also note that there is no ethnic distribution data for LFW.
Lastly, the OpenFace model is tested on the Labeled Faces in the Wild, the same dataset that was used to train dblib for face detection. Again, this doesn't mean that any of these issues result in a less accurate model on face detection and face recognition on people of color, but they are worth noting.
Now on to our testing data, which is equally important to test for real world conditions such as diversity.
Splitting training sets into 80/20, 90/10 isn’t good enough if you have doubts of the diversity of your data.
Test your data specifically on traditionally underrepresented people or your model might perform poorly on real world conditions, and our goal is to create the most accurate model we can.
Another question to ask yourself, is who is using your model?
You likely don’t know, so assume everyone. Yes, it’s expensive and time consuming to gather diverse training data and a lot of us only have access to public datasets, but, you don’t initially have to. First you need to make sure you are testing with diverse data and that will take you to the next decision you need to make.
If your model is failing on your more diverse test set, then you need to make the decision on whether or not your model is “good enough”, but I highly recommend that you carefully evaluate the consequences of your model failing on a portion of the population and decide if you want to risk alienating an entire group of people or incur bad press, by putting out a model that performs poorly. If you are labeling animals where a fox gets mislabeled as a cat 30% of the time, unless your user is the Department of Natural Resources or Fish and Wildlife, your users likely won't care too much.
However, if it is a mislabeling as erroneous and insensitive as in Google's case where they mislabeled African American's as gorillas, note that their model failure, as public as it was, still has not been fixed. According to a Wired article, Google's initial fix that is still in production two years later, has been to remove certain labels such as the term “Gorillas” rather than fix the actual model.
And last, a way to mitigate poor model performance on racially diverse real world conditions is to ask yourself if your own team is diverse. Often these issues with racial bias are due to homogeneous teams and this makes it all the more important to increase diversity in tech. Note that from the PBS article Ghosts in the Machine, Ms. Buolamwini, black PhD student at MIT had to put on a white mask to enable facial recognition algorithms to track her face while in another project, her white classmates had no trouble with a facial detection algorithm while it failed on her face. This is an example of why it's important to have diversity on your teams, because when testing these models out research teams and developers many times will test on their own images, or at the very least they might catch an issue with the training data sooner than if teams are homogenous, because diversity increases unique perspectives that help us build smarter applications and models.