Facial recognition has become an increasingly ubiquitous part of our lives.
Today smartphones use facial recognition for access control while animated movies such as Avatar use it to bring realistic movement and expression to life. Police surveillance cameras use face recognition software to identify citizens that have warrants out for their arrest and these models are also being used in retail stores for targeted marketing campaigns. And of course we’ve all used celebrity look-a-like apps and Facebook’s auto tagger that classifies us, our friends, and our family.
Face recognition can be used in many different applications, but not all facial recognition libraries are equal in accuracy and performance and most state-of-the-art systems are proprietary black boxes.
OpenFace is an open source library that rivals the performance and accuracy of proprietary models. This project was created with mobile performance in mind, so let’s look at some of the internals that make this library fast and accurate and think through some use cases on why you might want to implement it in your project.
High-level Architectural Overview
OpenFace is a deep learning facial recognition model developed by Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. It’s based off of the paper: FaceNet: A Unified Embedding for Face Recognition and Clustering by Florian Schroff, Dmitry Kalenichenko, and James Philbin at Google and is implemented using Python and Torch so it can be run on CPU or GPU’s.
While OpenFace is only a couple of years old, it’s been widely adopted because it offers levels of accuracy similar to facial recognition models found in private state-of-the-art systems such as Google’s FaceNet or Facebooks DeepFace.
What’s particularly nice about OpenFace, besides it being open source, is that development of the model focused on real-time face recognition on mobile devices, so you can train a model with high accuracy with very little data on the fly.
From a high level perspective, OpenFace uses Torch, a scientific computing framework to do training offline, meaning it’s only done once by OpenFace and the user doesn’t have to get their hands dirty training hundreds of thousands of images themselves. Those images are then thrown into a neural net for feature extraction using Google’s FaceNet model. FaceNet relies on a triplet loss function to compute the accuracy of the neural net classifying a face and is able to cluster faces because of the resulting measurements on a hypersphere.
This trained neural net is later used in the Python implementation after new images are run through dlib’s face detection model. Once the faces are normalized by OpenCV’s Affine transformation so all faces are pointing in the same direction, they are sent through the trained neural net in a single forward pass. This results in 128 facial embeddings used for classification for matching or can even used in a clustering algorithm for similarity detection.
During the training portion of the OpenFace pipeline, 500k images are passed through the neural net. These images are from two public datasets: CASIA-WebFace which is comprised of 10,575 indiviudals with a total of 494,414 images and FaceScrub which is made of 530 individuals with a total of 106,863 images who are public figures.
The point of training the neural net on all these images ahead of time is that obviously, it wouldn’t be possible on mobile or any other real-time scenario to train 500,000 images to retrieve the needed facial embeddings. Now remember, this portion of the pipeline is only done once because OpenFace trains these images to produce 128 facial embeddings that represent a generic face that are to be later used in the Python training-on-the-fly part of the pipeline. Then instead of matching an image in high-dimensional space, you’re only using low-dimensional data which helps make this model fast.
As mentioned before, OpenFace uses Google’s FaceNet architecture for feature extraction and uses a triplet loss function to test how accurate the neural net classifies a face. It does this by training on three different images where one is a known face image called the anchor image, then another image of that same person has positive embeddings, while the last one is an image of a different person which of course has negative embeddings.
If you want to learn more information about triplet loss check out from Andrew NG’s Convolutional Neural Network Coursera video.
The cool thing about using triple embeddings is that the embeddings are measured on a unit hypersphere where Euclidean distance is used to determine which images are closer together and which ones are further apart. Obviously, the negative image embeddings are measured further from the positive and anchor embeddings while those two would be closer in distance to each other. This is important because it allows for clustering algorithms to be used for similarity detection. You might want to use a clustering algorithm if you wanted to detect family members on a genealogy site for example, or on social media for possible marketing campaigns (I’m thinking groupon here).
Isolate Face from Background
Now that we’ve covered how OpenFace uses Torch to train hundreds of thousands of images from public datasets to get low-dimensional face embeddings we can check out their use of the popular face detection library dlib and see why you’d want to use it versus OpenCV’s face detection library.
One of the first steps in facial recognition software is to isolate the actual face from the background of the image along with isolating each face from others found in the image. Face detection algorithms also must be able to deal with bad and inconsistent lighting and various facial positions such as tilted or rotated faces. Luckily dlib along with OpenCV handles all these issues. Dlib takes care of finding the fiducial points on the face while OpenCV handles the normalization of the facial position.
It’s important to note that while using OpenFace you can either implement dlib for face detection, which uses a combination of HOG (Histogram of Oriented Gradient) & Support Vector Machine or OpenCV’s Haar cascade classifier. Both are trained on positive and negative images (meaning there are images that have faces and ones that don’t), but they are very different in implementation, speed, and accuracy.
There are several benefits to using the HOG classifier. First, the training is done using a sliding sub-window on the image so no subsampling and parameter manipulation is required like it is in Haar classifier that is used in OpenCV. This makes dlib’s HOG and SVM face detection easier to use and faster to train. It also means that less data is required and note that HOG has higher accuracy for face detection than OpenCV’s Haar cascade classifier. Kind of makes using dlib’s HOG + SVM a no brainer for face detection!
Along with finding each face in an image, part of the process in facial recognition is preprocessing the images to handle problems such as inconsistent and bad lighting, converting images to grayscale for faster training, and normalization of facial position.
While some facial recognition models can handle these issues by training on massive datasets, dlib uses OpenCV’s 2D Affine transformation which rotates the face and makes the position of the eyes, nose, and mouth for each face consistent. There are 68 facial landmarks used in affine transformation for feature detection and the distances between those points are measured and compared to the points found in an average face image. Then the image is rotated and transformed based on those points to normalize the face for comparison and cropped to 96x96 pixels for input to the trained neural net.
So, after we isolate the image from the background and preprocess it using dlib and OpenCV, we can pass the image into the trained neural net that was done in the Torch portion of the pipeline. In this step, there is a single forward pass on the neural net to get 128 embeddings (facial features) that are used in prediction. These low-dimensional facial embeddings are then used in classification or clustering algorithms.
For classification in tests OpenFace uses a linear support vector machine which is commonly used out in the real world to match image features. The most impressive thing about OpenFace, that at this point it takes only a few milliseconds to classify images.
Now that we’ve gone through OpenFace’s architecture at a high level, we can cover some fun use case ideas for the open source library. Previously mentioned, facial recognition is used as a form of access control and identification. One idea that we explored a couple of years ago is to use it for identification and customization of your experience when entering our office https://blog.algorithmia.com/hey-zuck-we-built-your-facial-recognition-ai/. Wow, that was a long time ago in startup world! But you could also think about creating a mobile app that identifies VIP folks entering a club or party. Bouncers wouldn’t have to remember everyone’s face or rely on a list of names for letting people enter. It would also be easy to add new faces to the training data on the fly and the model would be trained by the time the individual went back outside for a breath of fresh air and wanted to enter the club again. Along those lines is to implement facial identification to a meetup or meeting where there is temporary access to a floor or office. Security or front desk personnel could easily update or remove images from the dataset on their phones.
Where to Find OpenFace for Implementation
Hopefully, we’ve made the case for checking out OpenFace. You can implement the model for facial recognition from OpenFace on GitHub yourself, or check out the hosted OpenFace model on Algorithmia where you can add, train, remove and predict images using our SVM implementation for classification. If you want to learn how to use our facial recognition algorithm, check out our recipe for making a celebrity classifier.
Also, stay tuned for slides from my talk next week from PyCascades, a regional Python conference in Vancouver, B.C. where I’ll be speaking on Racial Bias in Facial Recognition Software. I’ll be covering a use case for building a celebrity look-a-like app and talk about model failure due to racial bias in that and other use cases.