You’ve probably seen a lot of recent articles about NIST’s tests of facial recognition algorithms against people wearing masks. The headlines of these articles sometimes don’t capture the entire truth, however.
Normally I do not speak about NIST results because when past NIST results were released, I was employed by a company that participated in these results, and other people within the company – not me – were in charge of any official company communications. Presently I am not employed by a NIST test participant, so I have the liberty to speak more freely. (At least for now. If I get hired by a NIST test participant next week, I will need to abide by that company’s social media rules.)
Man, this sounds really sexy: “25 year biometric veteran tells THE TRUTH about NIST tests!” But before you get too excited about my speaking freely, I’m NOT going to spill any earth-shattering revelations, and I’m obviously NOT going to violate any confidentiality agreements with past employers.
But there are a few things that I want to clarify about what this NIST test IS, and what it IS NOT. Because the headlines – and in some cases even the articles themselves – distort the true story.
First, NIST did not actually conduct tests with people wearing masks. Wait a minute, you say. NIST tested masks without actually using masks?
NIST has been very up-front about this, even if this teeny little fact didn’t make it into the press headlines, or in some cases into the press stories. But if you missed it, here’s what NIST says about its tests.
Our initial approach has been to apply masks to faces digitally (i.e., using software to apply a synthetic mask). This allowed us to leverage large datasets that we already have.
This is certainly a valid approach, since the need to test for people wearing face masks only became critical in the spring of this year. Rather than go through the trouble of creating entirely new data sets (and validating that the images of person X wearing a mask and person X not wearing a mask don’t have any other differences) NIST simply took its existing datasets (which have been used for years) and did some image-jiggling to run some tests as soon as possible.
Second, the first round of NIST testing was conducted with algorithms that had NOT been optimized for masked faces. Perhaps you saw some of the headlines in July that pretty much said that facial recognition was a failure because it had problems with masks. Again, let’s see what NIST itself said.
We tested algorithms that were already submitted to FRVT 1:1 prior to mid-March 2020. This report is intended to support end-users to understand how a pre-pandemic algorithm might be affected by the arrival of substantial number of subjects wearing face masks. The next report will document accuracy values for more recent algorithms, some developed with capabilities for recognition of masked faces.
So the tested algorithms were all developed before COVID-19 caused massive changes in our behavior.
Now there’s always benefit in hindsight, but before mid-March 2020, who could have predicted that within mere weeks, large portions of the world population would suddenly be obscuring their faces? Good facial recognition algorithms are designed to work with non-changing portions of the face, which usually include the nose and mouth. So if there is a sudden change that suddenly results in loss of nose and mouth detail – or, alternatively, if there is a sudden change such as solar flares that suddenly results in loss of eye detail as people start wearing sunglasses – that is a significant factor.
Well, now the promised revised tests (“The next report”) that include newer algorithms have been released by NIST this week. Which brings me to my third point.
Third, the NIST face mask tests are an iterative process. NIST has been adopting this testing methodology lately, where it accepts new algorithms from vendors, tests them, releases the results, accepts newer algorithms, tests them, re-releases the results, and so forth.
So if you go to https://pages.nist.gov/frvt/html/frvt_facemask.html this afternoon, you will see a section entitled “Results” with a note that the results were last updated on August 25, 2020. The results are going to be continuously updated as NIST tests newer algorithms.
The Results section explains the tests in detail. (I’ll return to some points about this later.)
The accuracy table currently shows the top performing 1:1 algorithms evaluated on masked images. Results are tabulated on the VISABORDER dataset where the probe image is masked (and the enrollment image remains unmasked). It also includes baseline FNMR when both images are not masked. FNMR values are stated at a fixed threshold calibrated to give FMR = 0.00001 on unmasked images. FNMR is the proportion of mated comparisons below a threshold set to achieve the false match rate (FMR) specified. FMR is the proportion of impostor comparisons at or above that threshold.
When I viewed the tests, 128 algorithms had been submitted. Newer algorithms are shown in blue, older ones in red. As NIST itself noted, SOME of these newer algorithms were “developed with capabilities for recognition of masked faces.” Some newer algorithms may have been submitted for other reasons.
Anyway, the test results show both a “NOT MASKED” value and a “MASKED PROBE” value, and you can sort the algorithm list based upon either of these factors. For example, if you click on the heading of the “MASKED PROBE” column, you can list the algorithms based upon the lowest false non-match rate, or FNMR, or you can list them based upon the highest FNMR. Low false non-match rates are good. As of today, you can see that the masked probe FNMR ranges from around 2% (either really good or pretty good, depending upon how you look at it) to 100% (not so good).
So, I guess we should just pick the algorithm with the lowest FNMR and go with it, right? Well, this brings me to my final point.
Fourth, results of NIST tests provide you with…results of NIST tests. While NIST takes care to design tests that are of value, the tests themselves do not necessarily reflect issues that will be encountered in the real world. Here are some things to consider:
- In the real world, you might not be performing 1:1 comparisons. There are times when you may be performing 1:n, or one-to-many, searches. Maybe you’re at the gate to a military base, and a person wearing a mask is approaching the gate and hasn’t provided any assertion of identity. Commonly, what you would do in this case is to search that face against a database of bad people to make sure that a known threat isn’t approaching the gate. This is NOT a 1:1 comparison, and 1:1 results do not necessarily indicate how an algorithm would perform in a 1:1 situation.
- In the real world, masks may come in all shapes and sizes, and may be worn in a variety of ways. Over the last few months, I’m sure that all of you have seen people who took great care to place their masks over their mouths…but left their noses uncovered. The NIST test doesn’t account for these many variations of mask wearing, or for the multitude of different styles of masks that we see today. Even if NIST designs future tests to account for this, they will still only be approximations of what is happening in the real world.
- In the real world, the people that you are matching will probably not be people in the VISABORDER Photos database. When NIST designs any type of biometric test, it designates a dataset to be used for the test. Usually, NIST will provide some level of detail that describes the characteristics of the dataset – the origins of the dataset, the number of images in the dataset, the conditions under which the biometric images were acquired, etc. But THAT dataset may not reflect YOUR dataset.
Allow me to go off on a tangent for a moment about datasets. There have been a number of conversations regarding how facial recognition algorithms perform with different populations. (And yes, NIST is investigating that also.) The common argument is that an algorithm needs to be tested against a representative population. But what is a representative population? The general population of the United States? The population of the state of Hawaii? The population of the city of Oakland? The population of the European Union? Each of these populations, and any others that can be cited, have differences in the composition of their populations. And when you stop looking at general populations and look at criminal populations, those differences can be more pronounced. So before we can assert that algorithms must be tested on a representative population, we have to define what a representative population is.
OK, back to differences between NIST tests and the real world.
- In the real world, you may not be using any of the algorithms that NIST tested. This may be for a couple of reasons. One reason is that vendors may tune their algorithms to meet certain test criteria. For example, a vendor may submit an algorithm that prioritizes accuracy but sacrifices speed; while such an algorithm will test well, its slow speed means that it can never be implemented in a production system. The vendor will use some other algorithm. In addition, some algorithms are designed in such a way that NIST cannot test them. Which brings up another point.
- In the real world, you may be using an algorithm that NIST never tested. To date, NIST has tested 128 algorithms. However, some vendors have chosen not to submit their algorithms to NIST for testing, for a variety of reasons. Vendors don’t have to submit their algorithms, and they can use non-NIST methods to assert accuracy. Perhaps the vendor has conducted its own test. Perhaps the vendor has used a different test. Or perhaps the customer of a vendor has performed its own test, and released the results.
In short, NIST tests are nice, as long as you remember what they show, and what they don’t show.
And one final note, which is kind of off topic of NIST tests, but I want to mention it regardless. ANY test that is conducted is by definition non-reflective of real world conditions. So if you’re comparing Members of Congress to the Manson family, or trying to deduce the gender of drag queens, or conducting tests in such a way that would not be permissible for a forensic face examiner…who cares?