Classification, Individualization, and Albino Wallabies

Classification is not individualization.

A few seconds’ time reading about Joy Buolamwini’s work Gender Shades illustrates the difference. It is clear that Buolamwini and her co-authors were trying to answer a very specific question:

How well do IBM, Microsoft, and Face++ AI services guess the gender of a face?

Note that they were NOT trying to answer a question about all facial recognition algorithms (although they subsequently tested others).

And also note that they were not trying to answer whether the three algorithms that they DID test could identify individuals (although the ACLU has set up tests for that).

Perhaps there are use cases in which you want to classify individuals by gender, or by race, or by some other factor. But in my experience in the industry, most of my customers didn’t really care about classification, because it didn’t help them do their jobs. Let me provide a couple of examples.

(For Californians who are reading the next paragraph of this post in September 2020, it assumes a time in the distant past in which fans would actually attend sporting events in person. I know it’s hard to remember such times, but they did exist. And hopefully they will again – soon.)

Let’s say that a major football team wants to make sure that fans entering the stadium have paid the necessary ticket fee to enter. While there are a variety of methods to do this, ranging from having the prospective attendee present an authentic ticket to having the attendee provide a biometric identity, all of these different methods are geared toward individualization of either the ticket or the person. It’s not good enough to merely CLASSIFY the prospective attendee – “this is a football fan; let him in.”

Or take another use case – using forensic evidence to find out who committed a crime. The mere fact that a video camera recorded a white male with glasses at the crime scene isn’t enough to make an arrest. In fact, under U.S. law, the mere fact that a video camera recorded John E. Bredehoft at the crime scene isn’t enough to make an arrest. However, the possible identification of an individual at a crime scene is enough to start investigating leads, including:

  • Confirmation by a trained forensic facial examiner that the algorithmic output truly reflected a conclusive match. (This was NOT performed in the ACLU test cited above. And frankly, Amazon shouldn’t have been suggesting ANY reliance on algorithmic results anyway, regardless of whether the threshold is set at 95 percent or 99 percent or 100 percent. A trained forensic facial examiner should ALWAYS review potential matches.)
  • Determination by an investigator whether the matched person may have been there for some other reason; perhaps the person is an innocent bystander or a victim.
  • Determination by an investigator that there is independent evidence linking the matched person to the crime. A face comparison result should ONLY be used as an investigative lead, NOT as the sole factor in making an arrest; there needs to be corroborating evidence.

Let’s get back to those words classification and individualization. The definitions of classification and individualization may vary depending upon the context, but it’s certainly true to say that classification identifies a CLASS (such as “male”) while individualization identifies an INDIVIDUAL (the “John E. Bredehoft” that is writing this particular post, and not any other John E. Bredehoft).

Sounds pretty straightforward.

Now while much of my career has been focused on individualization, I’ve also had to look at classification at times. Video analytics software often uses classification, because the use cases for video analytics often lend themselves to classification. For example, to answer the COVID-trendy question “is there overcrowding of persons in a particular area?” the algorithm needs to understand what a “person” is, and how a “person” is different from a “dog” or a “chair” or a “statue.”

So how does an algorithm determine the difference between a person and a statue? Well, usually the algorithm is trained on a set of data that includes both people and statues, and eventually the algorithm is trained enough to distinguish between the two.

Provided that the algorithm receives sufficient data to reliably make the distinction.

Which brings me to the albino wallaby.

Dean (leu) . from West Midlands, England / CC BY-SA (https://creativecommons.org/licenses/by-sa/2.0)

Last month, Mosalam Ebrahimi wrote “A 5-Minute Tour of Algorithmic Bias Mitigation,” and he began with an example of an algorithm that classified animals. If you click through to Ebrahimi’s article, you will see that the algorithm was trained on different types of animals, including wallabies, gorillas, and cats. But it appears that the data used to train the algorithm did not include albino wallabies or albino gorillas. An albino wallaby, for example, was classified by the algorithm as a white cat – presumably because the training data set lacked sufficient albino wallabies (and albino gorillas). This is what Ebrahimi refers to as a “benign” example of algorithmic bias, although I guess it’s not benign if you’re an albino wallaby who signed up for a dating service.

For these and more serious examples of algorithmic bias, Ebrahimi notes that programmatic solutions can be applied to the problem:

A class of algorithms attempts to determine when the model should be less confident about input data that is not similar to the training dataset. For instance, if the animal image classification model at the beginning of this post could tell it has never seen an albino gorilla, it would not mistake it for another animal. Hence, this approach can be used to mitigate bias in ML algorithms.

And yes, the correction algorithms themselves have to be rigorously tested.

Rigorous testing is what the National Institute of Standards and Technology (NIST) is all about. NIST, in addition to testing facial recognition with (simulated) masks, has also tested facial recognition when different demographic variables are present. And for me, the four most important words in the summary of that study were the words “Different algorithms perform differently.”

In other words, the claim that “all facial recognition algorithms are equally racist and sexist” is…an example of poor classification without sufficient individualization.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: