Tuesday, July 26, 2016

The False Positives / False Negatives Trade-off

I’ll try to illustrate the False-Positive False-Negative Trade-off in this post with a basic example. Suppose you have a test that tells you with what probability you have a disease. Such a test is some composite of your medical history, blood and tissue tests, genetic screening, etc. All this information goes into an algorithm, and out comes a single number: “The chance that you have Disease X is 20.6%.” The test is well-calibrated. Of those people who are told they have a 20.6% chance of having the disease, 20.6% have it. (And likewise for 1, 2, 3, …100%.) Given that you have this test, you now have to decide how to act on it. Basically, you have to set an arbitrary threshold: “Everyone with a probability above Y% should be treated as though they have the disease; everyone with a probability below Y% should be treated as though they are healthy.” Even though the test is very well calibrated, any policy of how to act on the test results will invariably worry some healthy people (false positives) and leave some diseased people untreated (false negatives). All you can do is set the value of Y; your task is essentially to manage the number of false positives and false negatives as wisely as you can.

You might respond, “Well, come up with a better test! Come up with something that neatly segregates the population into “100% healthy” and “100% diseased”!” That would be wonderful, but pretend for the sake of argument that we are using all possible information, and the algorithm that gives you your “diseased probability” is optimized to be as well-calibrated as possible. Often in life, we have to make due with limited information in situations when total omniscience would solve our problem. Maybe a better test for this disease will come along some day, but for the time being, we’ll have to make the best we can of this algorithm.

I’ve made up some data. I created a random set of 1001 people whose “diseased probability” ranges from 0 to 1 in increments of 0.001. I then randomly assign each of these people to be “diseased” or “healthy” based on their probability. The disease probability is known (it’s the output of an algorithm we’ve created), but the actual disease status of the person is not. I need to set a rule such as “treat everyone with a disease probability of Y% or greater as diseased, and everyone else is healthy.” So here’s what happens when you vary Y from 0 to 100%:



On the vertical axis we have false negatives. These are the sick people that our selection criterion is missing. On the horizontal axis we have false positives: these are the healthy people we are misidentifying as diseased. There is fundamentally an inverse relationship between the two types of errors. Following the curve down and to the right, you are decreasing the probability threshold (“Y” above) to identify someone as “diseased.” Following the curve up and to the left, you are increasing the threshold.

The problem arises because uncertainty is inherent in all decision-making. How *should* you treat someone who has a 50% chance of having a disease? Or how about a 40 or 30 or 10% chance? What general policy should apply to this very heterogeneous population, with a full range of disease probabilities? Surely it depends on the relative cost of treating a healthy person vs the cost of ignoring a sick one. Some discussions single-mindedly focus on one kind of error and ignore the possibility of the other kind. This is foolish because both kinds of errors are costly. A mature discussion would acknowledge the trade-off and specify where on the curve is the optimal place to be.

Discussions of public policy often miss this point. Suppose you are deciding what the threshold of “reasonable suspicion” is for a police officer to search a car, a home, or a pedestrian’s pockets. Suppose you are deciding which chronic pain sufferers should get strong prescription opioids and which ones shouldn’t, based perhaps on the severity of their condition or the chance that they are “abusers”. Suppose you are setting the standard for when a police officer can use deadly force. Suppose you are deciding which parolees are likely to re-offend and which ones are likely to stay clean, for the sake of sending them back to prison or keeping them on parole. Suppose we have to dole out society’s limited welfare resources to the needy. What’s the threshold for “needy” and based on what criteria? How many guilty men are you willing to leave unpunished to ensure that an innocent man isn’t imprisoned? Whether you have a fuzzy, judgment-based criterion or a hard-coded algorithmic criterion, all these decisions suffer from the false-positive/false-negative trade-off.  With given resources and given information, “make fewer mistakes” isn’t an option. You have to choose which kind of mistakes you are more comfortable with.

Take opioid prescriptions, for example. Maybe you think a single chronic pain sufferer is worth a thousand opioid addictions; in this case you’d choose a permissive standard. The “false positives” (giving opioids to someone who doesn’t need them) are relatively costless compared to the “false negatives” (denying opioids to a real pain sufferer).

For another example that's in recent news, take police shootings. Perhaps you want police officers to be able to defend themselves from an obvious threat, but you also think that police officers assume certain risks by taking the job. There are cases where innocent people are understandably confused about who exactly is bursting into their home in the middle of the night. These can lead to tense stand-offs where an innocent home-owner doesn’t know who’s threatening him and the police don’t know who’s shooting at them. It makes sense that the police should assume *some* portion of that risk, given that sometimes these are completely innocent people (often *not* the target of the raid at all) who the police are charged with protecting. If you force the police to be more restrained in their use of force, it means more cops get shot at but it also means there is more opportunity for the innocent homeowner to realize what’s happening and disarm voluntarily. Less restraint, and you get the opposite. The task here is to set the standard for the use of deadly force, and set it such that you are comfortable with the proportions of both kinds of mistakes. You want to set it high enough that police don’t instantly gun down innocent people at the first sign of a plausible threat, but low enough that people are still willing to endure the risk of becoming police officers. This is a hard problem, and I think many people (probably including my younger self) blow through it without much thought. Of course, “Stop doing no-knock raids” would put an end to most of these volatile encounters, and that’s a perfectly sensible policy solution. Even so, these volatile situations *could* arise during legitimate, routine police work. “Make fewer mistakes” isn’t a policy lever. *Given* the institutions we have, you can only shift the proportion of false positives vs false negatives. Given that both kinds of errors are inevitable, the task, once again, is to figure out which kinds of mistakes you’re more comfortable with (and by how much). All this assumes that we have a policy lever that controls when police fire their weapons during tense encounters. Maybe this sounds implausible, but a liability rule or standardized protocol might do the trick. In the recent discussion of police shootings, much of the commentary proceeds as if all police shootings neatly separate into “plainly 100% justified” and “plainly 100% unnecessary.” Surely no mortal perceives this world with such god-like clarity. Surely borderline cases exist, and the border is probably much broader than a line.


My task here is not to adjudicate any of the policy questions mentioned in the above paragraphs. If it sounds like I’m venturing too strong an opinion, I apologize for the distraction. My much more modest goal with this post is to articulate a trade-off. It’s a trade-off that is crucial to many policy discussions, but often gets completely glossed over. 

No comments:

Post a Comment