I’ll try to illustrate the False-Positive False-Negative
Trade-off in this post with a basic example. Suppose you have a test that tells
you with what probability you have a disease. Such a test is some composite of
your medical history, blood and tissue tests, genetic screening, etc. All this
information goes into an algorithm, and out comes a single number: “The chance
that you have Disease X is 20.6%.” The test is well-calibrated. Of those people
who are told they have a 20.6% chance of having the disease, 20.6% have it.
(And likewise for 1, 2, 3, …100%.) Given that you have this test, you now have
to decide how to act on it. Basically, you have to set an arbitrary threshold:
“Everyone with a probability above Y% should be treated as though they have the
disease; everyone with a probability below Y% should be treated as though they
are healthy.” Even though the test is very well calibrated, any policy of how
to act on the test results will invariably worry some healthy people (false
positives) and leave some diseased people untreated (false negatives). All you
can do is set the value of Y; your task is essentially to manage the number of
false positives and false negatives as wisely as you can.
You might respond, “Well, come up with a better test! Come
up with something that neatly segregates the population into “100% healthy” and
“100% diseased”!” That would be wonderful, but pretend for the sake of argument
that we are using all possible information, and the algorithm that gives you
your “diseased probability” is optimized to be as well-calibrated as possible.
Often in life, we have to make due with limited information in situations when
total omniscience would solve our problem. Maybe a better test for this disease
will come along some day, but for the time being, we’ll have to make the best
we can of this algorithm.
I’ve made up some data. I created a random set of 1001
people whose “diseased probability” ranges from 0 to 1 in increments of 0.001.
I then randomly assign each of these people to be “diseased” or “healthy” based
on their probability. The disease probability is known (it’s the output of an
algorithm we’ve created), but the actual disease status of the person is not. I
need to set a rule such as “treat everyone with a disease probability of Y% or
greater as diseased, and everyone else is healthy.” So here’s what happens when
you vary Y from 0 to 100%:
On the vertical axis we have false negatives. These are the
sick people that our selection criterion is missing. On the horizontal axis we
have false positives: these are the healthy people we are misidentifying as
diseased. There is fundamentally an inverse relationship between the two types
of errors. Following the curve down and to the right, you are decreasing the
probability threshold (“Y” above) to identify someone as “diseased.” Following
the curve up and to the left, you are increasing the threshold.
The problem arises because uncertainty is inherent in all
decision-making. How *should* you treat someone who has a 50% chance of having
a disease? Or how about a 40 or 30 or 10% chance? What general policy should
apply to this very heterogeneous population, with a full range of disease
probabilities? Surely it depends on the relative cost of treating a healthy person
vs the cost of ignoring a sick one. Some discussions single-mindedly focus on
one kind of error and ignore the possibility of the other kind. This is foolish
because both kinds of errors are costly. A mature discussion would acknowledge
the trade-off and specify where on the curve is the optimal place to be.
Discussions of public policy often miss this point. Suppose
you are deciding what the threshold of “reasonable suspicion” is for a police
officer to search a car, a home, or a pedestrian’s pockets. Suppose you are
deciding which chronic pain sufferers should get strong prescription opioids
and which ones shouldn’t, based perhaps on the severity of their condition or
the chance that they are “abusers”. Suppose you are setting the standard for
when a police officer can use deadly force. Suppose you are deciding which
parolees are likely to re-offend and which ones are likely to stay clean, for
the sake of sending them back to prison or keeping them on parole. Suppose we
have to dole out society’s limited welfare resources to the needy. What’s the
threshold for “needy” and based on what criteria? How many guilty men are you
willing to leave unpunished to ensure that an innocent man isn’t imprisoned? Whether
you have a fuzzy, judgment-based criterion or a hard-coded algorithmic
criterion, all these decisions suffer from the false-positive/false-negative
trade-off. With given resources and
given information, “make fewer mistakes” isn’t an option. You have to choose
which kind of mistakes you are more comfortable with.
Take opioid prescriptions, for example. Maybe you think a
single chronic pain sufferer is worth a thousand opioid addictions; in this
case you’d choose a permissive standard. The “false positives” (giving opioids
to someone who doesn’t need them) are relatively costless compared to the
“false negatives” (denying opioids to a real pain sufferer).
For another example that's in recent news, take police shootings. Perhaps you want police officers to be able to defend themselves
from an obvious threat, but you also think that police officers assume certain
risks by taking the job. There are cases where innocent people are
understandably confused about who exactly is bursting into their home in the middle
of the night. These can lead to tense stand-offs where an innocent home-owner
doesn’t know who’s threatening him and the police don’t know who’s shooting at
them. It makes sense that the police should assume *some* portion of that risk,
given that sometimes these are completely innocent people (often *not* the
target of the raid at all) who the police are charged with protecting. If you
force the police to be more restrained in their use of force, it means more
cops get shot at but it also means there is more opportunity for the innocent homeowner
to realize what’s happening and disarm voluntarily. Less restraint, and you get
the opposite. The task here is to set the standard for the use of deadly force,
and set it such that you are comfortable with the proportions of both kinds of
mistakes. You want to set it high enough that police don’t instantly gun down
innocent people at the first sign of a plausible threat, but low enough that
people are still willing to endure the risk of becoming police officers. This
is a hard problem, and I think many people (probably including my younger self)
blow through it without much thought. Of course, “Stop doing no-knock raids”
would put an end to most of these volatile encounters, and that’s a perfectly
sensible policy solution. Even so, these volatile situations *could* arise
during legitimate, routine police work. “Make fewer mistakes” isn’t a policy
lever. *Given* the institutions we have, you can only shift the proportion of
false positives vs false negatives. Given that both kinds of errors are
inevitable, the task, once again, is to figure out which kinds of mistakes you’re
more comfortable with (and by how much). All this assumes that we have a policy
lever that controls when police fire their weapons during tense encounters. Maybe
this sounds implausible, but a liability rule or standardized protocol might do
the trick. In the recent discussion of police shootings, much of the commentary
proceeds as if all police shootings neatly separate into “plainly 100%
justified” and “plainly 100% unnecessary.” Surely no mortal perceives this world
with such god-like clarity. Surely borderline cases exist, and the border is
probably much broader than a line.
My task here is not to adjudicate any of the policy
questions mentioned in the above paragraphs. If it sounds like I’m venturing
too strong an opinion, I apologize for the distraction. My much more modest
goal with this post is to articulate a trade-off. It’s a trade-off that is
crucial to many policy discussions, but often gets completely glossed over.
No comments:
Post a Comment