Tuesday, January 31, 2017

A Post About Fight Club and Sample Bias

Imagine if Jack from the movie Fight Club were to lecture you about auto safety.

In the movie Fight Club, Ed Norton (aka Jack) is an actuary who investigates automobile crash scenes in order to estimate the auto manufacturer’s liability.*  Imagine such a person lecturing you about the dangers of driving. All of his most vivid experiences involve real-life car accidents, some of which involve significant carnage. He could probably go on for a very long time stringing together one anecdote after another. He could justify each rhetorical flourish with another example of a family being horribly maimed or killed. You might get skeptical and say something about how you can’t judge the risk of driving just by looking examples of fatal or near-fatal car crashes. That’s a biased sample, to say the least. Obviously you’d want to start with the full sample of all drivers or all car-trips, and estimate the risk of a bad accident as a proportion of this greater total. But by the time you've managed to express this idea, he shuts you down with another vivid anecdote from just the other day in which a family was burned alive in their "safe" automobile. Driving is safe, indeed!

But of course you would be right to ignore his bluster and consult the actuarial tables to quantify the true risk. Generalizing based on the most vivid possible examples is generally a bad idea. If you know that someone has been gathering and assembling the worst 1% or the worst 0.01% of examples of something (or however far out in the right tail you want to go), they will be a biased source of risk information.

Switch the topic to drug legalization. Enter law enforcement and substance abuse treatment personnel and anyone else who deals with society's problems. "You naive drug law reformer, you simply do not know all the horrors I have seen," they might start. And they go on to regale you with example after example, not realizing that the vast majority of the relevant sample is hidden from their view. Of course, people with substance abuse problems are most likely to attract the attention of law enforcement and medical personnel. If you want to know how to deal with specific bad outcomes, these would be good people to consult for their opinions. (Sometimes, sometimes not...I have heard some of these people show very bad judgment even here in their supposed domain of expertise). But if you are trying to determine optimal drug policy, you should have some sense of how the typical potential user response to the various risks and hazards of drugs. Your sample will contain the millions of people who dabble for a while and never develop a habit, or develop some kind of a "habit" but it never becomes a problem. Beware the problem of sample bias. It is lurking everywhere. A skilled demagogue, or even someone who is honest but oblivious, can really mislead you if you aren't careful.

All this is simply to point out the issue of sample bias. Forget for a moment that many of those overdoses, blood-borne pathogens, and other problems are themselves products of prohibition. That fact will also be hidden from the view of someone who simply looks at the screw-ups and tries to extrapolate from there. The non-problem users in his (the cop/E.R medic/social worker's) own world are hidden from his view; the non-problem users in the counterfactual world of rational drug policy are doubly hidden. These law enforcement or medical professionals could at least see some kind of survey evidence for the existence of the hidden population of non-problematic drug users. But they can't "see" how much better the problem-users fair in a counterfactual world where clean needles are freely available, drugs are cheaper and thus don't require property crimes to support a habit, and drugs are of pharmaceutical grade and known purity (thus leading to fewer poisonings). "Seeing" in this way requires the disciplined use of logic and statistics, and somebody who is blinded by vivid anecdotes won't be able to do it. Since I've discussed these issues in other posts, I won't rehash them all here. I just wanted to point out that sample bias isn't the only problem contributing to these folks' misunderstanding of the issue.

Scott Alexander makes a similar point about sample bias in this excellent post. No, you can't just pile a bunch of horrific anecdotes on top of each other. You have to know something about the base rate. You have to know how large a sample you are digging in to dredge up the bad outcomes.

* He describes his work in the following excerpt: “Take the number of vehicles in the field, (A), and multiply it by the probable rate of failure, (B), then multiply the result by the average out-of-court settlement, (C). A times B times C equals X. If X is less than the cost of a recall, we don't do one.”
At this point in the movie you're supposed to imagine that the manufacturer is a cynical monster, a hulking avatar of capitalist excess and greed. But actually the formula is about the right criterion for issuing a recall. It might be prudent to make the cutoff "X times 1.5" or "X times 3" or something more conservative than simply "X", but there is no way that the value of X is completely irrelevant to the recall decision. Anyway, that's an argument for another post.

No comments:

Post a Comment