[If you find yourself getting angry at any of what I’ve written
below, please read to the end. I am not rendering a verdict on the racial
bias/fairness of the overall criminal justice system. I am making a pretty
narrowly focused critique of an article that recently appeared on Propublica
regarding the use of predictive modeling in criminal justice. My expertise, and
the topic of this commentary, is predictive modeling, not criminal justice.
Although I certainly have opinions about both.]
I read this Propublica article on the use of statistical modeling to predict the recidivism of criminals, and I found it to be
incredibly misleading.
To their credit, the authors make their methods and even
their data incredibly transparent and link to a sort of technical appendix.
The thrust of the article is that the statistical methods
used to predict recidivism by criminals are biased against black people, and
that attempts to validate such statistical methods either haven’t bothered to
check for such a bias or have overlooked it. I smelled a rat when the article
started making one-by-one comparisons. “Here’s a white guy with several prior
convictions and a very low risk-score. Here’s a black guy with only one prior
and a high risk-score. It turns out the white guy did indeed commit additional crimes, and
the black guy didn’t.” (Paraphrasing here, not a real quote.) This absolutely screams cherry-picking to me. I don’t
know exactly what else went into the risk scoring (a system called COMPAS by a
company named Northpointe), but surely it contains information other than the
individual’s prior criminal record. Surely other things (gender, age,
personality, questionnaire responses, etc.) are in their predictive model, so
it’s conceivable that someone with a criminal history but otherwise a
confluence of good factors will be given a low risk score. Likewise, someone
with only one prior conviction but a confluence of otherwise bad factors will get a high risk
score. After the fact, you can find instances of people who were rated
high-risk but didn’t recidivate, and people who were rated low-risk bid did
recidivate. These one-off "failures" of the model are irrelevant. What matters is how well the predictive model does *in aggregate.*
I build predictive models for a living. I don’t know exactly
what kind of model COMPAS is. My best guess is that it’s some kind of logistic
regression or a decision tree model. (Some of these details are proprietary.)
The output of these models isn’t a “yes/no” answer. It’s a probability. As in,
“This person has a 30% chance of committing a crime in the next 2 years.” You
can’t measure the performance of a predictive model by looking at individual
cases and saying, “This guess was correct, this guess was incorrect….” You have
to do this kind of assessment in aggregate. To do this, one would typically
calculate everyone’s probability of recidivism, pick some kind of reasonable
grouping (like “0-5% probability is group 1, 6-10% probability is group 2….,
95-100% probability is group 20”), then compare the model predictions to the
after-the-fact recidivism rates. If, for example, you identify a grouping of
individuals with a 35% probability of
recidivism, and 35% of those individuals recidivate, your model is pretty good.
You aren’t going to build a model with a sharp distinction like “This guy
*will* recidivate and this guy *won’t*.” Most likely you will get probabilities
spread out fairly evenly over some range. You could, for example, get an equal
number of individuals at every percentile (1%, 2%, 3%, and each integer % up to
100%). More often with models like these you get something that is distributed
across a narrower range. Perhaps everyone’s statistically predicted recidivism
rate is between, say, 30% and 60%, and the distribution is a bell-curve within
that range with a midpoint near 45%. You don’t typically get a bimodal
distribution, with people clustering around 1% and 99%. In other words, a
predictive model doesn’t make clear “yes/no” predictions.
I don’t see any understanding of these predictive modeling
concepts in the piece above. Nothing in it indicates that the author is competent
to validate or judge a predictive model. In fact, when it says things like “Two years
later, we know the computer algorithm got it exactly backward.” and “Only 20
percent of the people predicted to commit violent crimes actually went on to do
so.” and “the algorithm made mistakes with black and white defendants at
roughly the same rate but in very different ways.”, it betrays a real ignorance
about how predictive models work. To be fair, the page describing the analysis
is far more nuanced than the article itself. Something got lost in translating
a reasonable and competent statistical analysis into an article for the general public.
The bottom line for this piece is the “false positive/false
negative differences for blacks and whites” result. See the table at the bottom
of the main piece. If we look at black defendants who recidivated, 27.99% were
rated low risk (the “false negative” rate); this number is 47.72% for whites.
The authors interpret this as (my own paraphrase): the algorithm is more
lenient for whites, because it’s mislablelling too many of them as low-risk. If
we look at black defendants who did not recidivate, 44.85% were rated high-risk
(the “false positive” rate); this number is 23.45% for whites. The authors once
again interpret this as the algorithm being lenient toward whites, since it is
more likely to mislabel non-recidivating blacks as high-risk. At first this
really did seem troubling, but then I looked at the underlying numbers.
Black defendants were more likely than white defendants to
have a high score (58.8% vs 34.8%), but that alone does not imply an unfair
bias. Despite their higher scores, blacks had a higher recidivism rate than
whites for *both* high and low scoring populations. Blacks with a high score had a 63% recidivism
rate, while high-scoring whites had a 59.1% recidivism rate. The difference is
even bigger for the low scorers. Anyway, I’m willing to interpret these
differences as small and possibly statistically insignificant. But it’s pretty
misleading to say that the scoring is unfair to blacks. The scoring is clearly
discriminating the high-recidivism and low-recidivism populations, and its
predictive performance is similar for whites and blacks. I think the “false
positive/false negative” result described in the above paragraph is just a
statistical artifact of the fact that black defendants, for whatever reason,
are more likely to recidivate (51.4% vs 39.4%, according to Propublica’s data).
It’s almost as if the authors looked at every conceivable ratio of numbers from
this “High/Low score vs Did/didn’t recidivate” table and focused on the only
thing that looked unfavorable to blacks.
(It should be noted that this analysis relies on data from Broward County in Florida. We can't necessarily assume any of these results generalize to the rest of the country. Any use of "black" or "white" in this post refers specifically to populations in this not-necessarily-representative sample.)
It’s kind of baffling when you read the main piece and then see
how it’s contradicted by the technical appendix. In the technical details, you
can clearly see the survival curves are different for different risk classes. See the survival curves here, about 3/4 of the way down the page. In
a survival curve people drop out of the population, in this case because they
recidivate (in a more traditional application, because they literally die). The
high-risk category is clearly recidivating more than the medium risk category,
and the medium more than the low. If you compare the survival curves for the
same risk category across races (e.g. compare high-risk blacks to high-risk
whites), you can even see how blacks in the same risk category have a slightly
higher recidivism rate. Contra the main article, this score is doing exactly
what it’s supposed to be doing.
Sorry if this all seems terribly pedantic and beside the
point. I even find myself saying, “Yes, yes, but there really *is* a serious
problem here, accurate scores or not.” I’m definitely sympathetic to the idea
that our criminal justice system is unfair. We police a lot of things that
shouldn’t be policed, and we fail to adequately police things that *should* be
policed. Clearance rates on real crimes, like theft, murder, and rape, are
pathetically low, while non-crimes like drug use and drug commerce are actively
pursued by law enforcement. If these scores are predicting someone’s tendency to
commit a drug offense, and such a prediction is used *against* that person,
then I will join the chorus condemning these scores as evil. However, it’s not
fair to condemn them as “racist” when they are accurately performing their
function of predicting recidivism rates and sorting high- and low- risk
individuals. I also don’t think it’s fair to condemn the entire enterprise of
predictive modeling because “everyone’s an individual.” The act of predicting
recidivism will be done *somehow* by *someone*. That somehow/someone could be a
computer performing statistical checks on its predictions, basing its
predictive model on enormous sample sizes, and updating itself based on new
information. Or it can be a weary human judge, perhaps hungering for his lunch
break, perhaps irritated by some matter in his personal life, perhaps carrying
with him a lifetime’s worth of prejudices, who in almost no one’s conception
performs any kind of prediction-checking or belief updating. Personally, I’ll
take the computer.
I think the main issue they pointed out here was that the algorithm was more likely to have an error in favor of a white defendant, and more likely to have an error not in favor of a black defendant. I think the error was like 15 percent. So if a black and a white defendant where to each have a 50 percent chance of reoffending, the algorithm would give the white defendant a 35 percent chance and the black defendant a 65 percent chance on average. Thats bad. I also don't think the article claimed the algorithm was less accurate or bias than a judge, they didn't attempt to measure that.
ReplyDelete@ JH.
ReplyDeleteThanks for your comment.
Did you read the technical write-up (the second link in my post above)? It explains how the analysis was done, and it was clear from this that the model wasn’t biased. A bias would mean that for some identifiable subgroup, the model overstates the chance of recidivism (and likewise understates it for those not in the subgroup). It doesn’t have to be race, either. Suppose the model accurately predicted that, say, younger offenders are more likely to recidivate than older offenders (or males vs females, or people with multiple priors vs people with 1 or no priors). Even if the model was unbiased (the predicted recidivism rate was approximately equal to the actual observed rate, as is the case for the dataset Propublica was using), you’d see the same kind of “bias” that Propublica found. The high-recidivism population will *always* have more “false positives” and the low-recidivism population will always have more “false negatives” as calculated in the Propublica piece. See my discussion in the paragraph above and below the data table. They could have written an equally compelling piece arguing that the model too easily lets females off the hook, but identifies too many non-recidivating males as high-risk. It’s not fair to call this a “bias,” though, because this pattern will emerge from a perfectly unbiased model. Indeed you’d have to bias your model to avoid this issue of disparities in false positives/false negatives.
Open up an Excel workbook and create some completely fictitious sample data (as in the data table above) to convince yourself of this. You'll see that the high-recidivism group has more false positives, etc. I had scratched out a blog post expanding on this point but never shared. Maybe I’ll have to dust it off and consider posting it.