Wednesday, May 25, 2016

Propublica’s “Machine Bias” Article is Incredibly Biased

[If you find yourself getting angry at any of what I’ve written below, please read to the end. I am not rendering a verdict on the racial bias/fairness of the overall criminal justice system. I am making a pretty narrowly focused critique of an article that recently appeared on Propublica regarding the use of predictive modeling in criminal justice. My expertise, and the topic of this commentary, is predictive modeling, not criminal justice. Although I certainly have opinions about both.]

I read this Propublica article on the use of statistical modeling to predict the recidivism of criminals, and I found it to be incredibly misleading.

To their credit, the authors make their methods and even their data incredibly transparent and link to a sort of technical appendix.

The thrust of the article is that the statistical methods used to predict recidivism by criminals are biased against black people, and that attempts to validate such statistical methods either haven’t bothered to check for such a bias or have overlooked it. I smelled a rat when the article started making one-by-one comparisons. “Here’s a white guy with several prior convictions and a very low risk-score. Here’s a black guy with only one prior and a high risk-score. It turns out the white guy did indeed commit additional crimes, and the black guy didn’t.” (Paraphrasing here, not a real quote.) This absolutely screams cherry-picking to me. I don’t know exactly what else went into the risk scoring (a system called COMPAS by a company named Northpointe), but surely it contains information other than the individual’s prior criminal record. Surely other things (gender, age, personality, questionnaire responses, etc.) are in their predictive model, so it’s conceivable that someone with a criminal history but otherwise a confluence of good factors will be given a low risk score. Likewise, someone with only one prior conviction but a confluence of otherwise bad factors will get a high risk score. After the fact, you can find instances of people who were rated high-risk but didn’t recidivate, and people who were rated low-risk bid did recidivate. These one-off "failures" of the model are irrelevant. What matters is how well the predictive model does *in aggregate.*

I build predictive models for a living. I don’t know exactly what kind of model COMPAS is. My best guess is that it’s some kind of logistic regression or a decision tree model. (Some of these details are proprietary.) The output of these models isn’t a “yes/no” answer. It’s a probability. As in, “This person has a 30% chance of committing a crime in the next 2 years.” You can’t measure the performance of a predictive model by looking at individual cases and saying, “This guess was correct, this guess was incorrect….” You have to do this kind of assessment in aggregate. To do this, one would typically calculate everyone’s probability of recidivism, pick some kind of reasonable grouping (like “0-5% probability is group 1, 6-10% probability is group 2…., 95-100% probability is group 20”), then compare the model predictions to the after-the-fact recidivism rates. If, for example, you identify a grouping of individuals with a 35%  probability of recidivism, and 35% of those individuals recidivate, your model is pretty good. You aren’t going to build a model with a sharp distinction like “This guy *will* recidivate and this guy *won’t*.” Most likely you will get probabilities spread out fairly evenly over some range. You could, for example, get an equal number of individuals at every percentile (1%, 2%, 3%, and each integer % up to 100%). More often with models like these you get something that is distributed across a narrower range. Perhaps everyone’s statistically predicted recidivism rate is between, say, 30% and 60%, and the distribution is a bell-curve within that range with a midpoint near 45%. You don’t typically get a bimodal distribution, with people clustering around 1% and 99%. In other words, a predictive model doesn’t make clear “yes/no” predictions.

I don’t see any understanding of these predictive modeling concepts in the piece above. Nothing in it indicates that the author is competent to validate or judge a predictive model. In fact, when it says things like “Two years later, we know the computer algorithm got it exactly backward.” and “Only 20 percent of the people predicted to commit violent crimes actually went on to do so.” and “the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.”, it betrays a real ignorance about how predictive models work. To be fair, the page describing the analysis is far more nuanced than the article itself. Something got lost in translating a reasonable and competent statistical analysis into an article for the general public.

The bottom line for this piece is the “false positive/false negative differences for blacks and whites” result. See the table at the bottom of the main piece. If we look at black defendants who recidivated, 27.99% were rated low risk (the “false negative” rate); this number is 47.72% for whites. The authors interpret this as (my own paraphrase): the algorithm is more lenient for whites, because it’s mislablelling too many of them as low-risk. If we look at black defendants who did not recidivate, 44.85% were rated high-risk (the “false positive” rate); this number is 23.45% for whites. The authors once again interpret this as the algorithm being lenient toward whites, since it is more likely to mislabel non-recidivating blacks as high-risk. At first this really did seem troubling, but then I looked at the underlying numbers.

Black defendants were more likely than white defendants to have a high score (58.8% vs 34.8%), but that alone does not imply an unfair bias. Despite their higher scores, blacks had a higher recidivism rate than whites for *both* high and low scoring populations.  Blacks with a high score had a 63% recidivism rate, while high-scoring whites had a 59.1% recidivism rate. The difference is even bigger for the low scorers. Anyway, I’m willing to interpret these differences as small and possibly statistically insignificant. But it’s pretty misleading to say that the scoring is unfair to blacks. The scoring is clearly discriminating the high-recidivism and low-recidivism populations, and its predictive performance is similar for whites and blacks. I think the “false positive/false negative” result described in the above paragraph is just a statistical artifact of the fact that black defendants, for whatever reason, are more likely to recidivate (51.4% vs 39.4%, according to Propublica’s data). It’s almost as if the authors looked at every conceivable ratio of numbers from this “High/Low score vs Did/didn’t recidivate” table and focused on the only thing that looked unfavorable to blacks.

(It should be noted that this analysis relies on data from Broward County in Florida. We can't necessarily assume any of these results generalize to the rest of the country. Any use of "black" or "white" in this post refers specifically to populations in this not-necessarily-representative sample.)  

It’s kind of baffling when you read the main piece and then see how it’s contradicted by the technical appendix. In the technical details, you can clearly see the survival curves are different for different risk classes. See the survival curves here, about 3/4 of the way down the page. In a survival curve people drop out of the population, in this case because they recidivate (in a more traditional application, because they literally die). The high-risk category is clearly recidivating more than the medium risk category, and the medium more than the low. If you compare the survival curves for the same risk category across races (e.g. compare high-risk blacks to high-risk whites), you can even see how blacks in the same risk category have a slightly higher recidivism rate. Contra the main article, this score is doing exactly what it’s supposed to be doing.

Sorry if this all seems terribly pedantic and beside the point. I even find myself saying, “Yes, yes, but there really *is* a serious problem here, accurate scores or not.” I’m definitely sympathetic to the idea that our criminal justice system is unfair. We police a lot of things that shouldn’t be policed, and we fail to adequately police things that *should* be policed. Clearance rates on real crimes, like theft, murder, and rape, are pathetically low, while non-crimes like drug use and drug commerce are actively pursued by law enforcement. If these scores are predicting someone’s tendency to commit a drug offense, and such a prediction is used *against* that person, then I will join the chorus condemning these scores as evil. However, it’s not fair to condemn them as “racist” when they are accurately performing their function of predicting recidivism rates and sorting high- and low- risk individuals. I also don’t think it’s fair to condemn the entire enterprise of predictive modeling because “everyone’s an individual.” The act of predicting recidivism will be done *somehow* by *someone*. That somehow/someone could be a computer performing statistical checks on its predictions, basing its predictive model on enormous sample sizes, and updating itself based on new information. Or it can be a weary human judge, perhaps hungering for his lunch break, perhaps irritated by some matter in his personal life, perhaps carrying with him a lifetime’s worth of prejudices, who in almost no one’s conception performs any kind of prediction-checking or belief updating. Personally, I’ll take the computer. 


  1. I think the main issue they pointed out here was that the algorithm was more likely to have an error in favor of a white defendant, and more likely to have an error not in favor of a black defendant. I think the error was like 15 percent. So if a black and a white defendant where to each have a 50 percent chance of reoffending, the algorithm would give the white defendant a 35 percent chance and the black defendant a 65 percent chance on average. Thats bad. I also don't think the article claimed the algorithm was less accurate or bias than a judge, they didn't attempt to measure that.

  2. @ JH.
    Thanks for your comment.
    Did you read the technical write-up (the second link in my post above)? It explains how the analysis was done, and it was clear from this that the model wasn’t biased. A bias would mean that for some identifiable subgroup, the model overstates the chance of recidivism (and likewise understates it for those not in the subgroup). It doesn’t have to be race, either. Suppose the model accurately predicted that, say, younger offenders are more likely to recidivate than older offenders (or males vs females, or people with multiple priors vs people with 1 or no priors). Even if the model was unbiased (the predicted recidivism rate was approximately equal to the actual observed rate, as is the case for the dataset Propublica was using), you’d see the same kind of “bias” that Propublica found. The high-recidivism population will *always* have more “false positives” and the low-recidivism population will always have more “false negatives” as calculated in the Propublica piece. See my discussion in the paragraph above and below the data table. They could have written an equally compelling piece arguing that the model too easily lets females off the hook, but identifies too many non-recidivating males as high-risk. It’s not fair to call this a “bias,” though, because this pattern will emerge from a perfectly unbiased model. Indeed you’d have to bias your model to avoid this issue of disparities in false positives/false negatives.

    Open up an Excel workbook and create some completely fictitious sample data (as in the data table above) to convince yourself of this. You'll see that the high-recidivism group has more false positives, etc. I had scratched out a blog post expanding on this point but never shared. Maybe I’ll have to dust it off and consider posting it.