Saturday, January 14, 2017

Pet Peeves of a Data Analyst

[Edit: I wrote this a few days ago then reread just before sharing. It comes off as more sneering and angry than I would like, though I was mostly in a good mood when I wrote it. I'm sharing anyway, recognizing that it could probably be written in a way that's more understanding to the peeve-inducers. I tried, when I could, to be constructive rather than abusive.]

I work with large datasets for a living, doing all manner of statistical analysis. My work ranges from simple summary statistics to model building and feature engineering on big datasets. I thought I’d list a few pet peeves of mine.

1)      Fake Precision on Noisy Data. Someone always pipes in with “Did you adjust/control for X? (Which happens to be a hobby horse of mine?)” Sometimes this is welcome, but sometimes the dataset is too small and therefore not credible enough for refined adjustments to matter. Sometimes the “analysis” is basically fitting a line through some random dots. If you had infinite data, the dots would *not* be random, but would be a noticeable pattern of some sort (linear or otherwise). But you have to work with the data you actually have. If your data are too sparse to see the real pattern, any “adjustment” you make, even if there’s some good theoretical reason for making it, isn’t going to matter. It just means you’re fitting a line though *this* set of random dots rather than *that* set. I’ve also seen maddening arguments about whether some set of points should be fitted with a line or a curve or some other sort of grouping, when in reality the data points are too noisy to make any such determination.

2)      Last Minute Demands on a Big Dataset. It’s often said that data modeling is 90% data gathering/cleaning and 10% model building. So it’s a huge headache when someone has a bright idea for a last minute insertion. Sometimes this is the fault of the modelers, but usually it’s wishy washy management deciding at the very last second that something they just thought of just now is very, very important. “Wait, will we be including social media history in our analysis of auto accident frequency? I didn’t see it in the list of variables. Let’s add it!” The data modeling people sigh at these kinds of requests, because it usually means a few days of additional data gathering and a delay in a (perhaps already determined) modeling schedule. Data modelers should keep lines of communication open and set some kind of “no further adjustments” date so that this doesn’t happen. But it probably will anyway.

3)      “Well, I did this at my former company…” Sometimes a person will move from a large company with millions of customers and a huge dataset to a small or medium-sized company. They often bring with them unrealistic expectations. With larger datasets it is easier to see very weak patterns that might be invisible on a smaller dataset. The pattern may be real, but you won’t see it in your data. At my company, you have to assemble several years' worth of customer data, in countrywide aggregates, to see real patterns. A much larger company might have 20% of the adult population as its customers, and can thus see real patterns in a single quarter’s worth of data in a single state or region. If a manager gets used to the latter, they will have unreasonable expectation if they moved to work at a smaller company. Sometimes this might be a matter not of data size but of expertise. If your previous company had three data science PhDs, a team of actuaries, and a bunch of SQL experts, but your new company has a bunch of Excel jockeys who barely know how to run an Excel regression, you won’t be able to implement all your awesome ideas at the new company. At the very least, it may take some time to develop the appropriate skill set.

4)      Ignoring the modeling results and expecting there to be no consequences. Data models sometimes give us surprising results. That’s why we do them. If we knew all the patterns ahead of time, there would be no need for the modeling. But I sometimes get requests such as: “I’m going to ignore a piece of your modeling results. Re-run your model so that I’m still right.” For example, a model might tell you that people above age 60 are at increased risk of an auto accident, but a manager objects that this is their target market. They may ask you to re-run the model without including age as a predictive variable, hoping that the new model will offset their (possibly unwise) business decision. (This is a completely made up example that is vaguely similar to something that might actually happen.) There may be some rare instances where this is appropriate, but managers need to understand when they are throwing away predictive power with their business decisions. If you throw a predictive variable out of your model, you are throwing away predictive power and you can’t get it back. Worse, the model will try to find the influence of that variable elsewhere, so you may end up with a model that is altogether weaker. To take an example, suppose I have a multivariable regression including driver age and other variables on auto accident frequency; if I throw driver age out of my model, the other variables that correlate with driver age will adjust to try to pick up the lost signal. It’s probably better to keep driver age in the model and consider it a “control variable” that won’t count against the driver (because you won't charge the customer based on their age or something). At any rate, it’s delusional to think that you can throw away predictive information and somehow totally offset this decision.

5)      Hobby horsing (again). As in, “Hey, I did X once and it was important and really made a big difference. Did you do X?” This overlaps with 3), obviously. Just because your brilliant insight saved the day once doesn’t mean it’s going to matter every time. It’s extremely annoying when people try to shoehorn their awesome idea into every single project.

6)      Finding a bullshit reason to discredit something. Often someone will object to an analysis the conclusion of which they dislike. A much more toxic dynamic is when someone dislikes a specific *person* and looks for stupid reasons to discredit their work. This is incredibly demoralizing to the data people and managers need to be very aware of when they are doing it. To guard against this, a data modeler should anticipate such objections and be ready to answer them, even going so far as to prepare for a specific person known to be an incredulous hard-ass.

7)      Someone gets mad at you for finding something inconvenient. I understand when someone responds incredulously to incredible results. Sometimes the data modeler really did goof. But sometimes the incredulity persists after all the objections are answered. “Yes, I adjusted for this. Yes, I controlled for that. Yes, I filtered for those.” After all this, perhaps you *still* conclude that your target market doesn’t deserve that huge discount you’ve been giving them, or your giant marketing initiative didn’t work. Accept the results and move on. I once had a boss who simply could not accept inconvenient results. He would come up with bullshit adjustments or filters or something hoping to get the results he wanted. It felt like I was being punished for bringing him bad news. Don’t be that guy.

8)      The regulatory state. I work in the insurance business as a research actuary. I am often in charge of crafting responses to regulators, sometimes filling out standard filing forms. Every single state (except Wyoming) requires that you file your rate plan with the state department of insurance (DOI), and every state DOI reserves the right to object to any filing. There is often a painful back-and-forth where regulators ask annoying questions and the insurance company tries to answer them. Sometimes the objections are based on the violation of a specific statute, and sometimes the reasons for objecting are far more capricious. They often have no statutory authority for their objections, or authority that comes from a lame catch-all (such as a vague law saying that rates must be “actuarially sound”, “not unfairly discriminatory”, etc.). This is a major pain for data modelers. I’m not making a libertarian stand here; if the state outlaws race-based discrimination and asks for reasonable proof that a model is not engaging in any such discrimination, I don’t object to that. My major gripe is that many of these departments are decades behind the latest modeling methods. Their standardized form questions often betray their ignorance.* A standard filing form from one state (see footnote below for additional detail) was littered with questions that looked like they were copied from a standard textbook on traditional linear models (very old school) but which are irrelevant to generalized linear models (glms, very commonly used in my industry). It’s like the actuaries at that department went to a predictive modeling seminar for a day or two and came back thinking they were experts on the topic. Then they copied some wording from the session handouts and turned it into an official state document. The non-standard questions we get from state DOIs are no better. Almost every filing is met with an “objection letter,” in which a DOI employee asks questions specific to the filing. One such question was (and I swear to you I am paraphrasing only slightly here): “What is a multivariate model?” Anyone remotely knowledgeable would have known that the term "multivariate" was a reference to glms, which almost every insurance company uses. These non-practitioners (I nearly said non-experts, but that would be a woeful understatement) have pathetically little knowledge of what they are actually regulating. I suspect this is the same in other industries. The latest, most cutting-edge methods must be justified, but must comply with decades-old language written for a different purpose. Such cutting edge tech must be explained to laymen who have the final decision. Once again I’m not here to critique government regulation in general. I just think it’s not too much to ask that government employees understand the thing they are regulating. And if a government agency cannot afford to keep such expertise on retainer, they need to relax or repeal those regulations.  This single factor is a huge barrier to innovation. If we can’t use a model unless it can be explained to an ignorant non-practitioner, that severely restricts what we can do. Another problem is that regulators sometimes demand a very specific statistical test when in practice something might be more of a judgment call. Data modeling is a process that requires a great deal of human judgment. What variables should I include? How should I group things? What kind of curve should I fit? Regulators often demand an unreasonable sort of rigor in the model-building process where all of this judgment is stripped away and replaced with an unalterable decision tree. Such a process will often lead to models with nonsensical results that any reasonable person can spot and fix, but the regulators interpret any insertion of human judgment as an attempt to be devious.

This list is by no means exhaustive. Obviously it could be longer, or shorter, but these were the peeves that occurred to me on one evening. 

* A questionnaire on generalized linear models (glms) in one state ask about tests for “homoscedasticity”, which means that the variance does not vary with the expected mean. (On a residuals plot, your residuals will spread out more on one side of the graph, rather than having a roughly constant standard deviation across the range.) This is a topic in traditional linear models, but the power of a glm is that you can relax the “constant variance” assumption. (You can make your error term gamma, poisson, tweedie, etc, rather than the traditional Gaussian that results in a constant expected variance.) Apologies if these technical details are confusing to the uninitiated, but it is really basic stuff as far as glms go. And every company is using these now.

No comments:

Post a Comment