[Edit: I wrote this a few days ago then reread just before sharing. It comes off as more sneering and angry than I would like, though I was mostly in a good mood when I wrote it. I'm sharing anyway, recognizing that it could probably be written in a way that's more understanding to the peeve-inducers. I tried, when I could, to be constructive rather than abusive.]
I work with large datasets for a living, doing all manner of statistical analysis. My work ranges from simple summary statistics to model building and feature engineering on big datasets. I thought I’d list a few pet peeves of mine.
I work with large datasets for a living, doing all manner of statistical analysis. My work ranges from simple summary statistics to model building and feature engineering on big datasets. I thought I’d list a few pet peeves of mine.
1)
Fake Precision on Noisy Data. Someone always pipes in with “Did
you adjust/control for X? (Which happens to be a hobby horse of mine?)” Sometimes
this is welcome, but sometimes the dataset is too small and therefore not
credible enough for refined adjustments to matter. Sometimes the “analysis” is
basically fitting a line through some random dots. If you had infinite data,
the dots would *not* be random, but would be a noticeable pattern of some sort
(linear or otherwise). But you have to work with the data you actually have. If your
data are too sparse to see the real pattern, any “adjustment” you make, even if
there’s some good theoretical reason for making it, isn’t going to matter. It
just means you’re fitting a line though *this* set of random dots rather than *that*
set. I’ve also seen maddening arguments about whether some set of points should
be fitted with a line or a curve or some other sort of grouping, when in
reality the data points are too noisy to make any such determination.
2)
Last Minute Demands on a Big Dataset. It’s often
said that data modeling is 90% data gathering/cleaning and 10% model building.
So it’s a huge headache when someone has a bright idea for a last minute insertion.
Sometimes this is the fault of the modelers, but usually it’s wishy washy
management deciding at the very last second that something they just thought of
just now is very, very important. “Wait, will we be including social media
history in our analysis of auto accident frequency? I didn’t see it in the list
of variables. Let’s add it!” The data modeling people sigh at these kinds of
requests, because it usually means a few days of additional data gathering and
a delay in a (perhaps already determined) modeling schedule. Data modelers should
keep lines of communication open and set some kind of “no further adjustments”
date so that this doesn’t happen. But it probably will anyway.
3)
“Well, I did this at my former company…”
Sometimes a person will move from a large company with millions of customers
and a huge dataset to a small or medium-sized company. They often bring with
them unrealistic expectations. With larger datasets it is easier to see very
weak patterns that might be invisible on a smaller dataset. The pattern may be
real, but you won’t see it in your data. At my company, you have to assemble
several years' worth of customer data, in countrywide aggregates, to see real
patterns. A much larger company might have 20% of the adult population as its
customers, and can thus see real patterns in a single quarter’s worth of data
in a single state or region. If a manager gets used to the latter, they will
have unreasonable expectation if they moved to work at a smaller company. Sometimes
this might be a matter not of data size but of expertise. If your previous
company had three data science PhDs, a team of actuaries, and a bunch of SQL
experts, but your new company has a bunch of Excel jockeys who barely know how
to run an Excel regression, you won’t be able to implement all your awesome
ideas at the new company. At the very least, it may take some time to develop
the appropriate skill set.
4)
Ignoring the modeling results and expecting
there to be no consequences. Data models sometimes give us surprising results.
That’s why we do them. If we knew all the patterns ahead of time, there would be
no need for the modeling. But I sometimes get requests such as: “I’m going to
ignore a piece of your modeling results. Re-run your model so that I’m still
right.” For example, a model might tell you that people above age 60 are at
increased risk of an auto accident, but a manager objects that this is their
target market. They may ask you to re-run the model without including age as a
predictive variable, hoping that the new model will offset their
(possibly unwise) business decision. (This is a completely made up example that is vaguely similar to something that might actually happen.) There may be some rare instances where
this is appropriate, but managers need to understand when they are throwing
away predictive power with their business decisions. If you throw a predictive
variable out of your model, you are throwing away predictive power and you can’t
get it back. Worse, the model will try to find the influence of that variable
elsewhere, so you may end up with a model that is altogether weaker. To take an
example, suppose I have a multivariable regression including driver age and
other variables on auto accident frequency; if I throw driver age out of my model,
the other variables that correlate with driver age will adjust to try to pick
up the lost signal. It’s probably better to keep driver age in the model and consider
it a “control variable” that won’t count against the driver (because you won't charge the customer based on their age or something). At any rate, it’s
delusional to think that you can throw away predictive information and somehow
totally offset this decision.
5)
Hobby horsing (again). As in, “Hey, I did X once
and it was important and really made a big difference. Did you do X?” This
overlaps with 3), obviously. Just because your brilliant insight saved the day
once doesn’t mean it’s going to matter every time. It’s extremely annoying when
people try to shoehorn their awesome idea into every single project.
6)
Finding a bullshit reason to discredit
something. Often someone will object to an analysis the conclusion of which
they dislike. A much more toxic dynamic is when someone dislikes a specific *person*
and looks for stupid reasons to discredit their work. This is incredibly
demoralizing to the data people and managers need to be very aware of when they
are doing it. To guard against this, a data modeler should anticipate such
objections and be ready to answer them, even going so far as to prepare for a
specific person known to be an incredulous hard-ass.
7)
Someone gets mad at you for finding something inconvenient.
I understand when someone responds incredulously to incredible results.
Sometimes the data modeler really did goof. But sometimes the incredulity
persists after all the objections are answered. “Yes, I adjusted for this. Yes,
I controlled for that. Yes, I filtered for those.” After all this, perhaps you *still*
conclude that your target market doesn’t deserve that huge discount you’ve been
giving them, or your giant marketing initiative didn’t work. Accept the results
and move on. I once had a boss who simply could not accept inconvenient
results. He would come up with bullshit adjustments or filters or something
hoping to get the results he wanted. It felt like I was being punished for
bringing him bad news. Don’t be that guy.
8)
The regulatory state. I work in the insurance
business as a research actuary. I am often in charge of crafting responses to
regulators, sometimes filling out standard filing forms. Every single state
(except Wyoming) requires that you file your rate plan with the state
department of insurance (DOI), and every state DOI reserves the right to object
to any filing. There is often a painful back-and-forth where regulators ask
annoying questions and the insurance company tries to answer them. Sometimes
the objections are based on the violation of a specific statute, and sometimes
the reasons for objecting are far more capricious. They often have no statutory
authority for their objections, or authority that comes from a lame catch-all (such
as a vague law saying that rates must be “actuarially sound”, “not unfairly discriminatory”, etc.). This is a major pain for data modelers. I’m not making
a libertarian stand here; if the state outlaws race-based discrimination and
asks for reasonable proof that a model is not engaging in any such
discrimination, I don’t object to that. My major gripe is that many of these
departments are decades behind the latest modeling methods. Their standardized
form questions often betray their ignorance.* A standard filing form from one
state (see footnote below for additional detail) was littered with questions
that looked like they were copied from a standard textbook on traditional linear models (very old school) but
which are irrelevant to generalized linear models (glms, very commonly used in my industry). It’s like the actuaries at that department went
to a predictive modeling seminar for a day or two and came back thinking they
were experts on the topic. Then they copied some wording from the session
handouts and turned it into an official state document. The non-standard
questions we get from state DOIs are no better. Almost every filing is met with
an “objection letter,” in which a DOI employee asks questions specific to the
filing. One such question was (and I swear to you I am paraphrasing only
slightly here): “What is a multivariate model?” Anyone remotely knowledgeable would have known that the term "multivariate" was a reference to glms, which almost every insurance company uses. These non-practitioners (I
nearly said non-experts, but that would be a woeful understatement) have
pathetically little knowledge of what they are actually regulating. I suspect
this is the same in other industries. The latest, most cutting-edge methods
must be justified, but must comply with decades-old language written for a
different purpose. Such cutting edge tech must be explained to laymen who have the final decision. Once again I’m
not here to critique government regulation in general. I just think it’s not too
much to ask that government employees understand the thing they are
regulating. And if a government agency cannot afford to keep such expertise on
retainer, they need to relax or repeal those regulations. This single factor is a huge barrier to
innovation. If we can’t use a model unless it can be explained to an ignorant
non-practitioner, that severely restricts what we can do. Another problem is that
regulators sometimes demand a very specific statistical test when in practice something might be more of a judgment call. Data modeling
is a process that requires a great deal of human judgment. What variables
should I include? How should I group things? What kind of curve should I fit? Regulators
often demand an unreasonable sort of rigor in the model-building process where all
of this judgment is stripped away and replaced with an unalterable decision tree. Such a process will often lead to models with nonsensical results that any reasonable person can spot and fix, but the regulators interpret any insertion of human judgment as an attempt to be devious.
This list is by no means exhaustive. Obviously it could be longer, or shorter, but these were the peeves that occurred to me on one evening.
* A questionnaire on generalized linear models (glms) in one
state ask about tests for “homoscedasticity”, which means that the variance
does not vary with the expected mean. (On a residuals plot, your residuals will
spread out more on one side of the graph, rather than having a roughly constant
standard deviation across the range.) This is a topic in traditional linear
models, but the power of a glm is that you can relax the “constant variance”
assumption. (You can make your error term gamma, poisson, tweedie, etc, rather
than the traditional Gaussian that results in a constant expected variance.) Apologies
if these technical details are confusing to the uninitiated, but it is really
basic stuff as far as glms go. And every company is using these now.
Really nice blog post. provided helpful information. I hope that you will post more updates like this
ReplyDeleteData Science certification