Sabermetric Research: Academic rigor

Tuesday, April 10, 2012

Academic rigor

At the SABR Analytics Conference last month, a group of academics, led by Patrick Kilgo and Hillary Superak, presented some comments on the differences between academic sabermetric studies, and "amateur" studies. The abstract and audio of their presentation is here (scroll down to "Friday"). Also, they have kindly allowed me to post their slides, which are in .pdf format here.

I'm not going to comment on the presentation much right now ... I'm just going to go off on one of the differences they spoke about, from page 11 of their slides:

-- Classical sabermetrics often uses all of the data -- a census.

-- [Academic sabermetrics] is built for drawing inferences on populations, based on the assumption of a random sample.

That difference hadn't occurred to me before. But, yeah, they're right. You don't often see an academic paper that doesn't include some kind of formal statistical test.

That's true even when there are times when there are better methods available. I've written about this before, about how academics like to derive linear weights by regression, when, as it turns out, you can get much more accurate results from a method that uses only logic and simple arithmetic.

So, why do they do this? The reason, I think, is that academics are operating under the wrong incentives.

If you're an academic, you need to get published in a recognized academic journal. Usually, that's the way to keep your job, and get promoted, and eventually get tenure. With few exceptions, nobody cares how brilliant your blog is, or how much you know about baseball in your head. It's your list of publications that's important.

So, you need to do your study in such a way that it can get published.

In a perfect world, if your paper is correct, whether you get published would depend only the value of what you discover. But, ha! That's not going to happen. For one thing, when you write about baseball, nobody in academia knows the value of what you've discovered. Sabermetrics is not an academic discipline. No college has a sabermetrics department, or a sabermetrics professor, or even a minor in sabermetrics. Academia, really, has no idea of the state of the science.

So, what do they judge your paper on? Well, there are unwritten criteria. But one thing that I'm pretty sure about, is that your methodology must use college-level math and statistics. The more advanced, the better. Regression is OK. Logit regression is even better. Corrections for heteroskedasticity are good, as are methods to make standard errors more robust.

This is sometimes defended under the rubric of "rigor". But, often, the simpler methods are just as "rigorous" -- in the normal English sense of being thorough -- as the more complicated methods. Indeed, I'd argue that computing linear weights by regression is *less* rigorous than doing it by arithmetic. The regression is much less granular. It uses innings or games as its unit of data, instead of PA. Deliberately choosing to ignore at least 3/4 of the available information hardly qualifies as "rigor", no matter how advanced the math.

Academics say they want "rigor," but what they really mean is "advanced methodology".

A few months ago, I attended a sabermetrics presentation by an academic author. He had a fairly straightforward method, and joked that he had to call it model "parsimonious," because if he used the word "simple," they'd be reluctant to publish it. We all laughed, but later on he told me he was serious. (And I believe him.)

If you want to know how many cars are in the parking lot today, April 10, you can do a census -- just count them. You'll get the right answer, exactly. But you can't get published. That's not Ph.D. level scholarship. Any eight-year old can count cars and get the right answer.

So you have to do something more complicated. You start by counting the number of parking spots. Then, you take a random sample, and see if there's a car parked in it. That gives you a sample mean, and you can calculate the variance binomially, and get a confidence interval.

But again, that's just too simple, a t-test based on binomial. You still won't get published. So, maybe you do this: you hang out in the parking lot for a few weeks, and take a detailed survey of parking patterns. (Actually, you get one of your grad students to do it.) Then, you run regressions based on all kinds of factors. What kind of sales were the stores having? What was the time of day? What was the price of gas? What day of the week was it? How close was it to a major holiday? How long did it take to find a parking spot?

So, now you're talking! You do a big regression on all this stuff, and you come up with a bunch of coefficients. That also gives you a chance to do those extra fancy regressiony tests. Then, finally, you then plug in all the dependent variables for today, April 10, and, voila! You have an estimate and a standard error.

Plus, this gives you a chance to discuss all the coefficients in your model. You may notice that the coefficient for "hour 6", which is 12pm to 1pm, is positive and significant at p=.002. You hypothesize that's because people like to shop at lunch time. You cite government statistics, and other sociological studies, that have also found support for the "meridiem emptor" hypothesis. See, that's evidence that your model is good!

And, everyone's happy. Sure, you did a lot more work than you had to, just to get a less precise estimate of the answer. But, at least, what you did was scholarly, and therefore publishable!

It seems to me that in academia, it isn't that important to get the right answer, at least in a field of knowledge that's not studied academically, like baseball. All journals seem to care about is that your methodology isn't too elementary, that you followed all the rules, and that your tone is suitably scholarly.

"Real" fields, like chemistry, are different. There, you have to get the right answer, and make the right assumptions, or your fellow Ph.D. chemists will correct you in a hurry, and you'll lose face. But, in sabermetrics, academics seem to care very little if their conclusions or assumptions about baseball are right or wrong. They care only that the regression appears to find something interesting. If they did, and their method is correct, they're happy. They did their job.

Sure, it could turn out that their conclusion is just an artifact of something about baseball that they didn't realize. But so what? They got published. Also, who can say they're wrong? Just low-status sabermetricians working out of their parents' basement. But the numbers in an academic paper, on the other hand ... those are rigorous!

And if the paper shows something that's absurd, so much the better. Because, nobody can credibly claim to know it's absurd -- it's what the numbers show, and it's been peer reviewed! Even better if the claim is not so implausible that it can't be rationalized. In that case, the author can claim to have scientifically overturned the amateurs' conventional wisdom!

The academic definition of "rigor" is very selective. You have to be rigorous about using a precise methodology, but you don't have to be rigorous about whether your assumptions lead to the right answer.

-----

Just a few days ago, after I finished my first draft of this post, I picked up an article from an academic journal that deals with baseball player salaries. It's full of regressions, and attention to methodological detail. At one point, the authors say, "... because [a certain] variable is potentially endogenous in the salary equation, we conduct the Hausman (1978) specification test ..."

I looked up the Hausman specification test. It seems like a perfectly fine test, and it's great that they used it. When you're looking for a small effect, every little improvement helps. Using that test definitely contributed to the paper's rigor, and I'm sure the journal editors were pleased.

But, after all that effort, how did their study choose to measure player productivity? By slugging percentage.

Sometimes, academia seems like a doctor so obsessed with perfecting his surgical techniques that he doesn't even care that he's removing the wrong organ.

Labels: academics, baseball, Kilgo, SABR, Superak

17 Comments:

At Tuesday, April 10, 2012 10:51:00 AM, David said...: First minor point, the link to the slides seems to be broken.

That aside, I'm an academic economist and generally amenable to the argument that we can learn a lot from specialists in the field of study (whether it's baseball or healthcare or whatever). While I am generally amenable, I'd make a couple points.

First, using a statistical model when you have a census of data is still legitimate because the data are not fixed; they're the result of a random data generating process. That’s why we I don’t believe that someone’s talent level on a given day is perfect if they go 5 for 5 on that day. There’s noise in the process. Sabermetrics usually does a great job of incorporating this insight, so I don't see why you're arguing against statistical tests. We all need some stats to say anything.

That said, the best academic work (in my opinion) uses only the minimum sophistication necessary for the task. And a lot of the academic work I’ve read about baseball does a lot of substituting machinery for actual insight. You’re right that there are very low rewards to writing good academic papers about baseball, and the result is papers with a lot of relatively speaking fancy but poorly implemented analysis (e.g. the Hausman test is pretty much useless in all situations, in my opinion).

Second point and I’ll stop writing because this is way too long: some tasks require a lot of sophistication. Take something as simple as the DH penalty. A lot of the sabermetric community buys that you can just compare means for a player when he’s DH’ing vs not. But there’s an abundance of variables related to both batting outcomes and whether someone DH’s (injury, opponent, etc.). You need some sort of better research design.
At Tuesday, April 10, 2012 11:02:00 AM, Phil Birnbaum said...: Link fixed, thanks!
At Tuesday, April 10, 2012 12:31:00 PM, David said...: Thanks for linking! From the slides, it looks like a mostly agree with them. I'd be interested to hear your more general thoughts on the presentation. I like reading your blog because I find it to generally be a good compromise between the two approaches, so that's why I'm a bit confused about your skepticism about statistical sophistication.
At Tuesday, April 10, 2012 12:58:00 PM, BMMillsy said...: I find the general premise of this presentation to be pretty myopic.

Sure, a census tells us what happened before. But ultimately sabermetricians are making point estimates of, say, the value of a double, with error. Getting the average value of a double over the history of baseball is not the same as knowing exactly how much every person in the United States earned in 1960. In this vein, a census doesn't give us precise cause-effect estimates given the error around these point estimates without rigorous manipulation (and often times that even does not come close to giving us the true cause-effect relationship).

Why does someone want to do any of that, or use all of the data available? To evaluate the value of something in the future.

In that case, you don't actually have all the data. You have all the data from time point 0,1,...T, but you do NOT have data from T+1, T+2, etc. That makes it a sample...albeit it a very nice big one. This is why someone like Tango will emphasize a technique like weighting and regressing to the mean even if you have a players' entire career worth of data.

The large majority of sabermetric work is inferential in that it wants to predict or understand future performance (or make associations between two events) based on past performance. Do you not think this to be the case?

The presenters get at this later on, but I think completely mis-characterize the dichotomy of the two areas with a myopic view of sabermetrics. They essentially characterize sabermetricians purely as historians, and I doubt that many that do the sort of work I see around the internet would claim this is the case.
At Tuesday, April 10, 2012 5:36:00 PM, mettle said...: I think the same point can be made in the opposite direction: SABR bloggers et al have the perverse incentive to make their stuff as esay-to-understand as possible so that forgo even the most rudimentary statistical methods in favor of straight-out comparisons, or "buckets" instead of regression.

And to defend the use of complexity in analysis, reiterating some of what the other commenters have noted , sometimes it's the problem that *requires* a complex solution. The thing about SABR right now, is that there is so much low-hanging fruit that a t-test is all you need (or should need, but rarely see, unfortunately). Once it matures, however, I think you'll find that you need more and more complex method because the differences are that much more subtle.

And as far as regressions go, since that seems to be a favorite hobby horse for the non-academic crowd: the results you
highlight are a great example of the *known* problem with multiple simultaneous regression. Every stats person worth their salt knows it exists. So, this isn't an example of complex worse vs. simple better, it's an example of known to be wrong vs. less wrong.
At Wednesday, April 11, 2012 5:53:00 PM, Phil Birnbaum said...: I'm not disputing that sometimes you need to use a complex method to get your results (although, to be honest, I haven't seen too many of these).

What I'm saying is, academics can't get published with simple methods, so they often have to make things more complex than they otherwise would be.
At Wednesday, April 11, 2012 7:01:00 PM, David said...: Phil, I think it comes down to exactly your parenthetical statement which I (and I think most academics) would disagree with. I would argue that for almost any problem of interest, comparison of means is not enough because of rampant selection bias. Take the "pinch hitting" penalty. In "The Book", Tango and company measure the pinch hitting penalty by comparing at bats with a pinch hitter versus those without. They acknowledge and refute some potential selection bias due to inning-out-baserunner situation. I’m assuming they’re also only looking at players that ever pinch hit. This is good. But there's an endless list of factors that could affect the decision to pinch hit and batting outcomes: injury status of batter, quality of pitcher, score, etc. And some of these (like a batter who didn’t start because of a nagging minor injury) are never going to be observable. I would argue that almost any causal effect of interest has these issues, and to get at these issues, you need a better research design which often involves more than comparing means.

That said, I really enjoy your blog. I’m interested to know how, in a world where there’s always going to be some people who “love baseball, like math” and others that “like baseball, love math” how do things get better?
At Wednesday, April 11, 2012 8:58:00 PM, Patrick Kilgo said...: In response to Millsy's comments ... I think you are putting words in our mouth. We fully support the approach of making future predictions based on any available data, census or otherwise. Only a fool would argue at this point that census-driven baseball analysis doesn't inform future events. The main goal of our presentation was not to minimize the contributions of sabermetricians, I don't know where you got that from. It was more of a rebuke to out-of-control methodologists.

The goal of our presentation was to get people to stop and think about the kinds of data they have. Formal inferential statistics (hypothesis testing, p-values, etc) were built for SAMPLES and even if you prescribe to the "infinite sample" or "random draws" theory (which I mostly don't), you have to admit that the interpretation of the census results under a formal inferential framework is suspect at best. In preparation for this talk, I asked several of the statistical theorists on our faculty about this idea of mixing baseball-esque census data with frequentist statistical inference and all of them expressed discomfort with the idea and stated (as we presented) that this is very much a gray area. But it happens all the time in baseball research, hence our call for a re-consideration of the matter. Thanks for reading the presentation.
At Wednesday, April 11, 2012 10:33:00 PM, Phil Birnbaum said...: Hi, David,

I certainly don't mean to imply that a more complex study can't improve on a simpler one. But I think your example is not one where the method is too simple, but where it fails to control for factors that you (quite reasonably) believe should be controlled for.

(Also, I believe that "The Book" study DID control for pitcher, but I'd have to go back and check.)

So, yeah, if the only way to control for those complex factors is to have a more sophisticated research design, then, sure.

That's not really my point, though. I'm not saying that complex methods should never be used. Rather, I'm saying that academics seem to prefer the complex methods even when a simpler method is just as good.
At Thursday, April 12, 2012 11:59:00 AM, BMMillsy said...: Patrick,

I had not meant to put words in your mouth, I was simply stating that the premise seemed to sell short the sabermetrician. This comes from someone who on the following spectrum:

Saber<----------------->Academic

sits much closer to the right than the left.

I admire the goal of the presentation and think it does just what you had hoped in terms of enlightening the use of data. I wasn't implying that you had a goal of belittling sabermetrics in any way.

I mentioned in my comment that you got to the differences later on, so I apologize if the comment came off as negative.

I think the most interesting part of the issues you bring up are the ones we both mention regarding samples and census data (likely because I'm toward the right of the above spectrum). I'd be curious to hear philosophies on this in general and in the baseball context.
At Thursday, April 12, 2012 12:22:00 PM, EvanZ said...: Where's the machine learning guy who is supposed to come in and say, screw your inference, just give me the best predictor, because that's what all anyone (making decisions) actually cares about anyway.

;)
At Thursday, April 12, 2012 1:00:00 PM, David said...: @Phil: My point is more that there are things we would like to control but will never be able to because the data will never exist. I doubt we will ever (at least soon) get good data on injuries that don't result in DL trips. And I suspect these have a big impact on batting outcomes while being correlated with lots of lineup decisions.

@Patrick: I'm an economist rather than a statistician so I'm interested in your thoughts. What makes you suspicious of an infinite sample/random draws approach? Of course, it's really important to account for uncertainty if I want to do anything beyond saying, "here's what happened in this population." As Phil posted a couple weeks ago, some people have massive splits in favor of even vs. odd days of the month. That's just chance. How do we account for that if we take a "this is a census and here's what happened" approach? I suspect you've thought about this a lot more than me and I wonder what your alternative to a "sample of random draws from an infinite population" approach is. Am I missing it in the presentation?
At Friday, April 13, 2012 8:51:00 AM, Patrick Kilgo said...: Thanks David,

In Slides 41-43 we touch on this idea of using census-level data to supply uncertainty estimates to future predictions. I think it's fine to do (and betetr than doing nothing) and we have done it in our own studies (though whether we did it correctly is just as uncertain!). I will be perfectly honest with you ... I'm not quite sure how best to "gel" sampling theory with census-derived parameter uncertainty calculation(I hesitate to write "estimation" because it really is not estimation when you have a census). In my chats with our point estimation guys at Emory it is evident that they don't know either. My gut says that pretending the census calculations have uncertainty because the data might have turned out different IF we could re-play the games is an insufficient assumption to make that doesn't mirror the reality of how sampling works. In formal statistical inference, the emphasis is completely on OBSERVED sample data. To say that there is an UNOBSERVED census of the data that could have been taken under different circumstances and then to try to apply sample-based inferential theory to this unobserved census feels wrong to me. That's my problem with the random draws theory - it's just counter to how statistics is practiced under our normal framework. I wish I had a solution but I really don't. We just wanted to point out in our presentation that the studies being performed in common baseball research often venture into this gray area and it's not clear that this is correct. Thanks David, I would enjoy reading your thoughts on the matter.
At Friday, April 13, 2012 12:21:00 PM, David said...: I'm an economist, not a statistical theorist, so this is somewhat over my head. But I tend to be just fine with assuming that there's an infinite population from which we're taking random draws. This probably reflects my economics training which tends to view the world as having some underlying model involving uncertainty that defines a data generating process and the data we observe are just draws from that model.

I think it's natural for this framework to be used in baseball. In reality, I would tend to think that the "census" of all outcomes in plate appearances in a single year is not really a census at all. It represent the outcomes that occurred because of a combination of player talent, weather, manager choices about the lineup, etc. We really only have a sample of all different lineup combinations that could have been chosen. There really could have been a different sample in which the wind was blowing from left and to right and Moises Alou caught the ball and the Cubs go to the World Series (yes I'm a Cubs fan). Given all that randomness, we really should think of our data as random rather than fixed. It doesn't sound to me like you're opposed to this in general. But I'm not statistical theorist enough to think of a better approximation than this: there is an infinite population and we have a sample from it. If there's a better way to account for this, I'm open to it, but taking draws from a data generating process is the best I've found.

Another lens to think about it: if we just want to know "what happened" historically, then all we need is a point estimate. I think we're all really interested in "what would happen if we could run history 1000 times allowing the conditions to vary." For that, classical statistics seems like the right tool.

That said, I don't think most people (myself included) usually think through this before typing "regress y x" and interpreting the standard errors. I appreciate the opportunity to think about it a bit.
At Saturday, April 14, 2012 9:53:00 AM, Charlie Pavitt said...: I'm both an academic (social science) and an amateur researcher, and while I basically agree with Phil when it comes to publishing statistical baseball research it depends on the outlet - and I'm going to name journal names here. BRJ has had a very rocky relationship with statistical work because it depends on the editor. Some past editors would not publish anything statistically sophisticated because they did not understand it, others published poorly done work with sophisticated stats because (I would guess) they were impressed by the stats and assumed they were done well. To their great credit two past editors (Charlton and Frankovitch) went to outside reviewers for help, and the journal was much the better for it. I hope that continues in the future. As for the academic side, economists have their standards and I'm fine with that, so you need some rigor to get published in the Journal of Sports Economics. Same with psychologists and sociologists. But taking this to an extreme, the Journal of Quantitative Analysis in Sports under its former editor published a lot of very sophisticated statistical pieces that were substantively trite and made no contribution to our knowledge of baseball. I trust Jim Albert won't allow that to happen.
At Saturday, April 14, 2012 12:49:00 PM, Phil Birnbaum said...: Patrick,

Isn't a lot of sabermetrics predicated on the idea that the entire census is a random sample from an ideal distribution?

I mean, suppose a player hits .330 over 500 AB. When we estimate his talent, we have to assume that the .330 is just a binomial sample, don't we?

And, clinical trials ... if you test a new drug in a double-blind test on, say, 100 patients, you get a census of all 100 patients. Doesn't the analysis assume that the 100 patients are a random sample, even though, at this time, there really are only 100 patients in the entire world?
At Monday, April 23, 2012 8:17:00 PM, Anonymous said...: This is basically the story of David Berri... The "peers" who reviewed Wages of Wins knew nothing about basketball and therefore were blind to the handful of glaring flaws that cripple his metric. Amateurs (and another academic economist, Dan Rosenbaum) quickly recognized the problems endemic to WoW's research because they're the actual experts in the field. They were the "peers" he should have been seeking approval from in the first place, but like you say, that is not how academia works.

Sabermetric Research

Tuesday, April 10, 2012

Academic rigor

17 Comments:

About Me

Previous Posts