## Wednesday, February 15, 2012

### Absence of evidence vs. evidence of absence

People tell me that Albert Pujols is a better hitter than John Buck. So I did a study. I watched all their at-bats in August, 2011. I observed that Pujols hit .298, and Buck hit .254.

Yes, Pujols' batting average was better than Buck's, but the difference wasn't statistically significant. In fact, it wasn't even close: it was less than 1 standard deviation!

So, clearly, August's performance shows no evidence that Pujols and Buck are different in ability.

Does that sound wrong? It's right, I think, at least as I understand how things work in the usual statistical studies. If you fail to reject the null hypothesis, you are entitled to use the words "no evidence."

Which is a little weird, because, of course, it *is* evidence, although perhaps *weak* evidence. I suppose they could have chosen to say "not enough" evidence, or "insufficient" evidence, but that carries with it an implication that the null hypothesis is correct. If I say, "the study found no evidence that whites are smarter than blacks," that sounds fine. But if I say, "the study found insufficient evidence that whites are smarter than blacks," that sounds racist.

The problem is, if you don't really know what "no evidence" really means, you might get the wrong impression. You might have 25 different studies testing whether Pujols is better than Buck, each of them using a different month. They all fail to reject the hypothesis that they're equal, and they all say they found "no evidence". (That's not unlikely: to be significant at .01 for a single month, you'd have to find Pujols outhitting Buck by about 200 points.)

And you think, hey, "25 studies all failed to find any evidence. That, in itself, is pretty good evidence that there's nothing there."

But, the truth is, they all found a little bit of evidence, not *no* evidence. If you multiply *no* evidence by 25, you still have *no* evidence. But if you multiply a little bit of evidence by 25, now you have *enough* evidence.

------

There's an old saying, "absence of evidence is not evidence of absence." The idea is, just because I look around my office and don't see any proctologists or asteroids, it doesn't mean proctologists or asteroids don't exist. I may just not be looking in the right place, or looking hard enough. Similarly, if I look at only one month of Pujols/Buck, and I don't see a difference, it doesn't mean the difference isn't there. It might just mean that I'm not looking hard enough.

This is the point Bill James was making in his "Underestimating the Fog." We looked for clutch hitting, and we didn't find it. And so we concluded that it didn't exist. But ... maybe we we just need to look harder, or in different places.

What Bill was asking is: we have the absence of evidence, but do we have the evidence of absence?

------

Specifically, what *would* constitute evidence of absence? The technically-correct answer: nothing. In normal statistical inference, there's actually no evidence that can support absence.

Suppose I do a study of clutch hitting, and I find it's not significantly different from zero. But ... my parameter estimate is NOT zero. It's something else, maybe (and I'm making this up), .003. And maybe the SD is .004.

If I think clutch hitting is zero, and you think it's .003, we can both point to this study as confirming our hypotheses. I say, "look, it's not statistically significantly different from zero." And you say, "yeah, but it's not statistically significantly different from .003 either. Moreover, the estimate actually IS .003! So the evidence supports .003 at least as much as zero."

That leaves me speechless (unless I want to make a Bayesian argument, which let's assume I don't). After all, it's my own fault. I didn't have enough data. My study was incapable of noticing a difference between .000 and .003.

So I go back to the drawing board, and use a lot more data. And, this time, I come up with an estimate of .001, with an SD of .002.

And we have the same conversation! I say, "look, it's not different from zero." And you say, "it's not different from .001, either. I still think clutch hitting exists at .001."

So I go and try again. And, every time, I don't have an infinite amount of data, so, every time, my point estimate is something other than zero. And every time, you point to it and say, "See? Your study is completely consistent with my hypothesis that clutch hitting exists. It's only a matter of how much."

------

What's the way out of this? The way out of this is to realize that you can't use statistics to prove a point estimate. The question, "does clutch hitting exist?" is the same as the question "is clutch hitting exactly zero?". And, no statistical technique can ever give you an exact number. There will always be a standard error, and a confidence interval, so it will always be possible that the answer is not zero.

You can never "prove" a hypothesis about a single point. You can only "disprove" it. So, you can never use statistical techniques to demonstrate that something does not exist.

What we should be talking about is not existence, but size. We can't find evidence of absence, but we can certainly find evidence of smallness. When an announcer argues for the importance of being able to step up when the game is on the line, we can't say, "we studied it and there's no such thing". But we *can* say, "we studied it, and even under the most optimistic assumptions, the best clutch hitter in the league is only going to hit maybe .020 better in the clutch ... and there's no way to tell who he is."

Or, the short form -- "we studied it, and the differences between players are so small that they're not worth worrying about."

------

But aren't there issues where it's important to actually be able to disprove a hypothesis? Take, for instance, ESP. Some people believe they can do better than chance at guessing which card is drawn from an ESP deck.

If we do a study, and the subject guesses exactly what you'd expect by chance, you'd think that would qualify as a failure to find ESP. But when you calculate the confidence interval, centered on zero, you might have to say, "our experiment suggests that if ESP exists, its maximum level is one extra correct guess in 10,000."

And, of course, the subject will hold it up, and triumphantly say, "look, the scientists say that I might have a small amount of ESP!!"

What's the solution there? It's to be common-sense Bayesian. It's to say, "going into the study, we have a great deal of "evidence of absence" that ESP doesn't exist -- not from statistical tests, but from the world's scientific knowledge and history. If you want to challenge that, you need an equal amount of evidence."

That makes sense for ESP, but not for clutch hitting. Don't we actually *know* that clutch hitting talent must exist, even at a very small level? Every human being is different in how they respond to pressure. Some batters may try to zone out, trying to forget about the situation and hit from instinct. Some may decide to concentrate more. Some may decide to watch the pitcher's face between pitches, instead of adjusting their batting glove.

Any of those things will necessarily change the results a tiny bit, in one direction or the other. Maybe concentration makes things worse, maybe it makes it better. Maybe it's even different for different hitters.

But we *know* something has to be different. It would be much, much too coincidental if every batter did something different, but the overall effect is exactly .0000000.

Clutch hitting talent *must* exist, although it might be very, very small.

So why are we so fixated on zero? It doesn't make sense. We know, by logical argument, that clutch hitting can't be exactly zero. We also know, by logical argument, that even if it *were* exactly zero, it's impossible to have enough evidence of that.

When we say "clutch hitting doesn't exist," we're using it as a short form for, "clutch hitting is so small that, for all intents and purposes, it might as well not exist."

------

When the effect is small, like clutch hitting, it's not a big deal. But when the effect might be big, it's a serious issue.

A lot of formal studies -- not just clutch hitting or baseball -- will find they can't reject the null hypothesis. They usually say, "we found no evidence," and then they go on to assume that that also means they can assume that what they're looking for doesn't exist.

They'll do a study on, I don't know, whether an announcer is right that playing a day game after a night game affects you as a hitter. And they'll get an estimate that says that batters are 40 points of OPS worse the day after. But it's not statistically significant. And they say, "See? Baseball guys don't know what they're talking about. There's no evidence of an effect!"

But that's wrong. Because, unlike clutch hitting, the confidence interval does NOT show an effect that "for all intents and purposes, might as well not exist." The confidence interval is compatible with a *large* effect, of at least 80 points. (That is, since 2 SD is enough to drop from 40 points to zero on one side, it's also enough to rise from 40 points to 80 points on the other side.)

So it's not that there's evidence of absence. There's just absence of evidence.

And that's because of the way they did their study. It was just too small to find any evidence -- just like my office is too small to find any asteroids.

Labels: ,

At Thursday, February 16, 2012 2:33:00 AM,  mkt said...

Nice article. The one thing that I would add is that in some cases -- especially ones similar to your example of 25 monthly studies of Pujols vs Buck -- we can do more than simply throw up our hands and say that we've done 25 studies and haven't found anything. We can pool the studies; this is exactly what meta-analysis does.

At Thursday, February 16, 2012 3:21:00 AM,  Anonymous said...

Yes, meta-analyses are wonderful things. One of the major problems is that "unsuccessful" studies tend not to be published defeating the purpose of combining the results, or at least rendering them unusable.

So when one does a meta-analysis one has to be careful that it is the type of study where there is not much (almost none, really) publishing bias...

MGL

At Friday, February 17, 2012 11:44:00 AM,  Anonymous said...

The other thing you would do with the Pujols/Buck study is quick power analysis. After one month, Pujols and Buck have best estiamtes of .298 and .254 as their averages - how many ABs do we need for a properly sized study? 80% power, 5% cutoff for two-tailed hypothesis, you'd need ~1664 ABs each, or 2-3 full seasons, which sounds about right. There's no point in doing 25 underpowered studies, afterall. Don't blame "Stats" if people aren't doing them right.

This sort of thing happens in medical research - you start with a trial run, small sample, just looking to establish a trend in the right direction and some estimates to base a larger study on.

Oh, a clutch hitting would reasonably be blanked out, in aggregate, by unclutch hitting, clutch pitching, and unclutch pitching. Why give the hitters the credit/blame, maybe the pitcher choked, or focused better? Adrenaline runs on both sides of the game here, it's not hitters vs. ball on tee. It's possible that there is no overall effect at all, but also that there are wildly divergence particular matchups (bad clutch pitcher vs good clutch hitter).

At Saturday, February 18, 2012 8:14:00 AM,  Patrick Kilgo said...

I can tell you that at Emory we are very careful to say "not enough evidence to reject the null" rather than "no evidence."
A p-value of 0.07 must constitute some evidence under some circumstances, right?

Formal hypothesis testing is designed to give us the tools to make a decision, plain and simple. However, in making a decision with hypothesis testing you can do everything right mathematically and still make a wrong conclusion. Wrong conclusions come from several sources: insufficient sample size, unadjusted bias, random chance and several others. It is the job of the analyst to evaluate these items prior to performing the test. If you choose to do the test you are technically bound by its decision. This is why, when a p-value of 0.07 is found after testing, I am opposed to language like "there is a trend towards significance" or the finding is "marginally significant." If you thought enough of the testing environment to do the test in the first place then you have to live with the resulting decision, or so formal theory would tell us. This, of course, gets trampled on all the time.

But we are also careful to say that repeated testing of the same null by the same person undermines the theory of random sampling. Technically, we are allowed to ask one specific question (or test one specific null) once in our lives. Once you have performed a test you have "spent your alpha" and are disqualified from ever asking the question again. This also gets trounced on by investigators (especially lab scientists).

Formal hypothesis testing to me is just a burden of proof upon a continuum or a threshold of evidence that must be surmounted before declaring differences. It is the job of the reader to evaluate the "practical significance" of the observed finding.

The more I think about formal inference in baseball studies, the more I think it is misapplied (embarassingly, even by me in the past).

Hope you are well Phil!

At Saturday, February 18, 2012 9:11:00 AM,  Phil Birnbaum said...

Hi, Patrick,

I was only vaguely aware of the "spent your alpha" idea ... I didn't realize it was a formal (or semi-formal) convention! Is there a central repository where these things are listed or taught?

Same for what you say about "burden of proof". I guess if something is true, it shouldn't be too hard to find statistical significance. It's like needing to get a high-school diploma to get into college. It's possible that you don't actually need it to succeed, but it's just a formal hoop you have to jump through.

At Saturday, February 18, 2012 10:39:00 AM,  Phil Birnbaum said...

BTW, here's a thingy from a certain academic study (that I reviewed here in the past).

"The coefficient on this variable is not significantly different from zero, and the estimated coefficients on the other variables are essentially unchanged. This is important because it tells us that [this variable] is not capturing some other unobserved measure of [what we're looking for]. In other words, there is nothing in our data to suggest that [the variable is meaningful]."

It's not "no evidence", exactly, but "there is nothing in our data" is close.

This variable was 1.7 SD from zero, in the expected direction, with a very strong prior that it should not be zero. And, the point estimate, if taken at face value, was very significant in a real-world sense.

I wouldn't call that "nothing in our data."