Absence of evidence vs. evidence of absence
People tell me that Albert Pujols is a better hitter than John Buck. So I did a study. I watched all their at-bats in August, 2011. I observed that Pujols hit .298, and Buck hit .254.
Yes, Pujols' batting average was better than Buck's, but the difference wasn't statistically significant. In fact, it wasn't even close: it was less than 1 standard deviation!
So, clearly, August's performance shows no evidence that Pujols and Buck are different in ability.
Does that sound wrong? It's right, I think, at least as I understand how things work in the usual statistical studies. If you fail to reject the null hypothesis, you are entitled to use the words "no evidence."
Which is a little weird, because, of course, it *is* evidence, although perhaps *weak* evidence. I suppose they could have chosen to say "not enough" evidence, or "insufficient" evidence, but that carries with it an implication that the null hypothesis is correct. If I say, "the study found no evidence that whites are smarter than blacks," that sounds fine. But if I say, "the study found insufficient evidence that whites are smarter than blacks," that sounds racist.
The problem is, if you don't really know what "no evidence" really means, you might get the wrong impression. You might have 25 different studies testing whether Pujols is better than Buck, each of them using a different month. They all fail to reject the hypothesis that they're equal, and they all say they found "no evidence". (That's not unlikely: to be significant at .01 for a single month, you'd have to find Pujols outhitting Buck by about 200 points.)
And you think, hey, "25 studies all failed to find any evidence. That, in itself, is pretty good evidence that there's nothing there."
But, the truth is, they all found a little bit of evidence, not *no* evidence. If you multiply *no* evidence by 25, you still have *no* evidence. But if you multiply a little bit of evidence by 25, now you have *enough* evidence.
There's an old saying, "absence of evidence is not evidence of absence." The idea is, just because I look around my office and don't see any proctologists or asteroids, it doesn't mean proctologists or asteroids don't exist. I may just not be looking in the right place, or looking hard enough. Similarly, if I look at only one month of Pujols/Buck, and I don't see a difference, it doesn't mean the difference isn't there. It might just mean that I'm not looking hard enough.
This is the point Bill James was making in his "Underestimating the Fog." We looked for clutch hitting, and we didn't find it. And so we concluded that it didn't exist. But ... maybe we we just need to look harder, or in different places.
What Bill was asking is: we have the absence of evidence, but do we have the evidence of absence?
Specifically, what *would* constitute evidence of absence? The technically-correct answer: nothing. In normal statistical inference, there's actually no evidence that can support absence.
Suppose I do a study of clutch hitting, and I find it's not significantly different from zero. But ... my parameter estimate is NOT zero. It's something else, maybe (and I'm making this up), .003. And maybe the SD is .004.
If I think clutch hitting is zero, and you think it's .003, we can both point to this study as confirming our hypotheses. I say, "look, it's not statistically significantly different from zero." And you say, "yeah, but it's not statistically significantly different from .003 either. Moreover, the estimate actually IS .003! So the evidence supports .003 at least as much as zero."
That leaves me speechless (unless I want to make a Bayesian argument, which let's assume I don't). After all, it's my own fault. I didn't have enough data. My study was incapable of noticing a difference between .000 and .003.
So I go back to the drawing board, and use a lot more data. And, this time, I come up with an estimate of .001, with an SD of .002.
And we have the same conversation! I say, "look, it's not different from zero." And you say, "it's not different from .001, either. I still think clutch hitting exists at .001."
So I go and try again. And, every time, I don't have an infinite amount of data, so, every time, my point estimate is something other than zero. And every time, you point to it and say, "See? Your study is completely consistent with my hypothesis that clutch hitting exists. It's only a matter of how much."
What's the way out of this? The way out of this is to realize that you can't use statistics to prove a point estimate. The question, "does clutch hitting exist?" is the same as the question "is clutch hitting exactly zero?". And, no statistical technique can ever give you an exact number. There will always be a standard error, and a confidence interval, so it will always be possible that the answer is not zero.
You can never "prove" a hypothesis about a single point. You can only "disprove" it. So, you can never use statistical techniques to demonstrate that something does not exist.
What we should be talking about is not existence, but size. We can't find evidence of absence, but we can certainly find evidence of smallness. When an announcer argues for the importance of being able to step up when the game is on the line, we can't say, "we studied it and there's no such thing". But we *can* say, "we studied it, and even under the most optimistic assumptions, the best clutch hitter in the league is only going to hit maybe .020 better in the clutch ... and there's no way to tell who he is."
Or, the short form -- "we studied it, and the differences between players are so small that they're not worth worrying about."
But aren't there issues where it's important to actually be able to disprove a hypothesis? Take, for instance, ESP. Some people believe they can do better than chance at guessing which card is drawn from an ESP deck.
If we do a study, and the subject guesses exactly what you'd expect by chance, you'd think that would qualify as a failure to find ESP. But when you calculate the confidence interval, centered on zero, you might have to say, "our experiment suggests that if ESP exists, its maximum level is one extra correct guess in 10,000."
And, of course, the subject will hold it up, and triumphantly say, "look, the scientists say that I might have a small amount of ESP!!"
What's the solution there? It's to be common-sense Bayesian. It's to say, "going into the study, we have a great deal of "evidence of absence" that ESP doesn't exist -- not from statistical tests, but from the world's scientific knowledge and history. If you want to challenge that, you need an equal amount of evidence."
That makes sense for ESP, but not for clutch hitting. Don't we actually *know* that clutch hitting talent must exist, even at a very small level? Every human being is different in how they respond to pressure. Some batters may try to zone out, trying to forget about the situation and hit from instinct. Some may decide to concentrate more. Some may decide to watch the pitcher's face between pitches, instead of adjusting their batting glove.
Any of those things will necessarily change the results a tiny bit, in one direction or the other. Maybe concentration makes things worse, maybe it makes it better. Maybe it's even different for different hitters.
But we *know* something has to be different. It would be much, much too coincidental if every batter did something different, but the overall effect is exactly .0000000.
Clutch hitting talent *must* exist, although it might be very, very small.
So why are we so fixated on zero? It doesn't make sense. We know, by logical argument, that clutch hitting can't be exactly zero. We also know, by logical argument, that even if it *were* exactly zero, it's impossible to have enough evidence of that.
When we say "clutch hitting doesn't exist," we're using it as a short form for, "clutch hitting is so small that, for all intents and purposes, it might as well not exist."
When the effect is small, like clutch hitting, it's not a big deal. But when the effect might be big, it's a serious issue.
A lot of formal studies -- not just clutch hitting or baseball -- will find they can't reject the null hypothesis. They usually say, "we found no evidence," and then they go on to assume that that also means they can assume that what they're looking for doesn't exist.
They'll do a study on, I don't know, whether an announcer is right that playing a day game after a night game affects you as a hitter. And they'll get an estimate that says that batters are 40 points of OPS worse the day after. But it's not statistically significant. And they say, "See? Baseball guys don't know what they're talking about. There's no evidence of an effect!"
But that's wrong. Because, unlike clutch hitting, the confidence interval does NOT show an effect that "for all intents and purposes, might as well not exist." The confidence interval is compatible with a *large* effect, of at least 80 points. (That is, since 2 SD is enough to drop from 40 points to zero on one side, it's also enough to rise from 40 points to 80 points on the other side.)
So it's not that there's evidence of absence. There's just absence of evidence.
And that's because of the way they did their study. It was just too small to find any evidence -- just like my office is too small to find any asteroids.