Why it's hard to estimate small effects
Here's a great 2009 paper (.pdf) by Andrew Gelman and David Weakliem (whom I'll call "G/W"), on the difficulty of finding small effects in a research study.
I'm translating to baseball to start.
Let's suppose you have someone who claims to be a clutch hitter. He's a .300 hitter, but, with runners on base, he claims to be a bit better.
So, you say, show us! You watch his 2012 season, and see how well he hits in the clutch. You decide in advance that if it's statistically significantly different from .300, that will be evidence he's a clutch hitter.
Will that work? No, it won't.
Over 100 AB, the standard deviation of batting average is about 46 points. To find statistical significance, you want 2 SD. That means to convince you, the player would have to hit .392 in the clutch.
The problem is, he's not a .392 hitter! He, himself, is only claiming to be a little bit better than .300. So, in your study, the only evidence you're willing to allow, is evidence that you *know* can't be taken at face value.
Let's say the batter actually does meet your requirement. In fact, let's suppose he exceeds it, and hits .420. What can you conclude?
Well, suppose you didn't know in advance that you were looking for small effect. Suppose you were just doing a "normal" paper. You'd say, "look, he beat his expectation by 2.6 SD, which is statistically significant. Therefore, we conclude he's a clutch hitter." And then you write a "conclusions" section with all the implications of having a .420 clutch hitter in your lineup.
But, in this case, that would be wrong, because you KNOW he's not a .420 clutch hitter, even though that's what he hit and you found statistical significance. He's .310 at best, maybe .320, if you stretch it. You KNOW that the .420 was mostly due to luck.
Still ... even if you can't conclude that the guy is truly a .420 clutch hitter, you SHOULD be able to at least conclude that he's better than .300 right? Because you did get that statistical significance.
Well ... not really, I don't think. Because, the same evidence that purports to show he's not a .300 hitter ALSO shows he's not a .320 hitter! That is, .420 is also more than 2 standard deviations from .320, which is the best he possibly could be.
What you CAN do, perhaps, is compare the two discrepancies. .420 is 2.6 SDs from .300, but only 2.2 SDs from .320. That does appear to make .320 more likely than .300. In fact, the probability of a .320 hitter going 42-for-100 is almost five times as high as the probability of the .300 hitter going 42-for-100.
But, first, that's only 5 in 6. Second, that ignores the fact that there are a lot more .300 hitters than .320 hitters, which you have to take into account.
So, all things considered, you should know in advance that you won't be able to conclude much from this study. The sample size is too small.
That's Gelman and Weakliem's point: if you're looking for a very small effect, and you don't have much data, you're ALWAYS going to have this problem. If you're looking for the difference between .300 and .320, that's a difference of 20 points. If the standard error of your experiment is a lot more than 20 points ... how are you ever going to prove anything? Your instrument is just too blunt.
In our example, the standard error is 46 points. To find statistical significance, you'd have to observe an effect of at least 92 points! And so, if you're pretty sure clutch hitting talent is less than 92 points, why do the experiment at all?
But what if you don't know if clutch hitting talent is less than 92 points? Well, fine. But you're still never going to find an effect less than 92 points. And so, your experiment is biased, in a way: it's set up to only find effects of 92 points or more.
That means that if the effect is small, no matter how many scientists you have independently searching for it, they'll never find it. Moreover, they will frequently find a LARGE effect.
No matter what happens, the experiment will either be wrong too high, or wrong too low. It is impossible for it to be accurate for a small effect. The only way to find a small effect is to increase the sample size. But even then, that doesn't eliminate the problem: it just reduces it. No matter what your experiment, and how big your sample size, if the effect your looking for is smaller than 2 SDs, you'll never find it.
That's G/W's criticism. It's a good one.
G/W's example, of course, is not about clutch hitting. It's about a previously-published paper, which found that good-looking people are more likely to produce female offspring than male offspring. That study found an 8 percentage point difference between the nicest-looking parents and the worst-looking parents -- 52 percent girls vs. 44 percent girls.
And what G/W are saying is, that 8 point difference is HUGE. How do they know? Well, it's huge as compared to a wide range of other results in the field. Based on the history of studies on birth sex bias, two or three points is about the limit. Eight points, on the other hand, is unheard of.
Therefore, they argue, this study suffers from the "can't find the real effect" problem. The standard error of the study was over 4 points. How can you find an effect of less than 3 points, if your standard error is 4 points? Any reasonable confidence interval will cover so much of the plausible territory, that you can't really conclude anything at all.
Gelman and Weakliem don't say so explicitly, but this is a Bayesian argument. In order to make it, you have to argue that the plausible effect is small, compared to the standard error. How do you know the plausible effect is small? Because of your subject matter expertise. In Bayesian terms, you know, from your prior, that the effect is most likely in the 0-3 range, so any study that can only find an 8-point difference must be biased.
Every study has its own limits of how the standard error compares to the expected "small" effect. You need to know what "small" is. If a clutch hitting study was only accurate to within .0000001 points of batting average ... well, that would be just fine, because we know, from prior experience, that a clutch effect of .0000002 is relatively plausible. On the other hand, if it's only accurate to within .046, that's too big -- because a clutch effect of .092 is much too large to be plausible.
It's our prior that tells us that. As I've argued, interpreting the conclusions of your study is an informal Bayesian process. G/W's paper is one example of how that kind of argument works.
Hat tip: Alex Tabarrok at Marginal Revolution