Five percent significance is often too weak
In academic research, why is the standard significance level set to 5 percent?
I don't know, but I'm guessing a consensus emerged that 0.05 is a reasonable threshold beyond which it's OK to assume any effect you found is real. The idea might be that, when searching for an effect that doesn't actually exist, academia figures that one false positive out of 20 is something they can live with -- but no more than that.
The problem is that the "1 out of 20" isn't generally true if you're looking at more than one variable.
Suppose, for instance, that I want to see whether hitters do better on one specific day of the week more than others. So I set up a regression with six variables for days of the week (the seventh is the default case, and doesn't get a variable), and I figure out that batters do the best on Monday, compared to the other days. And, moreover, the effect is statistically significant, at p<0.05.
Should I accept the result? Maybe not. Because, instead of one variable, I had six, and I didn't know in advance what I was looking for. So I had six chances for a false positive, not just one.
Obviously, if I have enough variables, the chances increase that at least one of them will appear statistically significant just by luck. Imagine if I had 100 independent variables in my regression. Then, five of them would show up as significant just by chance. If I stick to the .05 criterion, I should expect five false positives, on average.
In the "day of the week" case, I didn't have one hundred variables -- just six. What are the chances that at least one of six variables would be significant at 5 percent? The answer: about .26. So I thought I was using a .05 confidence level, but I was really using a .26 confidence level.
If I want to keep a 5 percent probability of having *any* false positive, I can't use the threshold of 0.05 any more. I have to use 0.01. Actually, a little less than that: 0.0085. Intuitively, that's because six variables, multiplied by 0.0085 each, works out to about .05. (That's not the precise way to calculate it, but it's close enough.)
But suppose I don't do that, and I stick with my 0.05. Then, if I have a regression with six meaningless variables, there's still a 26.5 percent chance I'll have something positive to report. If I have ten variables, that goes up to 40 percent! Ten variables is actually not uncommon in academic studies.
And it's actually even higher than that in practice. For one thing, if you do a regression, and you get close to significance for variable X, you can try adding or removing a variable or two, and see if that pushes X over the threshold. You see that you're at 0.06, and you think, "wait, maybe I didn't control for home/road!" And you think, "yeah, I really should have controlled for home/road." And you're right about that, it's a good idea. So you add a new variable for home/road, and, suddenly, the 0.06 moves to 0.049, and you're in. But: if you'd got the 0.049 in the first place, you probably never would have thought of adding home/road. So, really, you're giving yourself two (or more) kicks at the can.
As an aside .... there's another thing you can do to try to get a significant result, which, in my opinion, is cheating, a bit. Here's how it works. When you choose which six of the days of the week to include as variables, you leave out the one that's most extreme. That bumps all your estimates, and your significance levels.
For instance, suppose that, compared to the average, the observed effects of days of the week are something like this (numbers made up):
Sunday .... 2 points higher (than average)
Monday .... 10 points higher
Tuesday ... 3 points lower
Wednesday . 9 points higher
Thursday .. 11 points lower
Friday .... 7 points lower
Saturday .. 0 points difference
If you omit Sunday, your coefficients wind up being the difference from Sunday. So you get:
Monday .... +8 (relative to Sunday)
Tuesday ... -5
Wednesday . +7
Thursday .. -13
Friday .... -9
Saturday .. -2
The most extreme day is Thursday, which comes out to -13. Maybe the SD of that estimate is only 8 points, and so Thursday isn't significant (1.6 SDs).
But, now, you come up with the bright idea of using Monday as your reference instead. Now, your coefficient estimates are:
Sunday .... -8 (relative to Monday)
Tuesday ... -13
Wednesday . -1
Thursday .. -23
Friday .... -17
Saturday .. -10
Thursday is now -23, which is almost 3 standard deviations below zero -- clearly significant! But that's because of the way you structured your study. Effectively, you arranged it so that you wound up looking at Monday-Thursday, the most extreme of all the possible comparisons. There are 21 such comparisons, so, at a 5 percent chance of finding a false positive, you'd expect 1.05 false positives overall, which means a pretty good chance that the most extreme comparison will wind up looking significant.
As I said, I think this is cheating. I think you should normalize all your days of the week to the overall average (as in the top table above), to avoid this kind of issue.
Anyway, the point is that the more variables you have in your study, and the more different sets of variables you looked at before publishing, the higher your chances of finding at least one variable with the required significance. So, when a study is published, the weight you give the evidence of significance should be inversely proportional to the number of variables looked at.
Also, it should also depend on how many rejiggerings the author did before finally settling on the regression that got published. As for that, there's no way to tell from just the published study.
Again, if I wanted to cheat, I could try this. I run a huge regression with 200 different nonsense variables. I take the one that came out the most significant -- it'll probably be less than 0.01 -- and run a regression on that one alone. It'll probably still come in as significant, even without the other 199 variables. (If it doesn't, I'll just take the next most significant and try that one instead.)
Then, I write a paper, or a blog post, suggesting that that particular variable should be highly predictive, based on some theory I make up on the spot. I might say something like this:
"Thursdays are getaway days, which means the team is in a bit of a hurry. Psychological theory suggests that human cognitive skills are reduced with perceived time pressure from work authorities. That means batters should concentrate less. Therefore, you'd expect batting to be worse on Thursdays."
I'll also add a paragraph about why that doesn't apply to pitchers, and why batters should hit well on Mondays.
A month or two later, I publish a working paper that looks only at Thursdays and Mondays, and finds the exact effect I predicted!
I hasten to add that I don't think this kind of conscious cheating actually goes on. I'm just trying to make the point that an 0.05 is not an 0.05 is not an 0.05. The weight you give to that 0.05 has to take into account as much of the context as you can figure out.
And so, I'd like to see authors use a more stringent threshold for significance the more variables they use. They could choose a level so that the chance of finding significance is not 0.05 per variable, but 0.05 for the paper as a whole. That means that with six new variables, you use 0.0085 as your new significance level. Heck, round it up to 0.01 if you want.
Let 0.01 be the new 0.05. That would be appropriate for most studies, I think.
One last point here, and this is something that really bugs me. You'll read a paper, and the author will find variable X significant at 0.05, and he'll go ahead and write his conclusion as if he's 100 percent certain that the effect he found is real.
But, as I've just argued, there's a pretty good chance it's not. And, even if you disagree with what I've written here, still, there's always *some* chance the finding is spurious.
So why don't authors mention that? They could just throw it in casually, in a sentence or two, like this:
"... of course, it could be that we're just looking at a false positive, and there's no day-to-day effect at all. Remember, we'd see this kind of result about 1 in 20 times even if there were no effect at all."
Why won't the authors even consider a false positive? Is it just the style and standards of how academic papers work, that when you find a 5 percent significance level, you have to pretend it removes all doubt? Maybe it's just me. But when I see such confident conclusions based on one 0.05 cherry-picked from a list of fifteen variables, I can't help feeling like I'm being sold a bill of goods.
UPDATE: Commenter Mettle (fifth comment -- don't know how to link to it directly) points out that much of what I'm saying here isn't new (and referred me to this very relevant Wikipedia page).
However, the question remains: if this is a well-established issue, why don't authors routinely address it in their papers?