Sabermetric Research: Five percent significance is often too weak

In academic research, why is the standard significance level set to 5 percent?

I don't know, but I'm guessing a consensus emerged that 0.05 is a reasonable threshold beyond which it's OK to assume any effect you found is real. The idea might be that, when searching for an effect that doesn't actually exist, academia figures that one false positive out of 20 is something they can live with -- but no more than that.

The problem is that the "1 out of 20" isn't generally true if you're looking at more than one variable.

Suppose, for instance, that I want to see whether hitters do better on one specific day of the week more than others. So I set up a regression with six variables for days of the week (the seventh is the default case, and doesn't get a variable), and I figure out that batters do the best on Monday, compared to the other days. And, moreover, the effect is statistically significant, at p<0.05.

Should I accept the result? Maybe not. Because, instead of one variable, I had six, and I didn't know in advance what I was looking for. So I had six chances for a false positive, not just one.

Obviously, if I have enough variables, the chances increase that at least one of them will appear statistically significant just by luck. Imagine if I had 100 independent variables in my regression. Then, five of them would show up as significant just by chance. If I stick to the .05 criterion, I should expect five false positives, on average.

In the "day of the week" case, I didn't have one hundred variables -- just six. What are the chances that at least one of six variables would be significant at 5 percent? The answer: about .26. So I thought I was using a .05 confidence level, but I was really using a .26 confidence level.

If I want to keep a 5 percent probability of having *any* false positive, I can't use the threshold of 0.05 any more. I have to use 0.01. Actually, a little less than that: 0.0085. Intuitively, that's because six variables, multiplied by 0.0085 each, works out to about .05. (That's not the precise way to calculate it, but it's close enough.)

But suppose I don't do that, and I stick with my 0.05. Then, if I have a regression with six meaningless variables, there's still a 26.5 percent chance I'll have something positive to report. If I have ten variables, that goes up to 40 percent! Ten variables is actually not uncommon in academic studies.

-----

And it's actually even higher than that in practice. For one thing, if you do a regression, and you get close to significance for variable X, you can try adding or removing a variable or two, and see if that pushes X over the threshold. You see that you're at 0.06, and you think, "wait, maybe I didn't control for home/road!" And you think, "yeah, I really should have controlled for home/road." And you're right about that, it's a good idea. So you add a new variable for home/road, and, suddenly, the 0.06 moves to 0.049, and you're in. But: if you'd got the 0.049 in the first place, you probably never would have thought of adding home/road. So, really, you're giving yourself two (or more) kicks at the can.

------

As an aside .... there's another thing you can do to try to get a significant result, which, in my opinion, is cheating, a bit. Here's how it works. When you choose which six of the days of the week to include as variables, you leave out the one that's most extreme. That bumps all your estimates, and your significance levels.

For instance, suppose that, compared to the average, the observed effects of days of the week are something like this (numbers made up):

Sunday .... 2 points higher (than average)
Monday .... 10 points higher
Tuesday ... 3 points lower
Wednesday . 9 points higher
Thursday .. 11 points lower
Friday .... 7 points lower
Saturday .. 0 points difference

If you omit Sunday, your coefficients wind up being the difference from Sunday. So you get:

Monday .... +8 (relative to Sunday)
Tuesday ... -5
Wednesday . +7
Thursday .. -13
Friday .... -9
Saturday .. -2

The most extreme day is Thursday, which comes out to -13. Maybe the SD of that estimate is only 8 points, and so Thursday isn't significant (1.6 SDs).

But, now, you come up with the bright idea of using Monday as your reference instead. Now, your coefficient estimates are:

Sunday .... -8 (relative to Monday)
Tuesday ... -13
Wednesday . -1
Thursday .. -23
Friday .... -17
Saturday .. -10

Thursday is now -23, which is almost 3 standard deviations below zero -- clearly significant! But that's because of the way you structured your study. Effectively, you arranged it so that you wound up looking at Monday-Thursday, the most extreme of all the possible comparisons. There are 21 such comparisons, so, at a 5 percent chance of finding a false positive, you'd expect 1.05 false positives overall, which means a pretty good chance that the most extreme comparison will wind up looking significant.

As I said, I think this is cheating. I think you should normalize all your days of the week to the overall average (as in the top table above), to avoid this kind of issue.

------

Anyway, the point is that the more variables you have in your study, and the more different sets of variables you looked at before publishing, the higher your chances of finding at least one variable with the required significance. So, when a study is published, the weight you give the evidence of significance should be inversely proportional to the number of variables looked at.

Also, it should also depend on how many rejiggerings the author did before finally settling on the regression that got published. As for that, there's no way to tell from just the published study.

Again, if I wanted to cheat, I could try this. I run a huge regression with 200 different nonsense variables. I take the one that came out the most significant -- it'll probably be less than 0.01 -- and run a regression on that one alone. It'll probably still come in as significant, even without the other 199 variables. (If it doesn't, I'll just take the next most significant and try that one instead.)

Then, I write a paper, or a blog post, suggesting that that particular variable should be highly predictive, based on some theory I make up on the spot. I might say something like this:

"Thursdays are getaway days, which means the team is in a bit of a hurry. Psychological theory suggests that human cognitive skills are reduced with perceived time pressure from work authorities. That means batters should concentrate less. Therefore, you'd expect batting to be worse on Thursdays."

I'll also add a paragraph about why that doesn't apply to pitchers, and why batters should hit well on Mondays.

A month or two later, I publish a working paper that looks only at Thursdays and Mondays, and finds the exact effect I predicted!

I hasten to add that I don't think this kind of conscious cheating actually goes on. I'm just trying to make the point that an 0.05 is not an 0.05 is not an 0.05. The weight you give to that 0.05 has to take into account as much of the context as you can figure out.

------

And so, I'd like to see authors use a more stringent threshold for significance the more variables they use. They could choose a level so that the chance of finding significance is not 0.05 per variable, but 0.05 for the paper as a whole. That means that with six new variables, you use 0.0085 as your new significance level. Heck, round it up to 0.01 if you want.

Let 0.01 be the new 0.05. That would be appropriate for most studies, I think.

------

One last point here, and this is something that really bugs me. You'll read a paper, and the author will find variable X significant at 0.05, and he'll go ahead and write his conclusion as if he's 100 percent certain that the effect he found is real.

But, as I've just argued, there's a pretty good chance it's not. And, even if you disagree with what I've written here, still, there's always *some* chance the finding is spurious.

So why don't authors mention that? They could just throw it in casually, in a sentence or two, like this:

"... of course, it could be that we're just looking at a false positive, and there's no day-to-day effect at all. Remember, we'd see this kind of result about 1 in 20 times even if there were no effect at all."

Why won't the authors even consider a false positive? Is it just the style and standards of how academic papers work, that when you find a 5 percent significance level, you have to pretend it removes all doubt? Maybe it's just me. But when I see such confident conclusions based on one 0.05 cherry-picked from a list of fifteen variables, I can't help feeling like I'm being sold a bill of goods.

------

UPDATE: Commenter Mettle (fifth comment -- don't know how to link to it directly) points out that much of what I'm saying here isn't new (and referred me to this very relevant Wikipedia page).

However, the question remains: if this is a well-established issue, why don't authors routinely address it in their papers?

Labels: false positives, regression, statistics

14 Comments:

At Monday, June 20, 2011 11:34:00 PM, Andrew Trembley said...: This is all true, but don't forget that any confidence level is a tradeoff: you've got a small chance of a false positive (seeinga pattern when you shouldn't) , but a pretty large chance of a false negative (not seeing a pattern when there is one). All the more reason why we should just publish P-values instead: let readers and not publishers decide what's convincing.
At Monday, June 20, 2011 11:42:00 PM, Phil Birnbaum said...: Sure, I'm all for that ... show your result, show your significance level, and make an argument for what it means.
At Tuesday, June 21, 2011 1:51:00 AM, Phil Birnbaum said...: Beautiful, thanks!
At Tuesday, June 21, 2011 11:48:00 AM, mettle said...: I can't tell if you aren't aware that this is a well-established issue and has been for decades (wiki "multiple comparisons" or "Bonferroni correction") or if there's some other specific reason you're not acknowledging all the previous work in this area or using the existing terminology so people can continue on with their own investigations.

Either way, it's a well-known problem with many well-known solutions (as you may already know).
At Tuesday, June 21, 2011 11:55:00 AM, Phil Birnbaum said...: Nope, wasn't aware. I wrote about it because I've seen some of these things done (on a less obvious scale) in some of the papers I've been reading.

Thanks for the links, will check 'em out.
At Tuesday, June 21, 2011 12:49:00 PM, Phil Birnbaum said...: Mettle: Thanks! I guess I partly reinvented the wheel, here. I've updated the post with the Wikipedia link.
At Tuesday, June 21, 2011 2:09:00 PM, mettle said...: I've found it actually ends up being a pretty tricky (and interesting!) issue because ideally you have to have all your planned comparisons worked out ahead of time, but that sort of ignores the real-life process many people go through of poking around for findings via trial and error. Then you sort of have to go back and do a posthoc pre-justification of what you were originally looking for.

I know you will find it consistently used in neuroscience and psychology articles. My guess as to why it's not used in baseball is because I think a lot of the people don't really have real stats training -- they learned it on the job. That's why you see lots of other sketchy methods like "buckets" instead of regressions or not using non-parametric stats where needed.
At Tuesday, June 21, 2011 3:21:00 PM, bradluen said...: "if this is a well-established issue, why don't authors routinely address it in their papers?"

Because then it might not get published :)

Really it's the referees who should insist on such a disclaimer. Why don't they? Maybe because then they would be obliged to include disclaimers in their own manuscript... but less conspiratorially, I think in a number of fields, this isn't seen as much a problem because those fields ultimately rely on replication anyway. Even if your P-value is 0.0001, your finding isn't confirmed until someone in another lab reproduces it.

"All the more reason why we should just publish P-values instead"

And confidence intervals! (Though that opens up the corresponding can of worms -- why do we use 95% CIs?)
At Tuesday, June 21, 2011 6:24:00 PM, Alex said...: Hey Phil - your post evoked enough thoughts that I made a post instead of commenting here. But as far as switching to confidence intervals go - they typically (if not always) involve both the error of the test (which also dictates the p value) and the choice of an alpha value. So confidence intervals will have all the problems of both; if you have a multiple regression with multicollinearity there will be issues with the errors and a need to correct for multiple tests.
At Wednesday, June 22, 2011 9:41:00 AM, BMMillsy said...: As Mettle points out, this is not a new problem. There are all sorts of corrections and such that people attempt--though some aren't necessary. If the reader understands what the p-value means, than he or she can make their own assessment as to the validity of the researcher's claims. I agree with Andrew in comment #1.

I think an earlier commenter points out the real issue: "because you won't get published". The real issue is that publishers are biased in what they choose to print, which means we only see positive results (i.e. p < .05). However, we seldom see those other studies or replications where p > .05. If we have knowledge of all of them, we can make a more accurate assessment of the single study...but that doesn't seem to be how things work.

You may enjoy this little comic:

http://iterativepath.wordpress.com/2011/04/07/significance-of-random-flukes/
At Thursday, June 23, 2011 5:44:00 AM, Mike said...: Phil, I'm confused by this:

"I set up a regression with six variables for days of the week (the seventh is the default case, and doesn't get a variable)"

Are you saying you use 6 if the seventh day is the baseline day? So if you used the daily average, you'd have 7 variables? Or are you saying the 7th variable is the "b" in y=mx+b? Or neither? :-) Thanks!
At Thursday, June 23, 2011 9:26:00 AM, Phil Birnbaum said...: Mike,

I used six dummy variables: "Is today Monday?" "Is today Tuesday?" and so on.

You can't use all 7. If you do, the regression will crap out because there's an infinite number of solutions. For instance, you could add X to all the day coefficients, and subtract X from the constant, and the equation would still work.

Google "dummy variables multicollinearity", there are probably better explanations out there than mine.
At Sunday, June 26, 2011 11:49:00 AM, Anonymous said...: OT. If you haven't seen it, an interesting review of SCORECASTING by an Englishman, at:

http://www.lrb.co.uk/v33/n13/david-runciman/swing-for-the-fences
At Sunday, June 26, 2011 12:10:00 PM, Phil Birnbaum said...: Thanks, that was a very interesting review! Appreciate the link, I hadn't seen it.

<< Home

Sabermetric Research

Monday, June 20, 2011

Five percent significance is often too weak

14 Comments:

About Me

Previous Posts