## Monday, November 21, 2011

### "Statisticians can prove almost anything"

Sometimes, when you look for statistical significance, you'll find it even if the effect isn't real -- in other words, a false positive. With a 5% significance level, you'll find that one out of 20 times.

However, experimenters don't do just one analysis one time. They'll try a bunch of different variables, and a bunch of different datasets. If they try enough things, they have a much better than 5% chance of coming up with a positive. How much better? Well, there's no real way to tell, since the tests aren't independent (adding one dependent variable to a regression isn't really a whole new regression). But, intuitively: if, by coincidence, your first experiment winds up at (say) p=0.15, it seems like it should be possible to get it down to 0.05 if you try a few things.

That's exactly what Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn did in a new academic paper (reported on in today's National Post). They wanted to prove the hypothesis that listening to children's music makes you older. (Not makes you *feel* older, but actually makes your date of birth earlier.) Obviously, that hypothesis is false.

Still, the authors managed to find statistical significance. It turned out that subjects who were randomly selected to listen to "When I'm Sixty Four" had an average (adjusted) age of 20.1 years, but those who listened to the children's song "Kalimba" had an adjusted age of 21.5 years. That was significant at p=.04.

How? Well, they gave the subjects three songs to listen to, but only put two in the regression. They asked the subjects 12 questions, but used only one in the regression. And, they kept testing subjects 10 at a time until they got significance, then stopped.

In other words, they tried a large number of permutations, but only reported the one that led to statistical significance.

One thing I found interesting was that one variable -- father's age -- made the biggest difference, dropping the p-value from .33 to .04. That makes sense, because father's age is very much related to subject's age. If you father is 40, you're unlikely to be 35. You could actually make a case that father's age *helps* the logic, not hurts it, even though it was arbitrarily selected because it gave the desired result.

-----

In this case, all the permutations meant that statistical significance was extremely likely. Suppose that, before any regressions, the two groups had about the same age. Then, you start adjusting for things, one at a time. What you're looking for is a significant difference in that one respect. The chances of that are 5%. But, the things the researchers adjusted for are independent: how much they would enjoy eating at a diner, their political orientation, which of four Canadian quarterbacks believed they won an award ... and so on. With ten independent thingies, the chance at least one would be significant is about 0.4.

Add to that the possibility of continuing the experiment until significance was found, and the possibility of combining factors, and you're well over 0.5.

Plus, if the researchers hadn't found significance, they would have kept adjusting the experiment until they did!

-----

The authors make recommendations for how to avoid this problem. They say that researchers should be forced to decide in advance, when to stop collecting data. And they should be forced to list all variables and all conditions, allowing the referees and the readers to see all the "failed" options.

These are all good things. Another thing that I might add is: you have to repeat the *exact same study* with a second dataset. If the result was the result of manipulation, you'll have only a 5% chance of having it stand up to an exact replication. This might create more false negatives, but I think it'd be worth it.

-----

One point I'd add is that this study reinforces my point, last post, that the interpretation of the study is just as important as the regression. For one thing, looking at all the "failed" iterations of the study is necessary to decide how to describe the conclusions. But, mostly, this study shows an extreme example of how you have to use insight to figure out what's going on.

Even if this study wasn't manipulated, the conclusion "listening to children's music makes you older" would be ludicrous. But, the regression doesn't tell you that. Only an intelligent analysis of the problem tells you that.

In this case, it's obvious, and you don't need much insight. In other cases, it's more subtle.

-----

Finally, let me take exception to the headline of the National Post article: "Statisticians can prove almost anything, a new study finds." Boo!

First of all, the Post makes the same mistake I argued against last post: the statistics don't prove anything: the statistics *plus the argument* make the case. Saying "statistics prove a hypothesis" is like saying "subtraction proves socialism works" or "the hammer built the birdhouse."

Second, a psychologist who uses statistics should not be described as a statistician, any more than an insurance salesman should be described as an actuary.

Third, any statistician would tell you, in seconds, that if you allow yourself to try multiple attempts, the .05 goes out the window. It's the sciences that have chosen to ignore that fact.

The true moral of the story, I'd argue, is that the traditional academic standard is wrong -- the standard that once you find statistical significance, you're entitled to conclude your effect is real.

-----

P.S. If Uri Simonsohn's name looks familiar, it might be because he was one of the authors of the "batters hit .463 when gunning for .300" study.

Labels: ,