Wednesday, July 09, 2014

"The Cult of Statistical Significance"

"The Cult of Statistical Significance" is a critique of social science's overemphasis on confidence levels and its convention that only statistically-significant results are worthy of acceptance. It's by two academic economists, Stephen Ziliak and Deirdre McCloskey, and my impression is that it made a little bit of a splash when it was released in 2008.

I've had the book for a while now, and I've been meaning to write a review. But, I haven't finished reading it, yet; I started a couple of times, and only got about halfway through. It's a difficult read for me ... it's got a flowery style, and it jumps around a bit too much for my brain, which isn't good at multi-tasking. But a couple of weeks ago, someone on Twitter pointed me to this .pdf -- a short paper by the same authors, summarizing their arguments. 


Ziliak and McCloskey's thesis is that scientists are too fixated on significance levels, and not enough on the actual size of the effect. To illustrate that, they use an example of two weight-loss pills:

"The first pill, called "Oomph," will shed from Mom an average of 20 pounds. Fantastic! But Oomph is very uncertain in its effects—at [a standard error of] plus or minus 10 pounds. ... Could be ten pounds Mom loses; could be thrice that.

"The other pill you found, pill "Precision," will take 5 pounds off Mom on average but it is very precise—at plus or minus 0.5 pounds. Precision is the same as Oomph in price and side effects but Precision is much more certain in its effects. Great! ...

"Fine. Now which pill do you choose—Oomph or Precision? Which pill is best for Mom, whose goal is to lose weight?"

Ziliak and McCloskey -- I'll call them "ZM" for short -- argue that "Oomph" is the more effectual pill, and therefore the best choice. But, because its effect is not statistically significant from zero*, scientists would recommend "Precision". Therefore, the blind insistence on statistical significance costs Mom, and society, a high price in lost health and happiness.

(*In their example, the effect actually *is* statistically significant, at 2 SDs, but the authors modify the example later so it isn't.)

But: that isn't what happens in real life. In actual research, scientists would *observe* 20 pounds plus or minus 10, and try to infer the true effect as best they can. But here, the authors proceed as if we *already know* the true effect on Mom is 20 +/- 10. But if we did already know that, then, *of course* we wouldn't need significance testing!

Why do the authors wind up having their inference going the wrong way?  I think it's a consequence of failing to notice the elephant in the room, the fact that's the biggest reason significance testing becomes necessary. That elephant is: most pills don't work. 

What I suspect is that when the authors see an estimate of 20, plus or minus 10, they think that must be a reasonable, unbiased estimate of the actual effect. They don't consider that most true values are zero, therefore, most observed effects are just random noise, and that the "20 pounds" estimate is likely spurious.

That's the key to the entire issue of why we have to look at statistical significance -- to set a high-enough bar that we don't wind up inundated with false positives.

At best, the authors are setting up an example in which they already assume the answer, then castigating statistical significance for getting it wrong. And, sure, insisting on p <  .05 will indeed cause false negatives like this one. But ZM fail to set off the false negatives against the inevitable false positives that would result without looking at significance, without realizing we need to find evidence of existence.


In fairness, Ziliak and McCloskey don't say explicitly that they're rejecting the idea that most pills are useless. They might not actually even believe it. They might just be making statistical assumptions that necessarily assume it's true. Specifically:

-- In their example, they assume that, because the "Oomph" study found a mean of 20 pounds and SD of 10 pounds, that's what Mom should expect in real life. But that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- They also seem to assume the implication of that, that when you come up with a 95% confidence interval for the size of the effect, there is actually a 95% probability that the effect lies in that range. Again, that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- And, I think they assume that if a result comes out with a p-value of .75, it implies a 75% chance that the true effect is greater than zero. Same thing: that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

I can't read minds, and I probably shouldn't assume that's what ZM were actually thinking. But that one single assumption would easily justify their entire line of argument -- if only it were true. 

And it certainly *seems* justifiable, to assume that every effect size is equally likely. You can almost hear the argument being made: "Why assume that the drug is most likely useless?  Isn't that an assumption without a basis, an unscientific prejudice?  We should keep a completely open mind, and just let the data speak."  

It sounds right, but it's not. "All effects are equally likely" is just as strong a prejudice as "Zero is most likely."  It just *seems* more open-minded because (a) it doesn't have to be said explicitly, (b) it keeps everything equal, which seems less arbitrary, and (c) "don't be prejudiced" seems like a strong precedent, being such an important ethical rule for human relationships.

If you still think "most pills don't work" is an unacceptable assumption ... imagine that instead of "Oomph" being a pill, it was a magic incantation. Are you equally unwilling to accept the prejudice "most incantations don't work"?

If it is indeed true that most pills (and incantations) are useless, ignoring the fact might make you less prejudiced, but it will also make you more wrong. 


And "more wrong" is something that ZM want to avoid, not tolerate. That's why they're so critical of the .05 rule -- it causes "a loss of jobs, justice, profit, and even life."  Reasonably, they say we should evaluate the results not just on significance, but on the expected economic or social gain or loss. When a drug appears to have an effect on cancer that would save 1,000 lives a year ... why throw it away because there's too much noise?  Noise doesn't cost lives, while the pill saves them!

Except that ... if you're looking to properly evaluate economic gain -- costs and benefits -- you have to consider the prior. 

Suppose that 99 out of 100 experimental pills don't work. Then, when you get a p-value of .05, there's only about a 17 percent chance that the pill has a real effect. Do you want to approve cancer pills when you know five-sixths of them don't do anything?

(Why 5/6?  Of the 99 worthless drugs, about 5 of them will show significance just randomly. So you accept 5 spurious effects for each real effect.)

And that 17 percent is when you *do* have p=.05 significance. If you lower your significance threshold, it gets worse. When you have p=.20, say, you get 20 false positives for every real one.

Doing the cost-benefit analysis for Mom's diet pill ... if there's only a 1 in 6 chance that the effect is real, her expectation is a loss of 3.3 pounds, not 20. In that case, she is indeed better off taking "Precision" than "Oomph".


If you don't read the article or book, here's the one sentence summary: Scientists are too concerned with significance, and not enough with real-life effects. Or, as Ziliak and McCloskey put it, 

"Precision is Nice but Oomph is the Bomb."

The "oomph" -- the size of the coefficient -- is the scientific discovery that tells you something about the real world. The "precision" -- the significance level -- tells you only about your evidence and your experiment.

I agree with the authors on this point, except for one thing. Precision is not merely "nice". It's *necessary*. 

If you have a family of eight and shop at Costco and need a new vehicle, "Tires are Nice but Cargo Space is the Bomb." That's true -- but the "Bomb" is useless without the "Nice".

Even if you're only concerned with real-world effects, you still need to consider p-values in a world where most  hypotheses are false. As critical as I have been about the way significance is used in practice, it's still something that's essential to consider, in some way, in order to filter out false positives, where you mistakenly approve treatments that are no better than sugar pills. 

None of that ever figures into the authors' arguments. Failing to note the false positives -- the word "false" doesn't appear anywhere in their essay, never mind "false positive" -- the authors can't figure out why everyone cares about significance so much. The only conclusion they can think of is that scientists must worship precision for its own sake. They write, 

"[The] signal to noise ratio of pill Oomph is 2-to-1, and of pill Precision 10-to-1. Precision, we find, gives a much clearer signal—five times clearer.

"All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way. "Well," say our significance testing colleagues, "the pill with the highest signal to noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision.” 

"But Precision—precision commonly defined as a large t-statistic or small p-value on a coefficient—is obviously the wrong choice. Wrong for Mother's weight-loss plan and wrong for the many other victims of the sizeless scientist. The sizeless scientist decides whether something is important or not—he decides "whether there exists an effect," as he puts it—by looking not at the something's oomph but at how precisely it is estimated. Mom wants to lose weight, not gain precision."

Really?  I have much, much less experience with academic studies than the authors, but ... I don't recall ever having seen papers boast about how precise their estimates are, except as evidence that effects are significant and real. I've never seen anything like, "My estimates are 7 SDs from zero, while yours are only 4.5 SDs, so my study wins!  Even though yours shows cigarettes cause millions of cancer deaths, and mine shows that eating breakfast makes you marginally happier."

Does that really happen?


Having said that, I agree emphatically with the part of ZM's argument that says scientists need to pay more attention to oomph. I've seen many papers that spend many, many words arguing that an effect exists, but then hardly any examining how big it is or what it means. Ziliak and McCloskey refer to these significance-obsessed authors as "sizeless scientists." 

(I love the ZM terminology: "cult," "oomph," "sizeless".) 

Indeed, sometimes studies find an effect size that's so totally out of whack that it's almost impossible -- but they don't even notice, so focused are they on significance levels.

I wish I could recall an example ... well, I can make one up, just to give you the flavor of how I vaguely remember the outrageousness. It's like, someone finds a statistically-significant relationship between baseball career length and lifespan, and trumpets how he has statistical significance at the 3 percent level ... but doesn't realize that his coefficient estimates a Hall-of-Famer's lifespan at 180 years. 

If it were up to me, every paper would have to show the actual "oomph" of its findings in real-world terms. If you find a link between early-childhood education and future salary, how many days of preschool does it take to add, say, a dollar an hour?  If you find a link between exercising and living longer, how many marathons does it take to add a month to your life?  If fast food is linked with childhood obesity, how many pounds does a kid gain from each Happy Meal?  

And we certainly do also need less talk of precision. My view is that you should spend maybe one paragraph confirming that you have statistical significance. Then, shut up about it and talk about the real world. 

If you're publishing in the Journal of Costcological Science, you want to be talking about cargo space, and what the findings mean for those who benefit from Costcology. How many fewer trips to Costco will you make per year?  Is it now more efficient to get your friends to buy you gift cards instead of purchasing a membership? Are there safety advantages to little Joey no longer having to make the trip home with an eleven-pound jar of Nutella between his legs?

You don't want to be going on and on about, how, yes, the new vehicle does indeed have four working tires!  And, look, I used four different chemical tests to make sure they're actually made out of rubber!  And did I mention that when I redo the regression but express the cargo space in metric, the car still tests positive for tires?  It did!  See, tires are robust with respect to the system of mensuration!

For me, one sentence is enough: "The tire treads are significant, more than 2 mm from zero."  


So I agree that you don't need to talk much about the tires. The authors, though, seem to be arguing that the tires themselves don't really matter. They think drivers must just have some kind of weird rubber fetish. Because, if the vehicle has enough cargo space, who cares if the tires are slashed?

You need both. Significance to make sure you're not just looking at randomness, and oomph to tell you what the science actually means.

Labels: , , ,


At Wednesday, July 09, 2014 9:02:00 PM, Anonymous Alex said...

The ZM comparison (oomph does nothing while precision is better) likely wouldn't happen if directly compared in an actual piece of research because someone would have to make a bar graph, and they would see that oomph clearly leads to more weight loss most of the time. But I could very well see this conclusion being reached if oomph and precision were tested in different papers, say against different control groups.

I'll give you a mostly-equivalent example that I do see fairly often in research articles. Condition A leads to some change, say 10 points with an SD of 6 (something that doesn't come out significantly different from 0 with the sample size). But Condition B leads to a 12 point change with an SD of 4 (something that isn't significantly different from 0). It is certainly not uncommon for papers to claim that something is happening in Condition B and nothing is happening in Condition A.

Not only do they fall prey to the cult, but they fail to run the proper test, which is to directly compare the two (which would never come out as different). And were they to, they would then have to handwave about why B is significant while A is not while they're both actually the same. This kind of thing happens more often when you have to do something fancy to produce your estimates for A and B in the first place; I see it often in neuroimaging work.

At Thursday, July 10, 2014 12:08:00 PM, Anonymous Anonymous said...

Ziliak's first name is Stephen.

At Thursday, July 10, 2014 1:10:00 PM, Blogger Phil Birnbaum said...

Oops, thanks! Will fix as soon as I'm at my computer.

At Friday, July 11, 2014 10:20:00 PM, Blogger Don Coffin said...

In an earlier piece (I think a book; might have been an essay reprinted) (which I can't cite, because I don't have it, or remember its title, or where it was published), McCloskey made a similar argument, but in the context of a treatment which might (or might not) be lifesaving. In that example, the mean (estimated) effect of the treatment was positive, but the confidence interval around that mean estimated effect was large enough that there was a 15% to 20% chance that a patient was more, rather than less, likely to die *because of the treatment.*

McCloskey has been toning down the argument against using statistical significance for a while now, and this is another example of that.

Having said that, I agree that statistical significance is somewhat over-emphasized. An effect can be statistically significant, but practically meaningless--because it is very small.

Suppose we very precisely estimate the effect of (e.g.) mega doses vitamin C (say, 2000 mg per day--Linus Pauling's recommendation) on the likelihood that people will experience a bout with the flu. And that the effect is a 0.1 percentage point decrease in the likelihood of disease--but the statistical significance is extraordinarily high (e.g., the standard error of the estimate is 0.001). We might conclude that, yes vitamin C is effective--but the effect is so small that we don't care.

On the other hand, a less precisely estimated effect (say, a 50% reduction in the incidence of a disease, with a 30% standard error of estimate) might not be (conventionally) statistically significant--but the effect is practically meaningful. (Unfortunately, here, we have the relatively large chance that the treatment does make you more likely to get the disease...)

What I like are (a) properly specified nulls; (b) statistically significant effects; and (c) practically significant effects. I would guess that a fairly large percentage of statistical studies fall down on one or more of those.


Post a Comment

<< Home