Friday, February 20, 2015

Replacing "statistically significant"

In his recent book, "How Not To Be Wrong," mathematician Jordan Ellenberg writes about how the word "significant" means something completely different in statistics than it does in real life:

"In common language it means something like 'important' or 'meaningful.' But the significance test scientists use doesn't measure importance ... [it's used] merely to make a judgment that the effect is not zero. But the effect could still be very small -- so small that the drug isn't effective in any sense that an ordinary non-mathematical Anglophone would call significant. ...

"If only we could go back in time to the dawn of statistical nomenclature and declare ... 'statistically noticeable' or 'statistically detectable' instead of 'statistically significant!'"

I absolutely agree.

In fact, in my view, the problem is even more serious the other way, when there is *no* statistical significance. Researchers will say, "we found no statistically-significant effect," which basically means, "we don't have enough evidence to say either way." But readers will take that as meaning, "we find at best a very small effect." That's not necessarily the case. Studies often find values that would be very significant in the real world, but reject them because the confidence interval is wide enough to include zero. 


Tom Tango will often challenge readers to put aside "inertial reasoning" and consider how we would redesign baseball rules if we were starting from scratch. In that tradition, how would we redo the language of statistical significance?

I actually spent a fair bit of time on this a year or so ago. I went to a bunch of online thesauruses, and wrote down every adjective that had some kind of overlap with "significant." Looking at my list ... I notice I actually didn't include Ellenberg's suggestions, "noticeable" or "detectable." Those are very good candidates. I'll add those now, along with a few of their synonyms.

OK, done. Here's my list of possible candidates:

convincing, decisive, unambiguous, probable cause, suspicious, definite, definitive, adequate, upholdable, qualifying, sufficing, signalling, salient, sufficient, unambiguous, defensible, sustainable, marked, rigorous, determinate, permissible, accreditable, attestable, credentialed, credence-ive, credible, threshold, reliable, presumptive, persuasive, confident, ratifiable, legal, licit, sanctionable, admittable, acknowledgeable, endorsable, affirmative, affirmable, warrantable, conclusive, sufficing, sufficient, valid, assertable, clear, ordainable, non-spurious, dependable, veritable, creditable, attestable, avowable, vouchable, substantive, noticeable, detectable, perceivable, discernable, observable, appreciable, ascertainable, perceptible

You can probably divide these into classes, based on shades of meaning:

1. Words that mean "enough to be persuasive." Some of those are overkill, some are underkill. "Unambiguous," for instance, would be an obvious oversell; you can have a low p-value and still be pretty ambiguous. On the other hand, "defensible" might be a bit too weak. Maybe "definite" is the best of those, suggesting precision but not necessarily absolute truth.

2. Words that mean "big enough to be observed." Those are the ones that Ellenberg suggested, "noticeable" and "detectable." Those seem fine when you actually find significance, but not so much when you don't. "We find no relationship that is statistically detectable" does seem to imply that there's nothing there, rather than that you just don't have enough data in your sample.

3. Words that mean "enough evidence." That's exactly what we want, except I can't think of any that work. The ones in the list aren't quite right. "Probable cause" is roughly the idea we're going for, but it's awkward and sounds too Bayesian. "Suspicious" has the wrong flavor. "Credential" has a nice ring to it -- as an adjective, not a noun, meaning "to have credence." You could say, for instance, "We didn't have enough evidence to get a credential estimate."  Still a bit awkward, though. "Determinate" is pretty good, but maybe a bit overconfident.

Am I missing some? I tried to think, what's the word we use when we say an accused was acquitted because there wasn't enough evidence? "Insufficient" is the only one I can think of. Everything else is a phrase -- "within a reasonable doubt," or "not meeting the burden of proof."

4. Words that mean "passing an objective level," as in meeting a threshold. Actually, "threshold" as an adjective would be awkward, but workable -- "the coefficient was not statistically threshold." There's also "adequate," and "qualifying,” and "sufficient," and  "sufficing." 

5. Finally, there's words that mean "legal," in the sense of, "now the peer reviewers will permit us to treat the effect as legitimate." Those are words like "sanctionable," "admittable," "acknowledgable," "permissible," "ratifiable," and so on. My favorite of these is "affirmable." You could write, "The coefficient had a p-value of .06, which falls short of statistical affirmability." The reader now gets the idea that the problem isn't that the effect is small -- but, rather, that there's something else going on that doesn't allow the researcher to "affirm" it as a real effect.

What we'd like is a word that has a flavor matching all these shades of meaning, without giving the wrong idea about any of them. 

So, here's what I think is the best candidate, which I left off the list until now:


"Dispositive" is a legal term that means "sufficient on its own to decide the answer." If a fact is dispositive, it's enough to "dispose" of the question.

Here's a perfect example:

"Whether he blew a .08 or higher on the breathalyzer is dispositive as to whether he will be found guilty of DUI."

It's almost exact, isn't it? .08 for a conviction, .05 for statistical significance.

I think "dispositive" really captures how statistical significance is used in practice -- as an arbitrary standard, a "bright line" between Yes and No. We don't allow authors to argue that their study is so awesome that p=.07 should really be allowed to be considered significant, any more than we allow defendants to argue that should be acquitted at a blood alcohol level of .09 because they're especially good drivers. 

Moreover, the word works right out of the box in its normal English definition. Unlike "significant," the statistical version of "dispositive" has the same meaning as the usual one. If you say to a non-statistician, "the evidence was not statistically dispositive," he'll get the right idea -- that an effect was maybe found, but there's not quite enough there for a decision to be made about whether it's real or not. In effect, the question is not yet decided. 

That's the same as in law. "Not dispositive" means the evidence or argument is a valid one, but it's not enough on its own to decide the case. With further evidence or argument, either side could still win. That's exactly right for statistical studies. A "non-significant" p-value is certainly relevant, but it's not dispositive evidence of presence, and it's not dispositive evidence of absence. 

Another nice feature is that the word still kind of works when you use it to describe the effect or the estimate, rather than the evidence: 

"The coefficient was not statistically dispositive."

It's not a wonderful way to put it, but it's reasonable. Most of the other candidate words don't work well both ways at all -- some are well-suited only to describing the evidence, others only to describing the estimates. These don't really make sense:

"The evidence was not statistically detectable."  
"The effect was not statistically reliable."
"The coefficient was not statistically accreditable."

Another advantage of "dispositive" is that unlike "significant," you can leave out the word "statistical" without ambiguity:

"The evidence was not dispositive."
"The coefficient was not dispositively different from zero."

Those read fine, don't they? I bet they'd almost always read fine. I'd bet that if you were to pick up a random study, and do a global replace of "statistically significant" with "dispositive," the paper wouldn't suffer at all. (It might even be improved, if the change highlighted cases where "significant" was used in ways it shouldn't have been.)


When I'm finally made Global Despotic Emperor of Academic Standards, the change of terminology will be my first official decree.

Unless someone has a better suggestion. 

Labels: , , ,

Friday, February 06, 2015

Rating battery life on a 100-point scale

I've written before about how Consumer Reports (CR) uses a 100-point system for its product ratings. In their latest issue, they use that same system to rate AA batteries, and I suspect the ratings turned out so misleading that CR wound up fooling its own editorial staff!


CR rated 13 brands of alkaline batteries, and two brands of lithium batteries.  In the alkaline category, Kirkland Signature (Costco's house brand) was rated "best buy." It was the third-best alkaline, and, at 27 cents a battery, the least expensive by far. Most of the others were between $0.75 and $1.00 (although they would have been cheaper if CR had priced them in a bulk pack, like the Kirkland).

The two lithium batteries rated the highest of all, but they cost more than $2 each.

Now, suppose I'm not near a Costco, and need batteries. My choice is between the high-rated Duracell alkaline, at $1.20, and the Energizer Ultimate Lithium, at $2.50. Which should I buy?

There's no way to tell from the article. Why? Because all we have is that 100-point scale. That doesn't help much. Why doesn't CR just tell us how long each battery lasted, so we can do our own cost/benefit calculation?

It's not quite that simple, you could argue. Batteries perform differently in "high drain" and "low drain" applications. CR tested both -- it used a flashlight for its low-drain test, and a toy for its high-drain test. Then it combined the two, somehow, to get the rating. But, couldn't they have combined them in such a way that the ratings are roughly proportional to how long the batteries last? 


I found a 2012 "battery showdown", from BitBox, that gives you actual data. Here's their graph of how much power you get from different brands of battery at high-drain (before the voltage drops below 0.8V).  The lithiums are the two at the top, the alkalines are the large cluster in the middle, and the cheap carbon-zinc batteries (which CR didn't test) are the poor performers at the bottom.

Looking at their chart of numbers ... the Energizer Ultimate Lithium, it appears, lasts around 3.1 times as long as the Costco alkaline in high-drain applications. At low-drain, the lithium lasts 1.7 times as long.

That's consistent with what I had previously understood -- that lithium batteries are by far the best, but shine more in high-drain applications than low-drain applications. 

Strangely, the CR chart might lead you to expect exactly the opposite! CR rated the lithium batteries "excellent" (their maximum rating) in both applications. That "tied" eight of the 13 alkalines in the high-drain test, but only one in the low-drain test. Based on those ratings, a reader would be forgiven for concluding that lithium batteries give you more leverage in low-drain uses. (In fairness, the text of the article does give the correct advice, although it doesn't explain why the chart seems to imply otherwise.)


Anyway, combining the two factors, 3.1 and 1.7, we might choose to conclude that the lithiums last maybe two-and-a-half times as long as the alkalines.

But CR's ratings give no clue that the difference is that large. All they tell us is that the lithium grades a 96/100, and the Costco alkaline grades an 84/100. In other words: CR gives the lithium 14% more points for 150% more performance. 

Which, I guess, has to be the case, given the rating system. If you give the lithium a perfect score of 100, you'd have to give the alkalines 40 or less. And they can't do that, since, to CR, 40/100 can only mean "poor."


The article goes on to say, 

"The top-scoring [91] alkaline battery -- Duracell Quantum -- was not significantly different from the high-scoring [94 and 96] lithium models ..."

That, I believe, is just plain false. A quick Google search of bloggers who tested the Quantums suggest that, at best, they're a bit better than other alkalines, but nowhere near as good as lithiums. So, CR winds up telling us a battery that lasting twice as long does not make a battery "significantly different."

What happened? 

Well, it might have been a misapplication of the normal criteria for "significantly different."  In their longer ratings articles, CR includes a disclaimer in their ratings: "differences of fewer than X points aren't meaningful."  

For the batteries ... sure, lower in the rankings, five points isn't significant. I fully agree that the Rayovac at 78/100 isn't significantly different from the CVS at 82/100. But it's absolutely not true that the Quantum at 91/100 is anywhere near as good as the lithium at 94/100. The rating system might work in the middle, but it fails at the top.

That's how, I think, CR wound up fooling itself. The writers looked at the ratings, and thought, "hey, it's only three points!"

Labels: , ,