Tuesday, June 05, 2012

Privileging the null hypothesis

The fallacy of "privileging the hypothesis" occurs when you concentrate on one particular possibility, without having any logically correct reason for preferring it to all the others.

The general idea is: if you want to put forth a hypothesis for consideration, it's incumbent on you to explain why that particular hypothesis is worthy of attention, and not any other ones.  The "Less Wrong" website gives an example:

"Privileging the hypothesis" is the fallacy committed by a creationist who points out a purported flaw in standard evolutionary theory, and brings forward the Trinity to fill the gap - rather than a billion other deities or a trillion other naturalistic hypotheses. Actually, without evidence already in hand that points to the Trinity specifically, one cannot justify raising that particular hypothesis to explicit attention rather than a trillion others.

I think this is right.  And, therefore, I think that many statistical studies -- the traditional, frequentist, academic kind -- are committing this fallacy on a regular basis.

Specifically: most of these studies test whether a certain variable is significantly different from zero.  If it isn't, they assume that zero is the correct value.

That's privileging the "zero" hypothesis the same way the example privileges the "Trinity" hypothesis, isn't it?  The study comes up with a confidence interval, which includes zero, but, by definition, includes an uncountable infinity of other values.  And then it says, "since zero is possible, that's what we're going to assume."

That's not the right thing to do -- unless you can explain why the "zero" hypothesis is particularly worthy.

Often times, it obviously IS worthy.  Carl Sagan wrote that "extraordinary claims require extraordinary evidence."  If a non-zero coefficient is an extraordinary claim, then, of course, there's no problem.

For instance, suppose a subject in an ESP study gets 3 percent more correct guesses than you'd expect by chance.  That's not significantly different than zero percent.  In that case, you're absolutely justified in assuming the real effect is zero.

"ESP exists" is an extraordinary claim.  A p-value of, say, .25 is not extraordinary evidence.  So, you're not "privileging the hypothesis," in the sense of giving it undeserved consideration.  You do have a logically correct reason for preferring it.

------

But ... that's not always the case.  Suppose you want to test whether scouts can effectively evaluate NHL draft prospects.  So you find 50 scouts, and you randomly choose two prospects, and ask them which one is more likely to succeed in the NHL.  If scouting were random, you'd expect 25 out of the 50 scouts to be correct.  Suppose it turns out that 27 are correct, which, again, isn't statistically significant.

Should you now conclude that scouts' picks are no better than random chance -- that scouts are worthless?

I don't think you should.

Because, why not start with a different null hypothesis, one that says that scouts are always 54.3 percent right?  If you do that, you'll again fail to find statistical significance.  Then, just like in the "zero" case, you say, "there's no evidence that 54.3 percent is wrong, so we will assume it's right."

That second one sounds silly, doesn't it?  It's obvious that a null hypothesis "scouts are good for exactly 4.3 percent" is arbitrary.  But, "Scouts are no good at all" seems ... better, somehow.

Why should we favor one over the other?  Specifically: why do we judge that this null hypothesis is good, but that other null hypothesis is bad?

It's not just the number zero.  Because, obviously, we can easily set up this study so that the null hypothesis is 50 (percent), or 25 (out of fifty), and we'll still think that's a better hypothesis than 54.3 percent.

Also, you can set up any hypothesis you want, to make the null zero.  Suppose I want to "prove" that third basemen live exactly 6.3 percent longer than second basemen.  I say, "John Smith believes third basemen live 6.3 percent longer.  So I built that into the model, and added another parameter for how much John is off by.  And, look, the other parameter isn't significantly different from zero.  So, while others might suggest that the other parameter should be negative 6.3 percent, there's no proof of that.  So we should assume that it's zero, and therefore that third basemen live 6.3 percent longer than second basemen."

That should make us very uncomfortable.

So if it's not the *number* zero, is it, perhaps, but the hypothesis of zero *effect*?  That is, the hypothesis that a certain variable doesn't matter, regardless of whether we represent that with a zero or not.

I don't think that's it either.  Suppose I find a bunch of random people, and use regression to predict the amount of money in their pocket based on the number of nickels, dimes, and quarters they're carrying.  And the estimate for nickels works out to 4 cents, but with an SD of 5 cents, so it's not statistically significantly different from zero.

In this case, nobody would assume the zero is true.  Nobody would say, "nickels do not appear to influence the amount of money someone is carrying."

It would be obvious that, in this case, the null hypothesis of "zero effect" isn't appropriate.

-----

So what is it?  Well, as I've argued before, it's common sense.  The null hypothesis has to be the one that human experience thinks is very much more likely to be true.  And that's often zero.

If you're testing a possible cancer drug, chances are it's not going to work; even after research, there are hundreds of useless compounds for every useful one.  So, the chance that this one will work is small, and zero is reasonable.

If people had ESP, we'd see a lot of quadrillionaires in the world, so common sense suggests a high probability that ESP is zero.

But what about scouting?  Why does it seem OK to use the null hypothesis of zero?

Perhaps it's because zero just seems like it should be more likely than any other single value.  It might still be a longshot, that scouts don't know what they're doing -- maybe you consider it a 5 percent chance.  That's still higher than the chance that scouts are worth exactly 2.156833 percentage points.  Zero is the value that's both more likely, and less arbitrary.

But ... still, it depends on what you think is common sense, which is based on your experience.  If you're an economist who's just finished reading reams of studies that show nobody can beat the stock market, you might think it reasonable that maybe scouts can't evaluate players very well either.

On the other hand, if you're an NHL executive, you feel, from experience, that you absolutely *know* that scouts can often see through the statistics and tell who's better.  To you, the null hypothesis that scouts are worth zero will seem as absurd as the null hypothesis that nickels are worth zero.

What happens, then, when a study and comes up with a confidence interval of, say, (-5 percent, 30 percent)?  Well, if the null hypothesis were zero, the researcher might say, "scouts do not appear to have any effect."  And the GM will say, "That's silly.  You should have used the null hypothesis of around 10 percent, which all us GMs believe from experience and common sense.  Your confidence interval actually fails to reject our null hypothesis too."

Which is another reason I say: you have to make an argument.  You can't just formulaically decide to use zero, and ignore common sense.  Rather, you have to argue that zero is an appropriate default hypothesis -- that, in your study, zero has *earned* its status.

But ... for scouting, I don't think you can do that.  There are strong, plausible reasons to assume that scouts have value.  If you ignore that, and insist on a null hypothesis of zero, you're begging the question.  You're committing the fallacy of privileging your null hypothesis.

Labels:

At Tuesday, June 05, 2012 1:57:00 PM,  mettle said...

You're arguing against a straw man.
A set of data can only reject a null hypothesis or fail to reject it.
People who suggest otherwise are bad at science.

At Tuesday, June 05, 2012 2:02:00 PM,  Phil Birnbaum said...

Fair enough. But many studies that fail to reject the null hypothesis immediately go on to assume -- implicitly or explicitly -- that it's true.

At Tuesday, June 05, 2012 2:29:00 PM,  Andy said...

Seems like the Bayesian approach doesn't suffer from this problem. You still need a prior but it's far less problematic to pick 0 since you won't ignore weak effects.

At Tuesday, June 05, 2012 2:59:00 PM,  mettle said...

If you do know of any, please point them out, because I'd certainly be inclined to write a firmly-worded letter to the editor.
I mean, this is the old absolutely-no-control. vs. little-or-possibly-no-control debate wrt to BABIP... No one who actually does the research says, "none at all", it's just the faulty repeaters and reporters who haven't actually touched any data.

At Tuesday, June 05, 2012 4:15:00 PM,  Alex said...

I think a lot of this really depends on how you come into an experiment/analysis. In my work, I might have a question like 'does condition X help someone remember better?'. Depending on the literature, I might expect it to help or not, but I rarely if ever have a prediction as to exactly how much it will help. But I have to test something, so functionally I say, 'I predict condition X has no effect on memory': I set the null to 0. You put the burden on the data to demonstrate that something has an effect.

If you don't get an effect (or get 0 in the confidence interval), it would be bad form to say that there's *really* no effect. But you would reasonably be able to say that the effect is small at best. And if an effect is small, how interesting/important could it be?

Aside from a question of importance, we have to return to the question of who gets to set the prior. Why is the GM opinion that scouts are at 55% better than the economist's that they're at 50%?

At Tuesday, June 05, 2012 5:27:00 PM,  Phil Birnbaum said...

Mettle,

I mention a couple of "it's not significant so we'll treat it as zero" studies here.

At Tuesday, June 05, 2012 5:30:00 PM,  Phil Birnbaum said...

Alex,

If you expect it to help, but it comes out not significantly different from zero, and you say "if it exists it's probably small," that's fine! (Assuming it's small for all plausible values.)

What I'm objecting to is when there is reason to expect it to help, and you get approximately that magnitude, but it's not significant and you say, "there appears to be no relationship."

At Tuesday, June 05, 2012 6:32:00 PM,  mettle said...

You are so very correct for calling them out. That's just bad broken awful science. You can not conclude anything from a non-significant result.

I like your dollars and cents example - another possibility is that the study was done wrong, too.

At Tuesday, June 05, 2012 8:34:00 PM,  Don Coffin said...

This is why it's useful to start with a theory (or a plausible story, if you will) of what the effect of something is. In economics, for example, there's a theory about international trade which says that a particular coefficient should be 1. So the hyoothesis in testing that theory is that what you find is not significantly different from 1.

The question is, really, "What is your plausible story (theory) and what value for an effect does it suggest?"

The null hypothesis is *not* zero effect; it's what we're looking for a difference from. And only some theory of the thing we are investigating can give us that.

Suppose we have a theory that says that scouting is better than chance. So when asking scouts to assess two prospects, "chance" is 50% picking each prospect, and "better than chance" is that more than half will correctly identify the better prospect.

But that is a fairly weak theory, isn't it? We might, in fact, find that 52% of the scouts do correctly identify the better prospect, and that 2% differential might in fact be *statisitically* significant. But is it far enough from no effect that you're willing to pay for it? That is, is it significant in a decision-making sense? Maybe, when you factor in the expense of scouting, you conclude that scouting has to be 20% better than chance to be worthwhile. THEN, your null hypothesis uses 60% success for scouts as the null (and, by the way, you use a one-tailed test, not a two-tailed test, because you only care about positive differences)--scouts have to do better than 60% to be "worth it."

Or take regression studies of the effect of stolen base attempts on scoring. I have yet to see one in which the coefficient of SBA is (statistically) significantly different from zero. But I also have yet to see a study in which the confidence interval does not include some part of the positive range. So would we conclude that the fact that the coefficient of SBA is not significantly different from +0.02 (i.e., each SBA adds an amountto runs scored that is not significantly different from +0.020 as evidence thata lot of SBA is a good idea?

Get the null hypothesis right--and the null is not, again, necessarily a zero effect--then worry about the statistice.

At Wednesday, June 06, 2012 11:59:00 PM,  Chris Phillips said...

Re: Third basemen living 6% longer than second basemen. The null hypothesis is an industry standard, a commonly known, or widely recognized amount, so Null Hyp: μ = Second basemen average lifespan. The alternate hypothesis is what those conducting the experiment secretly hope to prove. Alt Hyp: μ > Second basemen average lifespan. If the data in this one-tailed test is such that we are compelled to reject the null hypothesis, we conclude the alternate hypothesis has been proven. A confidence interval can then be constructed, hopefully with a large enough sample size that will pin down a range that is satisfyingly small.

Re: Scouts having value (or not). My recommendation would be to pick multiple pairs of players to avoid the possibility that scouts may be wrong on 1 player and to lessen the possibility that injuries will play a major factor. I would also recommend having as great of a number of scouts’ picks as possible. Then wait 10 or 15 years to determine which player in each pair did in fact have the better career. Of course, by then, the population of scouts will in all likelihood be dramatically different from the one sampled. And probably by 2027, the Birnbaumian Institute of Sabermetrics will have long since decided this issue. Hey, it could happen.