Sabermetric Research: When a null hypothesis makes no sense

In criminal court, you're "innocent until proven guilty." In statistical studies, it's "null hypothesis until proven significant."

The null hypothesis, generally, is the position that what you're looking for isn't actually there. If you're trying to prove that early-childhood education leads to success in adulthood, the default position is "we're going to assume it doesn't until evidence proves otherwise."

Why do we make "no" the null? It's because, most times, there really IS nothing there. Pick a random thing and a random life outcome: shirts, marriage. Is there a relationship between shirt color and how happy a marriage you'll have? Probably not. So "not" becomes the null hypothesis.

Carl Sagan famously said, "extraordinary claims require extraordinary evidence." And, in a world where most things are unrelated, "my drug shrinks tumors" is indeed an extraordinary claim.

The null hypothesis is the one that's the LEAST extraordinary -- the one that's most likely, in some common-sense way. "Randomness caused it until proven otherwise," not "Fairies caused it until proven otherwise."

In studies, authors usually gloss over that, and just use the convention that the null is always "zero". They'll say, "the difference between the treatment and control groups is not statistically-significantly different from zero, so we do not reject the hypothesis that the drug is of no benefit."

-------

But, "zero" isn't always the least extraordinary claim.

I believe that teams up 1-0 lead in hockey games get overconfident and wind up doing worse than expected. So, I convince Gary Bettman to randomly pick a treatment group of teams, and give them a free goal to start the first game of their season. At the end of the year, I compare goal differential between the treatment and control groups.

Which should my null hypothesis be?

-- The treatment has an effect of 0
-- The treatment has an effect of +1

Obviously, it's the second one. The first one, even though it includes the typical "zero," is, nonetheless, an extraordinary claim: that you give one group a one-goal advantage, but by the end of the year, that advantage has disappeared. Instead of saying "innocent until proven guilty," you'resaying, "one goal guilty unless proven otherwise." But that's hidden, because you use the word "zero" instead of "one goal guilty."

If you use 0 instead of +1, you're effectively making your hypothesis the default, by stealth.

(In this case, the null should be +1 ... in real life, the researcher would probably keep the same null, but also transform the model to put the conventional "0" back in. Instead of Y = b(treatment dummy), they'll use the model Y = (b+1)(treatment dummy), so that b=0 now means "no effect other than the obvious extra goal".)

What that shows is: it's not enough that you use "0". You have to make an argument about whether your zero is an appropriate null hypothesis for your model. If you choose the right model, and the peer reviewers don't notice, you can "privilege your hypothesis" by making "zero" represent anything you like.

But that's actually not my main point here.

------

A while ago, I saw a blog post where an economist ran a regression to predict wins from salaries, for a certain sport. The coefficient was not statistically-significantly different from zero, so the author declared that we can't reject the null hypothesis that team payroll relates to team performance.

But, in this case, "salary has an effect of zero" is not a reasonable null hypothesis. Why? Because we have strong, common-sense knowledge that salary DOES sometimes have an effect.

That knowledge is: we all know that better free-agent players get paid higher salaries. If you don't believe that's the case -- if you don't believe that LeBron James will earn more than a bench player next season -- you are excused from this argument. But, the economist who did that regression certainly believes it.

In light of that, "zero" is no longer the likeliest, least extraordinary, possibility, so it doesn't qualify as a null.

That doesn't mean it can't still be the right answer. It could indeed turn out that the relationship between salary and wins is truly 0.00000. For that to happen, it would have to be that other factors exactly cancel out the LeBron factor.

Suppose every million dollars more you spend on Lebron James gives you 0.33446 extra wins, on average (from a God's-eye view). In that case, if you use "zero" in your null hypothesis, it's exactly equivalent to this alternative:

"For every million dollars less you spend on Lebron, you just happen to get exactly 0.33446 extra wins from other players."

Well, that's completely arbitrary! Why would 0.33446 be more likely than 0.33445, or any other number? There's no reason to believe that 0.33446 is "least extraordinary." And so there's no reason to believe that the original "zero" is least extraordinary.

Moreover, if you use a null hypothesis of zero, you're contradicting yourself, because you're insisting on two contradictory things:

(1) Players who sign for a lot more money, like LeBron, are generally much better players.

(2) We do not reject the assumption that the amount of money a team pays is irrelevant to how good it is.

You can believe either one of these, but not both.

-----

It used to be conventional wisdom that women over 40 should get a mammogram every year. The annual scan, it was assumed, would help discover cancer earlier, and lead to better outcomes.

Recent studies, though, dispute that conclusion. They say that there is no evidence that there's any difference in cancer survival or diagnosis rates for women who get the procedure and women who don't.

Well, same problem: the null of "no difference" is an arbitrary one. It's the same argument as in the salary case:

Imagine two identical women, with the same cancer. One of them gets a mammogram, the cancer is discovered, and she starts treatment. Another one doesn't get the mammogram, and the cancer isn't discovered until later.

Obviously, the diagnosis MUST make a difference in the expected outcomes for these two patients. Nobody believes that whether you get treatment early or late makes NO difference, right? Otherwise, doctors would just shrug and ignore the mammogram.

But, the null hypothesis of "zero difference" suggests that, when you add in all the other women, the expected overall survival rates should be *exactly the same*.

That's an extraordinary claim. Sure it's *possible* that the years of life lost by the undiagnosed cancer are exactly offset by the years lost from the unnecessary treatment from false positives after a mammogram. Like, for instance, the 34 cancer patients who didn't get the mammogram each lose 8.443 years off their lives, and the 45 false-positives each lose 6.379 years, and if you work it out, it comes to exactly zero.

"We can't reject that there is no difference" is exactly as arbitrary as "We can't reject that the difference is the cosine of 1.2345".

Unless, of course, you have an argument about how zero is a special case. If you DID want to argue that cancer treatment is completely useless, then, certainly, your zero null would be appropriate.

------

"Zero" works well as a null hypothesis when it's most plausible that there's nothing there at all, when it's quite possible that there isn't any trace of a relationship. It's inappropriate otherwise: when there's SOME evidence of SOME real relationship, SOME of the time.

In other words, zero works when it's a synonym for "there's no relationship at all." It doesn't work when it's a synonym for, "the relationship is so small that it might as well be zero."

The null hypothesis works as a defense against the placebo effect. It does not work as a defense against actual effects that happen to be close to zero.

But, isn't it almost the same thing? It's it just splitting hairs?

No, not at all. It's an important distinction.

There are two different questions you may want a study to answer. First: is there actually a relationship there? And, second, if there is a relationship there, how big is it?

The traditional approach is: if you don't get statistical significance, you're considered to have not proved it's really there -- and, therefore, you're not allowed to talk about how big in might be. You have to stop dead.

But, in the case of the mammogram studies, you shouldn't have to prove it's really there. Under any reasonable assumptions a researcher might have about mammograms and cancer, there MUST be an effect. Whether the observed size is bigger or smaller than twice the SD -- which is the criterion for "existence" -- is completely irrelevant. You already know that an effect must exist.

If you demand statistical proof of existence when you already know it's there, you're choosing to ignore perfectly good information, and you're misleading yourself.

That's what happened in the Oregon Medicaid study. It found that Medicaid coverage was associated with "clinically significant" improvements in reducing hypertension. But they ignored those improvements, because there wasn't enough data to constitute sufficient evidence -- evidence that there actually is a relationship between having free doctor visits and having your hypertension treated.

But that's silly. We KNOW that people behave differently when they have Medicaid than when they don't. That's why they want it, so they can see doctors more and pay less. There MUST be actual differences in the two groups. We just don't know how large.

But, because the authors of the study chose to pretend that existence was in doubt, they threw away perfectly good evidence. Imprecise evidence, certainly -- the confidence interval was very wide. But imprecision was not the problem. If the point estimate had been just an SD higher than it was, they would have accepted it at face value, imprecision be damned.

-------

One last analogy:

The FDA has 100 untested pills that drugmakers say treat cancer. The FDA doesn't know anything about them. However, God knows, and He tells you.

It turns out 96 of the 100 compounds don't work at all -- they have no effect on cancer whatsoever, no more than sugar pills. The other four do something. They may help cancer, or they may hurt it, and all to different degrees. (Some of the four may even have an effect size of zero -- but if that's the case, they actually do something to the cancer, but the good things they do are balanced out by the bad things.)

You test one of the pills. The result is clinically significant, but only 0.6 SD from zero, not nearly strong enough to be statistically significant. It's reasonable for you to say, "well, I'm not even going to look at the magnitude of the effect, because, it's likely that it's just random noise from one of the 96 sugar pills."

You test another one of the pills, and get the same result. But this time, God pops His head into the lab and says, "By the way, that drug is one of the four that actually do something!"

This time, the size of the effect matters, doesn't it? You'd look like a fool to refuse to consider the evidence, given that you now know the pill is doing *something* to the cancer.

Well, that's what happens in real life. God has been telling us -- through common sense and observation -- that expensive players cost more money, that patients on Medicaid get more attention from doctors, and that patients with a positive mammogram get treated earlier.

Clinging to a null hypothesis of "no real effect," when we know that null hypothesis is false, makes no sense at all.

Labels: medicine, null hypothesis, significance, statistics

3 Comments:

At Thursday, July 03, 2014 4:20:00 PM, Alex said...: I agree with you in general Phil, but I think this leads to the same issue I have with Bayesian tests - you have to pick a prior. In this case, you have to pick a non-zero null. What correlation would you expect to see between salary and wins? You have to pick a number that isn't 0, and then you have to justify it to everyone (on top of justifying all the other things in an analysis that you always have to justify).

The whole argument can be avoided by just talking in effect sizes (which a correlation already is!), but that viewpoint is slow to make ground.
At Thursday, July 03, 2014 4:31:00 PM, Phil Birnbaum said...: Agreed, a null makes no sense in these cases. That's probably part of the issue, that it's hard to justify why you don't have a null, and what the results mean. But you know how things work with refereed journals much better than I do.
At Saturday, July 12, 2014 9:43:00 AM, Eric C. said...: The deal is, from how I see it, that you choose a null of no effect because whatever common sense may seem to tell you, the point of research is to test our assumptions, not assume them.

In the Oregon medical study: does access to affordable medical care mean that cases of hypertension are diagnosed and treated more frequently, improving outcomes? Seems reasonable. But maybe diagnosis and drug treatment, without patient education, mean patients think they can eat whatever they want as long as they take their medicine, so there is in fact no improvement. Hunh. I guess that could be. Or maybe treatment of hypertension via statins (the main drugs available) isn't actually that effective, and patients in this population are much less willing, or able, to comply with recommendations about diet and exercise than patients in some other population. Well, that could be. The point is that we *don't* know, and the reason we do the study is to tell whether we have evidence that two things are associated (Medicaid and better blood pressure outcomes) and to tell how sure we are that our evidence is due to some real result, rather than due to chance.

Here's a test: if you're willing to accept a result that is not statistically significant (at 95% confidence, or whatever), would you be willing to accept that result if it had gone the other way? If not, why bother with the study -- I exaggerate here, but in that case, you're acting as if you already know the answer.

If you find that more mammograms are *not* statistically significantly associated with better breast cancer outcomes at 95* confidence, you aren't then constrained to stop talking about the size of the effect. In fact, you usually talk about a confidence interval: "We are 95% sure that the effect is within this range"[*], and if that range includes zero, then you also have say that you cannot be 95% confident that the effect is not in fact zero, but you are also saying that "Our best point estimate of the effect is x%, but don't read too much into that because. . . ."

Back to your main point: But, in this case, "salary has an effect of zero" is not a reasonable null hypothesis. Why? Because we have strong, common-sense knowledge that salary DOES sometimes have an effect. That's where I disagree. You may think that and I may think that, but the point of statistics is to help us not fool ourselves.

[*] I'm glossing over what a confidence interval really is, but this comment is much too long already.

<< Home

Sabermetric Research

Wednesday, July 02, 2014

When a null hypothesis makes no sense

3 Comments:

About Me

Previous Posts