Wednesday, June 18, 2014

Absence of evidence: the Oregon Medicaid study

In 2008, the State of Oregon created a randomized, controlled experiment to study the effect of Medicaid on health. For the randomization, they held a lottery to choose which 10,000 of the 90,000 applicants would receive coverage. Over the following years, researchers were able to compare the covered and uncovered groups, to check for differences in subsequent health outcomes, and to determine whether the covered individuals had more conditions diagnosed and treated.

Last month brought the publication of another study. In 2006, the state of Massachusetts had instituted health care reform, and this new study compared covered individuals pre- and post-reform, within MA and compared to other states.

The two states' results appeared to differ. In Oregon, the studies found improvement in some health measures, but no statistically-significant change in most others. On the other hand, the Massachusetts study found large benefits across the board.

Why the differences? It turns out the Oregon study had a much smaller dataset, so it was unable to find statistical significance in most of the outcomes it studied. Austin Frakt, of "The Incidental Economist," massaged the results in the Oregon study to make them comparable to the Massachusetts study. Here's his diagram comparing the two confidence intervals:




The OR study's interval is ten times as wide as the MA study's! So, obviously, there's no way it would have been able find statistical significance for the size of the effect the MA study found.  

On the surface, it looks like the two studies had radically different findings: MA found large benefits in cases where OR failed to find any benefit.  But that's not right. What really happened is: MA found benefits in cases where OR really didn't have enough data to decide one way or the other.

-----

The Oregon study is another case of "absence of evidence is not evidence of absence." But, I think, what really causes this kind of misinterpretation is the conventional language used to describe non-significant results.  

In one of the Oregon studies, the authors say this:


"We found no significant effect of Medicaid coverage on the prevalence or diagnosis of hypertension."

Well, that's not true. For one thing, the authors say "significant" instead of "statistically significant." Those are different -- crucially different.  "Not significant means, 'has little effect'." "Not statistically significant" means, "the sample size was too small to provide evidence of what the effect might be."

When the reader sees "no significant improvements," the reasonable inference is that the authors had reasonable evidence, and concluded that any improvements were negligible. That's almost the opposite of the truth -- insufficient evidence either way.  

In fact, it's even more "not true," because the estimate of hypertension WAS "significant" in the real-life sense:


"The 95% confidence intervals for many of the estimates of effects on individual physical health measures were wide enough to include changes that would be considered clinically significant — such as a 7.16-percentage-point reduction in the prevalence of hypertension."

So, at the very least, the authors should have put "statistically" in front of "significant," like this:


"We found no statistically-significant effect of Medicaid coverage on the prevalence or diagnosis of hypertension."

Better!  Now, the sentence is longer false.  But now it's meaningless.

It's meaningless because, as I wrote before, statistical significance is never a property of a real-world effect. It's a property of the *data*, a property of the *evidence* of the effect.

Saying "Medicaid had no statistically-significant effect on patient mortality" is like saying, "OJ Simpson had no statistically-significant effect on Nicole Brown Simpson's mortality." It uses an adjective that should apply only to the evidence, and improperly applies it to the claim itself.

Let's add the word "evidence," so the sentence makes sense:


"We found no statistically-significant evidence of Medicaid coverage's effect on the prevalence or diagnosis of hypertension."

We're getting there.  Now, the sentence is meaningful. But, in my opinion, it's misleading. To my ear, It implies that, specifically, you found no evidence that an effect exists. But, you also didn't find any evidence that an effect *doesn't* exist -- especially relevant in this case, wherer the point estimate was "clinically significant."  

So, change it to this:


"We found no statistically-significant evidence of whether or not Medicaid coverage affects the prevalence or diagnosis of hypertension."

Better again. But it's still misleading, in a different way. It's phrased in such a way that it implies that it's an important fact that they found no statistically-significant evidence.  

Because, why else say it at all? These researchers aren't the only ones with no evidence. I didn't find any evidence either. In fact, of the 7 billion people on earth, NONE of them, as far as I know, found any statistically-significant evidence for what happened in Oregon. And none of us are mentioning that in print.

The difference is: these researchers *looked* for evidence. But, does that matter enough?

Mary is murdered.  The police glance around the murder scene. They call Mary's husband John, and ask him if he did it. He says no. The police shrug. They don't do DNA testing, they don't take fingerprints, they don't look for witnesses, and they just go back to the station and write a report. And then they say, "We found no good evidence that John did it."

That's the same kind of "true but misleading." When you say you didn't find evidence, that implies you searched.  Doesn't it? Otherwise, you'd say directly, "We didn't search," or "We haven't found any evidence yet."

Not only does it imply that you searched ... I think it implies that you searched *thoroughly*.  That's because of the formal phrasing: not just, "we couldn't find any evidence," but, rather, "We found NO evidence." In English, saying it that way, with the added dramatic emphasis ... well, it's almost an aggressive, pre-emptive rebuttal.

Ask your kid, "Did you feed Fido like I asked you to?". If he replies, "I couldn't find him," you'll say, "well, you didn't look hard enough ... go look again!" 

But if he replies, "I found no trace of him," you immediately think ... "OMG, did he run away, did he get hit by a car?" If it turns out Fido is safely in his doggie bed in the den, and your kid didn't even leave the bedroom ... well, it's literally true that he found "no trace" of Fido in his room, but that doesn't make him any less a brat.

In real life, "we found no evidence for X" carries the implication, "we looked hard enough that you should interpret the absence of evidence of X as evidence of absence of X." In the Oregon study, the implication is obviously not true. The researchers weren't able to look hard enough.  Not that they weren't willing -- just that they weren't able, with only 10,000 people in the dataset they were given.

In that light, instead of "no evidence of a significant effect," the study should have said something like,


"The Oregon study didn't contain enough statistically-significant evidence to tell us whether or not Medicaid coverage affects the prevalence or diagnosis of hypertension."

If the authors did that, there would have been no confusion. Of course, people would wonder why Oregon bothered to do the experiment at all, if they could have realized in advance there wouldn't be enough data to reach a conclusion.

------

My feeling is that for most studies, the authors DO want to imply "evidence of absence" when they find a result that's not statistically significant.  I suspect the phrasing has evolved in order to evoke that conclusion, without having to say it explicitly. 

And, often, "evidence of absence" is the whole point. Naturopaths will say, "our homeopathic medicines can cure cancer," and scientists will do a study, and say, "look, the treatment group didn't have any statistically-significant difference from the control group." What they really mean -- and often say -- is, "that's enough evidence of absence to show that your belief in clutch hitters stupid pseudo-medicine is useless, you greedy quacks."  

And, truth be told, I don't actually object to that. Sometimes absence of evidence IS evidence of absence. Actually, from a Bayesian standpoint, it's ALWAYS evidence of absence, albeit perhaps to a tiny, tiny degree.  Do unicorns exist? Well, there isn't one in this room, so that tips the probability of "no" a tiny bit higher. But, add up all the billions of times that nobody has seen a unicorn, and the probability of no unicorns is pretty close to 1.  

You don't need to do any math ... it's just common sense. Do I have a spare can opener in the house? If I open the kitchen drawer, and it isn't there ... that's good evidence that I don't have one, because that's probably where I would have put it. On the other hand, if I open the fridge and it's not there, that's weak evidence at best.

We do that all the time. We use our brains, and all the common-sense prior knowledge we have. In this case, our critical prior assumption is that spare can openers are much more likely to be found in the drawer than in the fridge.  
If you want to go from "absence of evidence" to "evidence of absence," you have to be Bayesian. You have to use "priors," like your knowledge of where the can opener is more likely to be. And if you want to be intellectually honest, you have to use ALL your priors, even those that work against your favored conclusion. If you only look in the fridge, and the toilet, and the clothes closet, and you tell your wife, "I checked three rooms and couldn't find it," ... well, you're being a dick. You're telling her the literal truth, but hoping to mislead her into reaching an incorrect conclusion.

If you want to claim "evidence of absence," you have to show that if the effect *was* there, you would have found it. In other words, you have to convince us that you really *did* look everywhere for the can opener.  

One way to do that is to formally look at the statistical "power" of your test. But, there's an easier way: just look at your confidence interval.  If it's narrow around zero, that's good "evidence of absence". If it's wide, that's weak "evidence of absence."  

In the Oregon study, the confidence interval for hypertension is obviously quite wide. Since the point estimate is "clinically significant," the edge of the confidence interval -- the point estimate plus 2 SDs -- must be *really* clinically significant.  

The thing is, the convention for academic studies is that even if the estimate isn't statistically significant, you don't treat it differently in high-power studies versus low-power studies. The standard phrasing is the same either way: "There was no statistically-significant effect."  

And that's misleading.

Especially when, as in the Oregon case, the study is so underpowered that even your "clinically significant" result is far from statistical significance. Declaring "the effect was not statistically significant," for a study that weak, is as misleading as saying "The missing can opener could not be found in the kitchen," when all you did was look in the fridge.

If you want to argue for evidence of absence, you have to be Bayesian. You have to acknowledge that your conclusions about absence are subjective. You have to make an explicit argument about your prior assumptions, and how they lead to your conclusion of evidence of absence.  

If you don't want to do that, fine. But then, your conclusion should clearly and directly acknowledge your ignorance. "We found no evidence of a significant effect" doesn't do that: it's a "nudge nudge, wink wink" way to imply "evidence of absence" under the table.

If you don't have statistical significance, here's what you -- and the Oregon study -- should say instead:


"We have no clue.  Our study doesn't have enough data to tell."


Labels: ,

2 Comments:

At Wednesday, June 18, 2014 6:41:00 PM, Blogger ed said...

Great post. I've been using the Oregon experiment as an example in the statistics classes I teach.

It is interesting to look at WHY the Oregon experiment generated such wide confidence intervals, despite what might seem like a fairly large sample. It was a randomized experiment, which in principle should make the results much more credible than something like the MA study. But the problem is that the actual difference in Medicaid coverage between lottery winners and losers was not that great: for example, about 20% of the "losers" were on Medicaid at some point during the study, vs only a little over 40% of the "winners." So when the authors used IV to account for this fact, the precision of the estimated effects of Medicaid coverage was pretty low. (You need to get the technical appendix from the authors to really look at this stuff.)

 
At Thursday, June 19, 2014 9:19:00 PM, Blogger doc said...

I prefer: "We found that enrollment in Medicaid is associated with [effect]. However, our sample size was not large enough for us to conclude that this measured effect is, in fact, real."

 

Post a Comment

Links to this post:

Create a Link

<< Home