## Tuesday, June 19, 2012

### Significance testing and contradictory conclusions

A bank is robbed.  There are witnesses and video cameras.  There are two suspects -- identical twin brothers.  The police investigate, and are unable to determine which brother is the criminal.

The police call a press conference.  They are familiar with the standards of statistical inference, where you don't have evidence until p < .05.  And, of course, the police have only p = .5 -- a fifty/fifty chance.

"There is no evidence that Al did the crime," they say, accordingly.  "And, also, there is no evidence that Bob did the crime."

But if the newspaper says, "police have no evidence pointing to who did it," that seems wrong.  There is strong evidence that *one* of the brothers did it.

------

"Bad Science," by Ben Goldacre, is an excellent debunking of some bad research and media reporting, especially in the field of medicine.  A lot of the book talks about the kinds of things that we sabermetricians are concerned about -- bad logic, misunderstandings by reporters, untested conventional wisdom, and so forth.

I just discovered that Goldacre has a blog, and I found this post on significance, which brings up the twins issue.

Let me paraphrase his example.  You have two types of cells -- normal, and mutant.  You give them both a treatment.  In the normal cells, firing increased by 15 percent, but that's not statistically significant.  In the mutant cells, firing increased by 30 percent, and that *is* significant.  (I'm going to call these cases the "first treatment" and the "second treatment," even though it's the cells that changed, and not really the treatment.)

So, what do you conclude?

Under the normal standards of statistical inference, you can say the following:

1.  There is no evidence that the first treatment works.
2.  There is significant evidence the second treatment works.

Seeing that, a lay reader would obviously conclude that the researcher found a difference between the two treatments.

Goldacre objects to this conclusion.  Because, after all, the difference between the two treatments was only 15 percent.  That's probably not statistically significant, since that same 15 percent number was judged insignificant when applied to the normal cell case.  So, given that insignificance, why should we be claiming there's a difference?

What Goldacre would have us do, I think, is to say that we don't have evidence that there's a difference.  So we'd say this:

1.  There is no evidence that the first treatment works.
2.  There is significant evidence the second treatment works.
3.  There is no evidence that the treatments are different.

And now we have the twin situation.  Because even though there's no evidence for #1, and there's no evidence for #3, there is evidence that one of the two must be true.  Either the first treatment has an effect, or the two treatments are different.  They can't both be false.  At least of the twins must be guilty.

You have to be especially careful, more careful than usual, that you don't assume that absence of evidence is evidence of absence.  Otherwise, if you insist that both coefficients should be treated as zero, you're claiming a logical impossibility.

----

Even if you just assume that one of the two coefficients is zero ... well, how do you know which one?  If you choose the first one, you assume the effect is zero, when the observation was 15 percent.  If you choose the second one, you assume the effect is 30 percent, when the observation was 15 percent.

And it gets worse.  Imagine that instead of 15 percent for the first treatment, the result was 27 percent, which was significant.  Now, you can say there is evidence for the first treatment, and there is also evidence for the second treatment.  And, you can also say that there is no evidence that the two treatments are different.

That's all good so far.  But, now, you head to your regression, and you start computing estimates.  And, what do you do?  You probably use 27 percent for the first treatment, and 30 percent for the second treatment.  But you just finished saying there's no evidence they're different!  Shouldn't you be using 27 for both, or 30 for both, or 28.5 for both?  Shouldn't it be a problem that you assume one thing on one page, and the opposite on the very next page?

If you're going to say, "there's no evidence for a difference but we're going to assume it anyway," why is that better than saying (in the previous example) "there's no evidence that the first treatment works, but we're going to assume it anyway?"

Labels: ,

At Tuesday, June 19, 2012 10:30:00 PM,  Alex said...

This gets back to your other post - you get in trouble when you equate 'no evidence' with 'not statistically significant'.

I had students run an experiment for class. They had people fill out a survey rating their mood, and overall people are pretty average, neither happy nor sad. Then half the group watched a funny video meant to cheer them up and half watched a sad video. After the video everyone fills out the survey again.

It turned out that neither of the groups moved significantly away from their pre-video average. So the videos "didn't work". But if you tested the happy group against the sad group, they were different. So the videos "did work". How would you describe the results?

At Tuesday, June 19, 2012 10:50:00 PM,  Phil Birnbaum said...

How would I, personally, describe the results?

My rule is: if you find something noteworthy, but it's not statistically significant, your experiment is too small. This is a case of that. So that would be my first observation.

But, I'd still say it looks like both videos worked. Even though the individual videos are not significant, they both went in the expected direction, and, combined, the difference was significant.

At Wednesday, June 20, 2012 8:49:00 AM,  Millsy said...

Phil,

I really like the two treatments example. In some of my work, I want to test the equality of two coefficients in a regression. One is statistically significant and different from zero, the other is *sometimes* statistically significantly different from zero.

This is a pretty straight forward test (it's pretty close to a t-test), but I fear that trying to find it for those not particularly familiar with the basics statistics have nowhere to find this information.

I have a qualm with your comment, as I think using the rule, "if you find something noteworthy, but it's not statistically significant, your experiment is too small," gets us into the habit of increasing sample size until we find what we are looking for (or, if our result is significant, simply assuming we had a large enough sample size for sufficient power of our test).

This happens all the time in science. In a lot of cases, you're right (especially when you have some prior idea of what *should* be happening).

At Wednesday, June 20, 2012 11:22:00 AM,  Alex said...

I think it's interesting that you immediately assumed the experiment was too small, given that I didn't say how many people were in the experiment. If it so happened that they tested 1000 people, is it worth claiming that the videos changed people's mood if you need to run that many people to tell?

As it turns out, it probably was a sample size issue. We didn't run a lot of people since it was for a class. That being said, I think there are still interesting issues. The tests for change from baseline are probably more powerful because they were within-subject tests; the test for happy versus sad is between-subject. That might make you lean toward the null finding being more accurate.

All in all, I think the best you could really say is that the videos were somewhat effective, or probably effective.

At Wednesday, June 20, 2012 11:48:00 AM,  Phil Birnbaum said...

Alex,

The experiment is too small because, apparently, you got a result that was significant to you in the psychological sense, but not in the statistical sense.

If you tested 1000 people, then, yes, the actual result would have been a very small change in mood. If you're still interested in whether that's real, you need more data.

But, if the mood changed by 0.00001 mood units (p=.12), then it's so close to zero that who cares? In that case, even if you had a big enough sample to have that be statistically significant, the 0.00001 still wouldn't do much for you.