### Don't always blindly insist on statistical significance

Suppose you run a regression, and it turns out that the input you're investigating turns out to appear to have a real-life relationship to the output. But it also turns out that the despite being significant in the real-life sense, the relationship is not statistically significant. What do you do?

David Berri argues (scroll down to the second half of his post) that once you realize the variable is statistically insignificant, you stop dead:

We do not say (and this point should be emphasized) the “coefficient is insignificant” and then proceed to tell additional stories about the link between these two variables.

One of my co-authors puts it this way to her students.

“When I teach econometrics I tell my students that a sentence that begins by stating a coefficient is statistically insignificant ends with a period.” She tells her students that she never wants to see “The coefficient was insignificant, but…”

Well, I don't think that's always right. I explained why in a post two weeks ago, called "Low statistical significance doesn't necessarily mean no effect." My argument was that, if you already have some reason to believe there is a correlation between your input and your output, the result of your regression can help confirm your belief, even if it doesn't rise to statistical significance.

Here's an example with real data. I took all 30 major league teams for 2007, and I ran a regression to see if there was a relationship between the team's triples and its runs scored. It turned out that there was no statistically-significant relationship: the p-value was 0.23, far above the 0.05 that's normally regarded as the threshold.

Berri would now say that we should stop. As he writes,

"Even though we have questions, at this point it would be inappropriate to talk about the coefficient we have estimated ... as being anything else than statistically insignificant."

And maybe that would be the case if we didn't know anything about baseball. But, as baseball fans, we know that triples are good things, and we know that a triple does help teams score runs. That's why we cheer our team's players when they hit them. There is strong reason to believe there's a connection between triples and runs.

So I don't think it's inappropriate at all to look at our coefficient. It turns out that the coefficient is 1.88. On average, every additional triple a team hit was associated with an increase of 1.88 runs scored.

Of course, there's a large variance associated with that 1.88 estimate -- as you'd expect, since it wasn't statistically significant from zero. The standard deviation of the estimate was 1.53. That means a 95% confidence interval is approximately (-1.18, 4.94). Not only is the 1.88 not significantly different from zero, it's also not significantly different from -1, or from almost +5!

But why can't we say that? Why shouldn't we write that we found a coefficient of 1.88 with a standard deviation of 1.53? Why can't we discuss these numbers and the size of the real effect, if any?

Berri and his co-author would argue that it's because we have no good evidence that the effect is different from zero. But what makes zero special? We also have no good evidence that the effect is different from 1.88, or 4.1, or -0.6. Why is it necessary to proceed as if the "real" value of the coefficient is zero, when zero is just one special case?

As I argued before, zero is considered special because, most of the time, there's no reason to believe there's any connection between the input and the output. Do you think rubbing chocolate on your leg can cure cancer? Do you think red cars go faster than black cars just by virtue of their color? Do you think standing on your head makes you smarter?

In all three of these examples, I'd recommend following Berri's advice, because there's overwhelming logic that says the relationship "should" be zero. There's no scientific reason that red makes cars go faster. If you took a thousand similarly absurd hypotheses, you'd expect at least 999 of them to be zero. So if you get something positive but not statistically significant, the odds are overwhelming that the non-zero point estimate got that way just because of random luck.

But, for triples vs. runs, that's not the case. Our prior expectation should be that the result will turn out positive. How positive? Well, suppose we had never studied the issue, or read Bill James or Pete Palmer. Then, we might naively figure, the average triple scores a runner and a half on base, and there's a 70% chance of scoring the batter eventually. That's 2.2 runs. Maybe half the runners on base would score eventually even without the triple, so subtract off .75, to give us that the triple is worth 1.45 runs. (I know these numbers are wrong, but they're reasonable for what I might have guessed pre-Bill James.)

If our best estimate going in was that a triple should be worth 1.45 runs, and the regression gave us something close to that (and not statistically significantly different), then why should we be using zero as a basis for our decision for whether to consider this valid evidence?

Rather than end the discussion with a period, as Berri's colleague would have us do, I would suggest we do this:

-- give the regression's estimate of 1.88, along with the standard error of 1.53 and the confidence interval (-1.18, 4.94).

-- state that the estimate of 1.88 is significant in the baseball sense.

-- admit that it's not significantly different from zero.

-- BUT: argue that there's reason to think that the 1.88 is in the neighborhood of what theory predicts.

If I were writing a paper, that's exactly what I'd say. And I'd also admit that the confidence interval is huge, and we really should repeat this analysis with more years' worth of data, to reduce the standard error. But I'd argue that, even without statistical significance, the results actually SUPPORT the hypothesis that triples are associated with runs scored.

You've got to use common sense. If you got these results for a relationship between rubbing chocolate on your leg and cancer, it would be perfectly appropriate to assume that the relationship is zero. But if you get these results for a relationship between height and weight, zero is not a good option.

And, in any case: if you get results that are significant in the real world, but not statistically significant, it's a sign that your dataset is too small. Just get some more data, and run your regression again.

------

Here's another example of how you have to contort your logic if you want to blindly assume that statistical insignificance equals no effect.

I'm going to run the same regression, on the 2007 MLB teams, but I'm going to use doubles instead of triples. This time, the results are indeed statistically significant:

-- p=.0012 (signficant at 99.88%)

-- each double is associated with an additional 1.50 runs scored

-- the standard error is 0.417, so a 95% confidence interval is (0.67, 2.33)

Everyone would agree that there is a connection between hitting doubles and scoring runs.

But now, Berri and his colleague are in a strange situation. They have to argue that:

-- there is a connection between doubles and runs, but

-- there is NO connection between triples and runs!

If that's your position, and you have traditional beliefs about how doubles lead to more runs (by scoring baserunners and putting the batter on second base), those two statements are mutually contradictory. It's obvious to any baseball fan that, on the margin, a triple will lead to at least as many runs scoring as a double. It's just not possible that a double is worth 1.5 runs, but the act of stretching it into a triple makes it worth 0.0 runs instead. But if you follow Berri's rule, that's what you have to do! Your paper can't even argue against it, because "the coefficient was insignificant, but ..." is not allowed!

Now, in fairness, it's not logically impossible for doubles to be worth 1.5 runs in a regression but triples 0.0 runs. Maybe doubles are worth only 0.1 runs in current run value, but they come in at 1.5 because they're associated with power-hitting teams. Triples, on the other hand, might be associated with fast singles-hitting teams who are always below average.

In the absence of other evidence, that would be a valid possibility. But, unlike the chocolate-cures-cancer case, I don't think it's a very likely possibility. If you do think it's likely, then you still have to make the argument using other evidence. You can't just fall back on the "not significantly different from zero."

Using zero as your baseline for significance is not a law in the field of statistical analysis. It's a consequence of how things work in your actual field of study, an implementation of Carl Sagan's rule that "extraordinary claims require extraordinary evidence." For silly cancer cures, for red cars going faster than black cars, saying there's a non-zero effect is an extraordinary claim. And so you need statistical significance. (Indeed, silly cancer cures are so unlikely that you could argue that 95% significance is not enough, because that would allow too many false cures (2.5%) to get through.)

But for triples being worth about the same as doubles ... well, that's not extraordinary. Actually, it's the reverse that's extraordinary. Triples being worth zero while doubles are worth 1.5 runs? Are you kidding? I'd argue that if you want to say triples are worth less than doubles, the burden is reversed. It's not enough to show that the confidence interval includes zero. You have to show that the confidence interval does NOT include anything higher than the value of the double.

According to David Berri, the rule of thumb in econometrics is, "if you don't have signficance, ignore any effect you found." But that rule of thumb has certain hidden assumptions. One of those assumptions is that on your prior beliefs, the effect is likely to be zero. That's true for a lot of things in econometrics -- but not for doubles creating runs.

-----

This doubles/triples comparison is one I just made up. But there's a real life example, one I talked about a couple of years ago.

In that one, Cade Massey and Richard Thaler did a study (.pdf) of the NFL draft. As you would expect, they found that the earlier the draft pick, the more likely the player was to make an NFL roster. Earlier choices were also more likely to play more games, and more likely to make the Pro-Bowl. Draft choice was statistically significant for all three factors.

Then, the authors attempted to predict salary. Again as you'd expect, the more games you played, and the more you were selected to the Pro Bowl, the higher your salary. And, again, all these were statistically significant.

Finally, the authors held all these constant, and looked at whether draft position influenced salary over and above these factors. It did, but this factor did not reach statistical significance. Higher picks earned more money, but by somewhere between 1 and 2 SDs.

From the lack of significance, the authors wrote:

" ... we find that draft order is not a significant explanatory variable after controlling for [certain aspects of] prior performance."

I disagree. Because for that to be true, you have to argue that

-- higher draft choices are more likely to make the team

-- higher draft choices are more likely to play more games

-- higher draft choices are more likely to make the Pro-Bowl

but that

-- higher draft choices are NOT more likely to be better players in other ways than that.

That makes no sense. You have two offensive linemen on two different teams -- good enough to play every game for five years, but not good enough for the Pro Bowl. One was drafted in the first round; one was drafted in the third round. What Massey and Thaler are saying is that, despite the fact that the first round guy makes, on average, more money than the third round guy, that's likely to be random coincidence. That flies in the face of the evidence. Not statistically significant evidence, but good evidence nonetheless -- a coefficient that goes in the right direction, is signficant in the football sense, and is actually not that far below the 2 SD cutoff.

That isn't logical. You've shown, with statistical significance, that higher picks perform better than lower picks in terms of playing time and stardom. The obvious explanation, which you accept, is that the higher picks are just better players. So why would you conclude that higher picks are exactly the same quality as lower picks in the aspects of the game that you chose not to measure, when the data don't actually show that?

In this case, it's not only acceptable, but required, to say "the coefficient was insignificant, but ..."

Labels: Berri, draft, statistics