Sabermetric Research: Don't always blindly insist on statistical significance

Tuesday, May 19, 2009

Don't always blindly insist on statistical significance

Suppose you run a regression, and it turns out that the input you're investigating turns out to appear to have a real-life relationship to the output. But it also turns out that the despite being significant in the real-life sense, the relationship is not statistically significant. What do you do?

David Berri argues (scroll down to the second half of his post) that once you realize the variable is statistically insignificant, you stop dead:

We do not say (and this point should be emphasized) the “coefficient is insignificant” and then proceed to tell additional stories about the link between these two variables.

One of my co-authors puts it this way to her students.

“When I teach econometrics I tell my students that a sentence that begins by stating a coefficient is statistically insignificant ends with a period.” She tells her students that she never wants to see “The coefficient was insignificant, but…”

Well, I don't think that's always right. I explained why in a post two weeks ago, called "Low statistical significance doesn't necessarily mean no effect." My argument was that, if you already have some reason to believe there is a correlation between your input and your output, the result of your regression can help confirm your belief, even if it doesn't rise to statistical significance.

Here's an example with real data. I took all 30 major league teams for 2007, and I ran a regression to see if there was a relationship between the team's triples and its runs scored. It turned out that there was no statistically-significant relationship: the p-value was 0.23, far above the 0.05 that's normally regarded as the threshold.

Berri would now say that we should stop. As he writes,

"Even though we have questions, at this point it would be inappropriate to talk about the coefficient we have estimated ... as being anything else than statistically insignificant."

And maybe that would be the case if we didn't know anything about baseball. But, as baseball fans, we know that triples are good things, and we know that a triple does help teams score runs. That's why we cheer our team's players when they hit them. There is strong reason to believe there's a connection between triples and runs.

So I don't think it's inappropriate at all to look at our coefficient. It turns out that the coefficient is 1.88. On average, every additional triple a team hit was associated with an increase of 1.88 runs scored.

Of course, there's a large variance associated with that 1.88 estimate -- as you'd expect, since it wasn't statistically significant from zero. The standard deviation of the estimate was 1.53. That means a 95% confidence interval is approximately (-1.18, 4.94). Not only is the 1.88 not significantly different from zero, it's also not significantly different from -1, or from almost +5!

But why can't we say that? Why shouldn't we write that we found a coefficient of 1.88 with a standard deviation of 1.53? Why can't we discuss these numbers and the size of the real effect, if any?

Berri and his co-author would argue that it's because we have no good evidence that the effect is different from zero. But what makes zero special? We also have no good evidence that the effect is different from 1.88, or 4.1, or -0.6. Why is it necessary to proceed as if the "real" value of the coefficient is zero, when zero is just one special case?

As I argued before, zero is considered special because, most of the time, there's no reason to believe there's any connection between the input and the output. Do you think rubbing chocolate on your leg can cure cancer? Do you think red cars go faster than black cars just by virtue of their color? Do you think standing on your head makes you smarter?

In all three of these examples, I'd recommend following Berri's advice, because there's overwhelming logic that says the relationship "should" be zero. There's no scientific reason that red makes cars go faster. If you took a thousand similarly absurd hypotheses, you'd expect at least 999 of them to be zero. So if you get something positive but not statistically significant, the odds are overwhelming that the non-zero point estimate got that way just because of random luck.

But, for triples vs. runs, that's not the case. Our prior expectation should be that the result will turn out positive. How positive? Well, suppose we had never studied the issue, or read Bill James or Pete Palmer. Then, we might naively figure, the average triple scores a runner and a half on base, and there's a 70% chance of scoring the batter eventually. That's 2.2 runs. Maybe half the runners on base would score eventually even without the triple, so subtract off .75, to give us that the triple is worth 1.45 runs. (I know these numbers are wrong, but they're reasonable for what I might have guessed pre-Bill James.)

If our best estimate going in was that a triple should be worth 1.45 runs, and the regression gave us something close to that (and not statistically significantly different), then why should we be using zero as a basis for our decision for whether to consider this valid evidence?

Rather than end the discussion with a period, as Berri's colleague would have us do, I would suggest we do this:

-- give the regression's estimate of 1.88, along with the standard error of 1.53 and the confidence interval (-1.18, 4.94).
-- state that the estimate of 1.88 is significant in the baseball sense.
-- admit that it's not significantly different from zero.
-- BUT: argue that there's reason to think that the 1.88 is in the neighborhood of what theory predicts.

If I were writing a paper, that's exactly what I'd say. And I'd also admit that the confidence interval is huge, and we really should repeat this analysis with more years' worth of data, to reduce the standard error. But I'd argue that, even without statistical significance, the results actually SUPPORT the hypothesis that triples are associated with runs scored.

You've got to use common sense. If you got these results for a relationship between rubbing chocolate on your leg and cancer, it would be perfectly appropriate to assume that the relationship is zero. But if you get these results for a relationship between height and weight, zero is not a good option.

And, in any case: if you get results that are significant in the real world, but not statistically significant, it's a sign that your dataset is too small. Just get some more data, and run your regression again.

------

Here's another example of how you have to contort your logic if you want to blindly assume that statistical insignificance equals no effect.

I'm going to run the same regression, on the 2007 MLB teams, but I'm going to use doubles instead of triples. This time, the results are indeed statistically significant:

-- p=.0012 (signficant at 99.88%)
-- each double is associated with an additional 1.50 runs scored
-- the standard error is 0.417, so a 95% confidence interval is (0.67, 2.33)

Everyone would agree that there is a connection between hitting doubles and scoring runs.

But now, Berri and his colleague are in a strange situation. They have to argue that:

-- there is a connection between doubles and runs, but
-- there is NO connection between triples and runs!

If that's your position, and you have traditional beliefs about how doubles lead to more runs (by scoring baserunners and putting the batter on second base), those two statements are mutually contradictory. It's obvious to any baseball fan that, on the margin, a triple will lead to at least as many runs scoring as a double. It's just not possible that a double is worth 1.5 runs, but the act of stretching it into a triple makes it worth 0.0 runs instead. But if you follow Berri's rule, that's what you have to do! Your paper can't even argue against it, because "the coefficient was insignificant, but ..." is not allowed!

Now, in fairness, it's not logically impossible for doubles to be worth 1.5 runs in a regression but triples 0.0 runs. Maybe doubles are worth only 0.1 runs in current run value, but they come in at 1.5 because they're associated with power-hitting teams. Triples, on the other hand, might be associated with fast singles-hitting teams who are always below average.

In the absence of other evidence, that would be a valid possibility. But, unlike the chocolate-cures-cancer case, I don't think it's a very likely possibility. If you do think it's likely, then you still have to make the argument using other evidence. You can't just fall back on the "not significantly different from zero."

Using zero as your baseline for significance is not a law in the field of statistical analysis. It's a consequence of how things work in your actual field of study, an implementation of Carl Sagan's rule that "extraordinary claims require extraordinary evidence." For silly cancer cures, for red cars going faster than black cars, saying there's a non-zero effect is an extraordinary claim. And so you need statistical significance. (Indeed, silly cancer cures are so unlikely that you could argue that 95% significance is not enough, because that would allow too many false cures (2.5%) to get through.)

But for triples being worth about the same as doubles ... well, that's not extraordinary. Actually, it's the reverse that's extraordinary. Triples being worth zero while doubles are worth 1.5 runs? Are you kidding? I'd argue that if you want to say triples are worth less than doubles, the burden is reversed. It's not enough to show that the confidence interval includes zero. You have to show that the confidence interval does NOT include anything higher than the value of the double.

According to David Berri, the rule of thumb in econometrics is, "if you don't have signficance, ignore any effect you found." But that rule of thumb has certain hidden assumptions. One of those assumptions is that on your prior beliefs, the effect is likely to be zero. That's true for a lot of things in econometrics -- but not for doubles creating runs.

-----

This doubles/triples comparison is one I just made up. But there's a real life example, one I talked about a couple of years ago.

In that one, Cade Massey and Richard Thaler did a study (.pdf) of the NFL draft. As you would expect, they found that the earlier the draft pick, the more likely the player was to make an NFL roster. Earlier choices were also more likely to play more games, and more likely to make the Pro-Bowl. Draft choice was statistically significant for all three factors.

Then, the authors attempted to predict salary. Again as you'd expect, the more games you played, and the more you were selected to the Pro Bowl, the higher your salary. And, again, all these were statistically significant.

Finally, the authors held all these constant, and looked at whether draft position influenced salary over and above these factors. It did, but this factor did not reach statistical significance. Higher picks earned more money, but by somewhere between 1 and 2 SDs.

From the lack of significance, the authors wrote:

" ... we find that draft order is not a significant explanatory variable after controlling for [certain aspects of] prior performance."

I disagree. Because for that to be true, you have to argue that

-- higher draft choices are more likely to make the team
-- higher draft choices are more likely to play more games
-- higher draft choices are more likely to make the Pro-Bowl

but that

-- higher draft choices are NOT more likely to be better players in other ways than that.

That makes no sense. You have two offensive linemen on two different teams -- good enough to play every game for five years, but not good enough for the Pro Bowl. One was drafted in the first round; one was drafted in the third round. What Massey and Thaler are saying is that, despite the fact that the first round guy makes, on average, more money than the third round guy, that's likely to be random coincidence. That flies in the face of the evidence. Not statistically significant evidence, but good evidence nonetheless -- a coefficient that goes in the right direction, is signficant in the football sense, and is actually not that far below the 2 SD cutoff.

That isn't logical. You've shown, with statistical significance, that higher picks perform better than lower picks in terms of playing time and stardom. The obvious explanation, which you accept, is that the higher picks are just better players. So why would you conclude that higher picks are exactly the same quality as lower picks in the aspects of the game that you chose not to measure, when the data don't actually show that?

In this case, it's not only acceptable, but required, to say "the coefficient was insignificant, but ..."

Labels: Berri, draft, statistics

19 Comments:

At Tuesday, May 19, 2009 5:35:00 PM, Tom Spradlin said...: Phil, I have only a passing interest in baseball statistics. I am a retired biostatistician working in health research. I wish more people in my field would take your post to heart.

In the case of the triples, why should we pretend the coefficient is zero? The data are more consistent with a value of 1, or even 2, than with zero. With which value are the data most consistent? 1.88 . Call it that.
At Tuesday, May 19, 2009 10:22:00 PM, G Wolf said...: You've contradicted yourself here when you wrote "Now, in fairness, it's not logically impossible for doubles to be worth 1.5 runs in a regression but triples 0.0 runs. Maybe doubles are worth only 0.1 runs in current run value, but they come in at 1.5 because they're associated with power-hitting teams. Triples, on the other hand, might be associated with fast singles-hitting teams who are always below average."

That's sort of the point when you get a coefficient that's not statistically significant, but you know it shouldn't be the case: it means you need to look further.

Also, if your a priori estimate is that a triple is worth 1.45 runs, you shouldn't be testing it against 0 in the first place. You should be testing it versus 1.45, if you want a meaningful test.
At Tuesday, May 19, 2009 10:44:00 PM, Phil Birnbaum said...: G Wolf: my point was that while the scenario I suggested was logically possible, it would be so obviously unlikely that it would never be considered. Even by a strict application of the "if it's not significantly different from zero, ignore it" rule, a researcher wouldn't put forth that scenario as realistic.
At Wednesday, May 20, 2009 8:57:00 AM, Tom Spradlin said...: With respect to G Wolf's comment on testing the coefficient against 1.45 :

Why test it against anything? Just say that it is almost certainly in the range of 1.88 plus or minus a couple of standard deviations. The test doesn't add anything.
At Wednesday, May 20, 2009 8:28:00 PM, Ted said...: Phil, you selectively quote Barri out of context -- omitting parts of his post which say many of the things you say -- and then spend several screens destroying a strawman of your own creation.

The rules of the game in doing (classical) statistics are, roughly:
(1) Formulate a hypothesis.
(2) Determine the dataset you'll use to test that hypothesis.
(3) Determine the threshhold at which you are willing to conclude that the data do not support the hypothesis.
(4) Look at the data and see whether or not the threshhold is met.

Them's the rules, as they are (or at least should be) taught in any intro to statistics class. If you don't follow those rules, then you can (often) get any conclusion you want. In the context of these rules, then, either your data met the threshhold for rejecting your null hypothesis, or it didn't.

Let's look at your triples study strawman. The question you asked -- maybe it wasn't what you *thought* you were asking, but it's the one you asked of the data -- is whether you could conclude, in one particular season, whether you could reject the hypothesis that teams which hit more triples did not score more runs. The answer you found is that you cannot.

Now, Barri's point is that at this point, you can't start the "ohbuts." Yes, we all intuitively know that other things being equal, more triples is a good thing (unless it's at the expense of HRs). But that's not the null hypothesis you chose, nor is it the model you estimated. Also, you asked the question of one season's data. If you had more data, why did you not include it in your estimation? That violates step (2) -- you can't say "oh, but I have this other information I didn't use..." Use all the information you have at your disposal in designing the test in the first place.

And, after all that, you omit Barri's key paragraph which explains this. Everyone should read Barri's original post, and look at the paragraph starting with "It's important to highlight my wording." In that paragraph, he already covers the points of Phil's which are valid.

But, basically, what good econometric practice says is that, once you *use all the information you have*, if you can't statistically determine that an effect is different from zero, then you should conclude you do not have evidence that it is different from zero. That's eminently sensible.
At Wednesday, May 20, 2009 9:53:00 PM, Phil Birnbaum said...: Hi, Ted,

I don't think I agree with you here.

First, I've re-read Berri's paragraph you highlighted several times, and I'm not sure what you're saying. Berri says,

"From this we can conclude that the empirical evidence suggests no relationship between market size and voting points ... It’s important to highlight my wording. The empirical evidence “suggests” a story."

The story Berri is suggesting is that there is no relationship.

He does go on to say that you can do another study -- but he seems to continue to assert that you can't add "but" to the statement that the results were not significant. What he DOES say is that it's legitimate to ask questions of the form, "if you did the study differently, might you get different results?" I agree with him that those are legitimate questions, and I have no dispute with him there. Our difference is that he seems to think all your "but"s have to do with redoing the study, and I think the "but"s can reference the (insignificant) results of the current study.

I don't agree with you on the triples experiment, because, for one thing, you don't know what my hypothesis is! Who said it was zero?

Indeed, my point is that -- and I hope I'm using the terminology right -- that you should be intuitively using a Bayesian approach. Before you did the experiment, you had an inutitive prior for what a triple might be worth. The experiment gives you more information. Now you have a posterior based on the evidence that gave you the prior, plus this new evidence.

Why isn't it legitimate to say, "geez, you know, we can't reject the zero hypothesis on this evidence alone, but the estimate of the value of the triple sure is consistent with what we thought going in. I bet it's not really zero. We should do another study."

As I read Berri's post, he wouldn't allow me to say that.

As for zero being the only acceptable null hyptothesis: let's say my regression had triples as the dependent variable, and (runs - 1.8 * triples) as the independent variable. And my null hypothesis is that there's no correlation.

And the results are decidedly non-significant, the slope being close to zero.

So now I conclude that there is no evidence that triples are worth anything other than 1 run.

Two experiments, two different null hypotheses, two cases where I can't reject the null hypothesis. But the conclusions are contradictory!

My argument is that if the two experiments give you the same relationship between triples and runs, they CANNOT lead you to different conclusions. Which means, you can't just reject the null hypothesis and stop dead. You have to argue for why that particular null hypothesis is appropriate. And if it isn't, a "yes but" is perfectly acceptable.
At Wednesday, May 20, 2009 9:55:00 PM, Phil Birnbaum said...: BTW, if anyone else believes that I've unfairly selectively omitted parts of Berri's argument, let me know. I will clarify my original post if necessary.
At Thursday, May 21, 2009 3:45:00 PM, Ted said...: I don't think I agree with you here.

"First, I've re-read Berri's paragraph you highlighted several times... The story Berri is suggesting is that there is no relationship."

To be pedantic, Berri is saying that the evidence suggests there is no relationship. But then he goes on to note that this isn't the end of the story. He is drawing a distinction between reporting the results of one test in one study -- which either does or does not reject a null hypothesis -- and the greater process of knowledge generation using statistical techniques. It is sloppy to conflate the two, which is his point, and is what I believe you're doing, when you write:

"I agree with him that those are legitimate questions, and I have no dispute with him there. Our difference is that he seems to think all your "but"s have to do with redoing the study, and I think the "but"s can reference the (insignificant) results of the current study."

"I don't agree with you on the triples experiment, because, for one thing, you don't know what my hypothesis is! Who said it was zero?"

You did. You wrote about finding a coefficient that was not statistically different from zero, which means it was your null hypothesis, whether or not you knew it was.

"Indeed, my point is that -- and I hope I'm using the terminology right -- that you should be intuitively using a Bayesian approach. Before you did the experiment, you had an inutitive prior for what a triple might be worth. The experiment gives you more information. Now you have a posterior based on the evidence that gave you the prior, plus this new evidence."

If you have prior knowledge and want to incorporate that into your analysis, then you should use a Bayesian approach, yes. I personally love Bayesian statistics, so I'd support that. But the point is, you should do that as part of the formal approach, and not bolt on ad-hoc judgments ex-post on a classical test.

"Why isn't it legitimate to say, "geez, you know, we can't reject the zero hypothesis on this evidence alone, but the estimate of the value of the triple sure is consistent with what we thought going in. I bet it's not really zero. We should do another study." As I read Berri's post, he wouldn't allow me to say that."

He wouldn't allow you to say that using the approach you chose, no. The problem is you chose the wrong statistical specification and test in the first place.

"Two experiments, two different null hypotheses, two cases where I can't reject the null hypothesis. But the conclusions are contradictory!"

No they're not contradictory. Remember that in classical statistics, you can either reject or not reject a null hypothesis, but you can never "accept" a null hypothesis. There are two possibilities, neither of which you can reject based on your data and test. That's not a contradiction.

"My argument is that if the two experiments give you the same relationship between triples and runs, they CANNOT lead you to different conclusions. Which means, you can't just reject the null hypothesis and stop dead. You have to argue for why that particular null hypothesis is appropriate. And if it isn't, a "yes but" is perfectly acceptable."

You meant "can't just not reject" I think. And, yes, it is the statistician's choice what null hypothesis to test. The point Barri is making is that the choice and justification have to happen before the tests are run, and not after.
At Thursday, May 21, 2009 5:26:00 PM, Phil Birnbaum said...: Ted,

Okay, maybe I understand what you're trying to say. Let me try and see if I've got it.

You (and Berri) seem to be saying that, by the rules and standards of this kind of research, when you put forth a hypothesis and find that you cannot reject it at the 95% level, the rules of the game force you to admit that there is insufficient evidence to reject the null. You are not permitted to add a "but maybe the null is false anyway." If you believe that the null is false, you have to find more evidence by doing another study. For instance, if you believe that you just need more data to prove a link between triples and runs, you have to go and get the more data and try again, not just sit there and come up with reasons why your 75% significance level is good enough.

Did I get that right? I don't really have a problem with that standard; it's a reasonable rule that eliminates a lot of arguments and sets a bright line for what is publishable and what isn't, and what results can be accepted and what results can't. Otherwise, you wind up with an argument every 75% result, with the author of the study giving his opinion on why he thinks that's good enough, and journals would spend all their time in subjectively arguing about whether or not every case is really a false negative.

I see it like a rule that prohibits running a red light at 3am when it's obvious there's no other traffic within a mile. You have to convict -- otherwise, if safety is a defense, you'd wind up with every ticketed driver arguing in court why his running the light was perfectly safe, and you'd never be able to convict anyone.

If that's what you're getting at, that academic convention requires it, I think that's perfectly reasonable.

So suppose I run my triples experiment, and find only 75% significance. By that convention, I am forced to conclude that there's no evidence of a connection between triples and runs scored, and I'm not allowed to speculate otherwise.

However: now you come along. Your null hypothesis is different. You may not even have a specific null hypothesis, or maybe you have strong reasons to think triples are worth "about" a run and a half. You look at my data, you see that the point estimate of a triple is really close to what you thought (although the standard error is large). Are *you* allowed to cite my experiment, informally, to argue that triples are worth more than zero? After all, my null hypothesis wasn't yours. You think I'm nuts for even suggesting that triples MIGHT be worth zero.

And if you can write about it on your blog, why can't I? I mean, can't I say, informally, "look, I know I found the result, and I know it's not significant, and I would never try to make a formal academic argument that I have enough evidence by academic standards. But, informally, to you laymen, isn't it obvious that the point estimate of 1.8 makes more sense than zero?"

Not being allowed to *formally* say "yes, but" is a reasonable metarule to provide for orderly progress in the research field (as is the 95% requirement, which is also arbitrary). It's like how newspapers have to act as if a defendant is innocent if he's found not guilty ... but if there is video evidence of him committing the crime, but it was judged inadmissible in court, we certainly can conclude that the guy did it. Maybe Berri may not feel comfortable saying that, if he tried the case. But the rest of us can!

If I understand this right and it's a case of academic convention, then Berri's argument goes too far. He says, "we can conclude that the [he means 'this'] empirical evidence suggests no relationship between market size and voting points." To which I say, no, you can only conclude that research convention says that there's not enough empirical evidence to formally reject the hypothesis of no relationship. What the evidence "suggests" depends on your prior, and you can only defend it on that basis.
At Thursday, May 21, 2009 5:29:00 PM, Phil Birnbaum said...: Ted and I previously argued:

---

Me: "Two experiments, two different null hypotheses, two cases where I can't reject the null hypothesis. But the conclusions are contradictory!"

Ted: "No they're not contradictory. Remember that in classical statistics, you can either reject or not reject a null hypothesis, but you can never "accept" a null hypothesis. There are two possibilities, neither of which you can reject based on your data and test. That's not a contradiction."

--------

Ted, you're right about this. I retract my "contradictory" argument.
At Monday, May 25, 2009 4:41:00 PM, Ted said...: (truncated a few quotes to meet the length max)
"You (and Berri) seem to be saying that... why your 75% significance level is good enough."

Well, I wouldn't object to someone saying that "maybe the null is false anyway," because that would always be true, no matter what the significance level of the test. Again, one can only "not reject" the null, so there's always the possibility the null is in fact false even when the test fails to reject it. The objection would be against selectively making this statement some of the time and not others; and, since you can always make the statement, you should either always remind the reader of this possibility, or never. Given that it's always true, it seems like it's better that it just goes without saying.

"If that's what you're getting at, that academic convention requires it, I think that's perfectly reasonable."

The issue isn't about academia; it's about the basic logic of what a statistical test is and what it does and does not do. A statistical test, basically, tells you the probability that the observations you see could have occurred by random chance under the assumption the null hypothesis holds. The objective of a statistical test is to give a yes-or-no answer to the question of whether there exists evidence that the null hypothesis is false. If you have a yes-or-no question, a line has to be drawn somewhere.

"So suppose I run my triples experiment, and find only 75% significance."

{pedantry on}It is not correct to say something like "a variable is 75% significant." What you mean is that you found that there is only a 25% chance that an effect at least as large as you observed could have occurred by random chance under the assumption that teams who hit more triples score no more or fewer runs.{pedantry off}

"By that convention, I am forced to conclude that there's no evidence of a connection between triples and runs scored, and I'm not allowed to speculate otherwise."

Again, not exactly. What Barri and his colleague said is that your test on your dataset do not allow you to conclude there is such a connection. You may continue to speculate otherwise, but you should not use the observed significance level of the given test to motivate that speculation.

"If I understand this right and it's a case of academic convention, then Berri's argument goes too far. He says, "we can conclude that the [he means 'this'] empirical evidence suggests no relationship between market size and voting points." To which I say, no, you can only conclude that research convention says that there's not enough empirical evidence to formally reject the hypothesis of no relationship. What the evidence "suggests" depends on your prior, and you can only defend it on that basis."

My whole point is that in your original post you didn't understand this correctly, and that you left out the part of Barri's post where he says basically what you say. One should critique a statistical test not on the results of the test but on the design. So, back to your triples example, valid critiques would be your sample size, and the fact that you apparently didn't control for other offensive events (if I understood your description correctly). Those are critiques of the design of your statistical experiment, and those critiques would have equal force irrespective of whatever your findings were. The proscription is against using the output of a statistical experiment either to move the goalposts as to what you consider a significant finding, or to redesign your experiment to get the finding you want. Both of those are circular, in that you'll be able to get any result you want.

And, again, this has nothing to do with the academy; it's understanding what a statistical test is, and following the practices implied by the assumptions underlying statistical testing methodology.
At Monday, May 25, 2009 5:23:00 PM, Phil Birnbaum said...: Hi, Ted,

>The objective of a statistical test is to give a yes-or-no answer to the question of whether there exists evidence that the null hypothesis is false. If you have a yes-or-no question, a line has to be drawn somewhere.

Okay, that's where I disagree. There is nothing in the logic of a statistical test that the answer has to be yes or no. That is a research convention, created by humans. Evidence, by nature, is not yes/no. A significance level of 96% is not that much stronger evidence than a signficance level of 94%. There is nothing mathematically or logically special about 95%; researchers have decided to create a bright line there because *convention* requires that line to be drawn. But evidence, especially statistical evidence from a regression, comes in infinite shades of grey. God does not give yes/no answers.

>"your test on your dataset does not allow you to conclude there is such a connection."

Your words "does not allow" refer to a *convention* of the research community. Suppose an alien civilization has drawn their bright line at 80% instead of 95%. In that case, the aliens could conclude there is a connection, but the humans couldn't. But data is data, and evidence is evidence. Both conventions are arbitrary. If you choose to be convinced at 96%, but insist that you can't talk about the relationship at 94%, the difference is a property of you, not of the data, and not of the mathematics or logic of the statistical test.

>"One should critique a statistical test not on the results of the test but on the design."

Agreed!. I am not critiquing Berri's (or Brook's) test. It's a straightforward regression, perfectly reasonable under the circumstances.

What I am critiquing is his specific assertion that you are not allowed to consider the evidence, for whatever it's worth, if the significance level is less than 95%. You take the evidence the test gave you, and you add it to the pile, and you do an evaluation of the evidence in the pile. Even if that one test was *all* your evidence, you can still consider it. Consider two tests: one (mine) gives a significance level of .77 and a coefficient of 1.88. The second gives a significance level of .50 and a coefficient near zero. These are different pieces of evidence! One supports the idea that triples are related to runs (although not as much as we'd like), while the other is extremely nonsupportive of a connection. My objection is that we should not treat these the same -- because they're not.

I would choose to say, "the variable proved non-significant at the .05 level, BUT the value of the coefficient was significant in a baseball sense, at 1.88 runs per triple. It's also reasonable in the baseball sense, not too far off from our intuition. Furthermore, the significance level was .23, which ain't all that great, but the sample size was small. Common sense does indeed suggest a strong connection. And, finally, unlike with quack cancer "cures," there is no reason to believe that the actual correlation is especially likely to be zero. So, in that light, there's a hint that something might be going on here, so maybe we should run another test."

Berri would say that I have to stop before the "BUT". Since everything after the BUT is true, I disagree with Berri that we shouldn't say it.
At Monday, May 25, 2009 5:29:00 PM, Phil Birnbaum said...: Also, by both your standards and mine, Berri overstates his case. He writes,

"From this [lack of statistical significance] we can conclude that the empirical evidence suggests no relationship between market size and voting points ... It’s important to highlight my wording. The empirical evidence "suggests" a story [of no relationship]."

Even by your standards, this is not correct! By your standards, the empirical evidence fails to find a relationship. This does NOT, on its own, suggest "no relationship." It could simply be that the sample size is too small. The single finding of significance level in the regression can, on its own, NEVER suggest "no relationship". Suggesting that there is no relationship requires an argument from logic. Otherwise, I could do a regression on smoking vs. cancer with a sample size of four people, be almost certain to come up short of p=0.05, and conclude that "the evidence suggests no relationship between smoking and cancer." That would NOT be true. The evidence is so weak that it suggests hardly anything!

From a standpoint of logic, if you want to argue for "no relationship," you have to *at least* show (a) a non-significant p-value, and (b) an analysis of the power of the test to show that if there *were* a relationship, you might have found it, and (c) an argument that your finding, added to the weight of evidence of no relationship, outweighs the evidence (if any) of a real relationship.

Berri is saying, "people think there's a relationship. I did a formal regression and found less than 0.05 significance. Therefore, people are wrong."

What he SHOULD be saying, according to you, is "people think there's a relationship. I did a formal regression and found less than 0.05 significance. Therefore, I can't confirm that there's a relationship."

I know it's not my main point, but it looks to me that this is one of those cases of conflating "absence of evidence" with "evidence of absence."
At Tuesday, May 26, 2009 9:08:00 AM, Grotus' Acorn said...: I agree with you when you say "Absence of evidence is not evidence of absence."

But statistical significance is not unimportant. It tells us if the question that we are asking has been answered. And that's not simply a research standard - every statistical test for which a p-value may be calculated is a question. So, if we are going to ask questions and abide by the answers we receive, yes: we must to insist on statistical significance. And if we are basing our conclusions purely on these statistical analyses, we have to end our sentence with a period as do Berri's students.

The extremely low significance of that question of triples/runs means that we can't statistically answer that question. The question "Did getting a triple in 2007 lead to getting runs?" is not answerable.

But the difference between Berri's econometrics class and sabermetric research is that you're trying to assign a numerical value to a relationship you already know exists, and they are trying to determine if a relationship exists at all (and only then asking what it is.) You already know that triples lead to runs - it's wildly improbable that they don't - and you're trying to find out how many runs. So for you, statistical insignificance only tells you that you need more triples to quantify the effect. Berri's students need to be more rigorous.

And so should we. If a relationship is not likely to be very strong (triples vs. runs) or very weak (third letter of managers middle name vs. fly balls caught by centerfielders on Wednesdays) then the very first test we should go to is statistical significance. It is the best way to determine if a relationship exists at all.

If no relationship exists at all, the regression is only an illusion.
At Tuesday, May 26, 2009 12:29:00 PM, Bryan said...: But now, Berri and his colleague are in a strange situation. They have to argue that:

-- there is a connection between doubles and runs, but
-- there is NO connection between triples and runs!If Berria and his colleague attempt to make that argument, they are fools. In fact, the correct argument to make is as follows:

-- there is a connection between doubles and runs, but
-- there is not enough data to determine the relationship between triples and runs.

This scenario makes quite a bit of sense, in fact. Doubles are much more common than triples (about four times more so), so looking at doubles vs. triples over a similar time span would give much more doubles data. In other words, performing a regression on 2007 data for both doubles and triples is not a comparable analysis in terms of the amount of data to be analyzed. The much larger amount of data analyzed in the doubles regression allows for much further dilution of outliers and results in a reduction of variance. To get the same amount of data to analyze (and thus reduce your variance) you would need to analyze four years worth of triples data.
At Thursday, June 11, 2009 8:05:00 AM, James said...: Do power calculations explain a lot of this discussion?

given the size of the sample in the original study one should be able to calculate what size effect triples would have to have to be highly likely togive you significance at the (arbitrary) cut-off of p=0.05 given the size of the data set.

I think with only one years data triples would have to be very ifluential to be sifgnificant possibly needing a co-eff of 2.0 ( a guess) the authors should say they found no significant effect of triples but add that the study was large enough to be 80 or 90% of detecting an effect of size X.

in other words saying you looked and didnt see any significance is fine provided you say how hard you looked and what size of effect would have been positive.

another way of looking at this is
If you toss a coin four times and it comes up heads each time that is not statistical significant from random but if the coin was actually biased and comes up heads 90% of the time then you accepted the null hypothesis when you shouldn't have. i.e. a false negativeresult. Most stats courses focus exclusively on false positives by stressing p values etc etc and neglect power calculations to avoid false negatives.

Ideally studies should be designed to detect a difference that makes sense in the real world and this is where sabrmetrivcs comes in in my opinion. rather than blame the scence of statistics we should blame the authors for not stating the power of the study.
At Tuesday, June 23, 2009 2:01:00 AM, Rob Kremer said...: I couldn't find an email for you on your blog, so decided to just post this in a comment. Sorry for it being off topic:

I’m just a baseball Dad, but I try to follow some of the SABR stuff.

Have you heard of anybody ever hear of tracking some statistical measure of a hitter's ability to survive a 2-strike count and not make an out?

For instance, it could be a simple ratio - the percentage of time a hitter gets two strikes and doesn't make an out.

I wonder what the major league average is? Is there a large variance between different types of hitters?

Are some hitters freakishly able to get two strikes and not make an out? Is this statistically independent from just measuring the strikeout ratio? That is, does it tell us anything that strikeout ratio doesn't?

I’ve heard “He’s a good two strike hitter,” but I wonder if it is true – are some players actually consistently better than others with two strikes? It seems that this would be easier to answer than the murky question of whether there are clutch hitters, since there are no definitional problems.

I've never heard reference to anything like this statistic, and was wondering if you were aware of anyone who has analyzed it to see if it is related to creating runs in a way that other batting statistics don’t reveal.

Just wondering. Would appreciate any input you could give me.
At Wednesday, July 08, 2009 10:11:00 AM, Anonymous said...: I'm not a statistician by any means, but I would actually think theres a good chance that Triples would have a low or even negative correlation to runs scored. Triples hitters often tend to be small speedy guys who may lack other skils which add to a to a team's runs scored. (Emilio Bonafacio or Michael Bourn types). So even though triples are good in and of themselves, teams that hit a lot of tripls might be doing so at the cost of fewer Home Runs and OBP, which would hurt their overall runs scored for the year.

It reminds me of a fantasy football contest that Rotoworld.com ran last year. You were allowed to pick a roster based on a $250 salary restriction, where players all had different salaries. Last year at one point, Courtney Taylor was the 5th most highly correlated Wide Receiver to successful fantasy teams, despite the fact that he had done absolutely nothing to contribute to the teams who drafted him. The reason was that he was a good "sleeper" pick and that people who selected him tended to be knowledgable fantasy football players. This means that even though Courntey Taylor is "bad", it doesnt mean teams with Courtney Taylor are bad. The same way that even though triples are "good", it might not mean a teams with a lot of triples aren't bad.

My point is that it seems as if the whole reason that you should strictly adhere to the concept of statistical signifigance is that there are many counter-intuitive findings out there, and it you allow the bias you had previous to running the study (teams with a lot of triples score more runs) to say that your findings are "signifigant enough", that could lead you to some incorrect conclusions.

Here is the link to the Courtney Taylor article: http://www.pro-football-reference.com/blog/?p=682
At Wednesday, July 08, 2009 2:15:00 PM, Phil Birnbaum said...: Anonymous,

You're right: the correlation between triples and runs could be negative, for the reasons you give. And you're right that correlation does not imply causation.

But my broader point remains: just because you don't have statistical significance doesn't mean you should assume the correlation is zero. You should be able to argue about what the coefficient means, even while you recognize that it's not statistically significant and the discussion may need to remain tentative until you have more data.

Sabermetric Research

Tuesday, May 19, 2009

Don't always blindly insist on statistical significance

19 Comments:

About Me

Previous Posts