Sabermetric Research: Statistical significance is only one piece of evidence

Friday, March 19, 2010

Statistical significance is only one piece of evidence

Joe Blow hits .300 in the clutch, and .230 non-clutch. Someone does a standard statistical test on the difference, and finds a signficance level of .05. That means that, if the player were actually exactly the same in the clutch as the non-clutch, there would have been only a 5% chance of him having a 70-point-or-more difference just by luck.

That 5% is something of a standard in research: at .05 or below, you conclude that the observation is unlikely enough -- "statistically significant" enough -- to justify a conclusion that you're seeing something real. So, you "reject" the idea that the differences between his clutch and non-clutch performances were caused by luck. You conclude that there is something going on that leads to Joe doing better in the clutch.

The above is my paraphrase of something Tango posted yesterday. He cites the above, and then makes an important caveat: the fact that you reject the null hypothesis -- that you're rejecting the idea that there's no difference between clutch and non-clutch -- does NOT mean that you should conclude that the actual difference is 70 points. All you can conclude, Tango argues, is that the difference isn't zero. For all you know, it might be 30 points, or 10 points, or even 1 point. You are NOT entitled to assume that it's 70 points, just because that's what the actual sample showed.

He's absolutely right.

Then, later in the comments, he says (and I'm paraphrasing again) that the actual difference should be taken to be greater than zero, but less than 70. Well, I agree with him on this baseball point, that greater than 0 and less than 70 is correct. But the point I want to make here is that you *can't* conclude that *from just the statistical test*. The test itself doesn't let you say that. You have to use common sense and logic, and make an argument.

First, and as commenter Sunny Mehta points out, the standard "frequentist" statistical method described here does NOT allow you to say anything at all about what the actual difference might be. It says that if you make the assumption that the parameter is 0 (no difference between clutch and non-clutch), the observation will only happen 5% of the time. If you choose to cite the rarity of the 5%, and reject the hypothesis, the statistical method doesn't say anything about how to adjust your guess as to what the real parameter is. All you can say is "non-zero."

If you want to do better than that, and argue about what the difference actually is, you have to use "Bayesian" methods. You can do that formally, or you can do that informally. Informally is easier: it basically means, "now that you've done the test and got a confidence interval, make a common sense argument about what's really going on, based on what you know about the real world."

That's what Tango is doing when he rejects the 70 points. At the risk of trying to read his mind, what he's saying is: "first, the test only lets you reject the zero. It doesn't tell you the answer is 70 points, or, for that matter, anything else in particular. And second, from what I know about baseball, 70 points is ridiculous."

The "from what I know about baseball" part is critical. Because, let's suppose instead of clutch and non-clutch, you compared Mark Belanger to Tony Gwynn. And let's suppose you got the same 70 point difference, and the same significance level. I'm guessing that you'd no longer argue that the "real" difference is between 0 and 70 points. You'd argue that it's probably MORE than 70 points. Why? Because your baseball knowledge tells you that Mark Belanger normally hits .220, and Tony Gwynn normally hits .330, and if the difference you observed was only 70 points, that's probably a bit too low. There's still some randomness going on -- it could be that Belanger learned something over the off-season, and Gwynn is injured -- but it's more likely that the talent difference is higher, and it was just luck that made them hit only .070 apart.

For clutch, 70 points is ridiculously high. For Belanger/Gwynn, 70 points is somewhat low. But the statistical test might wind up exactly the same. It's only because of your pre-existing baseball knowledge -- your "prior," as it's called in Bayesian -- that you argue the two cases differently.

When would your best guess be that the difference is *exactly* 70 points? When you know nothing about the subject, and 70 points seems as good a guess as anything else. Suppose there are two tribes with two systems of measurement. One measures in "blorps," and one measures in "glimps". You ask the tribes, indpendently, to estimate certain identical distances. You do the calculations, and it turns out that a "blorp" seems to be about .230 miles, and a "glimp" seems to be about .300 miles. The difference is significant at .05, so you conclude that a blorp is not equal to a glimp. What's your best estimate of the difference between them? .070. You have no "prior" reason to think it should be more, or less.

And there are even times where, despite the significance level, you can use common sense to call BS, and believe that the difference is zero DESPITE the statistical significance. Suppose Player X hits .230 when the first decimal place of the Dow Jones average is even, and he hits .300 when it's odd. Even though the difference is statistically significant at p=.05, you're not going to actually believe there's a connection, are you? It would take a lot more than that ... a significance level of 5% is only one time in 20. It's a lot less likely than 1 in 20 that there's actually some connection between the digit and the player's hitting, isn't there? I mean, if you get enough researchers to come up with some pseudo-random split, one of them is likely going to come up with something significant. You have to use common sense to accept the "nothing going on" hypothesis in practical terms, even though formally you "reject" the hypothesis in formal frequentist statistical terms.

So four different cases, four different conclusions, even when the identical statistical test shows a 70 point difference at p=.05:

-- Clutch hitter: real-life difference likely much less than 70 points, but greater than 0
-- Gwynn/Belanger: real-life difference likely more than 70 points
-- Glimps/Blorps: real-life difference likely around 70 points
-- Odd/Even digits: real life difference likely 0 points

A good way to think of it: the result of the statistical test adds to the pile of evidence on whatever issue it's testing. If you have no evidence, take the 70 points (or whatever) at face value. If you DO have evidence, use the 70 point difference to add to the pile, and it will move your conclusion one way or the other.

As Tango says, you are not entitled to assume that the difference is really 70 points just because the 70 is statistically significant. You have to make an argument. And, sure, your argument can be, "I don't know anything about baseball, so 70 points is the most likely difference." But it's perfectly legitimate for Tango to turn around and say, "I DO know something about baseball, and 70 points makes no sense."

Unfortunately, a lot of researchers don't understand that. Or, they believe that the fact that they're not using explicit formal Bayesian methods mean that they're allowed to assume the 70 points without further comment. Tango writes, "[that] is pretty much how I see conclusions being made."

And I agree with him that that's just not right.

Your one statistical test is just one piece of evidence. It does not entitle you to ignore any other evidence, and it does not allow you to make conclusions about baseball without making a logical argument about what your finding means, in the light of other evidence that might be available.

Labels: baseball, statistics

9 Comments:

At Friday, March 19, 2010 9:59:00 AM, lincolndude said...: Phil, thanks for laying this out, very helpful. However, I'm still left with a question (echoing what Sunny Mehta commented on Tangotiger's post):

If you find only a 5% chance, given the null hypothesis, that you got that 70 point gap, is it legitimate to conclude that there is a 95% chance that the null hypothesis is false?
At Friday, March 19, 2010 10:54:00 AM, Unknown said...: hmm, your statement about not being able to estimate the difference between two samples in a frequentist framework is not so much true. it's called estimating the difference between two sample means (or proportions) and is done all the time in frequentist statistics. this isn't to say that people shouldn't go over to the bayesian world (see, e.g., http://www.sciencenews.org/view/feature/id/57091/title/Odds_Are,_Its_Wrong), but if you haven't drank that kool-aid just yet, you can still estimate differences (and not be wrong to do it!).

to wit, using the original motivating example, if in a series of 175 at bats, joe blow hits 0.300 in the clutch and in a series of 300 at bats, joe blues hit 0.230 in non-clutch, one formula for estimating the clutch "boost" yields an estimated average effect of 0.07 with a 95 percent confidence interval of [-0.01, 0.15].

in this case the confidence interval contains zero, so we would fail to reject the null hypothesis of no difference. note this is the same inference you would make if you just did a simple difference in proportions test (you'll always get the same answer regardless of approach). that of course, does not mean that the two quantities are the same -- i.e., we're not accepting the null that no difference exists. indeed, there is a decent amount of evidence to suggest that joe blow is better during clutch.
At Friday, March 19, 2010 11:39:00 AM, Sunny Mehta said...: is it legitimate to conclude that there is a 95% chance that the null hypothesis is false?

Absolutely not! This was my point in the comment to Tango. Frequentist methods do NOT comment on the probability of the hypothesis. The only probabilistic statement they make is about the sample.

Even with Ryan's comment above about confidence intervals, that does not tell us anything about the probability of the hypothesis being true. It just tells us that if you took an infinite number of samples and constructed these "intervals" each time, 95 percent of them would contain the true parameter. But the parameter is either in the interval or it's not. There's no probability of it being in because it's not a probabilistic proposition, it's a 0 or 1 proposition. (Incidentally, I think people - even people who have stats backgrounds - often confuse the meaning of confidence intervals, partly because it's a confusing concept.)

Frequentist methods consider a parameter to be an unknown, but fixed variable. I.e. - you're not guessing at the probability of it, you're just assuming it to be true and then looking at its hypothetical sample distribution.

Bayesian methods, on the other hand, consider a parameter to be a random variable with its own probability distribution. In essence they're doing the opposite. They are taking the sample and using it to guess different probabilities of the hypothesis being true. Again, they are looking at the probability of the true parameter, given a particular sample. Frequentists look at the probability of getting a particular sample, given a hypothetical parameter.

That might seem similar because the language of those sentences is so close, but they're NOT. P(A|B) is not the same as P(B|A), and in fact they can often be very different.

Saying "95 percent of terrorists are Muslim" is not remotely the same as saying "95 percent of Muslims are terrorists", but our brains often have a hard time deciphering that because the language is so close in both sentences.
At Friday, March 19, 2010 11:54:00 AM, Sunny Mehta said...: Also, Phil, I don't know if you've seen this:

http://vhockey.blogspot.com/2009/11/clutch-baby.html

It's dense, but Vic knows his shit. (You can ignore my comments in the comments thread there - I didn't understand wtf was going on.)
At Friday, March 19, 2010 2:06:00 PM, Phil Birnbaum said...: Ryan's right ... often, studies DO estimate a confidence interval for a parameter, even though I argued that you can't.

What I should have said is that you can't do that *unless you make further assumptions*. The assumptions that most researchers make is that any value of the parameter in the vicinity of the confidence interval is roughly equally likely to be true based on any previous evidence on the subject.

For instance, let's suppose that you find that an extra $10,000 of income results in an average 1.2 months of life, with a 95% confidence interval of (1.0, 1.4). Being able to state that confidence interval implies that you believe that there's no other evidence that makes, say, 0.9 that makes it significantly more likely to be the right answer than, say, 1.3.

For clutch hitting, that doesn't hold. If you use the test Ryan describes, and come up with a confidence interval of, say, (0.010, 0.130), you're assuming that all those values (and others nearby) are equally likely according to previous evidence and logic. The are not: other evidence suggests the number is very small.

I'll write more on this when I get a chance.
At Monday, March 22, 2010 7:17:00 PM, G Wolf said...: the actual difference should be taken to be greater than zero, but less than 70

I disagree with the notion that it should be less than 70. If 70 is your point estimate, think of the error bars (or confidence intervals, depending on the specifics) around that point estimate. They extend in both directions, above and below 70.
At Tuesday, March 23, 2010 5:12:00 PM, pobguy said...: I agree with G. Wolf's comment. If no other information is available other than the batting averages in clutch vs. non-clutch, then 70 is the best estimate of the difference, with an error bar (i.e., a confidence interval) that extends above and below 70 (just as Ryan said). I haven't thought hard enough about it yet to know whether the 95% confidence interval is symmetric about 70 or not.

But, for the record, I also agree with Phil's general thesis that we know more about baseball than just the batting averages, so that quite likely the "real" answer is less than 70.
At Saturday, March 27, 2010 1:02:00 PM, Anonymous said...: Phil,

As much as I agree with your point about confusing rejecting Ho and accepting Ha, I have do agree with some of the commenters here. Typically, a p-value is reported in hypo testing, and there is no need to specify the significance level. As you know, the p-value is a likelihood.
At Saturday, March 27, 2010 1:13:00 PM, Phil Birnbaum said...: Anonymous,

I'm not sure what you mean. I agree that the p-value (you mean like p<.05, right?) is always legitimate. I'm arguing that it's the point estimate and confidence interval that always need an argument to support you relying on them in real life.

Am I missing your point?

Sabermetric Research

Friday, March 19, 2010

Statistical significance is only one piece of evidence

9 Comments:

About Me

Previous Posts