Statistical significance is only one piece of evidence
Joe Blow hits .300 in the clutch, and .230 non-clutch. Someone does a standard statistical test on the difference, and finds a signficance level of .05. That means that, if the player were actually exactly the same in the clutch as the non-clutch, there would have been only a 5% chance of him having a 70-point-or-more difference just by luck.
That 5% is something of a standard in research: at .05 or below, you conclude that the observation is unlikely enough -- "statistically significant" enough -- to justify a conclusion that you're seeing something real. So, you "reject" the idea that the differences between his clutch and non-clutch performances were caused by luck. You conclude that there is something going on that leads to Joe doing better in the clutch.
The above is my paraphrase of something Tango posted yesterday. He cites the above, and then makes an important caveat: the fact that you reject the null hypothesis -- that you're rejecting the idea that there's no difference between clutch and non-clutch -- does NOT mean that you should conclude that the actual difference is 70 points. All you can conclude, Tango argues, is that the difference isn't zero. For all you know, it might be 30 points, or 10 points, or even 1 point. You are NOT entitled to assume that it's 70 points, just because that's what the actual sample showed.
He's absolutely right.
Then, later in the comments, he says (and I'm paraphrasing again) that the actual difference should be taken to be greater than zero, but less than 70. Well, I agree with him on this baseball point, that greater than 0 and less than 70 is correct. But the point I want to make here is that you *can't* conclude that *from just the statistical test*. The test itself doesn't let you say that. You have to use common sense and logic, and make an argument.
First, and as commenter Sunny Mehta points out, the standard "frequentist" statistical method described here does NOT allow you to say anything at all about what the actual difference might be. It says that if you make the assumption that the parameter is 0 (no difference between clutch and non-clutch), the observation will only happen 5% of the time. If you choose to cite the rarity of the 5%, and reject the hypothesis, the statistical method doesn't say anything about how to adjust your guess as to what the real parameter is. All you can say is "non-zero."
If you want to do better than that, and argue about what the difference actually is, you have to use "Bayesian" methods. You can do that formally, or you can do that informally. Informally is easier: it basically means, "now that you've done the test and got a confidence interval, make a common sense argument about what's really going on, based on what you know about the real world."
That's what Tango is doing when he rejects the 70 points. At the risk of trying to read his mind, what he's saying is: "first, the test only lets you reject the zero. It doesn't tell you the answer is 70 points, or, for that matter, anything else in particular. And second, from what I know about baseball, 70 points is ridiculous."
The "from what I know about baseball" part is critical. Because, let's suppose instead of clutch and non-clutch, you compared Mark Belanger to Tony Gwynn. And let's suppose you got the same 70 point difference, and the same significance level. I'm guessing that you'd no longer argue that the "real" difference is between 0 and 70 points. You'd argue that it's probably MORE than 70 points. Why? Because your baseball knowledge tells you that Mark Belanger normally hits .220, and Tony Gwynn normally hits .330, and if the difference you observed was only 70 points, that's probably a bit too low. There's still some randomness going on -- it could be that Belanger learned something over the off-season, and Gwynn is injured -- but it's more likely that the talent difference is higher, and it was just luck that made them hit only .070 apart.
For clutch, 70 points is ridiculously high. For Belanger/Gwynn, 70 points is somewhat low. But the statistical test might wind up exactly the same. It's only because of your pre-existing baseball knowledge -- your "prior," as it's called in Bayesian -- that you argue the two cases differently.
When would your best guess be that the difference is *exactly* 70 points? When you know nothing about the subject, and 70 points seems as good a guess as anything else. Suppose there are two tribes with two systems of measurement. One measures in "blorps," and one measures in "glimps". You ask the tribes, indpendently, to estimate certain identical distances. You do the calculations, and it turns out that a "blorp" seems to be about .230 miles, and a "glimp" seems to be about .300 miles. The difference is significant at .05, so you conclude that a blorp is not equal to a glimp. What's your best estimate of the difference between them? .070. You have no "prior" reason to think it should be more, or less.
And there are even times where, despite the significance level, you can use common sense to call BS, and believe that the difference is zero DESPITE the statistical significance. Suppose Player X hits .230 when the first decimal place of the Dow Jones average is even, and he hits .300 when it's odd. Even though the difference is statistically significant at p=.05, you're not going to actually believe there's a connection, are you? It would take a lot more than that ... a significance level of 5% is only one time in 20. It's a lot less likely than 1 in 20 that there's actually some connection between the digit and the player's hitting, isn't there? I mean, if you get enough researchers to come up with some pseudo-random split, one of them is likely going to come up with something significant. You have to use common sense to accept the "nothing going on" hypothesis in practical terms, even though formally you "reject" the hypothesis in formal frequentist statistical terms.
So four different cases, four different conclusions, even when the identical statistical test shows a 70 point difference at p=.05:
-- Clutch hitter: real-life difference likely much less than 70 points, but greater than 0
-- Gwynn/Belanger: real-life difference likely more than 70 points
-- Glimps/Blorps: real-life difference likely around 70 points
-- Odd/Even digits: real life difference likely 0 points
A good way to think of it: the result of the statistical test adds to the pile of evidence on whatever issue it's testing. If you have no evidence, take the 70 points (or whatever) at face value. If you DO have evidence, use the 70 point difference to add to the pile, and it will move your conclusion one way or the other.
As Tango says, you are not entitled to assume that the difference is really 70 points just because the 70 is statistically significant. You have to make an argument. And, sure, your argument can be, "I don't know anything about baseball, so 70 points is the most likely difference." But it's perfectly legitimate for Tango to turn around and say, "I DO know something about baseball, and 70 points makes no sense."
Unfortunately, a lot of researchers don't understand that. Or, they believe that the fact that they're not using explicit formal Bayesian methods mean that they're allowed to assume the 70 points without further comment. Tango writes, "[that] is pretty much how I see conclusions being made."
And I agree with him that that's just not right.
Your one statistical test is just one piece of evidence. It does not entitle you to ignore any other evidence, and it does not allow you to make conclusions about baseball without making a logical argument about what your finding means, in the light of other evidence that might be available.