Regression to the likely
In the previous post, I gave an example of a statistical test on clutch hitting. It went like this:
"Joe Blow hits .300 in the clutch, and .230 non-clutch. Someone does a standard statistical test on the difference, and finds a significance level of .05. That means that, if the player were actually exactly the same in the clutch as the non-clutch, there would have been only a 5% chance of him having a 70-point-or-more difference just by luck."
Typically, when you get a statistically significant result like this, you use the observed effect as your estimate of the real effect -- in this case, 70 points. Previously, Tango had argued that you shouldn't do that. All you've shown by "statistical significance" is that the result is significantly different from zero. It could be 40 points, it could be 20 points, it could be anything non-zero. You shouldn't just assume it's 70.
I agreed. The point is that you have to take this result, and combine it with everything else you know about clutch hitting, before making a "baseball" estimate of what that observed 70 point difference really tells you.
To clarify what I meant in the previous post, let me give you an example that makes it more obvious. Suppose that you decide to study how good a hitter Albert Pujols is. You randomly pick 8 of his at-bats, and it turns out that in those AB, he went 7-for-8. And suppose your null hypothesis is that Pujols is average, just a .270 hitter.
If you were to do a traditional binomial test, you would find that Pujols' observed .875 batting average is high enough that you would easily reject the null hypothesis that he's .270.
But even though the sample showed .875, would anyone seriously argue that the evidence shows that Albert Pujols is an .875 hitter? That only makes sense if you're willing to ignore everything that you know about baseball, and if you're also willing to ignore everything that *everybody else* knows about baseball -- that there's no such thing as an .875 hitter.
There is nothing wrong with the statistical calculations and statistical test that came up with the .875 estimate. It's just that a naive statistical test doesn't know anything about baseball. And if you want to make an argument about baseball, you have to use baseball knowledge. The fact that you did a statistical test, and it came up significantly different from conventional wisdom, does NOT mean that conventional wisdom is wrong. When you have piles and piles of evidence that says that Pujols is not an .875 hitter, and one statistical test that estimates he is, based only on 8 AB, then if you consider only the 8 AB and ignore everything else, you're not being a sabermetrician -- you're doing a first-year STAT 101 class assignment.
Anyway, I think the ".875 Pujols" example makes the point clearer than the ".070 clutch" example, because it's more obvious that a .875 hitter is absurd than that a .070 clutch talent is absurd. Less so to Tango, of course, who has studied the clutch issue, and probably reacted the same way to ".070 clutch" as I did to ".875 batting average."
I think the example also makes the arguments of some of the other post's commenters more understandable. A couple of them were arguing that, OK, if you know something about baseball, you might discount the .070. But *in the absence of any other information*, you can take the .070 as the best estimate. To which I say, yes, that's true (subject to serious caveats I'll get to in a minute), but not all that relevant. Because, if you rephrase it as, "in the absence of any other information about Albert Pujols, you can take .875 as the best estimate of his batting average talent" ... well, that's still true, but completely irrelevant to any study that's trying to learn something about baseball.
So, we ran the Joe Blow statistical test, and we found that the .070 was statistically signficant at .05. And we're also willing to say, "in the absence of any other information, we can take the .070 as the best unbiased estimate of Joe Blow's clutch ability."
Now, I'm going to create a simulated teammate for Joe -- call him David Doe. I'm going to play God, and I'm going to create David Doe to be a clutch hitter. I'm going to go to my computer, and ask for a random number between 0 and 1, and divide it by 10 to get David's random clutch talent. I'm not going to tell you that true random clutch talent, because I'm God and you're the sabermetrician, so it's up to you to figure that out.
However, I'm going to simulate a bunch of non-clutch AB for David, and a bunch of clutch AB. I'll tell you the results and let you do the statistical test. Actually, I'll even do the test for you and tell the result and the significance level.
OK, let me run my randomizer ... done. And, hey, it turns out, coincidentally, the random numbers came up exactly the same for David as for Joe -- he hit exactly 70 points better in the clutch. And, coincidentally, the significance level is the same, .05.
As we said back when we were talking about Joe, in the absence of any other information than that given, we can say the best estimate for Joe's clutch ability is .070. Similarly, in the absence of any other information than that given, what's the best estimate for David's clutch ability?
The answer is not .070.
For David, the estimate of .070 is biased too high. Why? Because we know something in the David case that we didn't know in the Joe case: that because of the way I randomly chose David's clutch ability, it can never be more than .100.
David's true clutch ability might be between .040 and .070, and he just got lucky. It might be between .070 and .100, and he just got unlucky. Those two possibilities are symmetrical, so if we consider only those possibilities, .070 *is* our best estimate.
But there is another possibility: that David's true clutch ability is between .000 and .040, and he got even more lucky. That's not balanced by anything on the "unlucky" side, because there is no way David was actually a .100 to .140 hitter who got unlucky -- that case is impossible.
So David is more likely to have been lucky than unlucky, and the best estimate of his clutch ability is less than .070.
A way to make this more obvious is to give the standard confidence interval around .070. For both Joe and David, it might be the interval (.005, .135). For Joe, that makes sense. For David, it becomes clear that it doesn't make sense: everything above .100 is impossible, and so you know David's confidence interval is wrong.
This looks like a trick, but it's not, not really. It's just a special case of the principle that the estimate and confidence interval are not correct if some possible values of the parameter were less likely to be true than others. "Impossible" is just a special case of "less likely".
The real God does these things too. He's made it nearly impossible for anyone to be an .875 hitter. And, evidence shows, he's made it very unlikely for anyone to be a .070 clutch hitter. If you ignore those facts, you'll come up with implausible predictions. If you really believe that a 95% confidence interval around Joe Blow is centered at .070 clutch, I'll be happy to bet you on Joe's performance next year. You should be willing to give me 19:1 odds that Joe will hit within his confidence interval. I'll be very happy to take 10:1. If you're not willing to take my bet, then you don't really believe in your results, do you?
When we said that you can trust the .070 estimate "in the absence of any other information," that phrase, "in the absence of any other information," is a bit of a fuzzy way of saying what the true condition is. There's a mathematical Bayesian way of phrasing the condition, but I'll just use a rough approximation:
-- You can accept the point estimate and the confidence interval if, before the fact, you could say that every value was equally likely to be true.
That's not always the case. In my simulation, it was explicitly not the case, since I told you in advance that clutch couldn't be less than .000 or higher than .010.
It's not the case for clutch, either. Even if you didn't know anything about clutch hitting, it would be obvious, wouldn't it, that a clutch talent of .900 was impossible? And so, technically, not every value is equally likely to be true -- .070 is plausible, .900 is not. So, technically speaking, citing the .070 is invalid. It is NOT accurate to give a confidence interval for your parameter unless you are willing to assume everything is equally likely, from minus infinity to plus infinity.
That's a technicality, of course ... .900 is so far away from the .070 that the study showed, that you don't lose much accuracy from the fact that it's impossible. We don't actually have to go to infinity -- we can actually just say (and, again, this is a paraphrase of the math),
-- You can accept the point estimate and the confidence interval for all practical purposes if, before the fact, you could say that every value within a reasonable distance of the estimate was roughly equally likely to be true.
In my opinion, this is NOT strictly the same as saying "in the absence of any other information." It's an explicit assumption that has to be made, one that just happens to be reasonable in the absence of other information. Normally, it's just ignored or taken for granted. But that doesn't mean it's not really lurking there.
So, in the case of Joe Blow, is every value within a reasonable distance of the confidence interval equally likely to be true? I don't think so. Our hypothetical confidence interval is (.005, .135). That is within a reasonable distance of .000, and, you'd think, values around .000 are more likely to be true than any other value.
Why are values around .000 more likely to be true than other values? You could argue that from the evidence of previous studies. But you don't need to. You can argue it on first principles.
Most human attributes are normally distributed, or, at least, distributed in a "normal-like" way where there are more people near average than far from average. Assuming that we have already normalized Joe's clutch stats to the league average clutch stats (as most studies do), the league average is .000, and so we should have expected that Joe's clutch talent would be more likely found around .000 than .070. Therefore, the the condition "every value within a reasonable distance is roughly equally likely to be true" does not hold.
(Notice that you don't need to know whether clutch hitting exists or how big it is -- or even know anything about baseball -- to know that you can't take the point estimate for it at face value! All you need to know is that the distribution is higher in the middle. I think that's kind of cool, even though it's really just the same argument as regression to the mean.)
And so, you can't just take the .070 as your estimate without further argument.
If you disagree with me, you can still try the "without further information" argument. You might reply, "well, Phil, I agree with you, but you have to admit that *in the absence of any other information*, the .070 is a good unbiased estimate."
To which my first reaction is,
"if you're not willing to even consider that human tendencies are clustered near average, if you're willing to go that far to ignore that other information, then you're not studying baseball, you're just doing mathematical exercises."
And then I say,
"It's not even technically true that the .070 is correct 'in the absence of additional information.' That's just a fuzzy way of phrasing it. What is *really* true is that the .070 is correct only if you are willing to assume that all values were, a priori, equally likely. That's an assumption that you have to make, even though you avoid making it, and even though you may not even realize you're making it. And your assumption is false. It's not just a case of ignoring information -- it's a case of ignoring information *and then assuming the opposite*."
I mentioned "regression to the mean," which, in sabermetrics, is the idea that when you try to estimate a player's or team's talent, you have to take the performance and "regress it" (move it closer) to the mean. If you find ten players who hit .350, you'll find that next season they may only hit .310 or so. And if you find a bunch of teams who go 30-51 in the first half of the season, you'll find they do better in the second half.
This happens because there are more players with talent below .350 than above, so that when a player does hit .350, he's more likely to have been a lower-than-.350 player who got lucky than a higher-than-.350 player who got unlucky.
Regression to the mean is actually a special case of what I'm describing here. You might call this Bayesian stuff "regression to the likely." It's likely the .350 hitter is actually a lower-than-.350 player, so that's the way you regress.
"Regression to the likely" is just another way of saying, "take all the other evidence and arguments into account," because it's all those other things that made you come up with what you thought was likely in the first place.
If you accept a .070 clutch number at face value, you are implicitly saying "when I regressed to the likely, I made no change to my belief in the .070, because there was nothing more likely to regress to." If that's what you think, you have to argue it. You can't just ignore or deny that you're implying it.
It may sound like I'm arguing that everyone who has ever created a confidence interval in an academic study is wrong. But, not really -- much of the time, in academic studies, the hidden assumption is true.
Suppose you do a study that tries to figure out how much it costs in free agent salary to get an extra win. And you wind up with an estimate of $5,500,000, with a standard deviation of $240,000. Is there any reason to believe that any number within a reasonable distance of $5.5MM is likelier than any other? I can't think of one. I mean, it was easy to think of a reason why more players should be average clutch than extreme clutch, but, before doing this study, was there any reason to think that wins should be more likely to cost $4,851,334 than, say, $6,033,762? I can't think of any.
Much of the time things are like this: a study will try to estimate some parameter for which there was no strong logical reason to expect a certain result over any other result. And, in those cases, the confidence interval is just fine.
But other times it's not. Clutch is one of them.
If you take the study's observed point estimate of at face value -- and, as Tango observes, most studies do -- you're making that hidden assumption that (in the case of clutch) .070 is just as likely as .000. You're making that assumption whether you realize it or not; and the fact that the assumption is hidden does not mean you're entitled to assume it's true. In the clutch case, it seems obvious that it's not. And so, the .070 estimate, and accompanying confidence interval, are simply not correct in any predictive baseball sense.