I am writing this entry in order to put my two cents in on an essay written by Bill James and published in Baseball Research Journal 33 (2005) a couple years back called “Underestimating the Fog.” It got a lot of response at the time, most notably from Jim Albert and Phil Birnbaum in By the Numbers, Volume 15 Number 1 (see page 3 for Jim, page 7 for Phil). Phil’s response was specifically relevant to one of the claims Bill made in that essay: that the conclusions made by previous analysts that clutch hitters as such don’t exist were premature. Bill believed the existence of clutch hitters to be an open question; Phil attempted to provide evidence that we have good reason to believe that there is no such thing. I wish to make some comments on the validity of Bill’s claim. Although I will use the clutch hitting case as an example, my comments are also relevant to analogous issues such as “hot and cold streaks aren’t real” (which Bill also claimed to be an open question in his essay).
I write this from the standpoint of the traditional logic of hypothesis testing (apologies to those readers already familiar with this way of thinking). In this tradition, one proposes a “research hypothesis” (e.g., clutch hitting exists) and a corresponding “null hypothesis” voicing its opposite (e.g., clutch hitting does not exist). One then compares one’s data to these hypotheses, and finds support for whichever hypothesis the data more closely reflects. The problem is that no matter what the data look like, one can never be sure that one’s finding is accurate, because, due to the law of chance, fluky data is always a possibility. It could be that in the real world there is no such thing as clutch hitting but that it appears to exist in one’s data, leading to an incorrect conclusion in favor of the research hypothesis. This is like flipping a fair coin 20 times and getting 17 heads (which I have seen occur twice in a course exercise I have used) and concluding that one’s coin is biased; statisticians call this “Type I error,” although it would have better been called Type I bad luck. It could also be that in the real world there is such a thing as clutch hitting but it does not show up in one’s data, leading to an incorrect conclusion in favor of the null hypothesis. This is like flipping a biased coin 20 times and, despite the bias, getting 10 heads; this is known as Type II error, although again it is more bad luck than mistake.
On any single occasion, one can never be sure whether one’s findings are accurate or not. However, in a given research situation, we can use laws of probability to estimate the odds that Type I or Type II error would occur, and use these estimates as the basis for our decision concerning which hypothesis should gain support. If our data appears to support the research hypothesis and the calculated probability of Time I error is very small (one’s “significance level,” traditionally in the social and behavioral sciences, less than five percent), then we claim support for that research hypothesis, although we know there is some chance that claim is wrong. If our data appears to support the null hypothesis and the calculated probability of Type II error is very small, then we can analogously claim support for the null hypothesis with the corresponding proviso.
The point I wish to make concerns the second of these possibilities. Although there have been two recent cases in which researchers claim to have found an impact, both of which I will comment on later, research as a general rule has supported the null hypothesis that clutch hitting as a skill does not exist. However, our trust in this conclusion has to be tempered by the chance for Type II error. Now, rather than Type II error, the odds of supporting the null hypothesis in error, we tend to think about the issue in terms of its complement; statistical power, the odds of correctly supporting the null hypothesis. In this case, that would be the odds of finding evidence for clutch hitting in a given data set assuming it really exists. Now, statistical power (and Type II error) is determined by three different factors. First, significance level, which is as noted earlier traditionally .05 in the social and behavioral sciences. If one become more lenient and makes it say .10, then in so doing one increases one’s statistical power. The problem with doing so is that one increases the odds of Type I error; as a consequence, we don’t normally muck around with significance level. Second, one’s sample size; the more data we can work with, the more random fluctuations in data are likely to cancel one another out, making it more likely to find something that is there. Third, the strength of the effect itself. This is called “effect size.”
Let us turn now to clutch hitting. To be clear, the issue is not whether there is such a thing as a clutch hit; of course there is. The issue is whether there are certain players that have some special talent such that they are consistently better in the clutch than in “normal” game situations, whereas other players do not have this talent. If clutch hitting as an inherent ability has a significant factor in performance, it would be easier to find than if it its impact is weak. Given the difficulty people have had in finding a clutch hitting effect, if it actually exists, it must have a very small effect size. Given that we don’t mess with significance level, we would need to increase our sample size as much as feasible if we want to increase our power and the resulting odds of finding an effect if it exists. Having said this: If we run a study and find no evidence for clutch hitting, we will be confident in that conclusion to the extent that statistical power is high and, saying the same thing a different way, Type II error is low. We can calculate that power, based on our sample size, our significance level, and the effect size of clutch hitting. We know our significance level, we know our sample size, but we do not know our effect size. We can only guestimate it. We can never prove that clutch hitting does not exist as an inherent talent, no matter what our data look like, because our guestimate of its effect may be too large. Herein lies the fog that Bill alluded to.
Now, let’s get to what Phil Birnbaum did. Bill James had claimed that the Dick Cramer method for studying the clutch hitting issue, looking at year-to-year correlations in clutch performance (see the 1977 Baseball Research Journal), is invalid, because one cannot assume anything based on random data. In his article, using the Elias definition of a “late inning pressure situation” to distinguish clutch from non-clutch plate appearances, Phil computed the statistical power of this type of test assuming various effect sizes. This is a good way to deal with the fog issue; show the possibilities and then let the reader decide. For example, based on his data, if the correlation in clutch performance from season to season was a trifling .08, he would have found it 97.7% of the time; as he did not find it, if such an effect exists, it must be even weaker than that. Assuming the validity of his data set and analysis, what cannot be denied from his work is that if clutch hitting does exist as a talent, then its effect size must be extremely small. The moral of our story: Bill James was right about the following: we will never be in the position to definitively say that an effect does not exist. This is because, no matter how large our sample, our analysis will assume some effect size, and the effect could always be smaller than our assumption. But Bill did not understand that we can estimate the odds that there is no effect given different conceivable effect sizes. And at some point, we can also place ourselves in the position of concluding that if such-and-such an effect does exist, it is too small for have any appreciable impact on the game.
Let me now turn to the two recent claimed demonstrations of a clutch hitting effect. One is by Nate Silver in the Baseball Prospectus folks’ Baseball Between the Numbers. I like Nate’s general method; rather than making a clear distinction between clutch and no-clutch situations, Nate used the concept of situational leverage (the likelihood that the outcome of a given game situation will determine which team wins the game) to estimate the degree of clutch-ness in batter’s plate appearances, which could then be compared to the player’s overall performance to see if he tended to do better or worse as situational leverage increased. Turning a dichotomous clutch/non-clutch distinction into a gradated degree of clutch-ness scale (in technical terms, from ordinal measurement into interval measurement) improves the subtlety of one’s measure, which in effect is another way to improve one’s statistical power. Now, apologies for getting a bit technical; using a sample of 292 players, Nate found a correlation of .33 in their “situational hitting” (his term, which is probably better then “clutch” given his method) between the first and last half of their careers, which is significant at better than .001; in other words, Type I error rate is less than one in a thousand. Finally, some evidence that clutch hitting might indeed exist as a skill. Nonetheless, Nate is very careful to downplay the effect, guestimating that it may account for 2 percent of the impact of offense on wins.
Tom Tango, Mitchel Lichtman, and Andrew Dolphin also devote a chapter to the issue in The Book. They unfortunately use the dichotomous clutch versus non-clutch distinction despite the fact that most of the analyses in their book rely on situational leverage, and they claim to have found an effect but do not provide any relevant data for the rest of us to examine (other than lists of players with good and bad clutch hitting indices, which does not rate as evidence that the effect is non-random). So I cannot judge the validity of their conclusion one way or another. This chapter is one of the few weak points of an otherwise impressive body of work.
-- Charlie Pavitt
Labels: baseball, clutch