### Charlie Pavitt on recent clutch hitting studies

I am writing this entry in order to put my two cents in on an essay written by Bill James and published in Baseball Research Journal 33 (2005) a couple years back called “Underestimating the Fog.” It got a lot of response at the time, most notably from Jim Albert and Phil Birnbaum in *By the Numbers,* Volume 15 Number 1 (see page 3 for Jim, page 7 for Phil). Phil’s response was specifically relevant to one of the claims Bill made in that essay: that the conclusions made by previous analysts that clutch hitters as such don’t exist were premature. Bill believed the existence of clutch hitters to be an open question; Phil attempted to provide evidence that we have good reason to believe that there is no such thing. I wish to make some comments on the validity of Bill’s claim. Although I will use the clutch hitting case as an example, my comments are also relevant to analogous issues such as “hot and cold streaks aren’t real” (which Bill also claimed to be an open question in his essay).

I write this from the standpoint of the traditional logic of hypothesis testing (apologies to those readers already familiar with this way of thinking). In this tradition, one proposes a “research hypothesis” (e.g., clutch hitting exists) and a corresponding “null hypothesis” voicing its opposite (e.g., clutch hitting does not exist). One then compares one’s data to these hypotheses, and finds support for whichever hypothesis the data more closely reflects. The problem is that no matter what the data look like, one can never be sure that one’s finding is accurate, because, due to the law of chance, fluky data is always a possibility. It could be that in the real world there is no such thing as clutch hitting but that it appears to exist in one’s data, leading to an incorrect conclusion in favor of the research hypothesis. This is like flipping a fair coin 20 times and getting 17 heads (which I have seen occur twice in a course exercise I have used) and concluding that one’s coin is biased; statisticians call this “Type I error,” although it would have better been called Type I bad luck. It could also be that in the real world there is such a thing as clutch hitting but it does not show up in one’s data, leading to an incorrect conclusion in favor of the null hypothesis. This is like flipping a biased coin 20 times and, despite the bias, getting 10 heads; this is known as Type II error, although again it is more bad luck than mistake.

On any single occasion, one can never be sure whether one’s findings are accurate or not. However, in a given research situation, we can use laws of probability to estimate the odds that Type I or Type II error would occur, and use these estimates as the basis for our decision concerning which hypothesis should gain support. If our data appears to support the research hypothesis and the calculated probability of Time I error is very small (one’s “significance level,” traditionally in the social and behavioral sciences, less than five percent), then we claim support for that research hypothesis, although we know there is some chance that claim is wrong. If our data appears to support the null hypothesis and the calculated probability of Type II error is very small, then we can analogously claim support for the null hypothesis with the corresponding proviso.

The point I wish to make concerns the second of these possibilities. Although there have been two recent cases in which researchers claim to have found an impact, both of which I will comment on later, research as a general rule has supported the null hypothesis that clutch hitting as a skill does not exist. However, our trust in this conclusion has to be tempered by the chance for Type II error. Now, rather than Type II error, the odds of supporting the null hypothesis in error, we tend to think about the issue in terms of its complement; statistical power, the odds of correctly supporting the null hypothesis. In this case, that would be the odds of finding evidence for clutch hitting in a given data set assuming it really exists. Now, statistical power (and Type II error) is determined by three different factors. First, significance level, which is as noted earlier traditionally .05 in the social and behavioral sciences. If one become more lenient and makes it say .10, then in so doing one increases one’s statistical power. The problem with doing so is that one increases the odds of Type I error; as a consequence, we don’t normally muck around with significance level. Second, one’s sample size; the more data we can work with, the more random fluctuations in data are likely to cancel one another out, making it more likely to find something that is there. Third, the strength of the effect itself. This is called “effect size.”

Let us turn now to clutch hitting. To be clear, the issue is not whether there is such a thing as a clutch hit; of course there is. The issue is whether there are certain players that have some special talent such that they are consistently better in the clutch than in “normal” game situations, whereas other players do not have this talent. If clutch hitting as an inherent ability has a significant factor in performance, it would be easier to find than if it its impact is weak. Given the difficulty people have had in finding a clutch hitting effect, if it actually exists, it must have a very small effect size. Given that we don’t mess with significance level, we would need to increase our sample size as much as feasible if we want to increase our power and the resulting odds of finding an effect if it exists. Having said this: If we run a study and find no evidence for clutch hitting, we will be confident in that conclusion to the extent that statistical power is high and, saying the same thing a different way, Type II error is low. We can calculate that power, based on our sample size, our significance level, and the effect size of clutch hitting. We know our significance level, we know our sample size, but we do not know our effect size. We can only guestimate it. We can never prove that clutch hitting does not exist as an inherent talent, no matter what our data look like, because our guestimate of its effect may be too large. Herein lies the fog that Bill alluded to.

Now, let’s get to what Phil Birnbaum did. Bill James had claimed that the Dick Cramer method for studying the clutch hitting issue, looking at year-to-year correlations in clutch performance (see the 1977 Baseball Research Journal), is invalid, because one cannot assume anything based on random data. In his article, using the Elias definition of a “late inning pressure situation” to distinguish clutch from non-clutch plate appearances, Phil computed the statistical power of this type of test assuming various effect sizes. This is a good way to deal with the fog issue; show the possibilities and then let the reader decide. For example, based on his data, if the correlation in clutch performance from season to season was a trifling .08, he would have found it 97.7% of the time; as he did not find it, if such an effect exists, it must be even weaker than that. Assuming the validity of his data set and analysis, what cannot be denied from his work is that if clutch hitting does exist as a talent, then its effect size must be extremely small. The moral of our story: Bill James was right about the following: we will never be in the position to definitively say that an effect does not exist. This is because, no matter how large our sample, our analysis will assume some effect size, and the effect could always be smaller than our assumption. But Bill did not understand that we can estimate the odds that there is no effect given different conceivable effect sizes. And at some point, we can also place ourselves in the position of concluding that if such-and-such an effect does exist, it is too small for have any appreciable impact on the game.

Let me now turn to the two recent claimed demonstrations of a clutch hitting effect. One is by Nate Silver in the Baseball Prospectus folks’ *Baseball Between the Numbers*. I like Nate’s general method; rather than making a clear distinction between clutch and no-clutch situations, Nate used the concept of situational leverage (the likelihood that the outcome of a given game situation will determine which team wins the game) to estimate the degree of clutch-ness in batter’s plate appearances, which could then be compared to the player’s overall performance to see if he tended to do better or worse as situational leverage increased. Turning a dichotomous clutch/non-clutch distinction into a gradated degree of clutch-ness scale (in technical terms, from ordinal measurement into interval measurement) improves the subtlety of one’s measure, which in effect is another way to improve one’s statistical power. Now, apologies for getting a bit technical; using a sample of 292 players, Nate found a correlation of .33 in their “situational hitting” (his term, which is probably better then “clutch” given his method) between the first and last half of their careers, which is significant at better than .001; in other words, Type I error rate is less than one in a thousand. Finally, some evidence that clutch hitting might indeed exist as a skill. Nonetheless, Nate is very careful to downplay the effect, guestimating that it may account for 2 percent of the impact of offense on wins.

Tom Tango, Mitchel Lichtman, and Andrew Dolphin also devote a chapter to the issue in *The Book*. They unfortunately use the dichotomous clutch versus non-clutch distinction despite the fact that most of the analyses in their book rely on situational leverage, and they claim to have found an effect but do not provide any relevant data for the rest of us to examine (other than lists of players with good and bad clutch hitting indices, which does not rate as evidence that the effect is non-random). So I cannot judge the validity of their conclusion one way or another. This chapter is one of the few weak points of an otherwise impressive body of work.

-- Charlie Pavitt

## 9 Comments:

I find all of these studies of clutch hitting to be frustrating for the simple reason that they do such a poor job of defining "clutch".

The trouble is that clutch hitting, as understood by the naive fan, is the ability to get a hit when the game is on the line. That is an exceedingly difficult concept to get a handle on in operational terms. For example, one could study players' performances in late-inning pressure situations, but that just measures performance in late-inning pressure situations. There are many such situations that the naive fan would not term clutch. For instance, if one's team is mathematically eliminated from the pennant race and a player gets a hit in such a situation, I doubt that there are many fans who would term it a clutch hit. In fact, lots of them would grouse about the player only getting such a hit after it no longer mattered.

A similar criticism applies to the consistency of situational hitting throughout a player's career. It doesn't measure clutch as understood by the naive fan.

The upshot of my criticism is that sabermetricians should abandon the term "clutch" altogether when studying hitting, and pitching as well. If one is studying hitting in late-inning pressure situations, one should present one's results as results about performance in such situations, not as results about clutch hitting.

Charlie,

This is Andy's article which predates his work on clutch in The Book:

http://www.dolphinsim.com/ratings/notes/clutch.html

You may find it more technically to your liking.

***

As well, it's important to note that a correlation of .999 is achievable when your sample size is infinity if there is the slightest of possible real causation.

And that a correlation of .001 is achievable between two variables if they have an almost perfect cause/effect relationship, if the sample size is exceedingly small.

Therefore, reporting a correlation without an reporting the sample size really is meaningless.

In Nate's study, which was very good, I'd guess the sample size was, what, probably 3000 PA in each pool for each player? To get a correlation of r=.50, you'd then simply need 6000 PA for each player in each pool. That is, you'd regress a player's performance 50% toward the mean, if you have 6000 PA. If you had 600 PA, then your correlation would be r = .09.

The key is not to determine the correlation coefficient, but rather how much can your sample explain something. As Nate showed in his book, and Andy showed in his, the "true clutch" skill that we can infer based on a few hundred PA is not much.

As well, it's important to note that a correlation of .999 is achievable when your sample size is infinity if there is the slightest of possible real causation.I'm not too sure about that..., given a perfect regression relationship (error is equally normally distrusted over the interval) you could get a standard deviation around the regression of 5: Sum of squared errors: r^2 = (infinity*5^2)/(infinity*s^2) where s is the standard deviation for a simple average of the Y's... So the best you can ever do is: r^2 = (standard deviation around regression)^2/(standard deviation around the mean)^2.

Or r = standard deviation of regression/standard deviation of mean. This is why a low r doesn't necessarily mean a bad regression or good regression.

I'm not 100% sure the above is true, I'll test it out later...

What I know will happen as sample size goes to infinity is that the standard error on the coefficients will go to 0.

Let's say that Bonds is a true .400 OBP. Pujols is a true .399 OBP. Hafner is a true .398 OBP. Ortiz is a true .397 OBP. Manny is a true .396 OBP. Beltran is a true .395 OBP.

If you give them each 600 PA, they'll put up numbers all over the place, such that Bonds will win the OBP race say 18% of the time and Beltran will win 14% of the time.

Give them each 6000 PA for a career, and maybe Bonds will win 30% of the time.

Give them each 6 million PA, and Bonds will win, who knows, 70% of the time.

Give them each an infinite number of PA, and guess what.... Bonds' observed OBP will be .400. Pujols .399, etc, etc.

Repeat the test, and Bonds' observed OBP will be .400. Pujols .399, etc, etc.

If there is the slightest bit of non-randomness in what you are observing, then an infinite number of observations will give you a correlation approaching 1.00.

1. my equation was wrong its r^2 = 1-5^2/s^2...

2.

RE: Correlation of 1.0.. So for this regression it's Clutch.year vs. Clutch.(year+1) so standard error in each infinite set would be zero therrefor if there was a relationship it would be r=1.0 = sqrt(1-0^2/s^2)3. However, if a hitter plays against a larger percentage (of his infite set) of clutch pitchers in one year over another then his clutch hitting will be reduced by the clutch pitching. So even an infite sample would have "scheduling error": retiring players, marginal players that will produce in effect random variations on an infinite sample. Unless there is a perfect way to scale out these issues then there are problems.

Of course the two could cancel eachother out exactly: all players are about equal in clutch situations and therefor it appears no players are clutch, which of course is wrong, however the situation cannot be detected numerically unless you test vs. players you know who are not clutch (make them believe the situations is not important)

I still don't get your point. Say that Bonds has a +.003 OBP clutch skill, and Pujols is +.002 OBP clutch skill... Beltran is -.003 OBP clutch skill.

Given an infinite number of PA to each player, in of course the same context (i.e., against the same pitchers, parks, etc), Bonds will come out with an observed +.003 OBP clutch skill, plus or minus .000000000000001.

Now, you can throw in some variation, like changing pitchers, parks, etc. That won't matter, as long as our players get those infinite PA against the same context (and of course that the context itself doesn't give Beltran more of an advantage for some reason... that is, that Bonds' +.003 in OBP clutch skill is not a manifestation of his true clutch skill plus something in the environment that he can take more advantage of).

In short, I see no reason why an infinite observed sample won't make the observed = true. And if that's the case, then the correlation between two such observations will give us r=1

Tangotiger - thanx much for the link. And i'm glad to see my post of such interest to many people. Hope to contribute again on another topic soon.

charlie

My job is as a clinical psychologist who just happens to have an interest in stats. In clinical work, we have a concept called "clinical significance." It deals with situations in which the study results are statistically significant, but in which the effect size is so small that it's essentially irrelevant. In that case, as a clinician, I would say, "Yes, it's there, but so what?"

The recent studies of clutch hitting that you reference (Silver and Tango) all appear to be good illustrations of this concept. Clutch ability does seem to exist (or alternately, a fortunate Type I error has occured for the sentimental traditionalists) and in the grand scheme of things, it has very little effect on performance.

The only piece of methodology that's always seemed missing from "clutch" analyses was the thought that while batters may become "better" (or worse) in pressure/high-leverage situations, presumably for psychological reasons of being less prone to anxiety, would not the pitcher's "clutchness" also have to be examined and factored into the equation?

Response to Pizza Cutter - Both of those works discuss that very issue - that the effects they found are exceedingly small and of very little practical significance. Further, the Tango et al. book considers the issue of clutch pitching. charlie

Post a Comment

<< Home