Friday, December 04, 2015

A new "hot hand" study finds a plausible effect

There's a recent baseball study (main page, .pdf) that claims to find a significant "hot hand" effect. Not just statistically significant, but fairly large:

"Strikingly, we find recent performance is highly significant in predicting performance ... Furthermore these effects are of a significant magnitude: for instance, ... a batter who is “hot” in home runs is 15-25% more likely (0.5-0.75 percentage points or about one half a standard deviation) more likely to hit a home run in his next at bat."

Translating that more concretely into baseball terms: imagine a batter who normally hits 20 HR in a season. The authors are saying that when he's on a hot streak of home runs, he actually hits like a 23 or 25 home run talent. 

That's a strong effect. I don't think even home field advantage is that big, is it?

In any case, after reading the paper ... well, I think the study's conclusions are seriously exaggerated. Because, part of what the authors consider a "hot hand" effect doesn't have anything to do with streakiness at all.


The study took all player seasons from 2000 to 2011, subject to an AB minimum. Then, the authors tried to predict every single AB for every player in that span. 

To get an estimate for that AB, the authors considered:

(a) the player's performance in the preceding 25 AB; and
(b) the player's performance in the rest of the season, except that they excluded a "window" of 50 AB before the 25, and 50 AB after the AB being predicted.

To make this go easier, I'm going to call the 25 AB sample the "streak AB" (since it measures how streaky the player was). I'm going to call the two 50 AB exclusions the "window AB". And, I'm going to call the rest of the season the "talent AB," since that's what's being used as a proxy for the player's ability.

Just to do an example: Suppose a player had 501 AB one year, and the authors are trying to predict AB number 201. They'd divide up the season like this:

1. the first 125 AB (part of the "talent AB")
2. the next 50 AB, (part of the "window AB)
3. the next 25 AB (the "streak AB")
4. the single "current AB" being predicted
5. the next 50 AB, (part of the "window AB")
6. the next 250 AB (part of the "talent AB").

They run a regression to predict (4), based on two variables:

B1 -- the player's ability, with is how he did in (1) and (6) combined
B2 -- the player's performance in (3), the 25 "streak AB" that show how "hot" or "cold" he was, going into the current AB.

Well, not just those two -- they also include the obvious control variables, like season, park, opposing pitcher's talent, platoon, and home field advantage. 

(Why did they choose to exclude the "windows" (2) and (5)? They say that because the windows occur so close to the actual streak, they might themselves be subject to streakiness, and that would bias the results.)

What did the study find? That the estimate of "B2" was large and significant. Holding the performance in the 375 AB "ability" sample constant, the better the player did in the immediately preceding 25 "streak" AB, the better he did in the current AB.

In other words, a hot player continues to be hot!


But there's a problem with that conclusion, which you might have figured out already. The methodology isn't actually controlling for talent properly.

Suppose you have two players, identical in the "talent" estimate -- in 375 AB each, they both hit exactly 21 home runs.

And suppose that in the streak AB, they were different. In the 25 "streak" AB, player A didn't hit any home runs. But player B hit five additional homers.

In that case, do you really expect them to hit identically in the 26th AB? No, you don't. And not because of streakiness -- but, rather, because player B has demonstrated himself to be a better home run hitter than player A, by a margin of 26 to 21. 

In other words, the regression coefficient confounds two factors -- streakiness, and additional evidence of the players' relative talent.


Here's an example that might make the point a bit clearer.

(a)  in their first 10 AB -- the "talent" AB -- Mario Mendoza and Babe Ruth both fail to hit a HR.
(b)  in their second 100 AB -- the "streak" AB -- Mendoza hits no HR, but the Babe hits 11.
(c)  in their third 100 AB -- the "current" AB -- Mendoza again hits no HR, but the Babe hits 10.

Is that evidence of a hot hand? By the authors' logic, yes, it is. They would say:

1. The two players were identical in talent, from the control sample of (a). 
2. In (b), Ruth was hot, while Mendoza was cold.
3. In (c), Ruth outhit Mendoza. Therefore, it must have been the hot hand in (b) that caused the difference in (c)!

But, of course, the point is ... (b) is not just evidence of which player was hot. It's also evidence of which player was *better*. 


Now, the authors did actually understand this was an issue. 

In a previous version of their paper, they hadn't. In 2014, when Tango posted a link on his site, it took only two-and-a-half hours for commenter Kincaid to point out the problem (comment 6).  (There was a follow-up discussion too.)

The authors took note, and now realize that their estimates of streakiness are confounded by the fact that they're not truly controlling for established performance. 

The easiest way for them to correct the problem would have been just to include the 25 AB in the talent variable. In the "Player A vs. Player B" example, instead of populating the regression with "21/0" and "21/4", they could easily have populated it with "21/0 and "25/4". 

Which they did, except -- only in one regression, and only in an appendix that's for the web version only.

For the published article, they decided to leave the regression the way it was, but, afterwards, try to break down the coefficient to figure out how much of the effect was streakiness, and how much was talent. Actually, the portion I'm calling "talent" they decided to call "learning," on the grounds that it's caused by performance in the "streak AB" allowing us to "learn" more about the player's long-term ability. 

Fine, except: they still chose to define "hot hand" as the SUM of "streakiness" and "learning," on the grounds that ... well, here's how they explain it:

"The association of a hot hand with predictability introduces an issue in interpretation, that is also present but generally unacknowledged in other papers in the area. In particular, predictability may derive from short-term changes in ability, or from learning about longer-term ability. ... We use the term “hot hand” synonymously with short-term predictability, which encompasses both streakiness and learning."

To paraphrase, what they're saying is something like:

"The whole point of "hot hand" studies is to see how well we can predict future performance. So the "hot hand" effect SHOULD include "learning," because the important thing is that the performance after the "hot hand" is higher, and, for predictive purposes, we shouldn't care what caused it to be higher."

I think that's nuts. 

Because, the "learning" only exists in this study because the authors deliberately chose to leave some of the known data out of the talent estimate.

They looked at a 25/4 player (25 home runs of which 4 were during the "streak"), and a 21/0 player (21 HR, 0 during the streak), and said, "hey, let's deliberately UNLEARN about the performance during the streak time, and treat them as identical 21-HR players. Then, we'll RE-LEARN that the 25/4 guy was actually better, and treat that as part of the hot hand effect."


So, that's why the authors' estimates of the actual "hot hand" effect (as normally understood outside of this paper) are way too high. They answered the wrong question. They answered,

"If a guy hitting .250 has a hot streak and raises his average to .260, how much better will he be than a guy hitting .250 who has a cold streak and lowers his average to .240?"

They really should have answered,

"If a guy hitting .240 has a hot streak and raises his average to .250, how much better will he be than a guy hitting .260 who has a cold streak and lowers his average to .250?"


But, as I mentioned, the authors DID try to decompose their estimates into "streakiness" and "learning," so they actually did provide good evidence to help answer the real question.

How did they decompose it? They realized that if streakiness didn't exist at all, each individual "streak AB" should have the same weight as each individual "talent AB". It turned out that the individual "streak AB" were actually more predictive, so the difference must be due to streakiness.

For HR, they found the coefficient for the "streak AB" batting average was .0749. If a "streak AB" were exactly as important as important as a "talent AB", the coefficient would have been .0437. The difference, .0312, can maybe be attributed to streakiness.

In that case, the "hot hand" effect -- as the authors define it, as the sum of the two parts -- is 58% learning, and 42% streakiness.


They didn't have to do all that, actually, since they DID run a regression where the Streak AB were included in the Talent AB. That's Table A20 in the paper (page 59 of the .pdf), and we can read off the streakiness coefficient directly. It's .0271, which is still statistically significant.

What does that mean for prediction?

It means that to predict future performance, based on the HR rate during the streak, only 2.71 percent of the "hotness" is real. You have to regress 97.29 percent to the mean. 

Suppose a player hit home runs at a rate of 20 HR per 500 AB, including the streak. During the streak, he hit 4 HR in 25 AB, which is a rate of 80 HR per 500 AB. What should we expect in the AB that immediately follows the streak?

Well, during the streak, the player hit at a rate 60 HR / 500 AB higher than normal. 60 times 2.71 percent equals 1.6. So, in the AB following the streak, we'd expect him to hit at a rate of 21.6 HR, instead of just 20.


In addition to HR, the authors looked at streaks for hits, strikeouts, and walks. I'll do a similar calculation for those, again from Table A20.

Batting Average

Suppose a player hits .270 overall (except for the one AB we're predicting), but has a hot streak where he hits .420. What should we expect immediately after the streak?

The coefficient is .0053. 150 points above average, times .0053, is ... less than one point. The .270 hitter becomes maybe a .271 hitter.


Suppose a player normally strikes out 100 times per 500 AB, but struck out at double that rate during the streak (which is 10 K in those 25 AB). What should we expect?

The coefficient is .0279. 100 rate points above average, times .0279, is 2.79. So, we should expect the batter's K rate to be 102.79 per 500 AB, instead of just 100. 


Suppose a player normally walks 80 times per 500 PA, but had a streak where he walked twice as often. What's the expectation after the streak?

The coefficient here is larger, .0674. So, instead of walking at a rate of 80 per 500 PA, we should expect a walk rate of 85.4. Well, that's a decent effect. Not huge, but something.

(The authors used PA instead of AB as the basis for the walk regression, for obvious reasons.)


It's particularly frustrating that the paper is so misleading, because, there actually IS an indication of some sort of streakiness. 

Of course, for practical purposes, the size of the effect means it's not that important in baseball terms. You have to quadruple your HR rate over a 25 AB streak to get even a 10 percent increase in HR expectation in your next single AB. At best, if you double your walk rate over a hot streak, you walk expectation goes up about 7 percent.

But it's still a significant finding in terms of theory, perhaps the best evidence I've ever seen that there's at least *some* effect. It's unfortunate that the paper chooses to inflate the conclusions by redefining "hot hand" to mean something it's not.

(P.S.  MGL has an essay on this study in the 2016 Hardball Times. My book arrived last week, but I haven't read it yet. Discussion here.)

Labels: , , , , ,


Post a Comment

<< Home