Here's a good "hot hand" debate between Guy Molyneux and Joshua Miller, over at Andrew Gelman's blog.
A bit of background, if you like, before you go there.
-----
In 1985, Thomas Gilovich, Robert Vallone, and Amos Tversky published a study refuting the "hot hand" hypothesis, which is the assumption that after a player has recently performed exceptionally well, he is likely to be "hot" and continue to perform exceptionally well.
The Gilovich [et al] study showed three results:
1. NBA players were actually *worse* after recent field goal successes than after recent failures;
2. NBA players showed no significant correlation between their first free throw and second free throw; and
3. In an experiment set up by Gilovich, which involved long series of repeated shots by college basketball players, there was no significant improvement after a series of hits.
-----
Then, in 2015-2016, Joshua Miller and Adam Sanjurjo found a flaw in Gilovich's reasoning. 
The most intuitive way to describe the flaw is this:
Gilovich assumed that if a player shot (say) 50 percent over the full sequence of 100 shots, you'd expect him to shoot 50 percent after a hit, and 50 percent after a miss.
But this is clearly incorrect. If a player hit 50 out of 100, then, if he made his (or her) first shot, what's left is 49 out of 99. You wouldn't expect 50%, then, but only about 49.5%. And, similarly, you'd expect 50.5% after a miss.
By assuming 50%, the Gilovich study set the benchmark too high, and would call a player cold or neutral when he was actually neutral or hot.
(That's a special case of the flaw Miller and Sanjurjo found, which applies only to the "after one hit" case. For what happens after a streak of two or more consecutive hits, it's more complicated. Coincidentally, the flaw is actually identical to one that Steven Landsburg posted for a similar problem, which I wrote about back in 2010. See my post here, or check out the Miller paper linked to above.)
------
The Miller [and Sanjurjo] paper corrected the flaw, and found that in Gilovich's experiment, there was indeed a hot hand, and a large one. In the Gilovich paper, shooters and observers were allowed to bet on whether the next shot would be made. The hit rate was actually seven percentage points higher when they decided to bet high, compared to when they decided to bet low (for example, 60 percent compared to 53 percent).
That suggests that the true hot hand effect must be higher than that -- because, if seven percentage points was what the participants observed in advance, who knows what they didn't observe? Maybe they only started betting when a streak got long, so they missed out on the part of the "hot hand" effect at the beginning of the streak.
However, there was no evidence of a hot hand in the other two parts of the Gilovich paper. In one part, players seem to hit field goals *worse* after a hit than after a miss -- but, corrected for the flaw, it seems (to my eye) that the effect is around zero. And, the "second free throw after the first" doesn't feature the flaw, so the results stand.
------
In addition, in a separate paper, Miller and Sanjurjo analyzed the results of the NBA's three-point contest, and found a hot hand there, too. I wrote about that in two posts in 2015. 
-------
From that, Miller argues that the hot hand *does* exist, and we now have evidence for it, and we need to take it seriously, and it's not a cognitive error to believe the hot hand represents something real, rather than just random occurrences in random sequences. 
Moreover, he argues that teams and players might actually benefit from taking a "hot hand" into account when formulating strategy -- not in any specific way, but, rather, that, in theory, there could be a benefit to be found somewhere.
He also uses an "absence of evidence is not evidence of absence"-type argument, pointing out that if all you have is binary data, of hits and misses, there could be a substantial hot hand effect in real life, but one that you'd be unable to find in the data unless you had a very large sample. I consider that argument a parallel to Bill James' "Underestimating the Fog" argument for clutch hitting -- that the methods we're using are too weak to find it even if it were there.
------
And that's where Guy comes in. 
Here's that link again. Be sure to check the comments ... most of the real debate resides there, where Miller and Guy engage each other's arguments directly.
I recommend the interested reader to begin after the first set of comments between Guy and I, and maybe come back to it later. Out of context those comments will be a turn-off, well even in context they may still be a turn-off (I regrettably became frustrated by the discussion).
ReplyDeleteI think it is useful to summarize the Guy's 4 points here, in his own words, and then decide for yourself if Guy has marshaled the evidence to back up his points.
1. "I do think you [Andrew Gelman] underestimate the power of the evidence against a meaningful hot hand effect in sports."
2. "I believe the balance of evidence should create a strong presumption that the hot hand is at most a small factor in competitive sports"
3. "people’s belief in the hot hand is reasonably considered a kind of cognitive error"
4. "to me that those conducting such studies [Controlled Shooting, Three Point Contest] have ranged far from the original topic of a hot hand in competitive sports — indeed, I’m not sure it is even in sight."
For additional context, Guy and I discussed similar issues in the comment section of Andrew Gelman's 2015 blog post. There is actually more detail there on some of the issues.
Phil-
ReplyDeleteI have never read James, that "Underestimating the Fog" article is awesome. I need to read more James.
This quote from James:
Why? Because whenever you do a study, if your study completely fails, you will get random data. Therefore, when you get random data, all you may conclude is that your study has failed. Cramer's study may have failed to identify clutch hitters because clutch hitters don't exist — as he concluded — or it may have failed to identify clutch hitters because the method doesn't work — as I now believe. We don't know. All we can say is that the study has failed.
I would add that he doesn't have to "believe" that the method doesn't work. He can know that it doesn't with a simple simulation.
This quote, from James:
My opinion is that, at this point, no one has made a compelling argument either in favor of or against the hot-hand phenomenon. The methods that are used to prove that a hot hitter is not really hot, in my opinion, would reach this conclusion whether hot hitters in fact existed or whether they did not.
Stated another way, the hot-hand opponents are arguing — or seem to me to be arguing — that the absence of proof is proof. The absence of clear proof that hot hands exist is proof that they don't. I am arguing that it is not. The argument against hot streaks is based on the assumption that this analysis would detect hot streaks if they existed, rather than on the proven fact. Whether hot streaks exist or do not I do not know — but I think the assumption is false.
Well, it can actually be proven as fact that the previous analysis of in game would not detect hot streaks even if they existed (for a range of hot hand models). So, at least for basketball, Bill James has firmer ground than he lets on.
The question follows: should we return to pre-1985 beliefs whatever those were, or simply move back to agnosticism? Should we allow controlled shooting and Three Point shooting to inform or current beliefs? Expert practitioners (players and coaches) have more granular measures than we do, but they have less sophisticated analyses and we know they can get things wrong, should we allow their beliefs to inform ours? Quoting from a comment I saw from Udi Wieder on Gödel's Lost Letter: "Elite athletes are exceptionally attuned to their bodies and their performance. If they believe the phenomena exists I believe it too."
With regard to your mention of my argument "that teams and players might actually benefit from taking a `hot hand' into account when formulating strategy." I would turn the question around a bit: Do you think the evidence is compelling enough to convince Phil Jackson, Red Auerbach, or even a stats-savvy coach that they are wrong, or to justify recommending that in those relatively rare moments when a coach/player thinks it is important to adjust, offensively or defensively to what they perceive to be the hot hand, that they should ignore their intuition entirely?
By the way the flaw we detected is not identical to Landsburg's problem. Landburg's problem is related to optional stopping on sequence length. In his example you are sampling from the negative binomial distribution. The bias here is known, and was investigated formally at least as early as 1945 by J.B.S Haldane. Of course like many other results in mathematics, there is a way to relate this to the flaw that we found, but it is not identical.
ReplyDeleteAlso, the sampling-without-replacement (kind-of) works for streaks of length one, but as you mention it gets more complicated after that. We have a user-friendly explanation in our recent general interest write up
Hi, Josh,
ReplyDeleteI would argue that the *flaw* is identical to Landsburg's problem -- that the average is obtained by equally weighting strings, rather than equally weighting the test cases.
Agreed that the *question* is not identical, and the strings are obtained by different rules, but the reason the answer doesn't work out to 1/2 (or whatever) is due to the equal-weighting flaw.
I'll change the post to reflect that better.
thanks.
ReplyDeleteTo take it a bit further, while I agree that there is an (implicit) equal weighting of sequences, this is not the source of the bias. Further, I think this way of connecting it to the Landsburg problem is a bit misleading because one might think if you take two sequences and weight by the # of flips (trials) the bias would go away, but instead it would just be less (and you anyway shouldn't do this with b-ball data because of aggregation bias). This also happens to be true in the Landsburg problem that you discussed in the post you linked above; if you take two countries with the same stop-when-I-have-son rule, Country A and Country B, then weighting the average of the two countries by the # of babies doesn't eliminate the bias, it only lessens the bias. Anyway, the bias in any single country will be slight, because, presumably, there will be many families.
Let my explain why the equal-weighting intuition from the Landsburg problem is not enough. If the bias were identical then the reasoning used in the Landsburg problem would immediately indicate the direction of the bias here, but it doesn't. The equal weighting of sequences with an uneven # of observations per sequence is not a problem, per se. If you decide to inspect a random # of observations from a sequence and select those observations without peeking at the data, there is no bias. There is an extra step, you need to show why you expect the sequences with more observations to have higher values for prop(H|H), and then show it for prop(H|HH), etc.. Remember, the sequence length is fixed here, unlike in the Landsburg problem. To summarize, bias has two components: (1) unweighted average, (2) non-random selection of observations.
Anyway, I see your point that *one component* of the bias is just as in the Landsburg problem, and I agree. When formulating their benchmark for the prop(H|H) from a single sequence, Gilovich, Vallone and Tversky (implicitly) used the weighted average across the 2^n-2 hypothetical sequences that they could face, when they should have instead used the unweighted average across these sequences.
"If a player hit 50 out of 100, then, if he made his (or her) first shot, what's left is 49 out of 99. You wouldn't expect 50%, then, but only about 49.5%. And, similarly, you'd expect 50.5% after a miss."
ReplyDeleteI don't agree with this logic. For this to be true, there would need to be a memory property of the random trials. It assumes that after 100 shots, the player will have made and missed exactly 50 times each. This clearly doesn't happen. As a ridiculous example, supposed the player made the first 50 shots in a row. They wouldn't need to bother shooting the other 50 because they would automatically all be misses.
While that example is ridiculous, it might illustrate something else that could be going on. Perhaps there is a "learning" property. With every shot the player makes, their inherent skill improves by a tiny bit, making them slightly more likely to make the next shot. When a player makes a shot, they try to repeat exactly what they did on the next shot. If they make the next shot, they do the same, and soon that motion becomes part of their muscle memory. On the other hand, when a player misses a shot, they will clearly not try to repeat that same motion, but instead try to tweak or correct their motion just a bit - perhaps trying something less than ideal making them slightly less likely to make the next shot.
But in the scenario they aren't really random trials. They're random populations of a specific composition. Then those populations are sampled without replacement.
ReplyDeleteTrue Zach, the scenario discussed is sampling without replacement. My argument is that this is an inappropriate model for something like shooting free throws or testing for a hot hand in any sporting endeavor. It assumes that the player will make a pre-determined number of his next x shots. In other words, it would would mean that we could predict the future.
ReplyDeleteSuppose this were true. (I'm stealing this logic from Phil in a post on a different topic...) Once the player has shot 99 free throws, we would know with certainty whether he will make or miss the 100th shot, because we know that after 100 shots he will be 50-50. If we know what his 100th shot will be, we can use the same logic to determine what his 99th shot must be. And his 98th, and his 97th. In fact, we would know the precise result of every one of his next 100 shots, making them not random trials at all, but predetermined events.