Saturday, September 29, 2012

Hot Hand III

After my last two posts on the hot hand, I had an e-mail conversation with Dan Stone, the author of the original study.  He pointed out, correctly, that I had used his most conservative example, which is why I found such a low hot hand expectation based on outcomes.  He has a point ... in fairness, I should have shown results for his other models.  I'll do that now.

Let me start by recapping Dan's conservative model. 


Suppose that the chance of a player making his next shot (I'll assume it's a basketball free throw) is correlated with the chance of his having made his previous shot (regardless of whether or not he made that previous shot).  Specifically, suppose to figure out the probability for the current shot, you do this two step process:

1.  Take the probability for his previous shot, and regress it 90% to the mean of 75%.  That is, "keep" 10% of the probability, and regress 90% of it.  (Dan calls the 10% figure "rho".)

2.  Take that new probability, and figure the max you can add/subtract while keeping the probability between 0% and 100%.  Call that number X -- so, for instance, if the new probability is 77%, X is 23%.  Choose a random number, uniformly, from -X to +X.  Multiply it by 25%, and add it on.  (Dan calls the 25% figure "alpha").

So, for instance, suppose your previous probability was 65%.  Regress that 90% to the mean, bringing you to 74%.  Then, choose a random number between -26% and +26%.  Suppose you pick -12%.  Multiply that by the alpha of 25%, giving -3%.  Add the -3% to the 74%.  That gives you 71%, which is your new probability for the current shot.


I simulated that model over 100,000 shots, and got

-- after a hit, your chance of a hit is 75.049 pct.
-- after a miss, the chance of a hit is 74.875 pct.

My conclusion was, there's not much of a hot hand there, if knowing the previous outcome gains you so little in predictive value.


But, what if we try Dan's less conservative models?  We'll get a bigger difference.  I'll vary "rho" from 10%, and "alpha" from 25%, and show you some of the new results.

Alpha still 25%, but Rho now 50% regression to the mean:

-- after a hit, your chance of a hit is 75.03 pct.
-- after a miss, the chance of a hit is 74.81 pct.

We've improved to a difference of .22 percentage points, instead of .18. 

Now, let's up Rho to .9, which means we only regress 10% to the mean:

-- after a hit, your chance of a hit is 75.20 pct.
-- after a miss, the chance of a hit is 74.27 pct.

Now, the difference is quite large, relatively speaking: almost a whole percentage point (.93).


Dan also included a model where alpha is higher, at 0.5.  That means we add half the random number, instead of just one-quarter.  That should make the hot hand effect even more prominent, because we'll have more extreme values, which means a miss is more likely to be the result of a low probability.

So, here's Alpha a 50%, and Rho also at 50%:

-- after a hit, your chance of a hit is 75.23 pct.
-- after a miss, the chance of a hit is 74.27 pct.

And, here's Alpha at 50%, and Rho at 90% (meaning, only 10% regression to the mean):

-- after a hit, your chance of a hit is 75.93 pct.
-- after a miss, the chance of a hit is 72.06 pct.

As I said, and again in fairness, this is a much larger effect than the one I reported on.  However ... well, to me, it's still a pretty small "hot hand" effect in outcome terms.  Even knowing that probabilities vary a lot -- sometimes Kobe Bryant steps to the line having only a 50% chance of making his shot -- the difference between a hit and a miss is still only four percentage points.  The best a fan can do is to say, "hey, the guy missed his last shot, so he's probably cold -- he's a 72 percent shooter instead of 76 percent shooter."


However, Dan points out that we might be able to do better than that by looking at more than just the outcome.  We can actually try to estimate what the probability was, based on how good a shot it was.  So, on average, the difference may be 72 percent to 76, but, if Kobe really made a bad miss, as opposed to a close call, we can perhaps be more certain that Kobe had a low probability to start, so his current probability should be low too -- lower than the 72% we'd guess if all we knew is that he missed.

That makes sense, if you can indeed make an inference about the probability based on the qualitative characteristics of the actual performance.  You probably can, a bit, but I'm skeptical that it's enough to significantly improve predictions.

And, I'm also skeptical that Kobe's probability can vary so much from shot to shot; normally, most researchers assume a constant probability, not a varying one.

But, Dan may prove me wrong, on both counts, in a subsequent paper.  I'll keep an open mind.

Labels: ,

Saturday, September 15, 2012

More "hot hand" thoughts

There are two ways to define (or measure) a "hot hand": 

(A) you can check if a hit is more likely than normal to be followed by a hit, and if a miss  is more likely that normal to be followed by a miss.

(B) you can check if a *high-probability opportunity* is more likely than normal to be followed by another high-probability opportunity, and if a low-probability opportunity is more likely to be followed by a low-probability opportunity.

The measure in Type A is usually what fans mean.  After a team wins, say, six consecutive games, are they more likely than average to win the next one?

The measure in Type B is a bit more obscure.  Type B is saying something like, "are there streaks where a .500 team is really a .550 team, regardless of whether they win or lose during that streak?"


Now, you can always convert Type B to Type A.  That is, suppose your model says you get 100 consecutive chances at 40%, and then 100 more at 60%.  That's a hot hand in terms of probabilities, in terms of Type B.  Indeed, it's a *very* hot hand: with only one exception, a 40 is always followed by a 40, and a 60 is always followed by a 60.

But, in terms of Type A, a hit being followed by a hit, not so much.  If you work it out, you get

-- After a hit, the chance of a hit is 52%.
-- After a miss, the chance of a hit is 48%.

That is: if we assume a "hot hand" of type Type B that's absolutely huge -- 20 percentage points -- the corresponding Type A effect is much smaller, at only 4 percentage points.

With lesser effects, it's even worse.  Suppose that the Toronto Blue Jays show type Type B streakiness in probabilities.  If the weather is warm, they play .510 ball.  If the weather is cold, they play .490 ball.  That's a streaky pattern because temperature is streaky -- warmer in the summer and colder in spring and fall.

If you didn't know about this weather thing, would you be able to observe any streakiness?  No way.  After a win, the Jays are a .5002 team.  After a loss, they're a .4998 team.  You wouldn't notice that even after 1,000,000 games.  Seriously.  We're talking two games every 10,000.

Even if Type B is reasonably large, Type A is likely still small.


That's what we saw in the previous post.  In his paper, Dan Stone showed a model where the "Type B" effect was 10 percent of the difference from .500.  Then, he added a bunch of randomness.  And, what happened?  The equivalent "Type A" effect was almost zero:

-- after a hit, the chance of a hit is 75.049 pct.
-- after a miss, the chance of a hit is 74.875 pct.

If the academic standard for this kind of research is a "Type B" model, I wish authors would also show us the "Type A" effect, so we can really see what we're dealing with, as fans.  In this case, we're dealing with one extra hit every 575 attempts, which is barely a "hot hand" effect at all.


Of course, all this partially depends on what you're looking for.  If you just want to discover an effect -- any effect -- then it doesn't matter how big it is.  But, in a way, proving "existence" is unnecessary.  We *already know* there's a "hot hand" effect -- because of home field advantage.  

Players do better at home than on the road.  Therefore, a hit is more likely to be a home hit than a road hit, which means a hit is more likely after a hit (probably home) than after a miss (probably road).

Now, that effect is very, very small.  If you assume a .750 foul shooter is .751 at home, but .749 on the road, then the probability of a hit, given that the prevous attempt was successful, is .750001333.  Barely there at all -- but it's there, and we know it's there!

So, if you're looking just to prove the *existence* of an effect, there's no need.  I just did it.

And, of course, there certainly must other effects, of some magnitude.  I'm sure if a player's wife is mean to him one morning, he'll play differently that day.  Maybe better, probably worse.  It would be silly to insist that doesn't happen, that a .750 shooter is still a .7500000000 shooter no matter what's preoccupying him that day. 
But ... the effect might be very, very small.  If you want to prove it's there, good luck.  As I said, if he's .751 on good days and .749 on bad days, the Type A effect is one shot in 750,000.

Even if he's .760 on good days, but .740 on bad days -- which would be pretty significant, in my book -- he'd be .750133 after a success.  That's one shot in 7,519.


So it's not about existence -- it's about size.  We have evidence that *some* streakiness exists, albeit in very, very small amounts.  What the skeptics have to show is that it exists in some *significant* amount, in the real-world, "type Type A" sense. 

And, even that's probably not really enough.  Because, "hot hand" implies more than just correlation.  Suppose a baseball team trades for a bunch of superstars in August.  That  team will definitely show a "hot hand" effect.  If you check when it won three straight, it was most likely after the trade than before.  And, therefore, the chance of it winning the fourth game is higher than its average for the season.

But is that really a "hot hand," in the sense that people mean?  I doubt it.  I think they're talking about psychology, about how the three consecutive wins convinced the team it could compete, and that confidence and momentum extends into the fourth game and lifts them higher.  I think fans won't call it a "hot hand" if it's just caused by extra superstars, something they know is already there. 

Labels: ,

Thursday, September 13, 2012

Is the evidence against the "hot hand" flawed?

The "hot hand" effect is the purported tendency of players (or teams) to to follow success with success, and failure with failure.  For instance, when a shooter makes a few free throws in a row, most basketball fans would expect that he's on a roll, and should continue to shoot better than his overall average, at least until his hot hand cools.

Generally, researchers have concluded that the evidence shows no sign of such an effect.  Alan Reifman is perhaps the busiest researcher on the topic; he has a blog devoted to the hot hand, as well as a recent book (which I still mean to read and review).

However, recently, an academic paper by Dan Stone questioned the conclusion that the hot hand is largely a myth.  Stone argues that even if an effect exists, it would be very hard to find in the data.  His argument is along the lines of, "absence of evidence is not evidence of absence." 

At "Wages of Wins," Jeremy Bretton said he wasn't convinced -- that Stone's critique was "more a quibble" than a conclusive rebuttal.  Stone replied otherwise, and linked to his study, which I read.


First, and obviously, it's nearly impossible to prove that NO hot hand effect exists.  If the effect is small enough, we'll never see it.

For instance, suppose that players shoot at a 75.002 percent success rate after making a free throw, but only 75.001 percent after missing. That effect would be so small that we'd never be able to disprove it.  Even after a million shots, the SD of the success rate would be 0.04 percentage points -- still 40 times higher than the effect we're looking to find! 

So, for the question to be meaningful, it has to be a bit more specific: is the evidence enough to disprove a *reasonably large* hot hand effect?

That depends on a lot of things.  It depends what you mean by "reasonably large."  It depends what assumptions you make about how hot hands work.  It depends if you're looking at an individual player, or a team.

What Stone did, in his paper, is make some of those assumptions explicit.  He found, under his assumptions, that it would be difficult to prove a hot hand effect if it did exist.  He therefore concludes that we shouldn't be so quick to deny that it happens.

I don't disagree with his math, but I'm not sure the assumptions translate to the real world. 


Normally, we talk about the hot hand in terms of what happens after a success, vs. what happens after a failure.  Stone did something a bit different: he talked about what happens after a high-probability shot, vs. what happens after a low-probability shot. 

That's a huge difference.

First, it requires that a player has a different probability of making his shot every time.  That's an unusual assumption for this kind of study, to assume that, when Kobe Bryant gets fouled, sometimes he has an 90 percent chance of making the shot, and sometimes he has only a 65 percent chance of making it.  Usually, the assumption is that his probability is constant.  And that seems, to me, to be much more reasonable. 

Second, it adds a lot of randomness to the model.  Suppose that Kobe hits 86 percent after a hit, but only 82 percent after a miss.  Under the "standard" assumption that Kobe's probability is the same every shot, we only have to check whether the two are different based on the same mean.  But, under Stone's assumption, we also have to allow for the possibility that the "miss" shots were just harder, that Kobe just had a lower probability, randomly.  The more randomness you add, the harder it is to find a real effect.

For instance, if Kobe hits 86 percent in 1000 shots after a hit, and 82 percent in 210 shots after a miss, that's statistically significant if you assume he's otherwise constant at 85 percent.  But it's NOT statistically significant if you assume his probability varies randomly, because maybe the missed shots were "harder".

And that's why Stone finds that it's hard to find a "hot hand," even if it exists -- because he posits a model with so much randomness.


If you want to know exactly what Stone did, here it is, in small font.  I'm going to use his most plausible model as an example.

Suppose the mean probability of success is 75 percent.  But, suppose that rate varies randomly (I'll explain how in a minute).

Now, suppose there *is* a hot hand effect -- not in terms of consecutive successes, but in terms of *probabilities* of consecutive successes.  Specifically, you take the probability of the previous shot, and regress it 90 percent to the mean, and that gives you the expected probability of the next shot. 

However: after you regress to the mean, you add lots of randomness.  So, after a 71 percent shot, you regress to 74 percent -- but than you randomize around 74 percent.  So your next shot could be 70 percent, or 80 percent, or something else, although, on average, it'll be 74 percent.

The specifics:  Start by figuring out the maximum randomness you can add without going over 100 percent or below 0 percent.  (So, if the current probability is 80 percent, the maximum is 20 percent.)  Then, you take 1/4 of that in either direction, and choose a random number within that range, and add it to the probability.  (So, if you're at 80 percent, 1/4 of 20 is 5.  You take a random number between minus 5 percent, and plus 5 percent, and add it to the 80.)  That becomes your new probability.

So, on the whole, you have a better-than-average chance after a high-probability shot, and a lower-than-average chance after a low-probability shot.  Therefore, there is indeed a "hot hand" effect.  However, it doesn't always happen because of the randomization afterwards. 

And so, Stone goes on to show that even though there is an effect, there is virtually NO chance of finding it with statistical significance, by just looking at whether shots were made or missed. 

It's not that you *sometimes* see an effect -- it's that you'll almost *never* see an effect.  Stone ran a series of simulations of 1,000 shots each, and found a 5 percent significance level almost exactly 5 percent of the time.  That is: the effect is indistinguishable from no effect.

Stone's conclusion (paraphrased): "How can you say there's no hot hand effect, when I've shown you a scenario with a real hot hand effect, that's impossible to pick up?  We should admit that are tests aren't adequate to find an effect, instead of claiming that we've debunked the possibility."


To which I say: Stone is correct, but the "hot hand" effect he assumed is very, very small by "classical" standards.  That's hidden, mostly, by the fact that he bases his assumptions on probabilities, rather than successes.

What would be a reasonable "hot hand" effect that's significant in the basketball sense?  Maybe, 1 percentage point?  That is, after a success, you're 0.25 point better than average, but after a failure, you're 0.75 points worse than average? 

Even that, you're not going to be able to find in 1,000 repetitions.  The SD of success rate in 1,000 shots is 1.4 percentage points.  You'd need over 7,000 shots to get the SD down to 0.5 percentage points, which would get the actual effect to 2 SD. Even then, you'd only have a 50/50 chance of finding significance.

And Stone's effect is much, much smaller than that.  Because he manipulated the probabilities *based on probabilities*, instead of based on successes, it makes the effect tiny. 

I ran Stone's simulation myself, for 100,000 throws.  And the results were:

-- after a success, the chance of another success is 75.049 percent.
-- after a miss, the chance of success is 74.875 percent.

That is: instead of a 1 percentage point difference, which is (in my view) around the lower limit of significance in the basketball sense, Stone's model leads to a difference of only 0.174 percentage points -- about a sixth as large. 

That means, in order to find the effect, you need around 250,000 shots, just to have a 50/50 chance of getting significance!  And that's why Stone's simulation didn't find any effect -- he used only 1,000 shots, for a very, very small effect.


Now, I think Stone could defend his conclusion, by saying something like,

"Yes, a large hot hand effect leads to a very small observational effect.  That's my entire point.  There could be a decently-sized hot hand effect, but we'd never see it because it would barely show up in the game data."

And I'd agree with that.  But, it all depends on how you define a "hot hand".  Most of us define "hot hand" as an effect big enough to materially change our real-life performance expectations.  We are looking for a *reasonably sized* effect.  This doesn't qualify.

What Stone is telling us is this: if the "hot hand" effect is only big enough to produce one more basket after every 575 successes, as compared to after 575 failures, we'll never see it in the data.

Which, I think, we knew all along.

Labels: ,

Tuesday, September 04, 2012


(Note: technical statistics post.)

In baseball, a run is worth about a tenth of a win.  So, if I did a regression to predict wins from runs scored (RS), I'd probably get an equation something like

Wins = 0.1*RS + 11

Bill James' "Runs Created" (RC) is a statistic that's a reasonably unbiased estimate of runs scored.  So, if I did a regression to predict wins from Runs Created, I'd get roughly the same thing:

Wins = 0.1*RC + 11

(Actually, the coefficient might be a little lower, because of random prediction error, but never mind.)

Now, what if I try to do a regression to predict wins from *both* RS and RC?  What would happen?

When I asked myself this question, my first reaction is that it RS would still be worth 0.1, and RC would be worth close to zero, and not significant.  Because, who needs RS when you have RC?  So, maybe it would be something like this:

Wins = 0.1*RS + 0.007*RC + 11

And then it occurred to me: how would the regression "know" that RS is more important than RC?  Might it not keep the 0.1 coefficient on the RC, instead of the RS?  Like this:

Wins = 0.02*RS + 0.1*RC + 11

Or, why wouldn't it just give half credit to each?

Wins = 0.05*RS + 0.05*RC + 11

And, when I thought about it, I realized that any combination of coefficients that adds up to 1 is possible, like

Wins = 0.3*RS + 0.7*RC + 11
Wins = 5.4*RS - 4.4*RC + 11

And so on.  So, I wondered, which one is correct?  And why?

The answer, it turns out, is that you can't really predict what will happen.  The coefficients are very heavily dependent on the data, and the random error in the data. 

That's because RC and RS are very highly correlated.  It's a known rule of thumb that when you have variables that are highly correlated, the results are unpredictable.  That's called "multicollinearity".

Here's the example for real.  I ran the RS/RC regression for three different MLB seasons, 1973, 1974, and 1975.  Here are the three equations I got:

1973 Wins = 0.200*RS - 0.099*RC + 11.6  
1974 Wins = 0.153*RS - 0.001*RC - 20.3
1975 Wins = 0.044*RS + 0.100*RC - 15

The coefficients jump around a lot, and the standard errors are large.  For the 1975 regression, none of the coefficients are statistically significant.  (But if I take out RC and just use RS, I get significance at 6 SD.)

As I said, this is established knowledge.  But I still wanted to understand, intuitively, why it happens.  This is how I explained it to myself.


Suppose I run a regression where I expect Y to be equal to X -- maybe I'm trying to percentage of red balls drawn from an urn with replacement, based on the percentage of red balls actually in the urn.  I create a dataset that looks like this:

 Y X
56 51
72 70
44 45
67 82
90 93

Maybe the dataset has 1,000 rows.  I run the regression, and I get the best fit equation,

Y = 1*X + 0

Well, I don't get exactly that, because of random variation, but close.

Now, maybe I had two people independently counting the contents of the urns, and I expect the counts to be slightly different.  I can try to reconcile the differences, but I figure, hey, why not let the regression do it?  So I just add a second dependent variable, for the second person's count.

As it turns out, there were no errors, and the counts are exactly the same.  So the dataset looks like:

 Y X1 X2
56 51 51
72 70 70
44 45 45
67 82 82
90 93 93

When I run the regression, the software tells me it can't do it: the regression matrix is "singular," because my dependent variables are perfectly correlated (which, obviously, they are).

What that means, in English, is that there are an infinite number of regression equations that work.  For instance, I can just use X1:

Y = 1*X1 + 0

Or I can just use X2:

Y = 1*X2 + 0

Or I can use half of each:

Y = 0.5*X1 + 0.5*X2

In fact, I can use any combination in which the coefficients add to 1:

Y = 0.1*X1 + 0.9*X2
Y = 0.7*X1 + 0.3*X2
Y = 3.5*X1 - 2.5*X2
Y = -99*X1 + 100*X2

and so on.  Because I have perfect "multicollinearity," I have infinite solutions.

Now, let's make one small change -- suppose there was one difference in the count, in the first row:

 Y X1 X2
56 51 50
72 70 70
44 45 45
67 82 82
90 93 93

Now, what happens?  Well, X1 and X2 are no longer absolutely perfectly correlated, so the regression can come up with an answer.  This is it:

Y = 6*X1 - 5*X2

Why that one?  Because it meets the criteria that for the other 999 lines, the coefficients have to add to 1.  And, it minimizes the squared error for the first line, at 0, becuase it works out exactly (since 6 * 51 - 5 * 50 equals 56).

But the random error could have happened at any line.  What if it had been the second line?  And, what if it was the same error, just off by one, like this:

 Y X1 X2
56 51 51
72 70 69
44 45 45
67 82 82
90 93 93

You'd expect the regression equation to be roughly the same, right?  It's still the same dataset in 998 lines out of 1000, and the other two lines just changed by 1.  But it's not.  The coefficients have shrunk:

Y = 3*X1 - 2*X2

The question is: why did the coefficients vary so much? 

The answer, as I see it, is that what's really important is not X2 -- it's the *difference* between X1 and X2.  And we don't have a variable for that.  So, when the estimate of (X1 - X2) changes, it has to drag both X1 and X2 along with it. 

That is, what we really want is:

Y = X1 + 2(X1 - X2) for this case, and
Y = X1 + 5(X1 - X2) for the previous case.

That's a much easier way to understand it, and it keeps the X1 coefficient constant, at 1.00.  Except ... we don't have a variable for X1 - X2.  So the regression does the expansion, and gets

Y = 3*X1 - 2*X2 for this case, and
Y = 5*X1 - 4*X2 for the other case.

It's the same equation, just arranged differently, and therefore harder to interpret.

The obvious way around this is to use (X1-X2) as the second variable.  Call it X3:

 Y X1 X3
56 51 0
72 70 1
44 45 0
67 82 0
90 93 0

Now, the dependent variables (X1 and X3) are no longer correlated.  And, so, we get the "correct" coefficient for X1:

Y = X1 + 2*X3  for this case, and
Y = X1 + 5*X3  for the previous case.

What happens is, this time, you get the wide confidence interval for X3, but a narrow one for X1, which is really how you want to look at it.


If that's not intuitive enough, try this:

Every line except one has X1 and X2 the same.  In those lines, the regression gets no extra information from X2 than it did from X1.  Therefore, for X2, the regression is *completely* dependent on that one line in which they're different.  In other words, the regression will assume that the difference between X1 and X2 is *completely* responsible for the error in that line. 

If the error in that line is 5 (for instance, Y=60 and X1=55), and the difference between X1 and X2 is 1, the regression will see that you need 5 differences to fill the gap.  Therefore, you need 5(X1-X2).  Therefore, the coefficient of X2 has to be 5.

If the error in that line were 1 (Y=77, X1=76), the regression will see you need the coefficient of X2 to be 1.

If the error in that line were 10 (Y=62, X1 =52), the regression will see that the coefficient of X2 has to be 10.

That's why the coefficient of X2 varies so much -- it has to vary with the size of the error on the row where it's different from X1.  That error may have a high error -- choosing 100 balls out of an urn with 50 red balls, the SD of red balls drawn is 5.  That means the SD of (Y-X1) is 5.  Therefore, the SD of the coefficient of X2 is 5. 

And, since X1 and X2 have to add to 1, that means X1 has to vary the same as X2, with an SD of about 5.


More formally: in the example that I've been using, where only one line has X1 and X2 being off by 1, you can do some algebra and figure that the coefficient of X1 has to be equal to

1 - (Y - X1)

where Y and X1 are the values for that line.  The SD of 1 - (Y - X1), for any given line, is 5.  When the "real" value of the coefficient is 1, but it has a SD of 5 ... that's why the coefficient jumps around a lot, and that's why it's often not statistically significant.


Do any of these explanations make sense?  I'm finding this harder to explain smoothly than other stuff, but I hope you get the idea.