The Hamermesh umpire/race study revisited -- part VI
This is Part 6 of the discussion of the Hamermesh study. The previous five parts can be found here.
A few days ago, I thought I was done with the Hamermesh paper, but I found something else that might substantially impact the results.
In almost all its regressions, the paper controls for the score of the game. That makes sense; you'd expect that, with an 8-run lead, a team would tend to throw more strikes than when the game is tied. That's because,
"... if a pitcher is ahead in the game, he typically pitches more aggressively and is more likely to throw a pitch in the strike zone. ... The reason is that having a lead effectively reduces the pitcher's risk aversion. Relative to throwing a pitch likely to result in a walk, throwing a "hittable" pitch is risky – it increases the probabilities of both a very poor outcome for the pitcher (such as a home run) and a very good one (a fly out)."
So far, I agree. But the problem is that the study doesn't use indicator variables for separate scores – it just uses one variable for the number of runs. This assumes that the effect is linear with respect to the size of the lead. For instance, it forces the assumption that when a pitcher has a 10-run lead, the effect is twice as big as when he has a 5-run lead. More importantly, it assumes that a pitcher who throws lots more strikes with a 6-run lead will throw that same number of strikes *fewer* when he is *behind* by six runs.
That doesn't seem to make sense, does it? You'd think that when a pitcher is (say) 10 runs behind, he'll throw lots of strikes. For one thing, his team isn't going to win the game anyway, which means there's no risk, and no risk aversion. For another thing, he probably doesn't want to wear out his arm by throwing too many pitches, so he's not going to try to pick the corners quite as much.
You'd think that more strikes happen when the score is extreme either way, not when it's only in the positive direction. Wouldn't you?
And I think that's what happens. I checked all pitches from the 1991 to 1996 seasons (thanks, as always to Retrosheet). I limited the sample to relief pitchers to eliminate part of the bias where pitchers who are five runs behind will throw fewer strikes simply because they're not very good (which is why they gave up so many runs in the first place). That's necessary, because the Hamermesh study controlled for pitcher quality, and I have no way of doing that. It doesn't fix the problem entirely, because even relief pitchers might be responsible for part of the score, but it's better than nothing.
So here's what I found for called strike percentages based on score. Plus means the pitcher is ahead: minus means he's behind:
As I guessed, the strike percentage does increase along with the lead, but it also increases with the *deficit*. In any case, it's obviously not linear.
Here's a regression I did on lead vs. strikes. The trend is positive – more runs do lead to more strikes – but if you look at the graph, it's obviously not a very good fit. Sorry about the quality.
It looks like a quadratic curve might be a better fit. To check, I ran another regression, this time including a "lead squared" term to test for a quadratic. The squared term was statistically significant, and made the fit much better:
And keep in mind there is some "quality leakage" here too. Remember that this is relievers only. When 1 to 3 runs ahead, teams are more likely to put in their stopper, so those pitchers are better than normal. When they're 4-6 runs behind, teams are more likely to put in a mop-up man, so those pitchers are worse than normal. If you normalize for pitcher quality by lifting the left end of the curve and lowering the right end, you get a nice U-shape. I'm betting that's what would happen if you added indicator variables for the individual pitchers, as the real study did.
(Technical note: if you use weighted least squares instead of ordinary least squares, weighting by the number of times that lead occurred, you get much the same result.)
Now, what you'll notice is that the discrepancies for the top curve (the bottom one too, but the top is what the Hamermesh study used) can get pretty big. If you look at the tied games – "0" on the horizontal axis" – you notice that strikes in tie games are a lot scarcer then than the study thinks they are. By design, the Hamermesh model has tie-game strikes occurring at the overall average. But tie-game strikes happen a lot less than average. In this sample, the mean was 29.35% strikes, but tie-strikes were 27.93%. That's a difference of 1.43 percentage points.
1.43 percentage points is very, very large in the context of this study. That's about the same as the difference between white pitchers and black pitchers. It's one-and-a-half times as large as the biggest "racial bias" coefficient found in the study (0.84%).
And consider games with the pitcher ahead by 2-3 runs. According to the study, they should have been 0.36 percentage points above average in strikes. But in real life, they were over 2.00 percentage points higher! Again, that's about a point and a half wrong. (That's probably an exaggeration, because of the fact that my sample has ace relievers overrepresented, but the point remains.)
Remembering that the "black/black" sample consisted of only 11 games total, is it possible that those 11 games happened to be games in which those pitchers had the lead, which made black pitchers appear to have gotten more strikes from black umpires than they deserved? Or, could the black/hispanic sample, which was barely over 5 full games, have had a lot of close games, which caused those hispanic pitchers to look like they didn't get enough calls from those same black umpires? It seems possible, doesn't it?
The fact is that the linear score adjustment used in the study is wildly inaccurate. That would add an element of randomness selectively to those pitchers who appeared in certain situations. That would increase the standard error of all the estimates in the paper. I can't prove it mathematically, but I think the increase in variance would be enough that some of the statistically-significant findings would become non-significant.
The broader point, as commenters to previous posts have touched on, is this: if you're looking for a very, very small effect in the data, you need to be sure that your model, your assumptions, and your corrections are sufficiently accurate that, if you *do* find an effect, you can assume it's real and not an artifact of the method used. And, after considering this problem with the score adjustment, I don't think this study is able to give the reader confidence that the assumptions are sufficiently precise.
Put another way: a completely unadjusted, naive reading of the data (Table 2) shows almost no bias whatsoever. Controlling for a whole bunch of other variables suddenly *does* show bias. Is the bias what's left truly what's left after after properly adjusting for the randomness surrounding it? Or did improper adjustments, incorrect assumptions, and incomplete calculations *cause* the appearance of bias?
At this point, I think the adjustments – especially this one – aren't clean enough that we can be confident that what's left is real.
I think I'm done talking about this paper now ... unless I think of something else to say. Either way, in a few days, I'll come back and summarize everything properly.