The Hamermesh study on umpires and race
Yesterday, an article in Time Magazine discussed a new study about biased umpring. Here’s the study. It’s called “Strike Three: Umpires’ Demand for Discrimination,” by Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh.
The paper is quite similar to the Price/Wolfers paper on basketball referees (which I earlier reviewed in three parts). However, the data are much less convincing, and there are seeming conflicts in the results that I don’t understand.
The authors divide umpires and pitchers into four racial (or ethnic) groups: white, black, Hispanic, and Asian. (Hispanic is defined as any player born in one of several Spanish-speaking countries, regardless of skin color.) For three full seasons of play-by-play data, they counted balls and called strikes for each combination of pitcher and umpire. They conclude that pitchers have an advantage when facing an umpire of their same group. White umpires seem to favor white pitchers (over black, hispanic and asian), black umpires favor black pitchers, and so on. That is, umpires discriminate in favor of their own kind.
To show what they did, I’ll simplify things by ignoring the Hispanic and Asian groups (there are no Asian umpires anyway), and just show the data for white and black. Here’s the authors’ summary of the results (taken from their table 2):
|White Pitcher||Black Pitcher||Difference|
|White Umpire||.3206 (741,729)||.3061 (25,108)||.0145|
|Black Umpire||.3193 (46,825)||.3076 (1,765)||.0117|
The numbers in the table are percentage of called pitches that were strikes; the number of pitches follows in brackets.
You can see that White pitchers got more strike calls than black pitchers, regardless of who the umpire was; the overall numbers were 32.05% strikes for the white pitchers, and 30.62% for the black pitchers. We can reasonably conclude that white pitchers are actually more skilled in this regard. However, the size of the black pitchers’ disadvantage depends on the umpire. White umpires gave black pitchers a .0145 disadvantage over white pitchers, while black umpires cut that disadvantage to .0117.
This indeed seems to show that umpires favor their own group. But what would the chart look like if there were no bias at all? There are many ways to equalize the two groups of umpires. The easiest is to subtract .0028 from the Black/Black cell, in order to widen the .0117 to .0145. The chart would then look like this:
|White Pitcher||Black Pitcher||Difference|
What’s the real difference between the two cells? In the real chart, black umpires called 543 strikes out of 1,765 pitches. To get the bottom chart, black umpires would have had to call only 538 strikes out of 1,765 pitches. The difference: five pitches.
Over more than 7,000 ballgames over three seasons, the two groups of umpires are five pitches away from showing absolutely no racial bias. Obviously, that’s not statistically significant.
If you repeat this analysis, this time comparing white and hispanic pitchers and umpires, the difference is 35 pitches out of 7,323. It’s a bit higher a proportion, half a percent, but still not statistically significant (the SD, using the binomial distribution, is 39).
Finally, if you compare hispanic and black, the difference is less than two pitches.
Why are the discrepancies so small over such a large sample of data? It’s because of the small samples in the hispanic/hispanic and black/black cells. There are about three times as many white pitchers as hispanic, and 30 times as many white pitchers as black. There are also few black umpires, and even fewer hispanic umpires. The result: there are, literally, 420 times as many white/white datapoints as black/black datapoints.
Given all this, I’m not sure how the authors manage to come up with statistically significant (and baseball significant) findings of bias. I’ll return to that in a bit.
When the umpire and pitcher are of the same race, the study calls it “UPM” for “umpire/pitcher match”. In the black/white pairings, there are 743,494 UPM pitches, but only 71,933 non-UPM pitches. More than 99% of UPM pitches involve a white pitcher, while only 65% of non-UPM points do. And so, since white pitchers have higher strike frequencies than black pitchers, the UPMs also have higher strike frequencies than non-UPMs:
This has nothing to do with racial bias – it’s just a consequence of how the numbers work out. White pitchers throw more strikes than black pitchers, and the UPM group is dominated by the white/white group. So even if umpires were unbiased – or even moderately biased in favor of blacks – we’d still see this effect.
I mention this because it completely explains two of the columns in the study’s Table 3. In columns “1d” and “2d,” the authors run a regression that includes UPM. The table shows, not surprisingly, that the coefficient for UPM is hugely significant. Again, that’s just because the white pitchers throw more strikes than the black pitchers.
That problem goes away if you include a race variable in the regression, such as pitcher race. In that case, you wind up adjusting for the fact that white pitchers are strikier than black, and the coefficient for UPM starts to make more sense.
That’s what the study does in the other columns of Table 3 – notably, column “3d,” which, I think, is the major result of the paper. There, the authors run a regression that includes UPM, but also a bunch of control variables. They have an indicator variable for each of the 12 possible counts (from 0-0 to 3-2), and an indicator variable for each of the 900 or so pitchers in the study. The inclusion of the pitcher variables means that when a white umpire calls a white pitcher, the encounter is adjusted to the proportion of strikes that pitcher is expected to throw. So the fact that whites throw more strikes than blacks should be factored out.
What’s the result? Statistical significance at about the 2.5% level (almost exactly 2 standard errors). The estimate of the UPM factor is 0.00341. Since the overall probability of a strike is somewhere around 0.30, the increase is about 1%. That’s fairly signficant in a baseball sense. 1% is about half a pitch a start, perhaps (remember, swinging strikes aren't included). My research shows that changing a ball into a strike (or vice-versa) is worth about .14 of a run; a study in Baseball Prospectus estimated it at .10 runs. Either way, taking a hit of .05-.07 on your ERA, just by facing the “wrong” race umpire, is fairly significant.
But, to be honest, I don’t understand where this number could have come from, based on the raw data. As I showed in the 2x2 table above, the discrepancy between black and white is 5 pitches in 1,765, which is 1/3 of 1%. For white/hispanic, it’s about half a percent. For hispanic/black, it’s about an eighth of a percent. None of these is statistically significant.
So, how is it that Table 3 can combine all these and get something higher than even the largest of the 2x2 discrepancies, and so much more statistically significant? The only thing I can think of is the control variables for the count and specific pitcher. Even though the raw data don’t show much discrimination, it could be that when you look closer, the white umpires, by random chance, faced better black pitchers than average, so they should have called even more strikes than average (but didn’t). Still, that seems unlikely, given the size of the sample.
Anyone have an idea what’s going on? Is there something wrong with my 2x2 analysis?
In Table 4, the regressions include a variable for whether or not the game was pitched in a QuesTec park. In those parks, the umpires are “graded” against an electronic observation on whether or not a pitch should have been called a strike. The idea is that when umpires are being observed, they should discriminate less, because they have an incentive to be accurate instead of biased. That’s where the title of the study, "Umpires’ Demand for Discrimination," comes from. The implication is that umpires (perhaps unconsciously) like to discriminate, but will “buy” less discrimination when the price goes up (QuesTec).
The results show there's very much an effect. The UPM factor was much less when QuesTec was in operation. Indeed, the umps appear to have *overcompensated.* Unmonitored, umpires were 2% more likely to call a strike on a different-race pitcher. Under QuesTec, they were two-thirds of a percent *less* likely (although that number is not significantly different from zero).
While I’m convinced that umpires call pitches differently under QuesTec, I’m still uncomfortable about the discrimination estimates.
The authors then proceed to Table 5, where they add control variables for attendance and “terminal count” (any count with three balls or two strikes, where this pitch is more likely to make the at-bat "terminate").
I don’t think we can learn anything from the attendance analysis, because the sample for these new variables is not random. Parks with low attendance probably have worse pitchers. The study did control for that, but what it didn’t control for was that parks with low attendance have worse pitchers *playing at home*. This would tend to cause the worse pitchers to play slightly better, and the better pitchers (who are on the road) to play slightly worse.
For what it’s worth, the authors found that high attendance almost completely cancels out the racial bias (a small amount is left). But I’m not sure whether that would still hold if you re-ran the study, correcting the pitcher effects for home/road. Normally I would think it's not that important, but this study does look for very small effects.
As for the “terminal count” study, different pitchers will have different types of terminal counts. Good pitchers will have a lot of 0-2s, and (semi-deliberately) throw a lot of balls. Bad pitchers would have a lot of 3-0s, and (semi-deliberately) throw a lot of fastball strikes. I think that would render the results of the study unreliable.
But, again, for what it’s worth, on “terminal counts,” the bias is completely reversed: umpires judge own-race pitchers more harshly. The authors suggest it's because the ump knows his judgement on this pitch will be watched more closely.
Table 6 attempts to relate the race bias to team winning percentage -- how many wins is the bias worth? (The authors seem unaware of the previous work on the run value of balls and strikes.) I’m not sure how to interpret the “probit” coefficients in the table (are they just percentage points?), but the authors say that the probability of winning increases by "over 4.2 percentage points for a home team if its starting pitcher matches the umpire’s race/ethnicity." That seems way too big: it would improve the chance of winning from .540 to .582. But if two pitches per game are affected, that’s .28 runs, which is almost .030 in winning percentage (at 10 runs per win).
Still, two pitches per game is a lot. Overall, how many pitches in a typical game are borderline? Most calls are pretty clear. Suppose there are, say, 20 pitches a game that can go either way. Suppose 10 go to the pitcher and 10 go to the hitter. Now, suppose there are two calls that go the "wrong" way because of race. That brings the 10 up to 12. That means that umpires are so biased that 17% of close pitches are decided by race!
My gut says that's pretty unlikely.
A couple more quick comments:
-- In a section on robustness checks, the authors find that the race of the batter doesn’t seem to matter – only that of the pitcher. They hypothesize that the umpire is reluctant to show bias against the batter because he stands so close. The batter can confront the umpire, while the pitcher can’t. Doesn’t seem too unreasonable to me. Also, there is “weak evidence” that the more-experienced umps show less bias than the less-experienced ones; and, there is *no* observed bias among the crew chiefs.
-- To their credit, the authors suggest explanations for all the observed effects that don’t involve racial bias, although those only appear in their FAQ. For instance, they suggest that perhaps Hispanic umpires and pitchers have distinctive “styles,” and so both will implicitly understand a certain kind of pitch to be a strike. Other umpires may not call it a strike, and other pitchers won’t throw it, and so the hispanic/hispanic strike count is inflated.
-- I think all the standard errors and significance levels given in the study are too small. That’s because the analysis the authors do assumes that the umpires are seeing a random sample of pitches. They aren’t – they see pitchers in bunches. If umpire X has 30 games a year behind the plate, he’s going to see less than half the starters in the league. So it’s kind of a cluster sample instead of a random sample. Again, I don’t know how much of a difference this makes in the significance levels, whether they're a little too big or a lot too big, but I wish the study had tried to estimate the effect.
My bottom line is that I’m a bit confused by this study. The raw numbers seem as close to non-biased as you can get, but the regressions seem to show a significant effect. Until the discrepancy is explained to me, I have to go with the 2x2 data, because that’s what I understand. And that says no discrimination.
One explanation might be that the specific technique the authors used, despite using “pitcher fixed effects,” didn’t fully adjust for the fact that white pitchers throw more called strikes than black pitchers. If that’s the case, it would explain most of the positives. It wouldn’t explain why QuesTec completely reversed them into negatives, though. At any rate, these guys are professionals, and I can't think of anything wrong that might have caused this to happen.
And almost all the results go the expected way -- the umpires improve under QuesTec, they appear to improve when the pitch is more important, they improve when the crowds are large, and the experienced umpires show less bias. Again, this would be a big coincidence if there weren't real bias going on.
So am I missing something? Can anyone help resolve the contradiction?