The Hamermesh umpire/race study revisited -- part III
This is part 3 of a discussion of the Hamermesh study. There have been several previous posts:
Old: first, second, third, fourth
Recent: part 1, part 1(addendum), part 2
Here's a summary of my analysis of the Hamermesh paper so far:
-- I agree with the study that there is no evidence of bias in the overall data (Table 2 in the paper).
-- I believe the study is flawed because the model forces the assumption that every race of umpires is equally biased in favor of his own race.
-- Because of that flaw, I believe the coefficients found by the study's regression are not meaningful, although the significance levels are.
-- Doing my own regressions, I found an insignificant same-race preference for black and hispanic umpires, but a slight insignificant *opposite-race* preference for white umpires.
-- I argue that the results from the raw data can be explained by as few as 100 pitches out of a million.
-- After controlling for pitcher, umpire, batter, and count, the authors of the original study found an effect about five times as large as without those controls. That effect is also not statistically significant.
Now, let's turn to the paper's next test, the one that *does* turn out to be significant.
The authors note that about a third of major-league ballparks have the "QuesTec" system is installed. QuesTec is a hardware and software package that measures whether or not a given pitch is in the strike zone. After the game, the umpire's calls are compared to the QuesTec results. The authors write,
" ... QuesTec is the primary mechanism to gauge umpires' perfromance. In particular, if more than 10 percent of an umpire's calls differ from QuesTec's records, his performance is considered substandard, and that may influence his promotion to "crew chief," assignment to post-season games, or even retention in MLB."
They note that if an umpire discriminates by race, whether consciously or unconsciously, he faces a higher "price" of doing so when QuesTec is in operation. He has a much stronger incentive to abandon his same-race preference when the cost is high, and, therefore, we should see much less evidence of discrimation when QuesTec is in effect.
And we do.
In the previous post, we saw that, after adjusting for batter, pitcher, umpire, and count, the coefficient of "UPM" (designating an umpire/pitcher race match) was 0.27 percentage points more called strikes. With QuesTec running, the effect is almost reversed: UPM comes out to –0.35, which in fact favors other races. However, when umpires do not have to face an objective review, same-race favoritism appears to skyrocket: it jumps to 0.63.
I'll write those numbers out again:
+0.63 without QuesTec
-0.35 with QuesTec
The –0.35 figure is not statistically significant (about 1 SD), but the +0.63 figure definitely is, at +2.2 SDs (p=.015). The difference between the two figures, 0.98%, is exactly 2 SDs from zero, and so is also significant.
It makes sense to me that umpires will be calling pitches differently with QuesTec and without. It is well-accepted that every umpire has his own strike zone. If 93 umpires have 93 strike zones, but have a strong motivation to all use the same strike zone when QuesTec is running, there should be some significant changes in pitch calling.
It's unfortunate, though, that the authors chose not to provide any other data on the QuesTec breakdown. They don't even tell us whether umpires call more strikes under QuesTec, or fewer. That's a question that's of great interest to sabermetricians and baseball fans. It turns out (thanks to a Mitchel Lichtman study as part of a discussion of this paper) that there's no difference: there are 31.9% strikes in both QuesTec and non-QuesTec parks.
Another omission is that the paper doesn't show us what races of pitcher or umpire were affected the most. If they had provided us with the QuesTec equivalent of their Table 2, with strike percentages for each of the nine cells, we could figure this out.
Actually, in Figure 1, the authors do provide a bit of that data in graphic form. White umpires call 32.1% strikes on white pitchers in non-QuesTec parks, but only 32.0% in QuesTec parks – a difference of 0.1 percentage points. Minority umpires (the authors group black and hispanic together here) calling white pitchers go from 31.8% to 32.2% -- suggesting a much higher race preference, about five times as high.
For minority pitchers, there's a huge difference in how minority umps call strikes when the pitcher is of the same race. They go from 32.2% strikes without QuesTec, all the way down to 30.6% strikes when QuesTec is watching them. That's a difference of 1.6%. On the other hand, when the umpire is not of the same race as the minority pitcher (which mostly means white umpires), the difference is only 0.1%.
White ump, white pitcher: -0.1% under QuesTec
Minority ump, same-race pitcher: -1.6% under QuesTec
White ump, minority pitcher: +0.1% under QuesTec (approx.)
Minority ump, white pitcher: +0.4 under QuesTec
So if the QuesTec effect really does come from race bias, it looks like minority umpires' bias is between 4 and 16 times as high as white umpires' bias! This is something the paper never mentions; the authors never noticed that the empirical results strongly contradict their assumption about equal bias for every race.
However: I don't think any of these results are statistically significant. I think even the 1.6% figure is only about 1 SD (because of the small number of minority-on-same-minority pitches).
While none of the above four findings is statistically singificant by itself, taken together, they combine to create the significant UPM coefficient the authors found for non-QuesTec parks.
That coefficient again, is 0.63 percentage points. As I said in the last post, that coefficient means only that IF the races have identical biases, that bias is likely to be 0.63. But if the biases aren't identical, how big would they have to be?
To test, I started with an unbiased matrix that's close to the real Table 2:
Pitcher ------ White Hspnc Black
White Umpire-- 32.06 31.45 30.67
Hspnc Umpire-- 32.01 31.41 30.63
Black Umpire-- 31.85 31.24 30.46
As I mentioned, the authors didn’t provide a "Table 2" for non-QuesTec parks only. They did, however, say how many pitches there were, and it came out to 62.89% of the total. So I assumed that every cell had 62.89% as many pitches as the original. That's not exactly correct, I found out, because the first thing I did was to add the study's 0.63 to each of the diagonals. I wound up getting a significance level of .04, instead of .015. That might be the result of different numbers of pitches, or rounding errors. To get down to p=.016 I had to use 0.75 as the UPM, instead of 0.63.
Anyway, there are lots of unequal combinations of W/W, H/H, and B/B bias that would give us the same UPM coefficient of 0.75, and signficance level of .016. For instance, if we add 1.2 percentage points to the W/W cell, and leave the H/H and B/B cells alone, we get UPM=0.76, and p=.015. Here are some other combinations that work. I had to get these by trial and error:
-- W/W bias of 1.2 points (UPM=0.76, p=.015)
-- H/H bias of 2.6 points (UPM=0.76, p=.014)
-- B/B bias of 9.5 points (UPM=0.75, p=.015)
As it turns out, you can take any 100% combination of those three biases, and get roughly the same result. So if we go one-third/one-third/one-third, we get W/W bias of 0.4, H/H bias of 0.87, and B/B bias of 3.2:
-- W/W 0.4, H/H 0.87, B/B 3.2 (UPM=0.76, p=.015)
We can also go 1/2 H/H and 1/2 B/B:
-- W/W 0.0, H/H 1.3, B/B 4.7 (UPM=0.75, p=.016)
Let's take that last case and see what happens. That means that the bias in each cell is:
Pitcher ------ White Hspnc Black
White Umpire-- +0.00 00.00 00.00
Hspnc Umpire-- 00.00 +1.30 00.00
Black Umpire-- 00.00 00.00 +4.70
If that were the case, it would explain the QuesTec findings completely. How many pitches does it take to create this effect? This many:
Pitcher ------ Wht Hsp Blk
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +60 +00
Black Umpire-- +00 +00 +52
So: it is possible that the entire effect the study found is due to only 112 pitches, out of 700,000. Now, that's a little unfair, because, even though it's a small number of pitches, it's particularly unlikely for it to happen exactly this way. (I think to get the most likely, you want to find the maximum likelihood diagonal that still creates a UPM of 0.75. I don't know how to do that. But that doesn't matter to my argument here.)
And the B/B bias is pretty high, at almost 5 percentage points. Since there are only 30% strikes in the first place, 5 percentage points out of 30 is a 17% increase in called strikes. You'd think that would be hard for an umpire to get away with.
But the point is that there's no way of knowing how many pitches are affected. And it's at least *possible* that the entire effect is only 112 pitches – which is probably a surprise.
Now, let me argue that the significance level may not be all it's cracked up to be. That would be because the significance calculation made one assumption that isn't true in the smaller cells: the assumption that the errors in each cell are independent of each other.
Let me try to explain what I'm getting at here.
The regression in the study corrected for batter, pitcher, umpire, count, score, home team, and QuesTec. But we can imagine there are some other things that would affect strike percentage.
For instance: maybe runners on base and number of outs. With one out and a runner on third, the pitcher is willing to risk a walk, in order to avoid giving the batter something good to hit. If black pitchers facing black umpires had fewer such situations, that could account for part of the effect.
Also, there's the day/night factor. It's likely that there called strikes vary between day and night games. Suppose the ball is harder to see at night. Then there might be more swinging strikes. That means more 0-2 counts, which means more called balls. Or, it could be the other way around: bad visibility might mean more called strikes, as batters swing less.
In any case, suppose there's a significant difference between day and night. It could turn out that H/H and B/B combinations just happened to take place during the time of day when conditions were more "called-strikey."
I'm sure you can think of more factors that weren't controlled for. Maybe there's a platoon interaction: more called strikes when the pitcher has the platoon advantage.
These factors could be fairly large. And it could turn out that the same/race combinations just happened to have better circumstances.
Now, that's kind of grasping at straws, this "you didn't control for everything" criticism. In any study, there's always something you don't control for, and that, in and of itself, doesn't make the results of the study invalid. In fact, the significance *takes into account* that there are uncontrolled variables: the significance level is defined as the probability that the uncontrolled variables (including luck) could cause that kind of result.
But: that's only true if the errors are independent.
Suppose that the league bats .250. You pull 480 random AB from the league. What's the chance those random AB hit .300? Well, you expect 120 hits. You need 160. That's 3.9 standard deviations away, so the chance is less than 1 in 10,000.
But: instead of pulling 480 random AB from the league, what if you first select a random player, then select 480 random AB from only that player's batting line? Now, the chances are much higher: roughly, the chance of having pulled a .300 hitter. Maybe they're only, say, 1 in 15.
In the first case, the chance of drawing an AB from a certain player was independent of the last one. But in the second case, it's not. And so the normal binomial probability calculation doesn't apply.
Now, let's talk about the B/B cell. Are its pitches completely random in terms of day/night, platoon advantage, and runners on base? They are not. And that's because the umpire/pitcher combinations aren't dotted thoroughout the season – they're concentrated into only a few specific games.
There are about 160 called pitches per game (80 per team). The B/B cell has only 1,110 pitches. 1,110 divided by 160 is 7 games worth of pitches. That's it, just seven games. If the pitches were independent, the chance of getting, say, 80% of those pitches being during day games would be infinitesimal. But could easily happen that 6 of the 7 games would be day games. And because of that, the regression underestimates the probability that there could be a lot of day games, and therefore underestimates the amount of luck, and therefore overstates the signficance of the findings.
(This argument is a little oversimplified because of relief pitchers – the 7 games is probably 11 seven-inning games worth of starters, and maybe 48 innings of relievers. But it's still easy to see how the B/B cell could be daylight-intensive.)
Because of this concentration of pitches into small numbers of games – the B/B cell contains only 7 games, the B/H cell only 3 games, and the H/H cell only 29 games – the significance levels are overstated. Overstated by how much? I don't know. But the UPM coefficient was barely significant – 2.2 SD instead of just 2. If you figured out how to calculate the signficance levels correctly – using cluster sampling methods that I don't know how to use – it seems to me very possible that the UPM would slip below significance.
Having said that, I'm not really that thrilled with the idea that you should draw such a fine line at 5% significance, where one side of the line is meaningful and the other side is not. But a difference between p=.015 (which the study found) and (for instance) p=.08 (after adjusting for clusters) is a difference between a 1 in 67 chance that the results would obtain by luck, and a 1 in 12 chance that the results would obtain by luck.
And, in any case, just think about the B/B cell, which had 52 called strikes too many in seven full games. Does that seem like something that could be caused by a few too many day games? I'm not sure. We are talking about one extra strike every two innings, which seems like a lot. How about one strike every four innings? That would reduce the cell from +52 to +26, which would still be enough to make the results no longer statistically significant.
My gut says no, that's still too much, but my gut doesn't really know. To find out, you can find the SD of called strikes in full games, and compare it to the SD of one game's worth of called strikes, taken randomly from multiple games, and do some calculations.
Anyway, I'd argue that the significance level is definitely overstated due to the clustering of pitches within games. I just don't know by how much.
Now, suppose that, even after adjusting for day/night clustering and platoon clustering and runners-on-base clustering, we find there's still significance. That would be evidence of bias. We still wouldn't know how much bias, or among which races – as we saw above, there is an infinity of possible combinations – but we might conclude there's at least some bias.
But could it be anything other than racial bias?
The authors acknowledge that it could. Actually, I don't think they discuss this in the paper, but in a FAQ posted on Dr. Hamermesh's website, he says,
"[Perhaps] different race/ethnicity groups espouse differen[t] styles of play. Suppose for example, that youth baseball coaching is different in Latin America than elsewhere, and that Hispanic pitchers consequently develop pitching “styles” that differ from those of Black, Asian, or White pitchers. If Hispanic umpires and pitchers both espouse similar styles that differ from other races/ethnicities, then what appears as discrimination may simply reflect these stylistic differences."
I think it's quite possible that something like this is going on – but it might have nothing to do with race or ethnicity. Every umpire is different, so it's possible that the three hispanic umpires, or the five black umpires, might have different "styles" just by chance.
For instance, suppose some umpires, just by virtue of their own quirks, have trouble calling curve balls that just miss the low inside corner – they usually call them strikes. And suppose, that, by chance, three of the five black umpires (60%) have that quirk, but only (say) 20% of other umpires.
Also, pitchers are different, too. And there are only 27 black pitchers (as compared to 669 white pitchers). So it's also possible there are lots of black curveball pitchers, as compared to white curveball pitchers, just by chance.
This combination of black curveball pitchers and umpires who have oversized curveball strike zones for curveballs could easily create the appearance of racial bias, when it's just random style interaction.
And, again, the regression algorithm doesn't consider the probability that will happen – it assumes, by necessity, that curveball quirks are spread evenly among all umpires, and that curveballs are spread evenly among all pitchers. So, again, the regressions in the study overstate the significance of any discrepancies.
So, to summarize the findings so far:
-- the non-QuesTec finding of racial bias is signficant at 2.2 SD (p=0.015);
-- but because of clustering of possible uncontrolled variables (like day/night), the actual number is certainly less than 2.2 SD;
-- and because of clustering of uncontrolled umpire and pitcher styles (like curve ball characteristics), the actual number is even further less than 2.2 SD;
-- I don't know how much to reduce the 2.2 SD by. I suspect not too much, but it might be enough to go below 2.0 and make the results "officially" not significant;
-- there is some suggestion that the differences are larger among minority pitchers than white pitchers;
-- and so probably only a small number of overall pitches are affected, many fewer than the 0.63% suggested by the regression coefficient.
Let my official wimpy conclusion be that for the non-QuesTec games, the data is suggestive, but probably not quite significant, and more research is called for.
And the one obvious question that hasn't been asked yet: if the effect is small, as it seems to be, couldn't it be just one or two umpires causing it? Why not just look at individual umpires and see if there's an outlier? It could be as few as, say, 25 pitches that moves the result from signficant to random. Since it's obvious that, even within the same ethnic group, different human beings have different propensities to conscious or unconscious racial bias, why does the study insist on looking only at global "same race" effects? Not only wouldn't they break that down into W/W, H/H, and B/B, but they also didn't break it down by umpire. That's kind of puzzling to me.
Some commenters, here and elsewhere, have suggested it's worth looking at Angel Hernandez, because he has a reputation for being an outlier in other aspects of his umpiring career. I don't want to get sued or anything, so let me emphasize that there is absolutely no data to support this theory, and I have no idea whether it's true or not. But if it does turn out that he, or any other umpire, is an outlier who shows evidence of bias, and removing him is enough to make the rest of the umpires look unbiased ... well, won't a lot of work have been wasted on this intricate study for nothing? Why not just actually take a look at the data the way a fan would before you start doing complicated regressions with hard-to-interpret results?
Part IV will talk about some of the study's other tests, which (thank God) are a lot simpler than the regression we've been talking about in these last three posts.