The Hamermesh umpire/race study revisited -- part VII
This is Part 7 of the discussion of the Hamermesh study. The previous six parts can be found here.
In comments to posts here and elsewhere, Tom Tango asked whether the "same-race" effect could be caused by one umpire – specifically, Angel Hernandez, a controversial umpire with a poor reputation. (Here's a fan page.)
So I checked. I started by trying to reproduce the "low attendance" finding, because that was where the original study found the largest effect.
(Technical note: To save time, I only approximated what the Hamermesh authors did. I didn't correct for count, score, home/road pitcher, batter, or umpire. And I selected games by actual attendance (less than 30,000) instead of the study's 70% of capacity. For umpires, I considered only Hernandez and Alfonso Marquez as hispanic, and CB Bucknor, Laz Diaz, Chuck Meriwether, and Kerwin Danley as black. The original study included one additional hispanic umpire and one additional black umpire, but I don’t know which ones those are. Also, for hispanic pitchers, I used only those born in one of the countries listed in the study; and, for black pitchers, I used only those in this list.
But I got similar results both in relative number of pitches seen, and in the effect of those pitches. So I figure it's close enough.)
Here's what I got for the equivalent of the "Table 2" matrix in the original study (which the authors didn't actually publish for this sub-sample):
Pitcher ------ White Hspnc Black
White Umpire-- 31.88 31.27 31.27
Hspnc Umpire-- 31.41 32.47 28.29
Black Umpire-- 31.22 31.21 32.52
All Umpires –- 31.83 31.30 31.32
It does seem obvious that there's a same-race effect here: all three the numbers on the diagonal are the highest in their row and column. The "UPM" coefficient was 0.76, which is a bit over two standard errors, and so significant at p=.04. (The original study had 2.7 standard errors, which was more highly significant, which I think was because they (properly) adjusted for count. But the differences won't affect most of the discussion here.)
As I mentioned before, one of the things the authors didn't do was show us how the individual umpires varied. Here are the two hispanic umpires and four black umpires:
Umpire ------- White Hspnc Black
Marquez ------ 30.86 31.24 27.78
Angel -------- 31.91 33.76 28.49
Danley ------- 31.86 29.66 26.42
M'wthr ------- 30.29 31.65 33.98
Diaz --------- 31.84 32.11 39.55
Bucknor ------ 31.11 31.03 31.25
Out of these six umpires, five of them (all but Danley) called more strikes for own-race pitchers over other-race pitchers. But, remember that we can only find umpire bias relative to other umpires. So it's important to remember that this could also be evidence of *white* umpires being biased, in the other direction. Maybe hispanic and black pitchers are actually better than white, but the white umpires keep the minorities down.
In any case, these are based on small sample sizes. Typically, the above umpires saw about 1500 hispanic pitches and 110 black pitches each (although Bucknor saw 352 black pitches, strangely enough). So, taken individually, none of the umps showed statistically significant differences. I'll give you the Z-scores for the differences for same-race compared to white:
By the usual standard of 2 standard deviations, none of these six umpires shows individual significance. Diaz is the closest, at +1.94.
Having said that, I should admit that these z-scores are probably underestimates. As I wrote previously, ball and strike calls are somewhat mean-reverting, because (for instance) after a 3-0 ball, a pitcher is likely to throw a strike, and after an 0-2 strike, the next pitch is likely to be a ball. To get more accurate significance tests, you'd want to adjust for count, like the Hamermesh authors did.
Here's how the black umpires called hispanic pitchers, and how the hispanic umpires called the black pitchers:
Nothing close to significance here, either.
In fact, if you look at *all* the umpires, not just the minorities, you find only three statistically-significant white/other comparisons: one umpire in black/white (in favor of white), and two umpires in hispanic/white (one each way). You'd expect about four or five significant results in each group (5% of 93 umpires), not one or two.
Again, I think that's because I didn’t control for count. If I had, Diaz would probably have come out as significant, and maybe Angel Hernandez too.
But, sticking to these numbers, none of the umpires are significant by themselves. But, if we take all six together, we do get significance, at the .04 level – 2.01 standard errors from expected.
The question now is, how should we interpret these results?
Let's start by asking, as Tango did, if Angel Hernandez might be responsible for the finding of significance. If Angel wasn't included, would the results still be significant?
To check, I replaced Angel, in the hispanic group, with Gary Darling, who was almost completely average in the hispanic/white comparison. And, yes, the results became non-significant:
With Angel ----- UPM = 0.76, p=0.04
With Darling --- UPM = 0.41, p=0.28
This is in my study. Going back to the original Hamermesh study, they found a UPM of 0.84. Extrapolating, we can assume that replacing Angel there would have reduced it to 0.45, which would be about 1.5 standard errors away. So replacing Angel with Darling would have eliminated significance even in the original paper.
So it's true that without Angel, there would be no effect. But, in any finding of statistical significance, in any field, you can always find outlying datapoints to remove and eliminate the effect. That doesn’t mean there was no effect in the first place. And it doesn't mean that you can say the outliers "caused" the effect.
For instance, suppose there's a theory that russian roulette causes death. I check six participants, and one is dead. You could argue, "well, if you eliminate Bob from the study, russian roulette is risk free." But that wouldn't be a fair argument.
However, there is a slightly different argument in this case that *might* have some merit. You might say, "look, the the finding of significance rejects unbiasedness. So it really only tells us that *at least one* umpire may be biased. Maybe the only umpire that's biased is Angel, and the others are innocent. After all, if you eliminate Angel from the study, the significance disappears."
That's a better argument. But: why Angel? Diaz had a higher z-score, and if you neutralize Diaz, you probably also lose the signficance. And probably even Marquez. When you're so close to non-signficance – in this case, 2.01 standard errors, when 2.00 is the threshold – neutralizing any of the same-race-positive umpires will drop you below significance.
Where this argument makes the most sense is when one umpire is so clearly out of normal range, so obvious an outlier, that he almost demands to be a special case. But that's not happening here. Of the 93 umpires (of which about 70 are full-time enough to be considered), Angel Hernandez is only the 6th most favorable to hispanic pitchers (the other five are white). Why single him out?
Laz Diaz, on the other hand, is the most favorable of all 93 umpires towards black pitchers. By a lot. He's at 1.93 SD, while the second place ump is only at 1.51. But, still, that's based on only 134 pitches, and I don't think it would be fair, on the basis of this evidence, to suggest that he's biased. Besides, *someone* has to rank highest out of 93. And there are five black umpires. So, it's 1 in 20. Is that really that much of a coincidence?
It is certainly *possible* that any one of the moderate-to-high Z-score guys (Diaz, Meriwether, Angel, Marquez) is the only biased umpire. But this data isn't enough to tell. What you actually could do for the black umpires (because there are so few black/black pitches) is just use QuesTec (or trained observers) to call the pitches, and see if they're accurate or not. That way, you get right to the point. Did Laz Diaz lead MLB in apparent bias simply because the pitchers legitimately threw more strikes? If you have access to video archives, you can just look at the pitches themselves.
If they find that Laz Diaz did indeed make the right call as often as expected, that means he saw more black strikes just out of random chance. You'd be able to eliminate his datapoint, and your finding of significance will fade away.
But if you find that he made the wrong call a bit too often with black pitchers on the mound, you've confirmed the statistical evidence with some observational evidence.
Here's one last way to look at it, which might be more inutitive. There were 93 umpires in the study. If you sort them by the difference between their strike calls against whites and hispanics, here's what you get.
The two Xs are the hispanic umpires, and they're over to the "favor hispanics" side of the line. (The hyphens each represent two non-hispanic umpires.)
Here's the same for the black umpires with the black pitchers:
Again, more black umpires on the "favors blacks" side of the line.
Very roughly speaking, the significance level of .04 means that if you were to scatter six Xs randomly onto these lines, the chance they would land that far to one side (or farther) would be 1 in 25. At the same time, you can see that if you take one of the two leftmost Xs (Diaz and Hernandez) and move it to the middle, it doesn't look that significant anymore. And, if you move *both* of them to the middle, suddenly it looks really, really close to perfectly unbiased.
Let me repeat that. If you take one of Laz Diaz and Angel Hernandez out of the sample of umpires, and replace him with an average umps, every statistically significant effect in the original Hamermesh study become statistically insignificant. And if you replace *both* of those two umpires, the effect not only becomes insignificant, but almost completely disappears.
Make of that what you will.