Sabermetric Research: The Hamermesh umpire/race study revisited -- part V

(This is a continuation of an analysis of the Hamermesh "racial discrimination among umpires" paper. The other parts are the posts immediately preceding this one. If you're new to this, I'd recommend you go back and read the other parts, in order, or this won't make much sense to you.)

-----

In section V of the paper, the authors turn from counting pitches to examining other measures of game performance. If a same-race umpire increases the number of called strikes the pitcher receives, shouldn't that also lead to improved results in the win column, or in ERA?

The authors investigate starting pitchers "for the roughly 14,000 starting pitchers in the roughly 7.000 games in the three seasons in [the] sample."

The authors first look at white starters. They compare their performance when the umpire is white, to their performance when the umpire is a minority. The results, as read from Figure 4 of the paper:

-- 2.5% higher [Bill James] game scores [with a white umpire]
-- 1.0% more wins
-- 0.1% fewer strikeouts
-- 4.5% fewer home runs
-- 2.2% fewer hits
-- 5.2% fewer runs
-- 4.1% fewer walks
-- 5.8% lower ERA

The results are much better for a white pitcher when he faces a white umpire. And when a minority pitcher faces a same-race umpire, the results are also favorable:

-- 0.5% higher game scores [with a same-race umpire]
-- 11% (!) more wins
-- 7% more strikeouts
-- 14% (!!) fewer home runs
-- 4% fewer hits
-- 4.7% fewer runs
-- 1% fewer walks
-- 1% lower ERA

For both sets of pitchers, all the numbers go in the expected direction, except white/white strikeouts (which show a very small effect the other way). From this, the study concludes:

"For virtually every measure of pitcher performance, the impact of having a matched umpire benefits the pitcher ... many indirect outcomes, such as the number of home runs allowed by the pitcher, are also affected, suggesting that the umpire's behavior may alter the strategies of pitchers and batters."

Looking quickly at the raw data, you might agree with the authors. But if you look a bit more closely, you can see that's not necessarily so.

1. White Pitchers

First, although the authors talk about "14,000 starting pitchers" in the full sample, the numbers that make up the comparisons are much smaller. There are indeed a lot of starts (9,335 by my estimate) where a white pitcher faced a white umpire, but only about 899 starts where a white pitcher faced a non-white umpire. And so the standard error of the differences is pretty large.

Suppose the SD of earned runs per game is 3. Then the variance is 9. Assuming a starter pitches seven innings, the variance of earned runs per start becomes 7. So the variance over 9,335 games is 7/9335. That means the SD is .027. Converting that back to 9 innings gives .035.

Repeating for 899 starts gives an SD of .113.

Then, the standard error of the difference between the two sets of umpires, is the square root of the sums of squares, which is .118.

The observed difference in ERA was 5.8% fewer earned runs than expected, which is about a quarter of a run. 0.25 divided by 0.118 is a bit over 2 standard deviations, which is just over the threshold of significance.

However, there are several reasons that we should still suspect non-significance.

Reason 1: remember that white umpires call more strikes than minority umpires, regardless of who the pitcher is. The raw difference was about 0.2 percentage points. The authors don't tell us what it is after adjusting for batter, count, pitcher, catcher, inning and score, but, as I argued earlier, it should be higher after adjusting for count. Suppose it's 0.3 percentage points. That's about one pitch every four games. If turning a ball into a strike is worth about 0.12 runs (I have it at 0.14, while another study has it at 0.1), that should reduce ERA by about .03. That's enough to push the observed result into statistical non-significance (although just barely).

Reason 2: the observed difference makes no sense in light of the probable number of pitches affected. Overall, before any adjustments, the UPM coefficient was 0.27 percentage points, which could mean 0.27 points for each of the three races. Now, to be the most conservative, suppose the entire effect is white/white bias. That might mean a W/W coefficient of 0.30 or so (there are so many more W/W than others that the coefficient wouldn't rise much).

Take that 0.30, add another 0.30 for the fact that white pitchers call more strikes, and you get 0.6 pitches per game. That's about .07 runs per game, which is less than a 2% drop in ERA. The observed drop in ERA was about 5.8%.

The huge difference between the effect in pitches, and the effect in runs, suggests that you're looking at a substantial amount of luck.

Reason 3: Given that the authors found that most of the racial bias seems to occur in situations of lesser importance, you would think that the cost of a single biased strike call would be *less* than 0.12. That makes the large ERA drop even harder to explain.

Reason 4: The observations of the separate statistics aren't consistent with each other either. A 5.2% reduction in runs should lead to a 5.2% increase in wins. But the increase in wins was only 1%. Again, that suggests luck.

Also, the basic stat most affected was home runs (4.5% decrease). If the improved performance was the result of more called strikes, wouldn't you expect the largest effect to show up in strikeouts and walks? But Ks were actually *lower* than expected not higher. True, BBs were down only about 4.1%, but it's hard to understand why strikeouts and walks should move in opposite directions. Again, that suggests luck at work.

The study's authors seem to believe that the decrease in home runs is indeed due to umpire bias. They argue that pitchers, (subconciously?) aware of the discrimination they face, may alter their strategies to (for instance) throw more hittable pitches. However, ignoring the bias (and not changing strategy) would cost them at most a 2% increase in ERA. The theory that the threat of a 2% increase makes pitchers alter their behavior to cause a 5% increase seems kind of implausible.)

2. Minority Pitchers

Because there are so few non-white pitchers and umpires, there were only about 114 times in the sample where a minority starting pitcher faced a same-race umpire. So all the results in Figure 4 are decidedly non-significant.

For instance, there was a 11% increase in wins. Assuming that starters in the that 114-start sample would have gone 40-40, they actually went 44-36. This, I think, is about 1 standard deviation. You do have to adjust for the fact that minority umpires call fewer strikes than white umpires, but that would be another 1% effect, which is less than another half win.

And, again, there's a mismatch among the stats. Wins increase 11%, but runs decrease only 5%.

However, one thing in favor of the study's hypothesis, in this case, is that, because the same-minority-race sample is so small, it is possible (as we saw) that if all the bias is concentrated in the B/B and H/H cases, there could be a LOT of bias there – certainly enough to cause these results. But there's not enough data here to distinguish luck from bias. And so you certainly shouldn't conclude anything just based on the idea that it's *possible* that bias is the cause.

So when the authors write ...

"... [the] bias is strong enough to affect pitchers' measured performance and games' outcomes."

... I don't think they're correct. I bet that had the authors done significance tests on this data, and considered the sabermetric inconsistencies among the categories, they wouldn't have come to this conclusion.

------

Next, the study examines the effect on race matches in home games. When both teams have a starter/umpire race match, the home team wins 53.8% of games. When only the home team has a match, they win 55.6% of games. But when only the visiting team has a match, the home team again wins 53.8% of games:

53.8% when neither or both teams have a same-race pitcher
53.8% when only the visiting team has a same-race pitcher
55.6% when only the home team has a same-race pitcher

The authors write,

"These differences ... suggest that there is an asymmetry in the impact of racial/ethnic matching: Matches are more important between the umpire and the home team's pitcher than between the umpire and the visiting team's pitcher."

In my opinion, the differences suggest no such thing, because the sample size is so small. The difference is 1.8% out of two groups of about 1350 games each – 24 games total out of 2700. That's less than one standard error.

The authors break the numbers down by race of umpire, but with only 11 B/B and 36 H/H games, respectively, the results are certainly not meaningful.

-----

So, I think the authors' attempt to establish game-level differences by race fails: first, because of lack of statistical significance, and, second, because of the implausible relationships among the affected statistics.

-----

Labels: baseball, Hamermesh update, race

7 Comments:

At Thursday, April 17, 2008 2:37:00 PM, Anonymous said...: Phil:
What strikes me repeatedly, going through your analyses, is how very tiny these ostensible race differences are. Even those differences that inch past the threshhold of statistical signficance -- and you've shown that many don't really do that -- have virtually no baseball reference. To me, the lesson for researchers is that if you see results like these in the original Table 2 Matrix:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.80 30.77
Black Umpire-- 31.93 30.87 30.76

then DON'T do the study! If you can't see bias without elaborate statistical controls, there's a very good chance there isn't much bias to find.

Going back to your post 3, I think you were perhaps too quick to accept the conclusion of attendance-related (or Questec-related) bias. The fact that these sub-samples are not randomly selected is very important. In the high attendance games, for example, you're looking at a subset of generally good teams. Not only are there park effects to consider, but any number of other possible hidden biases. Perhaps the Hispanic pitchers on these teams are disproportionately right-handed, giving them a more frequent platoon advantage. Or maybe they just happen to be GB pitchers, and the 3 Hispanic umps tend to call low strikes. Who knows? But since so much of the alleged bias seems rooted in the Hispanic/Hispanic matchups, any further reductions of your sample -- especially in non-random ways -- really casts serious doubt on the meaning of any differences that do appear.

I think it's certainly possible that Questec causes subtle changes in how umps call strikes. And that change -- calling the high strike, let's say -- will advantage some pitchers and disadvantage others. And given small samples, that could easily appear to have a racial dimension. But it's more likely just a change in the ump's zone -- applied to pitchers of all races.
At Thursday, April 17, 2008 4:32:00 PM, Phil Birnbaum said...: Hi, Guy,

Generally, I agree with you.

"The fact that these sub-samples are not randomly selected is very important."

I agree with you, to the extent that the non-random selection affects the significance level. I'm surprised that the authors didn't correct for that, or even acknowledge it. I recall a sampling course I took many years ago where they gave us different formulas depending on whether your sample was random (pick 1000 people randomly from the US), cluster (pick 5 states randomly, then 200 from each of those five states), or stratified (pick 20 people from each state). The authors' sample is definitely cluster: they are likely to have many samples from the same umpire/pitcher pair, which makes the significance levels much lower.

I do want to stay away from criticisms like "you didn't control for everything, and therefore I reject your conclusions." That one isn't valid.

But yours are valid. And I wish I could figure out what the "correct" significance levels should be. But I don't have that expertise.

"But it's more likely just a change in the ump's zone -- applied to pitchers of all races."

But you still need an explanation of why same-race is affected more than different-race. For instance, as you say, maybe white umpires are more likely to call the high strike, and white pitchers are more likely to throw it. The authors mention some of that in their FAQ (but, unfortunately, not the paper). And I agree with you in that I think it's likely that that's what's going on. But "I think it's likely" isn't good enough, so the bottom line is that we have to admit there might be something there.

Now, if I was arguing in court, the evidence of racism is absolutely not convincing beyond a reasonable doubt. Not at all. But the fact is that the authors found something that looks to be somewhat significant.

So there are two issues:

1. Is there any statistically-significant evidence consistent with racial bias? Yes, just barely.

2. Is that enough to conclude that there is racial bias? Absolutely not.
At Monday, April 21, 2008 12:12:00 PM, Don Coffin said...: I said this before, and I'll say it again.

This is a magnificent series of posts. It raises real issues in a way that illuminates them. It points out both the strengths and weaknesses of the original study. I think, Phil, that you have addes a lot to our understanding of the issue.

And what I think we know, at this point, is that there might be an effect, but it's small, probably small enough that, while it may be statistically significant, it's not practically significant.
At Monday, April 21, 2008 8:31:00 PM, Anonymous said...: Yes, Phil, very fine work. I hope you'll post your final SABR presentation here.

And now that you've done all this hard thinking about Hamermesh, maybe you should revisit the Price/Wolfers paper on NBA ref discrimination. As I recall, it's a stronger analysis overall, but has some of the same problems.

I also notice that Wolfers has followed it up with a study on how ref's racial bias interacts with betting spreads, concluding that bettors could make money by betting on teams whose racial composition best matches the ref quad for that game: http://bpp.wharton.upenn.edu/jwolfers/Papers/NBABetting.pdf. Seems right up your alley....
At Monday, April 21, 2008 9:01:00 PM, Phil Birnbaum said...: Thanks, Doc, appreciate the support. And thanks, Guy.

The hard part now is trying to fit all this into 20 minutes, while keeping it interesting. Gotta work on that.

Thanks for the link to the new Wolfers paper; I wasn't aware of it. Will take a look!
At Friday, April 25, 2008 11:32:00 AM, Tangotiger said...: Willie Randolph was on the radio yesterday openly talking about Angel Hernandez. I've never heard a manager publicly say anything bad about an umpire the day after an incident. At least, not in the recent past. I was shocked that Willie would do so.

Can you confirm that Angel is not influencing anything being reported here, that what we see here is not hispanic-bias, but Angel-bias?
At Friday, April 25, 2008 3:56:00 PM, Phil Birnbaum said...: I'll try to work in a study before the SABR convention in June ... I might be able to run some Hernandez numbers sooner, but I don't have a breakdown of pitchers by race.

But I'll see what I can do.

<< Home

Sabermetric Research

Thursday, April 17, 2008

The Hamermesh umpire/race study revisited -- part V

7 Comments:

About Me

Previous Posts