Sabermetric Research: The Hamermesh study on umpires and race

Tuesday, August 14, 2007

The Hamermesh study on umpires and race

Yesterday, an article in Time Magazine discussed a new study about biased umpring. Here’s the study. It’s called “Strike Three: Umpires’ Demand for Discrimination,” by Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh.

The paper is quite similar to the Price/Wolfers paper on basketball referees (which I earlier reviewed in three parts). However, the data are much less convincing, and there are seeming conflicts in the results that I don’t understand.

The authors divide umpires and pitchers into four racial (or ethnic) groups: white, black, Hispanic, and Asian. (Hispanic is defined as any player born in one of several Spanish-speaking countries, regardless of skin color.) For three full seasons of play-by-play data, they counted balls and called strikes for each combination of pitcher and umpire. They conclude that pitchers have an advantage when facing an umpire of their same group. White umpires seem to favor white pitchers (over black, hispanic and asian), black umpires favor black pitchers, and so on. That is, umpires discriminate in favor of their own kind.

---

To show what they did, I’ll simplify things by ignoring the Hispanic and Asian groups (there are no Asian umpires anyway), and just show the data for white and black. Here’s the authors’ summary of the results (taken from their table 2):

	White Pitcher	Black Pitcher	Difference
White Umpire	.3206 (741,729)	.3061 (25,108)	.0145
Black Umpire	.3193 (46,825)	.3076 (1,765)	.0117

The numbers in the table are percentage of called pitches that were strikes; the number of pitches follows in brackets.

You can see that White pitchers got more strike calls than black pitchers, regardless of who the umpire was; the overall numbers were 32.05% strikes for the white pitchers, and 30.62% for the black pitchers. We can reasonably conclude that white pitchers are actually more skilled in this regard. However, the size of the black pitchers’ disadvantage depends on the umpire. White umpires gave black pitchers a .0145 disadvantage over white pitchers, while black umpires cut that disadvantage to .0117.

This indeed seems to show that umpires favor their own group. But what would the chart look like if there were no bias at all? There are many ways to equalize the two groups of umpires. The easiest is to subtract .0028 from the Black/Black cell, in order to widen the .0117 to .0145. The chart would then look like this:

	White Pitcher	Black Pitcher	Difference
White Umpire	.3206	.3061	.0145
Black Umpire	.3193	~~.3076~~ .3048	~~.0117~~ .0145

What’s the real difference between the two cells? In the real chart, black umpires called 543 strikes out of 1,765 pitches. To get the bottom chart, black umpires would have had to call only 538 strikes out of 1,765 pitches. The difference: five pitches.

Over more than 7,000 ballgames over three seasons, the two groups of umpires are five pitches away from showing absolutely no racial bias. Obviously, that’s not statistically significant.

If you repeat this analysis, this time comparing white and hispanic pitchers and umpires, the difference is 35 pitches out of 7,323. It’s a bit higher a proportion, half a percent, but still not statistically significant (the SD, using the binomial distribution, is 39).

Finally, if you compare hispanic and black, the difference is less than two pitches.

Why are the discrepancies so small over such a large sample of data? It’s because of the small samples in the hispanic/hispanic and black/black cells. There are about three times as many white pitchers as hispanic, and 30 times as many white pitchers as black. There are also few black umpires, and even fewer hispanic umpires. The result: there are, literally, 420 times as many white/white datapoints as black/black datapoints.

Given all this, I’m not sure how the authors manage to come up with statistically significant (and baseball significant) findings of bias. I’ll return to that in a bit.

---

When the umpire and pitcher are of the same race, the study calls it “UPM” for “umpire/pitcher match”. In the black/white pairings, there are 743,494 UPM pitches, but only 71,933 non-UPM pitches. More than 99% of UPM pitches involve a white pitcher, while only 65% of non-UPM points do. And so, since white pitchers have higher strike frequencies than black pitchers, the UPMs also have higher strike frequencies than non-UPMs:

.3206 UPMs
.3147 non-UPMs

This has nothing to do with racial bias – it’s just a consequence of how the numbers work out. White pitchers throw more strikes than black pitchers, and the UPM group is dominated by the white/white group. So even if umpires were unbiased – or even moderately biased in favor of blacks – we’d still see this effect.

I mention this because it completely explains two of the columns in the study’s Table 3. In columns “1d” and “2d,” the authors run a regression that includes UPM. The table shows, not surprisingly, that the coefficient for UPM is hugely significant. Again, that’s just because the white pitchers throw more strikes than the black pitchers.

That problem goes away if you include a race variable in the regression, such as pitcher race. In that case, you wind up adjusting for the fact that white pitchers are strikier than black, and the coefficient for UPM starts to make more sense.

That’s what the study does in the other columns of Table 3 – notably, column “3d,” which, I think, is the major result of the paper. There, the authors run a regression that includes UPM, but also a bunch of control variables. They have an indicator variable for each of the 12 possible counts (from 0-0 to 3-2), and an indicator variable for each of the 900 or so pitchers in the study. The inclusion of the pitcher variables means that when a white umpire calls a white pitcher, the encounter is adjusted to the proportion of strikes that pitcher is expected to throw. So the fact that whites throw more strikes than blacks should be factored out.

What’s the result? Statistical significance at about the 2.5% level (almost exactly 2 standard errors). The estimate of the UPM factor is 0.00341. Since the overall probability of a strike is somewhere around 0.30, the increase is about 1%. That’s fairly signficant in a baseball sense. 1% is about half a pitch a start, perhaps (remember, swinging strikes aren't included). My research shows that changing a ball into a strike (or vice-versa) is worth about .14 of a run; a study in Baseball Prospectus estimated it at .10 runs. Either way, taking a hit of .05-.07 on your ERA, just by facing the “wrong” race umpire, is fairly significant.

But, to be honest, I don’t understand where this number could have come from, based on the raw data. As I showed in the 2x2 table above, the discrepancy between black and white is 5 pitches in 1,765, which is 1/3 of 1%. For white/hispanic, it’s about half a percent. For hispanic/black, it’s about an eighth of a percent. None of these is statistically significant.

So, how is it that Table 3 can combine all these and get something higher than even the largest of the 2x2 discrepancies, and so much more statistically significant? The only thing I can think of is the control variables for the count and specific pitcher. Even though the raw data don’t show much discrimination, it could be that when you look closer, the white umpires, by random chance, faced better black pitchers than average, so they should have called even more strikes than average (but didn’t). Still, that seems unlikely, given the size of the sample.

Anyone have an idea what’s going on? Is there something wrong with my 2x2 analysis?

---

In Table 4, the regressions include a variable for whether or not the game was pitched in a QuesTec park. In those parks, the umpires are “graded” against an electronic observation on whether or not a pitch should have been called a strike. The idea is that when umpires are being observed, they should discriminate less, because they have an incentive to be accurate instead of biased. That’s where the title of the study, "Umpires’ Demand for Discrimination," comes from. The implication is that umpires (perhaps unconsciously) like to discriminate, but will “buy” less discrimination when the price goes up (QuesTec).

The results show there's very much an effect. The UPM factor was much less when QuesTec was in operation. Indeed, the umps appear to have *overcompensated.* Unmonitored, umpires were 2% more likely to call a strike on a different-race pitcher. Under QuesTec, they were two-thirds of a percent *less* likely (although that number is not significantly different from zero).

While I’m convinced that umpires call pitches differently under QuesTec, I’m still uncomfortable about the discrimination estimates.

---

The authors then proceed to Table 5, where they add control variables for attendance and “terminal count” (any count with three balls or two strikes, where this pitch is more likely to make the at-bat "terminate").

I don’t think we can learn anything from the attendance analysis, because the sample for these new variables is not random. Parks with low attendance probably have worse pitchers. The study did control for that, but what it didn’t control for was that parks with low attendance have worse pitchers *playing at home*. This would tend to cause the worse pitchers to play slightly better, and the better pitchers (who are on the road) to play slightly worse.

For what it’s worth, the authors found that high attendance almost completely cancels out the racial bias (a small amount is left). But I’m not sure whether that would still hold if you re-ran the study, correcting the pitcher effects for home/road. Normally I would think it's not that important, but this study does look for very small effects.

As for the “terminal count” study, different pitchers will have different types of terminal counts. Good pitchers will have a lot of 0-2s, and (semi-deliberately) throw a lot of balls. Bad pitchers would have a lot of 3-0s, and (semi-deliberately) throw a lot of fastball strikes. I think that would render the results of the study unreliable.

But, again, for what it’s worth, on “terminal counts,” the bias is completely reversed: umpires judge own-race pitchers more harshly. The authors suggest it's because the ump knows his judgement on this pitch will be watched more closely.

---

Table 6 attempts to relate the race bias to team winning percentage -- how many wins is the bias worth? (The authors seem unaware of the previous work on the run value of balls and strikes.) I’m not sure how to interpret the “probit” coefficients in the table (are they just percentage points?), but the authors say that the probability of winning increases by "over 4.2 percentage points for a home team if its starting pitcher matches the umpire’s race/ethnicity." That seems way too big: it would improve the chance of winning from .540 to .582. But if two pitches per game are affected, that’s .28 runs, which is almost .030 in winning percentage (at 10 runs per win).

Still, two pitches per game is a lot. Overall, how many pitches in a typical game are borderline? Most calls are pretty clear. Suppose there are, say, 20 pitches a game that can go either way. Suppose 10 go to the pitcher and 10 go to the hitter. Now, suppose there are two calls that go the "wrong" way because of race. That brings the 10 up to 12. That means that umpires are so biased that 17% of close pitches are decided by race!

My gut says that's pretty unlikely.

---

A couple more quick comments:

-- In a section on robustness checks, the authors find that the race of the batter doesn’t seem to matter – only that of the pitcher. They hypothesize that the umpire is reluctant to show bias against the batter because he stands so close. The batter can confront the umpire, while the pitcher can’t. Doesn’t seem too unreasonable to me. Also, there is “weak evidence” that the more-experienced umps show less bias than the less-experienced ones; and, there is *no* observed bias among the crew chiefs.

-- To their credit, the authors suggest explanations for all the observed effects that don’t involve racial bias, although those only appear in their FAQ. For instance, they suggest that perhaps Hispanic umpires and pitchers have distinctive “styles,” and so both will implicitly understand a certain kind of pitch to be a strike. Other umpires may not call it a strike, and other pitchers won’t throw it, and so the hispanic/hispanic strike count is inflated.

-- I think all the standard errors and significance levels given in the study are too small. That’s because the analysis the authors do assumes that the umpires are seeing a random sample of pitches. They aren’t – they see pitchers in bunches. If umpire X has 30 games a year behind the plate, he’s going to see less than half the starters in the league. So it’s kind of a cluster sample instead of a random sample. Again, I don’t know how much of a difference this makes in the significance levels, whether they're a little too big or a lot too big, but I wish the study had tried to estimate the effect.

---

My bottom line is that I’m a bit confused by this study. The raw numbers seem as close to non-biased as you can get, but the regressions seem to show a significant effect. Until the discrepancy is explained to me, I have to go with the 2x2 data, because that’s what I understand. And that says no discrimination.

One explanation might be that the specific technique the authors used, despite using “pitcher fixed effects,” didn’t fully adjust for the fact that white pitchers throw more called strikes than black pitchers. If that’s the case, it would explain most of the positives. It wouldn’t explain why QuesTec completely reversed them into negatives, though. At any rate, these guys are professionals, and I can't think of anything wrong that might have caused this to happen.

And almost all the results go the expected way -- the umpires improve under QuesTec, they appear to improve when the pitch is more important, they improve when the crowds are large, and the experienced umpires show less bias. Again, this would be a big coincidence if there weren't real bias going on.

So am I missing something? Can anyone help resolve the contradiction?

Labels: baseball, economics, race, umpires

11 Comments:

At Tuesday, August 14, 2007 1:09:00 PM, Anonymous said...: Phil:
The column headers in your first table are reversed.

Like you, I'm inclined to trust the 2x2 data in the authors' Table 2. At least, if there isn't evidence of bias at that level, the heavy burden is on them to show it exists after controlling for the appropriate factors. And
the unadjusted data seems to say that the racial bias — if it exists at all — is limited to Asian and Hisp pitchers. White and Black pitchers get the same called strike rate by umpires of all races. Asian pitchers get lower rate from both Black and Hisp umps (or, less plausibly, a positive bias from white umps). And Hispanic pitchers may get a slight positive bias from Hisp umps, and slight negative bias from Hisp umps.

As you rightly point out, the overall difference between same-race and mixed-race matchups of 0.6% can be easily explained by the racial makeup of the two samples: 98.8% of same-race pitches are thrown by white pitchers, vs. 18.7% for the mixed-race group (also, the umps are more likely to be white -- 98.8 vs. 76.7). As it happens, white pitchers and white umps both yield a higher strike % overall.

What I think is misleading is this statement by the authors, quoted in Time: "The highest percentage of called strikes occurs when both umpire and pitcher are White, while the lowest percentage is when a White umpire is judging a Black pitcher." This is almost entirely a function of white pitchers throwing more strikes in general. More relevant, surely, is that the strike% for black pitchers (and white pitchers) is virtually identical whether the ump is white or black.
At Tuesday, August 14, 2007 1:18:00 PM, Phil Birnbaum said...: Thanks, Guy, column headers now fixed.

I don't really see racial biased in hispanic or asian either. Given the very small number of minimum pitches in any 2x2 matrix, nothing is statistically significant.

Yes, that quote from Time is definitely misleading. Factually correct, though. Maybe the reporter took it out of context?
At Tuesday, August 14, 2007 1:39:00 PM, Anonymous said...: Phil: why do you find the Questec findings plausible? Aren't the sample issues you raise even more troubling when you further divide the sample into Questec and non-Questec parks? For example, the number of same-race pitches for minority pitchers in Questec parks must be only around 500-600. Also, I don't think the study controls for park, which could be a significant confounding factor in the Questec analysis. For example, it's easy to imagine that Hispanic pitchers in Questec parks are in higher-K parks than when they are in non-Questec parks, or vice-versa.
At Tuesday, August 14, 2007 2:31:00 PM, Phil Birnbaum said...: What I find plausible is the idea that umpires call pitches differently in Questec parks. I don't mean the race conclusions. As for parks, yes, you're absolutely right, I should have thought about the park effects.
At Tuesday, August 14, 2007 3:30:00 PM, Anonymous said...: "But, to be honest, I don’t understand where this number could have come from, based on the raw data. As I showed in the 2x2 table above, the discrepancy between black and white is 5 pitches in 1,765, which is 1/3 of 1%. For white/hispanic, it’s about half a percent. For hispanic/black, it’s about an eighth of a percent. None of these is statistically significant."

The significance comes from aggregating the data into same-race and mixed-race categories. The mixed-race group consists of over 381K pitches, so the same .003 (approx) difference found in various raceXrace matchups becomes significant. The question is whether their fixed effects variables really do manage to control for race of the pitcher. My guess is no. One way the authors could easily check is to replace the "UPM" variable with pitcher=white in their final regression, and see if pitcher race is significant. If so, their model isn't controlling for pitcher race. On top of that, one could argue that the same-race and mixed-race categories articifially conflate several very different dynamics.

Also, I can't see how their final model controls for race of umpire (though this is a much less important factor).
At Tuesday, August 14, 2007 3:41:00 PM, Phil Birnbaum said...: Right, they're aggregating, but, as you say, if they control for pitchers properly, that .003 should get washed out. Shouldn't it? Let me think about that a bit.

Excellent idea about running a test of that last regression with the "white" indicator. That would tell us for sure. It really seems like what they did should work, but that test would set my mind at rest.
At Tuesday, August 14, 2007 3:42:00 PM, Phil Birnbaum said...: Race of umpire is controlled for by having the pitcher and the UPM, no? If it's Greg Maddux and the UPM is true, it's a white ump.
At Tuesday, August 14, 2007 3:57:00 PM, Anonymous said...: On umpire, I don't think that does it. If you're going to say same-race has X effect overall, you need to control for fact that 98% of these matchups include a white ump (higher than other matchups). But in practice, white umps only call slightly more strikes, so it isn't a big deal.

* *

I think I disagree with your estimate of the magnitude of impact here. The authors say that same-race increases the strike% by .0034. If any average game has about 140 pitches, about 70 will be called. So that means .24 pitches per game that get converted from ball to strike. And that's comparing 0% to 100% same race rates; in fact, we're talking about a hypothetical increase of about 85% for black and Hisp pitchers, so let's call it .20 pitches per game.

So when the authors say this impacts "less than one pitch per game," that seems a bit misleading to me. And the idea that batters can detect this miniscule advantage when pitchers face other-race umps, such that they also hit more HRs or gain other advantages, is farfetched indeed.

Turning this into runs, it's .02 runs per game, or half a run per season for a fulltime starter. Might be worth one win to a white starter over an entire career.
At Tuesday, August 14, 2007 7:19:00 PM, Phil Birnbaum said...: Guy: right on both counts. My estimate was wrong because I figured 1% of all pitches, not just 1% of called strikes like I should have. At about 21 called strikes a game, your estimate is right on.
At Saturday, August 18, 2007 10:09:00 PM, Anonymous said...: See the results of my similar study in posts 24 and 25 at:

http://www.insidethebook.com/ee/index.php/site/comments/a_fascinating_study_worthy_of_some_discussion_i_think/#comments
At Wednesday, June 04, 2008 12:19:00 AM, Anonymous said...: First,
Jewish people are the most intelligent. They win almost 40% of the Nobel Prize's and they have a small population of only 14 million. So by far they exceed the other races in intelligence. The other races having huge numbers and such small contributions.

Second,
IQ tests, test intellectual conformity, not creativity and originality. This would explain the Asian high IQ's. They as a people are the ultimate conformists.

In IQ tests there is typically only one answer to the problem. That problem being a social conformity to reason. But everyone knows that Genius's and all of the greatest developments in the world are not the product of conformity. Conformity never breeds creativity. We can see this in the lack of influence the Asian population has had on Science. China used to be called the "sick man" of Asia. Their population is massive and their contribution to innovation is almost nil. We can see this lack of originality in their adoptation of European philosophies, I.e. Communism.

Friedrich Nietzsche and other Philosophers have critized Asians. Nietsche used the words "Pallid osification" to describe Orientals.

Pallid: lacking sparkle or liveliness.

Osification: The process of becoming set and inflexible in behavior, attitudes, and actions. Inflexible conformity, rigid unthinking acceptance of social conventions.

The reality is Asian people have yet to understand that laws and rules are arbitrary. Europeans make the rules and Asian's follow them.

It also doesn't make sense that Asian's are considered smart because of the fact that they have destroyed their own countries. This is due to over-population and their basic lack of enviromental understanding.

It is also common scientific fact that women who have many children are ignorant, and those who have less children are more intelligent. This has already been proven in studies. So it seems strange to say that Asians are smart when the obviousness of their backwards countries, and medieval lifestyle makes them contrary to that premise.

Europeans have the most advanced civilizations and every other race has yet to meet these levels other than the Japanese. The Japanese only being good at copying other people's inventions and making them better. Other than that their original creativity is lacking as well. They took American cars and made them better. They took the German camera and made it better. And they took German steel and made it better. Otherwise the greatest advances still come from Europeans and Jews. Other than that the Orientals have yet to produce an Einstein or a Thomas Edison.

When it comes to Black people. It makes sense that they have low intellectual comformity, I.e. IQ tests. They are far too creative to be trapped in this unoriginal form of conditioning. You can tell their creative capacity in their athletics, music, dance, and the way they talk. They by far exceed the Asiatic races in these areas. Being better singers, musicians ect. Blacks far exceed Asians in emotive expression. In all of North America there is only one or two famous high-paid Asian actors.

Reality, Europeans rule the world and they have allowed others to exist only out of desire for economic bennifet. They, (Europeans) are also the physically strongest, winning the Strongest Man competitions again and again.

The greater the conformity, the weaker the race. Thus we see the races as they are today. The wild animal being bred out of man, and the physically impotent, conformist thriving.

Otherwise "Group psychology" is the most destructive thing in the world. All these stereotypes are false when it comes to the individual. Individualism is the most important thing for this time. All countries, Religions, groups need to dissolve for man to live in peace.

www.truenewspaper.blogspot.com

Sabermetric Research

Tuesday, August 14, 2007

The Hamermesh study on umpires and race

11 Comments:

About Me

Previous Posts