Sabermetric Research: The Hamermesh umpire/race study revisited -- part IX

This is the ninth post in this series, on the Hamermesh racial-bias study. The previous posts are here.

There's not going to be anything new in this post – I'm just going to recap the various issues I've already talked about, with fewer numbers. You can consider this a condensed version of the other eight posts.

------

I'll start by summarizing the study one more time.

The Hamermesh study analyzed three seasons worth of pitch calls. It attempted to predict whether a pitch would be called a ball or a strike based on a whole bunch of variables: who the umpire was, who the pitcher was, the score of the game, the inning, the count, whether the home team was batting, and so forth.

But it added one extra variable: whether the race of the pitcher (black, white, or hispanic) matched the race of the umpire. That was called the "UPM" variable, for "umpire/pitcher match." If umpires had no racial bias, the UPM variable would come out close to zero, meaning knowing the races wouldn't help you predict whether the pitch was a ball or a strike. But if the UPM came out significant and positive, that would mean that umpires were biased -- that all else being equal, pitches were more likely to be strikes when the umpire was of the same race as the pitcher.

It turned out, that, when the authors looked at *all* the data, the UPM variable was not significant; there was only small evidence of racial bias. However, when the data were split, it turned out that the UPM coefficient *was* significant, at 2.17 standard deviations, in parks in which the QuesTec system was not installed. Umpires appear to have called more strikes for pitchers of their own race in parks in which their calls were not second-guessed by QuesTec.

An even stronger result was found when selecting only games in parks where attendance was sparse. In those games, the UPM coefficient was significant at 2.71 standard deviations. The authors interpreted this to mean that when fewer people were there scrutinzing the calls, umpires felt freer to indulge their taste for same-race discrimination.

In this latter case, the UPM coefficient was 0.0084, meaning that the percentage of strikes called for same-race pitchers increased by 0.84 of a percentage point. That would suggest that about 1 pitch in 119 is influenced by race.

That's the study.

------

I do not agree with the authors that the results show of widespread existence of same-race bias. I have two separate sets of arguments. First, there are statistical reasons to suggest that the significance levels might be overinflated. And, second, the model the authors chose have embedded assumptions which I don't think are valid.

------

1. The Significance Arguments

In their calculations, the authors calculated standard errors as if they were analyzing a random sample of pitches. But the sample is not random. Major League Baseball does not assign a random umpire for each pitch. They do, roughly, assign a random umpire for each *game*, but that means that a given umpire will see a given pitcher for many consecutive pitches.

If the sample of pitches were large enough, this wouldn't be a big issue – umpires would still see close to a random sample of pitchers. But there are very few black pitchers, and only 7 of the 90 umpires are of minority race (2 hispanic, 5 black). This means that some of the samples are very small. For instance, there were only about 900 ptiches called by black umpires on black pitchers in low-attendance situations. That situation is very influential in the results, but it's only about 11 games' worth. It seems reasonable to assume that these umpires saw only a very few starting pitchers.

What difference does that make? It means that the pitches are not randomly distributed among all other conditions, because they're clustered into only a very few games. That means that if the study didn't control for everything correctly, the errors will not necessarily cancel out, because they're not independent for each pitch.

For instance, the authors didn't control for whether it was a day game or a night game. Suppose (and I'm making this up) that strikes are much more prevalent in night games than day games because of reduced visibility. If the sample were very large, it wouldn't matter, because if there were 1000 starts or so, the day/night situation would cancel out. But suppose there were only 12 black/black starting pitcher games. If, just by chance, 8 of those 12 happened to be night games, that might explain all those extra strikes.

8 out of 12 is 66%. The chance of 66% of 12 random games being night games is reasonably high. But the chance of 66% of 900 random *pitches* being in night games is practically zero. And it's the latter that the study incorrectly assumes. (Throughout the paper, the standard errors are based on the normal approximation to binomial, which assumes independence.)

(I emphasize that this is NOT an argument of the form, "you didn't control for day/night, and day/night might be important, so your conclusions aren't right." That argument wouldn't hold much weight. In any research study, there's always some possible explanation, some factor the study didn't consider. But, if that factor is random among observations, the significance level takes it into account. The argument "you didn't control for X" might suggest that X is a *cause* of the statistical significance, but it is not an argument that the statistical significance is overstated.

So my argument is not "you didn’t control for day/night." My argument is, "the observations are not sufficiently independent for your significance calculation to be accurate enough." The day/night illustration is just to show *why* independence matters.)

Now, I don't have any evidence that day/night is an issue. But there's one thing that IS an issue, and that's how the study corrected for score. The study assumed that the bigger the lead, the more strikes get thrown, and that every extra run by which you lead (or trail) causes the same positive increase in strikes. But that's not true. Yes, there are more strikes with a five-run lead, but there are also more strikes with a five-run deficit, as mop-up men are willing to challenge the batter in those situations. So when the pitcher's team is way behind, the study gets it exactly backwards: it predicts very few strikes, instead of very many strikes.

Again, if the sample were big enough, all that would likely cancel out – all three races would have the same level of error. But, again, the black/black subsample has only a few games. What if one of those games was a 6-inning relief appearance by a (black) pitcher down by 10 runs? The model expects almost no strikes, the pitcher throws many, and it looks like racial bias. That isn't necessarily that likely, but it's MUCH more likely than the model gives it credit for. And so the significance level is overstated.

So we have at least one known problem, of the score adjustment. And there were many adjustments that weren't made: day/night, runners on base, wind speed, days of rest ... and you can probably think of more. All these won't be indpendent, and there's going to be some clustering. So if any of those other factors influence balls and strikes – which they probably do -- the significance levels will be even wronger.

How wrong? I don't know. It could be that if you did all the calculations, they'd be only only slightly off. It could be that they're way off. But they're definitely too high.

Note that this argument only applies to the significance levels, and not the actual bias estimates. Even with all the corrections, the 0.0084 would remain. The question is only whether it would still be statistically significant.

------

2. The Model

As I mentioned earlier, the study included only one race-related UPM variable, for whether or not the umpire and pitcher were of the same race. Because the variable came out significantly different from zero, the study concluded that its value represents the effect of umpires being biased in favor of pitchers of their own race.

However, the choice of a single value for UPM is based on two hidden assumptions. The first one:

-- Umpire bias is the same for all races.

That is: the study assumes that a white umpire is exactly as likely to favor a white pitcher as a black umpire is to favor a black pitcher, and exactly as likely as a hispanic umpire is to favor a hispanic pitcher.

Why is this a hidden assumption? Because there is only one UPM variable that applies to all races. But it's not hard to think of an example where you'd need to measure bias for each race separately.

Suppose umpires were generally unbiased, except that, for some reason, black umpires had a grudge against black pitchers, and absolutely refused to call strikes against them (if you like, you can suppose that fact is well-known and documented). If that were the case, the analysis done in this study would NOT pick that up. It would find that there is indeed racial discrimination against same-race pitchers, but it would be forced to assume that it's equally distributed among the three races of umpires.

That's a contrived example, of course. But, the real world, things are different. In real life, is it necessarily true that all races of umpires would have exactly the same level of bias?

It seems very unlikely to me. Historically, discrimination has gone mostly one way, mostly whites discriminating against minorities, mostly men discriminating against women. There are probably a fair number of white men who wouldn't want to work for a black or female boss. Are there as many black women who wouldn't want to work for a white or male boss? I doubt it.

Why, then, assume that signficant bias must exist for all races? And, why assume, as the study did, that not only does it exist for all races, but that the effects are *exactly the same* regardless of which race you're looking at?

If you remove the assumption, you wind up with a much weaker result. There still turns out to be a statistically significant bias, but you no longer know where it is. Take another hypothetical example: there are white and black pitchers and umpires, and three of the four combinations result in 50% strikes. However, the fourth case is off -- white umpires call 60% strikes for white pitchers.

Who's discriminating? You can't tell. It could be whites favoring whites. But it could be that white pitchers are just better pitchers, and it's the black umpires discriminating against whites, calling only 50% strikes when they should be calling 60%. If you open up the possibility that one set of umpires might be more biased than the other – an assumption which seems completely reasonable to me – you can find that there's discrimination, but not what's causing it.

Also, you can't even tell how many pitches are affected. If the white/white case had 350,000 pitches and the black umpire/white pitcher case had only 45,000 pitches, you could have as many as 35,000 pitches affected (if it's the white umpires discriminating), as few as 4,500 pitches (if it's the black umpires) or something in the middle (both sets of umpires discriminate, to varying extents).

And maybe it's not white umpires favoring white pitchers, or black umpires disfavoring white pitchers. Maybe it's black umpires favoring black pitchers. Maybe black umpires generally have a smaller strike zone, but they enlarge it for black pitchers. Since there are so few black pitchers, maybe in this case only 800 pitches are affected.

The point is that without the different-races-have-identical-biases assumption, all the conclusions fall apart. The only thing you *can* conclude, if you get a statistically-significant UPM coefficient, is that there is bias *somewhere*. You can reject the hypothesis that bias is zero everywhere, but that's it.

The other hidden assumption is:

-- All umpires have identical same-race bias.

The study treats all pitches the same, again with the same UPM coefficient, and assumes the errors are independent. This assumes that any umpire/pitcher matchup shows the same bias as any other umpire/pitcher matchup – in other words, that all umpires are biased by the same amount.

To me, that makes no sense. In everyday life, attitudes towards race vary widely. There are white people who are anti-black, there are people who are race-neutral, and there are people who favor affirmative action. Why would umpires be any different? Admittedly, we are talking about unconscious bias, and not political beliefs, but, still, wouldn't you expect that different personalities would exhibit different quantities of miscalled pitches?

Put another way: remember when MLB did its clumsy investigations of umpires' personal lives, asking neighbors if the ump was, among other things, a KKK member? Well, suppose they had found one umpire who, indeed *was* a KKK member. Would you immediately assume that *all* white umpires now must be KKK members? That would be silly, but that's what is implied by the idea that all umpires are identical.

I argue that umpires are human, and different humans must exhibit different degrees of conscious and unconscious racial bias.

Once you admit the possibility that umpires are different, it no longer follows that bias must be widespread among umpires or races of umpires. It becomes possible that the entire observed effect could be caused by one umpire! Of course, it's not necessarily true that it's one umpire – maybe it's several umpires, or many. Or, maybe it is indeed all umpires – even though they have different levels of bias, they might all have *some*.

How can we tell? What we can do is, for all 90 umpires, see how much they favor their one race over another. Compare them to the MLB average, so that the mean umpire becomes zero relative to the league. Look at the distribution of those 90 umpire scores.

Now, suppose there is no bias at all. In that case, the distribution of the individual umpires should be normally distributed exactly as predicted by the binomial distribution, based on the number of pitches.

What if only one or two umpires are biased? In that case, we probably can't tell that apart from the case where no umpire is biased – it's only a couple of datapoints out of 90. Unless the offending umps are really, really, really biased, like 3 or 4 standard deviations, they'll just fit in with the distribution.

What if half the umpires are biased? Then we should get something that's more spread out than the normal distribution – perhaps even a "two hump" curve, with the biased umps in one hump, and the unbiased ones in the other. (The two humps would probably overlap).

What if all the umpires are (differently) biased? Again we should get a curve more spread out than the normal distribution. Instead of only 5% of umpires outside 2 standard errors, we should get a lot more.

So we should be able to estimate the extent of bias by looking at the distribution of individual umpires. I checked, using a dataset similar to the one in the study (details are in part 8).

What I found was that the result looked almost perfectly normal. (You would have expected the SD of the Z-scores to be exactly 1. It was 1.02.)

This means one of the following:

-- no biased umps
-- 1 or 2 biased umps
-- many biased umps with *exactly the same bias*.

As I said, I don't find the third option credible, and the statistical significance, if we accept it, contradicts the first option. So I think the evidence suggests that only a very few umpires are biased, but at least one.

However: the white/white sample is so large that one or two biased white umpires wouldn't be enough to create statistical significance. So, if we must assume bias, we should assume we're dealing with a very small number of biased *minority* umpires. Maybe even just one.

And as it turns out (part 7), if you take out one of the two hispanic umpires (who, it turns out, both called more strikes for hispanic umpires), the statistical significance disappears.
If you take out a certain black umpire, who had the highest increase in strike calls for black batters out of all 90 umps, the statistical significance again disappears. This doesn't necessarily mean that any or all of those umps is biased. It *does* mean that the possibility explains the data just as well as the assumption of universal unconscious racial bias.

------

Based on all that, here's where the study and I disagree.

Study: there is statistically significant evidence of bias.
Me: there *may* be statistically significant evidence of bias, but you can't tell for sure because some of the critical observations aren't independent.

Study: the findings are the result of all umpires being biased.
Me: the findings are more likely the result of one or two minority umpires being biased.

Study: many pitches are affected by this bias.
Me: there is no way to tell how many pitches are affected, but, if the effect is caused by one or two minority umpires favoring their own race, the number of pitches would be small.

Study: overall, minority pitchers are disadvantaged by the bias.
Me: That would be true if all umpires were biased, because most umpires are white. But if only one or two minority umpires are biased, then minority pitchers would be the *beneficiaries* of the bias.

Study: the data show that bias exists.
Me: I don't think the data shows that bias exists beyond a reasonable doubt. For instance, suppose the results found are significant at a 5% level. And suppose your ex-ante belief is that umpire racism is rare, and the chance at least one minority umpire is biased is only about 5%. Then you have equal probabilities of luck and bias.

I do not believe the evidence in is study is strong enough to lead to a conclusion of bias beyond a reasonable doubt. But it is strong enough to suggest further investigation, specifically of those minority umpires who landed on the "same-race" side. My unscientific gut feeling is that if you did look closely, perhaps by watching game tapes and such -- most of the effect would disappear and the umps would be cleared.

But that's just my gut. Your gut my vary. I will keep an open mind to new evidence.

Labels: baseball, Hamermesh update, race

3 Comments:

At Tuesday, June 10, 2008 5:37:00 PM, Anonymous said...: Isn't there also a possibility that there is one (or a very small number of) pitcher that every ump despises, and if he's non-white, even though all umps call him equally (but worse than all other pitchers) we'd see evidence of bias under the Hamermesh study's assumptions?
At Tuesday, June 10, 2008 9:22:00 PM, Phil Birnbaum said...: Hmmm ... I think that would affect every umpire equally, so it wouldn't affect the race interactions. That is, there's no statistical difference between a minority pitcher getting fewer strikes because he's legitimately worse, versus him getting fewer strikes because everyone hates him.
At Sunday, July 10, 2011 7:44:00 PM, Parrish H said...: Great site. Anyway, onto the discussion.

1. Hasn't Implicit Race Bias ALREADY been shown to exist in other broader and more comprehensive social (see wiki) as well as sports (NBA) studies? So, as to your doubt that there could be "many biased umps with *exactly the same bias*", and you "don't find the...option credible", isn't this exactly what such human implicit race bias would expect to find? That is, roughly the same amount of unconscious bias across all umpires, and all umpires falling within a normal expected range of variance?

2. Isn't pulling 2 or 3 extreme umpire cases OUT of this study just as questionable as making sure you include them IN when interpreting this or any significance level?

You could be right on #1, it's just that I wouldn't be so sure that such a scenario exists.

On #2, it seems anyone could prove or disprove a study with this technique. So, how does one avoid this pitfall for a fully qualified and/or academic study? It seems that these included outliers are exactly what ends up proving all such studies in the first place.

<< Home

Sabermetric Research

Friday, June 06, 2008

The Hamermesh umpire/race study revisited -- part IX

3 Comments:

About Me

Previous Posts