Wednesday, April 09, 2008

The Hamermesh umpire/race study revisited -- part II

This is the latest in a series of posts about the Hamermesh umpire/race study.

August, 2007 posts:
first, second, third, fourth.
Recent posts:
part 1, part 1(addendum)

This is part 2. What follows is a little complicated, I hope I got it right. Please let me know.


In the
previous post, I showed that a regression on the Table 2 data from the Hamermesh study showed no significance for the "UPM" parameter, the one that represents same-race bias on the part of umpires.

That regression was still simpler than the ones in the study. In column 8 of Table 3, the authors do pretty much the same regression I did, but they control for a whole bunch of other variables:

-- the count
-- the pitcher
-- the umpire

When I say "the count," I mean that the authors included a dummy variable for every possible count (except one – you need to omit one, any one, to get the regression to work properly and avoid a singular matrix). That's 11 variables. They added a variable for each umpire, which is 92 variables (since there were 93 umpires). And they also included a variable for each pitcher (except one), which is about 900 more variables.

Why include all these variables? Because, whatever effect you get, it could be just luck, that certain pitchers wound up facing certain umpires more than usual. Maybe the white umpire with the big strike zone happened to face Jake Peavy a lot, and that's what inflated the W/W cell. By controlling for the individual pitcher, and the individual umpire, you help eliminate some of the random luck from the results.

A more important issue than controlling for pitchers and umpires is controlling for count. It's good that the authors did that, because using just the raw data might understate the differences among umpires and pitchers.

There's probably a bit of a negative correlation between one pitch and the next. After a strike, a given pitcher is more likely to throw a ball (perhaps wasting an 0-2 pitch); and, after a ball, he's more likely to throw a strike (such as on 3-0, when the batter is probably taking). So there is an underlying force – the count – that pulls pitchers' strike percentages towards each other. Pitcher A might have 5% more strikes than pitcher B on every count, but have only 3% more strikes overall because of all of B's extra 3-0 strikes. The 5% figure is the accurate indicator of the relative output of the pitchers – or of the umpires that called them.

So anyway, having added those 1,000 variables to the regression, the study found more of a UPM [umpire/pitcher matching in race] effect:

.05% -- my regression
.28% -- the study's regression, after controlling for pitcher, umpire, and count

With the extra variables, the racial bias coefficient was over five times as high as without – but it was still not statistically significant, at about p=0.12 (about 1.2 standard deviations above zero). The authors also ran one more regression, adding another 1,400 variables to control for the batter, and got almost the same result (.27%).

The authors write:

"... the point estimates imply that a given called pitch is approximately 0.27 percentage points more likely to be called a strike if the umpire and pitcher match race/ethnicity."

And this is where I disagree. I believe that the significance level is accurate, but that the actual coefficients are not.

There's nothing wrong with the way the authors read the results of the regression output – everything is fine statistically – but I disagree with the model that produced the regression. Specifically, the addition of the UPM parameter has hidden assumptions – the assumption that same-race bias must be the same regardless of the race involved. The UPM parameter of .27 means that each of the three cells on the diagonal is 0.27 percentage points higher than it would otherwise be. So white umpires called 0.27 percent "too many" strikes; hispanic umpires called 0.27 percent too many strikes; and black umpires called 0.27 percent too many strikes. The assumption forces the regression to choose equal estimates for all three races. That equal estimate came out to 0.27.

I don't think forcing this equality is appropriate. It's certainly *possible* that the races have the same bias, but is it so certain that you can embed it in the model that way? I'd argue that it isn't even likely. In almost every aspect of life that has real, proven bigotry, it almost always goes one way. Whites used to lynch blacks; did blacks ever lynch whites? Are there gangs of gay men who roam public parks looking for smooching heterosexuals to beat up?

Even where it's obvious that two groups mutually dislike each other, does it really follow that one group will be *exactly* as biased as the other? Is a Republican boss exactly as unlikely to hire a Democrat as a Democrat boss is to hire a Republican? Even if they're equal today, what about tomorrow? When George W. Bush does something controversial overnight, don't you think Democrats will get a lot more pissed off than Republicans, and the relative bias will wind up a little bit more extreme than yesterday?

If you agree that it's reasonable that the races would have different levels of bias towards each other – and even if you don’t – you have to qualify the results of the study. Instead of saying

-- "The best estimate of racial bias is 0.27% of pitches."

You need to say

-- "IF racial bias is the same across all races, THEN the best estimate of racial bias is 0.27% of pitches."

Since we don't know that bias is the same across the races (and I think we have reason to believe that it's not), we can't just assume that the 0.27% is the right number.

But, suppose that even without that assumption, we'd get a similar result. Then my objection would just be a technicality. But I don't think that's the case. Let me show how wildly different the numbers are if the bias isn't equally spread among the races.

Suppose the unbiased frequency of strikes should be this:

Pitcher ------ White Hspnc Black
White Umpire-- 32.00 33.00 34.00
Hspnc Umpire-- 35.00 36.00 37.00
Black Umpire-- 38.00 39.00 40.00

Now, suppose that same-race umpires are biased equally, as the model demands. Say, by 1 percentage point each. We'll add 1.00 to each cell of the main diagonal. That gives us:

Pitcher ------ White Hspnc Black
White Umpire-- 33.00 33.00 34.00
Hspnc Umpire-- 35.00 37.00 37.00
Black Umpire-- 38.00 39.00 41.00

In this case, the model works. If we do a regression now (keeping the same number of pitches for each cell as in the original study), we get exactly the results we expect, with a UPM coefficient of exactly 1%. The regression works perfectly because the actual bias matches the model. And the 1% means that all else being equal, a pitcher will receive, on average, extra percentage point more strikes when the umpire matches his race.

But, now, what if the bias isn't equal? What if it's all concentrated in the white umpires? That is, instead of a 1% same-race bias on each umpire, we have a 3% bias in the W/W case, and 0% in the other two. The observations look like this:

Pitcher ------ White Hspnc Black
White Umpire-- 35.00 33.00 34.00
Hspnc Umpire-- 35.00 36.00 37.00
Black Umpire-- 38.00 39.00 40.00

Now, all else being equal, how many percentage points more strikes will a pitcher get if the umpire matches his race? Well, the only time that happens is in the white/white case. So 3% more strikes out of 741,729 is 22,251 more strikes. There are 750,817 pitches in the same-race diagonal. 22,251 divided by 750,817 equals 2.964%.

So we'd expect the UPM coefficient to be about 2.964%. When we do the regression, it turns out to be 1.883%. That's way off, and it's way off because the reality doesn't match the model.

It's even worse if we try the other same-race cases. Suppose all 3% goes in the H/H cell, and the W/W and B/B umpires are unbiased:

Pitcher ------ White Hspnc Black
White Umpire-- 32.00 33.00 34.00
Hspnc Umpire-- 35.00 39.00 37.00
Black Umpire-- 38.00 39.00 40.00

Now, 0.029% of same-race matchups are affected, but the UPM coefficient is 0.882%.

Finally, suppose all the bias (3%) is in the B/B case:

Pitcher ------ White Hspnc Black
White Umpire-- 32.00 33.00 34.00
Hspnc Umpire-- 35.00 36.00 37.00
Black Umpire-- 38.00 39.00 43.00

Now, in reality, only 0.007% of pitches are affected (about 1 in 14,000). But the UPM coefficient is 0.23% (1 in 427).

My conclusion, again, is that if real-life doesn't match the model's assumption that all umpires are biased equally, then the estimated coefficients are so unreliable as to be meaningless.


Now, I said the UPM coefficient is meaningless. But I am NOT arguing that the significance level is meaningless. In fact, I agree with the study's authors that the significance tests are valid.

Why? Because the significance test is checking for no bias by anyone. And if there's no bias by anyone, then real-life DOES fit the model: all races are biased equally. Equally at zero, but equally nonetheless.

That is: if there is no bias, then the regression should fit the model, and the significance level will be low. But if the UPM coefficient is significantly different from zero, we know we have bias. The actual equation probably doesn't match real life very well, but we still have evidence that there's *some kind of bias*. Just not necessarily the kind the regression thinks it found.

To put it more formally:

-- if there is no racial bias, then racial bias is the same across all races (at zero).
-- if racial bias is the same across all races, then the UPM coefficient is the best estimate of that bias.
-- so if there is zero racial bias, the UPM coefficient should come out near zero.
-- if we do NOT get a UPM coefficient close to zero, then there must be a non-zero bias – although then the UPM coefficient is not necessarily a good estimate of that bias.

Hope that makes sense. In summary, it seems to me that we can trust the significance level of the UPM estimate, but not the number itself.


What's funny about the study is that the authors didn't really need to make the assumption that the bias was all equal. There's more than enough data to estimate the three races bias individually. All you have to do is get rid of the UPM variable, and replace it with three "umpire matches pitcher" variables, one for each race. That eliminates the assumption, and the results become applicable in cases where the real world groups doesn't have equal bias.

If you take the study data, get rid of UPM as a variable, and add the three race-specific UPMs, the regression gives these coefficients (I'm using the original Table 2 data, as usual):

W/W bias estimate: -0.416%
H/H bias estimate: +0.880%
B/B bias estimate: +0.689%

The regression is telling us that, indeed, it looks like the three races of umpires have fairly different biases. The white umpires are actually slightly biased *against* their own race. The black and hispanic umpires, however, are on the same-race side, perhaps as expected.

The other regressions didn't find statistical significance, and this one doesn't either. The first two coefficients are about 1 SD from zero, and the third one only half an SD. And even the differences between any pair aren't statistically significant.

But at least these coefficients are meaningful. We can say that when a white pitcher is on the mound, a same-race umpire means he loses an estimated 0.416 percentage points of his strikes. A hispanic pitcher facing a hispanic umpire gains 0.880 percentage points, and a black pitcher facing a black umpire gains 0.689 percentage points.

These are all *relative* to when a pitcher faces a different-race umpire (with the same overall strike zone). It is impossible to ever know the absolute level of "real" racial preference (if in fact there is any), because any relative difference equal to these coefficients is equally possible. It could indeed be that black umpires have +0.880% bias in favor of blacks, and 0.00% bias against non-blacks. Or it could be that they have +0.440% bias in favor of blacks, and –0.440% bias against non-blacks. Or any other combination.

To be more explicit, the regression came up with this matrix of relative own-race preferences:

Pitcher ------ White Hspnc Black
White Umpire-- -0.42 00.00 00.00
Hspnc Umpire-- 00.00 +0.88 00.00
Black Umpire-- 00.00 00.00 +0.69

But to get *absolute* preferences, we can add any constant to any cell. Suppose we subtract 0.2 from every cell, because it seems to us that other-race umpires should have about that much bias. Then we get

Pitcher ------ White Hspnc Black
White Umpire-- -0.62 –0.20 –0.20
Hspnc Umpire-- -0.20 +0.68 –0.20
Black Umpire-- -0.20 –0.20 +0.49

Either of those matrices fits the data. There is no way, no matter how much data we have, to figure out what's the true level of bias: the top one, the bottom one, or any of the infinity of possibilities. That's because we don't know where the zero is – we don't know which cell, if any, calls balls and strikes perfectly. If we were to measure performance objectively, using QuesTec or something, and call that "zero bias," then we could find out where the zero mark is, and adjust accordingly. (Of course, in that case, we wouldn't need to do a regression – we could just measure every racial combination against QuesTec.)


In the regression we just did, there was also a hidden assumption: that all different-race pairs had the same bias. If we didn't want to assume that all *same-race* pairs had the same bias, why would we want to assume that the *different-race* pairs had the same bias?

Unfortunately, we're stuck with a little bit of that. By adjusting for the tendencies of the individual pitchers and umpires – for instance, by assuming that white pitchers are legitimately better than hispanic pitchers, instead of considering that the result may be bias – we are simultaneously assuming that:

-- the bias in each column must sum to the same amount
-- the bias in each row must sum to the same amount

That means:

-- at most, we can estimate two of the three cells in each column, relative to the other
-- at most, we can estimate two of the three cells in each row, relative to the other

Which means that we have to cross out one row and one column. That leaves four cells, in a rectangle, that we can estimate. If we try to estimate more than that pattern of four cells, the regression won't work, because we get an infinity of possible answers. (Technical note: when we try to estimate more than four cells, we get collinearity among the dependent variables.)

(Just to emphasize: this is NOT a matter of not having enough data; it's a matter of trying to estimate too many things at once. To take a simpler example: if an umpire calls 31% strikes against whites and 32% against blacks, it is simply impossible, using only statistics, to figure out if he's biased against whites or in favor of blacks; the data, no matter how many pitches you have, equally support both possibilities. Here, with three races instead of just two, it's a little more complicated, but no matter how much data you have, you can't do all of: (a) estimate an umpire effect, (b) estimate a pitcher effect, and (c) estimate nine cells. If you want (a) and (b), the best you can do for (c) is estimate four cells relative to the other five.)

So which four cells should you estimate? That depends on what you're most interested in. Whatever four cells you pick, means you have to hold the other five (one row and one column) constant, and assume that the bias is the same in all five of those cells. The previous regression, where we chose the three same-race cells, and assumed the other six were equally biased, was a reasonable choice.

Since our null hypothesis is that there's no bias, another reasonable choice might be to hold constant the row and column involving whites. Why? Because they have the most pitches. By looking for bias in the other four cells, we are investigating bias where there are the fewest pitches, and where the difference would be most likely to happen by chance. That would be the most conservative test of whether the bias is just random.

(As an example: suppose you want to compare two players. Joe hit .333 in 480 AB, Manny went 0-for-3. It's more informative, intuitively, to say "Manny is only one hit worse than Joe's average" than to say "Joe is 159 hits better than Manny's average.")

If we choose those bottom-right four cells, here are the results:

Pitcher ------ White Hspnc Black
White Umpire-- 00.00 00.00 00.00
Hspnc Umpire-- 00.00 +0.48 –0.31
Black Umpire-- 00.00 –0.47 -0.28

In terms of pitches:

Pitcher ------ Wht Hsp Blk
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +35 +03
Black Umpire-- +00 -65 +05

If you change the "-65" to "-74", you get *exactly* what we got just by quickly playing with the numbers back in August.

And since the results do appear to be non-significant, I think this particular regression is probably pretty close to what's actually going on.


Hope this all makes sense. Next will be Part III, where I'll talk about some of the other findings in the paper, including the statistically significant ones.

Labels: , ,


At Thursday, April 10, 2008 1:44:00 AM, Blogger Unknown said...


This is great stuff -- I'll need to spend a bit more time absorbing it before commenting.

One question though: Where did you get Hamermesh's data from and is it publicly available?


At Thursday, April 10, 2008 9:33:00 AM, Blogger Phil Birnbaum said...

So far, got everything from Table 2 of the study.

At Thursday, April 10, 2008 9:50:00 AM, Blogger Phil Birnbaum said...

I should clarify ... the regressions I've done have been by generating rows that are equivalent to Table 2. So for the B/B cell, which has 1765 pitches at 30.76% strikes, I created 1765 rows of which 543 are strikes.

Rows have: (a) whether the pitch is a strike, (b) two indicator variables for W and H umps, (c) two indicator variables for W and H pitchers, and (d) indicator variables for UPM variables as described.

At Thursday, April 10, 2008 12:20:00 PM, Blogger Don Coffin said...

Phil--This is quite a substantial piece of work, and it is pretty damned persuasive.

One problem with immensely large data sets is that there will almost always be some statistically significant relationships, even if they're not real. And even if they are real, they may be so small that they don't cross the "practically significant" threshhold. I'm guessing the Hamermesh et al. study falls into one or the other of these categories (or both).


Post a Comment

<< Home