A couple of days ago, I looked at the data from the Hamermesh (et al) study on racial bias among umpires. There, I gave an intuitive argument that there is no bias evident between whites and blacks. Specifically, I showed that if black umpires had called only five less strikes in favor of black pitchers, that would have wiped out all traces of differential treatment.
Here, I’m going to try to take that informal argument and turn it into a statistical test, kind of. Any statisticians reading this, tell me if I’ve done something wrong.
I’m going to start with the matrix of results from Table 2 of the study, except that I’m going to leave out the column of Asian pitchers, because they won’t affect the results.
Here’s the table:
Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.80 30.77
Black Umpire-- 31.93 30.87 30.76
(Apologies for the crappy-looking table, but I’ve had trouble with HTML tables in Blogger.)
The numbers are the percentages of non-swung-at pitches that are called strikes.
What I’m going to try to do is find a table where the values are close to the ones above, but where they show no bias. Then, I’m going to check for statistical significance between the two tables.
The most obvious non-biased table might be to make all the rows the same. I set every cell to the overall average of its column, which got me this:
Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.05 31.45 30.62
Hspnc Umpire-- 32.05 31.45 30.62
Black Umpire-- 32.05 31.45 30.62
If the data had come out like this, we would all agree that umpires aren’t biased at all.
Now, suppose umpires truly aren’t biased, and this is actually the way they call balls and strikes. That means the data came out the way they did only because of random chance. The question is: what is the probability that random chance could turn this beautiful unbiased table into the real table 2?
I’ll start by figuring out how many pitches would have to change to turn the second table into the first. Take the top-left cell, for instance. The difference between 32.06 and 32.05 is 0.01 of a percent. There were 741,729 called pitches in that cell. 0.01 percent of that is 74. So the difference is that in the first table, white umpires called 74 more strikes on white pitchers than expected.
Repeating that calculation for every cell gives this result:
Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +74 +47 -3
Hspnc Umpire-- -34 +26 +1
Black Umpire-- -56 -81 +2
Now, what kind of bias was the study looking for? It was looking to see if same-race umpires call more strikes on same-race pitchers than on different race pitchers.
The same-race case comprises the three cells on the diagonal; the different-race cases are the other six cells. Adding up the pitches gives:
Same race: 74 + 26 + 2 = +102
Different race: -34 - 56 + 47 – 81 –3 + 1 = –126
The difference between the two is 228. The same race umpires called 228 more strikes on same-race pitchers than on different-race pitchers. Is 228 is big enough to be significant? We’ll check, but, so far, at least we know that the bias goes in the right direction.
How can we test this for significance? Well, the result of any cell is a binomial distribution. The mean is zero. The variance is given by p(1-p)n, where p is the probability of a strike (say, 0.31) and n is the number of observations. For the white/white cell, the variance is 0.31 * 0.69 * 741,729, which equals 158656.
Just for information, here are the variances I calculated for all 9 cells:
Pitcher ------- White Hspnc Black
--------------------------------
White Umpire-- 158656 50680 5370
Hspnc Umpire-- __5260 _1566 _181
Black Umpire-- _10015 _2969 _378
Now, the calculation of interest is the diagonal cells minus the non-diagonal cells; we did that calculation a few paragraphs ago and got 228. If you number the cells 1-9 starting horizontally, the calculation was
cell 1 + cell 5 + cell 9 – cell 2 – cell3 – cell4 – cell 6 – cell 7 – cell 8.
These variables are all independent, so the variance of that expression is just the sum of the variances of the nine terms. Add up the nine variances, and you get 242,338. The standard deviation is the square root of that number, which is about 492.
Our result was a difference of 228 extra strikes. That’s less than one-half an SD away from zero, and so obviously not statistically significant.
By the way, this test checks for unbiased umpires, but it doesn't test that the "null hypothesis" unbiased matrix is actually a decent fit overall. To do that, you can just add up the nine pitch-discrepancy cells (that is, add the non-diagonals instead of subtracting). That result should alsobe normal with mean 0 and SD 492. The farther you get from zero, in either direction, the more likely the test matrix is a crappy fit.
In this case, the total is -24, which is close to zero. So we can conclude that not only is the measure of bias close enough to zero, but also that the "null hypothesis" matrix is a pretty good fit to the data as a whole.
---
But there are lots of other unbiased tables we could test for fit. The one we tried had all the rows the same, but that isn’t necessary. Here’s an alternative “null hypothesis” table:
Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.32 30.46
Black Umpire-- 31.93 31.34 30.48
This new matrix is as unbiased as the other one, but that may not be obvious when you look at it. It’s different in that, here, we don’t demand that all races of umpires be exactly the same. We just demand that they treat the various pitchers in the same way as the other umpires.
In this table, every race of umpire calls 0.59% fewer strikes for hispanic pitchers than for white pitchers. And every race of umpire calls 0.86% fewer strikes for black pitchers than for hispanic pitchers. Sure, white umpires appear to have a bigger strike zone than hispanic or black umpires, but they have that same bigger strike zone regardless of who the pitcher is.
If that were the model, would the observed results be statistically significant?
Let’s check. Here is the matrix of pitch discrepancies:
Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +35 +03
Black Umpire-- +00 -74 +05
Same race: 35 + 5 = +40
Different race: -74 + 3 = -71
So the “own race/different race” difference is 111. Still in the right direction, but even smaller than the previous model. And still a very low significance level, less than a quarter of the standard error of 492.
Even without the formal significance test, but we can see intuitively that the result is very close to non-bias. If only 117 pitches were called differently, we would have absolutely no bias at all! And 117 pitches over three seasons is a very small number.
---
We tested two different models of unbiasedness, and both of them were a good fit with the observed data. But there’s no reason to stop at two. We could build another, and another, and another. That might seem like cheating, but I don’t think it is. After all, when you do a regression, you find the best-fit straight line out of a literal infinity of possible lines. Finding the best-fit matrix (out of all the matrices that represent unbiased umpires) seems like the same idea.
If there actually *were* large amounts of between-race bias in the data, we wouldn’t be able to fit *any* unbiased matrix to the data with low significance levels. It just turns out that the apparent bias in the real-life data was so small that both “null hypothesis” matrices went unrejected with low significance levels.
Which of the two matrices is better? I suppose you could argue that the one with the smaller significance level is better. But this exercise won’t really help tell you what the *real* umpire characteristics are. Maybe all racial groups of umpires do call the game the same, and the first matrix matches real-life better. Or maybe the groups call the game differently, and the second matrix is more realistic. You can’t tell from the data; you have to use your judgement. It’s as if you do a regression and you find the equation comes out to y=4x+1. Are you sure real life is y=4x+1? No, not necessarily; it could really be y=3.99x+1.01; either is consistent with the data. 4x+1 is *more* consistent, but if you have really good reason to think 3.99x+1.01 is better, then, hey, go for it.
So all I think we can say here is that, from the evidence of this test, there is absolutely no reason to suspect overall umpire bias in favor of their own race.
That is, no umpire bias in the sense of “a random same-race encounter having a higher strike probability than a random mixed-race encounter,” because that’s all our test checked for. It’s possible there are other forms of bias, and you’ll need other tests. For instance, suppose the matrix looked like this:
Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +01 +06 +00
Hspnc Umpire-- -04 +99 -12
Black Umpire-- +06 +99 -01
If you run the above test on this data, you get something extremely insignificant. And that’s correct, because what you’re trying to prove – that a same-race encounter is more likely to lead to a strike – isn’t true. However, there is another manifestation of umpire bias shown in the data -- a same-race encounter involving a *hispanic* umpire will be biased, and a black/hispanic encounter will also be positively biased. And there are other tests, obviously, that will find an effect for that.
Also, even by our measure, our test may not be powerful enough to pick up an effect. Our measure of “same-race” bias was the overall difference in number of pitches. A different test that considers, say, the sum of squares of the number of pitches in each cell, might be more powerful and find an effect this test didn’t.
The Hamermesh authors did try to test roughly the same thing we did: the difference between same-race and different-race. In fairness, they did do a more sophisticated analysis, doing a regression and correcting for pitcher and count. But our own test found so little bias – only 117 pitches for complete unbiasedness! – that, even with their extra sophistication, I still wonder how their test was able to come up with statistical significance.
Labels: baseball, economics, race, umpires