Sabermetric Research: An alternative significance test for the Hamermesh data

A couple of days ago, I looked at the data from the Hamermesh (et al) study on racial bias among umpires. There, I gave an intuitive argument that there is no bias evident between whites and blacks. Specifically, I showed that if black umpires had called only five less strikes in favor of black pitchers, that would have wiped out all traces of differential treatment.

Here, I’m going to try to take that informal argument and turn it into a statistical test, kind of. Any statisticians reading this, tell me if I’ve done something wrong.

I’m going to start with the matrix of results from Table 2 of the study, except that I’m going to leave out the column of Asian pitchers, because they won’t affect the results.

Here’s the table:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.80 30.77
Black Umpire-- 31.93 30.87 30.76

(Apologies for the crappy-looking table, but I’ve had trouble with HTML tables in Blogger.)

The numbers are the percentages of non-swung-at pitches that are called strikes.

What I’m going to try to do is find a table where the values are close to the ones above, but where they show no bias. Then, I’m going to check for statistical significance between the two tables.

The most obvious non-biased table might be to make all the rows the same. I set every cell to the overall average of its column, which got me this:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.05 31.45 30.62
Hspnc Umpire-- 32.05 31.45 30.62
Black Umpire-- 32.05 31.45 30.62

If the data had come out like this, we would all agree that umpires aren’t biased at all.

Now, suppose umpires truly aren’t biased, and this is actually the way they call balls and strikes. That means the data came out the way they did only because of random chance. The question is: what is the probability that random chance could turn this beautiful unbiased table into the real table 2?

I’ll start by figuring out how many pitches would have to change to turn the second table into the first. Take the top-left cell, for instance. The difference between 32.06 and 32.05 is 0.01 of a percent. There were 741,729 called pitches in that cell. 0.01 percent of that is 74. So the difference is that in the first table, white umpires called 74 more strikes on white pitchers than expected.

Repeating that calculation for every cell gives this result:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +74 +47 -3
Hspnc Umpire-- -34 +26 +1
Black Umpire-- -56 -81 +2

Now, what kind of bias was the study looking for? It was looking to see if same-race umpires call more strikes on same-race pitchers than on different race pitchers.

The same-race case comprises the three cells on the diagonal; the different-race cases are the other six cells. Adding up the pitches gives:

Same race: 74 + 26 + 2 = +102
Different race: -34 - 56 + 47 – 81 –3 + 1 = –126

The difference between the two is 228. The same race umpires called 228 more strikes on same-race pitchers than on different-race pitchers. Is 228 is big enough to be significant? We’ll check, but, so far, at least we know that the bias goes in the right direction.

How can we test this for significance? Well, the result of any cell is a binomial distribution. The mean is zero. The variance is given by p(1-p)n, where p is the probability of a strike (say, 0.31) and n is the number of observations. For the white/white cell, the variance is 0.31 * 0.69 * 741,729, which equals 158656.

Just for information, here are the variances I calculated for all 9 cells:

Pitcher ------- White Hspnc Black
--------------------------------
White Umpire-- 158656 50680 5370
Hspnc Umpire-- __5260 _1566 _181
Black Umpire-- _10015 _2969 _378

Now, the calculation of interest is the diagonal cells minus the non-diagonal cells; we did that calculation a few paragraphs ago and got 228. If you number the cells 1-9 starting horizontally, the calculation was

cell 1 + cell 5 + cell 9 – cell 2 – cell3 – cell4 – cell 6 – cell 7 – cell 8.

These variables are all independent, so the variance of that expression is just the sum of the variances of the nine terms. Add up the nine variances, and you get 242,338. The standard deviation is the square root of that number, which is about 492.

Our result was a difference of 228 extra strikes. That’s less than one-half an SD away from zero, and so obviously not statistically significant.

By the way, this test checks for unbiased umpires, but it doesn't test that the "null hypothesis" unbiased matrix is actually a decent fit overall. To do that, you can just add up the nine pitch-discrepancy cells (that is, add the non-diagonals instead of subtracting). That result should alsobe normal with mean 0 and SD 492. The farther you get from zero, in either direction, the more likely the test matrix is a crappy fit.

In this case, the total is -24, which is close to zero. So we can conclude that not only is the measure of bias close enough to zero, but also that the "null hypothesis" matrix is a pretty good fit to the data as a whole.

---

But there are lots of other unbiased tables we could test for fit. The one we tried had all the rows the same, but that isn’t necessary. Here’s an alternative “null hypothesis” table:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.32 30.46
Black Umpire-- 31.93 31.34 30.48

This new matrix is as unbiased as the other one, but that may not be obvious when you look at it. It’s different in that, here, we don’t demand that all races of umpires be exactly the same. We just demand that they treat the various pitchers in the same way as the other umpires.

In this table, every race of umpire calls 0.59% fewer strikes for hispanic pitchers than for white pitchers. And every race of umpire calls 0.86% fewer strikes for black pitchers than for hispanic pitchers. Sure, white umpires appear to have a bigger strike zone than hispanic or black umpires, but they have that same bigger strike zone regardless of who the pitcher is.

If that were the model, would the observed results be statistically significant?

Let’s check. Here is the matrix of pitch discrepancies:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +35 +03
Black Umpire-- +00 -74 +05

Same race: 35 + 5 = +40
Different race: -74 + 3 = -71

So the “own race/different race” difference is 111. Still in the right direction, but even smaller than the previous model. And still a very low significance level, less than a quarter of the standard error of 492.

Even without the formal significance test, but we can see intuitively that the result is very close to non-bias. If only 117 pitches were called differently, we would have absolutely no bias at all! And 117 pitches over three seasons is a very small number.

---

We tested two different models of unbiasedness, and both of them were a good fit with the observed data. But there’s no reason to stop at two. We could build another, and another, and another. That might seem like cheating, but I don’t think it is. After all, when you do a regression, you find the best-fit straight line out of a literal infinity of possible lines. Finding the best-fit matrix (out of all the matrices that represent unbiased umpires) seems like the same idea.

If there actually *were* large amounts of between-race bias in the data, we wouldn’t be able to fit *any* unbiased matrix to the data with low significance levels. It just turns out that the apparent bias in the real-life data was so small that both “null hypothesis” matrices went unrejected with low significance levels.

Which of the two matrices is better? I suppose you could argue that the one with the smaller significance level is better. But this exercise won’t really help tell you what the *real* umpire characteristics are. Maybe all racial groups of umpires do call the game the same, and the first matrix matches real-life better. Or maybe the groups call the game differently, and the second matrix is more realistic. You can’t tell from the data; you have to use your judgement. It’s as if you do a regression and you find the equation comes out to y=4x+1. Are you sure real life is y=4x+1? No, not necessarily; it could really be y=3.99x+1.01; either is consistent with the data. 4x+1 is *more* consistent, but if you have really good reason to think 3.99x+1.01 is better, then, hey, go for it.

So all I think we can say here is that, from the evidence of this test, there is absolutely no reason to suspect overall umpire bias in favor of their own race.

That is, no umpire bias in the sense of “a random same-race encounter having a higher strike probability than a random mixed-race encounter,” because that’s all our test checked for. It’s possible there are other forms of bias, and you’ll need other tests. For instance, suppose the matrix looked like this:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +01 +06 +00
Hspnc Umpire-- -04 +99 -12
Black Umpire-- +06 +99 -01

If you run the above test on this data, you get something extremely insignificant. And that’s correct, because what you’re trying to prove – that a same-race encounter is more likely to lead to a strike – isn’t true. However, there is another manifestation of umpire bias shown in the data -- a same-race encounter involving a *hispanic* umpire will be biased, and a black/hispanic encounter will also be positively biased. And there are other tests, obviously, that will find an effect for that.

Also, even by our measure, our test may not be powerful enough to pick up an effect. Our measure of “same-race” bias was the overall difference in number of pitches. A different test that considers, say, the sum of squares of the number of pitches in each cell, might be more powerful and find an effect this test didn’t.

The Hamermesh authors did try to test roughly the same thing we did: the difference between same-race and different-race. In fairness, they did do a more sophisticated analysis, doing a regression and correcting for pitcher and count. But our own test found so little bias – only 117 pitches for complete unbiasedness! – that, even with their extra sophistication, I still wonder how their test was able to come up with statistical significance.

Labels: baseball, economics, race, umpires

7 Comments:

At Thursday, August 16, 2007 1:31:00 PM, Anonymous said...: I appreciate your careful thought on this. I'm no statistician, but I wonder about whether using the overall total percentage of strikes as the unbiased number necessarily makes the differences you calculated small. After all, that unbiased percentage is awfully close to the White umpires' values. What if you supposed that the Black and Hispanic umpires were calling the game "right?" Then your unbiased strike % for White players would be more like 31.92. If you use that in your white-white cell then the difference is actually (by my calculation) 1039 pitches. Whether this is significant or not, I'm not sure, but it certainly is much more than you found. My guess is that the methods used by in the Hammersmith data better account for the disparity in sample sizes than you were able to here. Anyway, my two cents. Thanks for the interesting and thought-provoking read!
At Thursday, August 16, 2007 1:54:00 PM, Phil Birnbaum said...: Jeff: you're absolutely right that if you start with the assumption that the black and hispanic (call them "minority") umpires are doing it right, then you indeed get a statistically significant result.

And so your conclusion is then, "IF the minority umpires called exactly the right number of strikes, then there is strong evidence that white umpires are biased." That statement -- note the capital IF -- is a true one.

But the point is not to look for effects under the assumption assume the minority umpires aren't biased and the white umpires are, because that assumption might not be correct. The point is see if you can find assumptions that are consistent with NO bias, and then see if those assumptions are reasonable.

For instance, suppose (white) pitcher X goes 300-for-1000 pitches (30% strikes) against white umpires, but 0-for-3 against black umpires. Is there evidence of bias? Well, if you assume the black umpires called the game right, then there's huge evidence of bias -- the white umpire should have called NO strikes, but he called 150 of them!

The significance test might be correctly done, but the assumption -- that the 0-for-3 is more reliable than the 1000-for-3000 -- is ludicrous.

Similarly, assuming the black umpires are unbiased but the white umpires are not is questionable. And even if it weren't questionable, but actually reasonable, that's not enough, because you still can't assume that it's true.

The best you can do is say: "look, there are many reasonable scenarios X where it looks like umpires are biased. But there are many other reasonable scenarios Y where it looks like they're unbiased. Unless you can convince me that the Xs are so much more likely to be true than the Ys, you can't claim to have found evidence of bias."

Put another way: there are good prior subjective and objective reasons to expect non-bias, which is why non-bias is the null hypothesis. To disprove the null hypothesis, you have to show that NO reasonable scenario fits the data. Since we have shown that at least two reasonable scenarios fit the data, the null hypothesis stands.

I could probably explain this better than I have, but I hope it still makes sense.
At Monday, August 20, 2007 2:48:00 AM, Pizza Cutter said...: Phil, you start off in the neighborhood of a chi-square and then take some very strange turns. Your method is intuitively correct, although not what I would recommend in my stats classes. Looks like we have a small, possibly luck-driven effect statistically.
At Monday, August 20, 2007 6:47:00 AM, Phil Birnbaum said...: Ah, a Chi-Square test! I think I vaguely remember how those work; That's probably what I want.

It's the sum of all the ((actual minus expected)^2 / expected), right? I'll look it up.
At Monday, August 20, 2007 10:23:00 AM, Phil Birnbaum said...: Pizza, stats question: can you run that Chi-Squared test in ANY "expected" vs. "observed" situation? Or are there assumptions, such as identical distributions?

That is, I can check if something is Poisson by dividing the observations into groups, and comparing actual to expected, where "expected" is "expected under Poisson". But can I use that same test to compare actual to "expected" where "expected" is just my own invention?

From what I've seen, the test requires that the distribution of the random variable be normal or poisson (which, in this baseball case, it is). But can the "expected" that you're comparing it to be arbitrary?
At Monday, August 20, 2007 10:30:00 AM, Phil Birnbaum said...: Actually, now that I think about it, I wonder if the Chi-Squared test isn't strong enough.

Chi-squared will tell us if the "expected" is a good match for the "actual". But it will do that by looking at all 9 cells. What we want is to compare the three diagonal cells to the other 6, and only something that treats the two groups of cells differently can do that.
At Monday, August 20, 2007 3:47:00 PM, Unknown said...: "Pizza, stats question: can you run that Chi-Squared test in ANY "expected" vs. "observed" situation? Or are there assumptions, such as identical distributions?"

Yes, absolutely you can.

Phil -- I think the Chi squared test is powerful enough because what you want to compare is whether, say:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.80 30.77
Black Umpire-- 31.93 30.87 30.76

Is similar to:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.05 31.45 30.62
Hspnc Umpire-- 32.05 31.45 30.62
Black Umpire-- 32.05 31.45 30.62

The only drawback from the Chi squared is that is won't tell you which direction the bias is in and how much bias there is. All it will tell you is whether it is there or not, which, right now, is all we really care about.

<< Home

Sabermetric Research

Thursday, August 16, 2007

An alternative significance test for the Hamermesh data

7 Comments:

About Me

Previous Posts