Sabermetric Research: Are umpires racially biased? A 2021 study (Part II)

(Part I is here.)

20 percent of drivers own diesel cars, and the other 80 percent own regular (gasoline) cars. Diesels are, on average, less reliable than regular cars. The average diesel costs $2,000 a year in service, while the average regular car only costs $1,000.

Researchers wonder if there's a way to reduce costs. Maybe diesels cost more partly because mechanics don't like them, or are unfamiliar with them? They create a regression that controls for the model, age, and mileage of the car, as well as driver age and habits. But they also include a variable for whether the mechanic owns the same type of car (diesel or gasoline) as the owner. They call that variable "UTM," or "user/technician match".

They run the regression, and the UTM coefficient turns out negative and significant. It turns out that when the mechanic owns the same type of car as the user, maintenance costs are more than 13 percent lower! The researchers conclude that finding a mechanic who owns the same kind of car as you will substantially reduce your maintenance costs.

But that's not correct. The mechanic makes no difference at all. That 13 percent from the regression is showing something completely different.

If you want to solve this as a puzzle, you can stop reading and try. There's enough information here to figure it out.

-------

The overall average maintenance cost, combining gasoline and diesel, is $1200. That's the sum of 80 percent of $1000, plus 20 percent of $2000.

So what's the average cost for only those cars that match the mechanic's car? My first thought was, it's the same $1200. Because, if the mechanic's car makes no difference, how can that number change?

But it does change. The reason is: when the user's car matches the mechanic's, it's much less likely to be a diesel. The gasoline owners are over-represented when it comes to matching: each has an 80% chance of being included in the "UTM" sample, while the diesel owner has only a 20% chance.

In the overall population, the ratio of gasoline to diesel is 4:1. But the ratio of "gasoline/gasoline" to "diesel/diesel" is 16:1. So instead of 20%, the proportion of "double diesels" in the "both cars match" population is only 1 in 17, or 5.9%.

That means the average cost of UTM repairs is only $1059. That's 94.1 percent of $1000, plus 5.9% of $2000. That works out to 13.3 percent less than the overall $1200.

Here's a chart that maybe makes it clearer. Here's how the raw numbers of UTM pairings break down, per 1000 population:

Technician Gasoline Diesel Total
-------------------------------------------
User gasoline 640 160 800
User diesel 160 40 200
-------------------------------------------
Total 800 200 1000

The highlighted diagonal is where the user matches the mechanic. There are 680 cars on that diagonal, but only 40 (1 in 17) are diesel.

In short: the "UTM" coefficient is significant not because matching the mechanic selects better mechanics, but because it selectively samples for more reliable (gasoline) cars.

--------

In the umpire/race study I talked about last post, they had a regression like that, where they put all the umpires and batters together into one regression and looked at the "UBM" variable, where the umpire's race matches the batter's race.
From last post, here's the table the author included. The numbers are umpire errors per 1000 outside-of-zone pitches (negative favors the batter).

Umpire Black Hispanic White
-------------------------------------------
Black batter: --- -5.3 -0.3
Hispanic batter +7.8 --- +5.9
White batter +5.6 -4.4 ---

I had adjusted that to equalize the baseline:

Umpire Black Hispanic White
------------------------------------------
Black batter: -5.6 -0.9 -0.3
Hispanic batter +2.2 +4.4 +5.9
White batter --- --- ---

I think I'm able to estimate, from the original study, that the batter population was almost exactly in the 2:3:4 range -- 22 percent Black, 34 percent Hispanic, and 44 percent White. Using those numbers, I'm going to adjust the chart one more time, to show approximately what it would look like if the umpires were exactly alike (no bias) and each column added to zero.

Umpire Black Hispanic White
------------------------------------------
Black batter: -2.2 -2.2 -2.2
Hispanic batter +3.8 +3.8 +3.8
White batter -1.7 -1.7 -1.7

I chose those numbers so the average UBM (average of diagonals in ratio 22:34:44) is zero, and also to closely fit the actual numbers the study found. That is: suppose you ran a regression using the author's data, but controlling for batter and umpire race. And suppose there was no racial bias. In that case, you'd get that table, which represents our null hypothesis of no racial bias.

If the null hypothesis is true, what will a regression spit out for UBM? If the batters were represented in their actual ratio, 22:34:44, you'd get zero:

Diagonal Effect Weight Product

-------------------------------------------------
Black UBM -2.2 22% -0.5
Hispanic UBM +3.8 34% +1.5
White UBM -1.7 44% -0.8
-------------------------------------------------
Overall UBM 100% -0.0 per 1000

However: in the actual population in the MLB study, the diagonals do NOT appear in the 22:34:44 ratio. That's because the umpires were overwhelmingly White -- 88 percent White. There were only 5 percent Black umpires, and 7 percent Hispanic umpires. So the White batters matched their umpire much more often than the Hispanic or Black batters.

Using 5:7:88 for umpires, and 22:34:44 for batters, the relative frequency of each combination looks like this. Here's the breakdown per 1000 pitches:

Batter
Umpire Black Hispanic White Total
---------------------------------------------------
Black batter 11 15 194 220
Hispanic batter 17 24 300 341
White batter 22 31 387 439
---------------------------------------------------
Umpire total 50 70 881 1000

Because there are so few minority umpires, there are only 24 Hispanic/Hispanic pairs out of 422 total matches on the UBM diagonal. That's only 5.7% Hispanic batters, rather than 34 percent:

Diagonal Frequency Percent
----------------------------------
Black UBM 11 2.6%
Hispanic UBM 24 5.7%
White UBM 387 91.7%
----------------------------------
Overall UBM 422 100%

If we calculate the observed average of the diagonal, with this 11/24/387 breakdown, we get this:

Effect Weight Product
--------------------------------------------------
Black UBM -2.2 2.6% -0.06 per 1000
Hispanic UBM +3.8 5.7% +0.22 per 1000
White UBM -1.7 91.7% -1.56 per 1000
--------------------------------------------------
Overall UBM 100% -1.40 per 1000

Hispanic batters receive more bad calls for reasons other than racial bias. By restricting the sample of Hispanic batters to only those who see a Hispanic umpire, we selectively sample fewer Hispanic batters in the UBM pool, and so we get fewer bad calls.

Under the null hypothesis of no bias, UBM plate appearances still see 1.40 fewer bad calls per 100 pitches, because of selective sampling.

------

That 1.40 figure is compared to the overall average. The regression coefficient, however, compares it to the non-UBM case. What's the average of the non-UBM case?

Well, if a UBM happens 422 times out of 1000, and results in 1.40 pitches fewer than average, and the average is zero, then the other 578 times out of 1000, there must have been 1.02 pitches more than average.

Effect Weight Product
--------------------------------------------------
UBM -1.40 42.2% -0.59 per 1000
Non-UBM +1.02 57.8% +0.59 per 1000
--------------------------------------------------
Full sample 100% -0.00 per 1000

So the coefficient the regression produces -- UBM compared to non-UBM -- will be 2.42.

What did the actual study find? 2.81.

That leaves only 0.39 as the estimate of potential umpire bias:

-2.81 Selective sampling plus possible bias
-2.42 Effect of selective sampling only
---------------------------------------------
-0.39 Revised estimate of possible bias

The study found 2.81 fewer bad calls (per 1000) when the umpire matched the pitcher, but 2.42 of that is selective sampling, leaving only 0.39 that could be umpire bias.

Is that 0.39 statistically significant? I doubt it. For what it's worth, the original estimate had an SD of 0.44. So adjusting for selective sampling, we're less than 1 SD from zero.

--------

So, the conclusion: the study's finding of a 0.28% UBM effect cannot be attributed to umpire bias. It's mostly a natural mathematical artifact resulting from the fact that

(a) Hispanic batters see more incorrect calls for reasons other than bias,

(b) Hispanic umpires are rare, and

Because of that, almost the entire effect the study attributes to racial bias is just selective sampling.

2 comments:

larryWednesday, September 08, 2021 8:35:00 AM
Maybe I’m missing something. Didn’t they put type of car (gas or diesel) into the regression model? If they did, wouldn’t the UTM show If the match explained the cost over and above the car type?
Phil BirnbaumWednesday, September 08, 2021 1:03:00 PM
Right, they didn't put batter race or umpire race into the regression model, which is why they got the results they did.

(In my illustration, I didn't put type of car in the regression model either.)

Sabermetric Research

Pages

Tuesday, September 07, 2021

Are umpires racially biased? A 2021 study (Part II)

2 comments: