## Tuesday, September 07, 2021

### Are umpires racially biased? A 2021 study (Part II)

(Part I is here.)

20 percent of drivers own diesel cars, and the other 80 percent own regular (gasoline) cars. Diesels are, on average, less reliable than regular cars. The average diesel costs \$2,000 a year in service, while the average regular car only costs \$1,000.

Researchers wonder if there's a way to reduce costs. Maybe diesels cost more partly because mechanics don't like them, or are unfamiliar with them? They create a regression that controls for the model, age, and mileage of the car, as well as driver age and habits. But they also include a variable for whether the mechanic owns the same type of car (diesel or gasoline) as the owner. They call that variable "UTM," or "user/technician match".

They run the regression, and the UTM coefficient turns out negative and significant. It turns out that when the mechanic owns the same type of car as the user, maintenance costs are more than 13 percent lower! The researchers conclude that finding a mechanic who owns the same kind of car as you will substantially reduce your maintenance costs.

But that's not correct. The mechanic makes no difference at all. That 13 percent from the regression is showing something completely different.

If you want to solve this as a puzzle, you can stop reading and try. There's enough information here to figure it out.

-------

The overall average maintenance cost, combining gasoline and diesel, is \$1200. That's the sum of 80 percent of \$1000, plus 20 percent of \$2000.

So what's the average cost for only those cars that match the mechanic's car? My first thought was, it's the same \$1200. Because, if the mechanic's car makes no difference, how can that number change?

But it does change. The reason is: when the user's car matches the mechanic's, it's much less likely to be a diesel. The gasoline owners are over-represented when it comes to matching: each has an 80% chance of being included in the "UTM" sample, while the diesel owner has only a 20% chance.

In the overall population, the ratio of gasoline to diesel is 4:1. But the ratio of "gasoline/gasoline" to "diesel/diesel" is 16:1. So instead of 20%, the proportion of "double diesels" in the "both cars match" population is only 1 in 17, or 5.9%.

That means the average cost of UTM repairs is only \$1059. That's 94.1 percent of \$1000, plus 5.9% of \$2000. That works out to 13.3 percent less than the overall \$1200.

Here's a chart that maybe makes it clearer. Here's how the raw numbers of UTM pairings break down, per 1000 population:

Technician      Gasoline    Diesel    Total
-------------------------------------------
User gasoline     640        160       800
User diesel       160         40       200
-------------------------------------------
Total             800        200      1000

The highlighted diagonal is where the user matches the mechanic. There are 680 cars on that diagonal, but only 40 (1 in 17) are diesel.

In short: the "UTM" coefficient is significant not because matching the mechanic selects better mechanics, but because it selectively samples for more reliable (gasoline) cars.

--------

In the umpire/race study I talked about last post, they had a regression like that, where they put all the umpires and batters together into one regression and looked at the "UBM" variable, where the umpire's race matches the batter's race.
From last post, here's the table the author included. The numbers are umpire errors per 1000 outside-of-zone pitches (negative favors the batter).

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8       ---     +5.9
White batter       +5.6      -4.4      ---

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

I think I'm able to estimate, from the original study, that the batter population was almost exactly in the 2:3:4 range -- 22 percent Black, 34 percent Hispanic, and 44 percent White. Using those numbers, I'm going to adjust the chart one more time, to show approximately what it would look like if the umpires were exactly alike (no bias) and each column added to zero.

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -2.2     -2.2     -2.2
Hispanic batter    +3.8     +3.8     +3.8
White batter       -1.7     -1.7     -1.7

I chose those numbers so the average UBM (average of diagonals in ratio 22:34:44) is zero, and also to closely fit the actual numbers the study found. That is: suppose you ran a regression using the author's data, but controlling for batter and umpire race.  And suppose there was no racial bias. In that case, you'd get that table, which represents our null hypothesis of no racial bias.

If the null hypothesis is true, what will a regression spit out for UBM? If the batters were represented in their actual ratio, 22:34:44, you'd get zero:

Diagonal          Effect Weight    Product
-------------------------------------------------
Black UBM          -2.2    22%     -0.5
Hispanic UBM       +3.8    34%     +1.5
White UBM          -1.7    44%     -0.8
-------------------------------------------------
Overall UBM               100%     -0.0  per 1000

However: in the actual population in the MLB study, the diagonals do NOT appear in the 22:34:44 ratio. That's because the umpires were overwhelmingly White -- 88 percent White. There were only 5 percent Black umpires, and 7 percent Hispanic umpires. So the White batters matched their umpire much more often than the Hispanic or Black batters.

Using 5:7:88 for umpires, and 22:34:44 for batters, the relative frequency of each combination looks like this. Here's the breakdown per 1000 pitches:

Batter
Umpire             Black   Hispanic  White    Total
---------------------------------------------------
Black batter        11        15      194      220
Hispanic batter     17        24      300      341
White batter        22        31      387      439
---------------------------------------------------
Umpire total        50        70      881     1000

Because there are so few minority umpires, there are only 24 Hispanic/Hispanic pairs out of 422 total matches on the UBM diagonal.  That's only 5.7% Hispanic batters, rather than 34 percent:

Diagonal       Frequency  Percent
----------------------------------
Black UBM            11     2.6%
Hispanic UBM         24     5.7%
White UBM           387    91.7%
----------------------------------
Overall UBM         422     100%

If we calculate the observed average of the diagonal, with this 11/24/387 breakdown, we get this:

Effect  Weight      Product
--------------------------------------------------
Black UBM          -2.2    2.6%    -0.06 per 1000
Hispanic UBM       +3.8    5.7%    +0.22 per 1000
White UBM          -1.7   91.7%    -1.56 per 1000
--------------------------------------------------
Overall UBM                100%    -1.40 per 1000

Hispanic batters receive more bad calls for reasons other than racial bias. By restricting the sample of Hispanic batters to only those who see a Hispanic umpire, we selectively sample fewer Hispanic batters in the UBM pool, and so we get fewer bad calls.

Under the null hypothesis of no bias, UBM plate appearances still see 1.40 fewer bad calls per 100 pitches, because of selective sampling.

------

That 1.40 figure is compared to the overall average. The regression coefficient, however, compares it to the non-UBM case. What's the average of the non-UBM case?

Well, if a UBM happens 422 times out of 1000, and results in 1.40 pitches fewer than average, and the average is zero, then the other 578 times out of 1000, there must have been 1.02 pitches more than average.

Effect  Weight       Product
--------------------------------------------------
UBM                -1.40   42.2%   -0.59 per 1000
Non-UBM            +1.02   57.8%   +0.59 per 1000
--------------------------------------------------
Full sample                100%    -0.00 per 1000

So the coefficient the regression produces -- UBM compared to non-UBM -- will be 2.42.

What did the actual study find? 2.81.

That leaves only 0.39 as the estimate of potential umpire bias:

-2.81  Selective sampling plus possible bias
-2.42  Effect of selective sampling only
---------------------------------------------
-0.39  Revised estimate of possible bias

The study found 2.81 fewer bad calls (per 1000) when the umpire matched the pitcher, but 2.42 of that is selective sampling, leaving only 0.39 that could be umpire bias.

Is that 0.39 statistically significant? I doubt it. For what it's worth, the original estimate had an SD of 0.44. So adjusting for selective sampling, we're less than 1 SD from zero.

--------

So, the conclusion: the study's finding of a 0.28% UBM effect cannot be attributed to umpire bias. It's mostly a natural mathematical artifact resulting from the fact that

(a) Hispanic batters see more incorrect calls for reasons other than bias,

(b) Hispanic umpires are rare, and

(c) The regression didn't control for the race of batter and umpire separately.

Because of that, almost the entire effect the study attributes to racial bias is just selective sampling.