Tuesday, September 07, 2021

Are umpires racially biased? A 2021 study (Part II)

(Part I is here.)


20 percent of drivers own diesel cars, and the other 80 percent own regular (gasoline) cars. Diesels are, on average, less reliable than regular cars. The average diesel costs $2,000 a year in service, while the average regular car only costs $1,000. 

Researchers wonder if there's a way to reduce costs. Maybe diesels cost more partly because mechanics don't like them, or are unfamiliar with them? They create a regression that controls for the model, age, and mileage of the car, as well as driver age and habits. But they also include a variable for whether the mechanic owns the same type of car (diesel or gasoline) as the owner. They call that variable "UTM," or "user/technician match".

They run the regression, and the UTM coefficient turns out negative and significant. It turns out that when the mechanic owns the same type of car as the user, maintenance costs are more than 13 percent lower! The researchers conclude that finding a mechanic who owns the same kind of car as you will substantially reduce your maintenance costs.

But that's not correct. The mechanic makes no difference at all. That 13 percent from the regression is showing something completely different.

If you want to solve this as a puzzle, you can stop reading and try. There's enough information here to figure it out. 

-------

The overall average maintenance cost, combining gasoline and diesel, is $1200. That's the sum of 80 percent of $1000, plus 20 percent of $2000.

So what's the average cost for only those cars that match the mechanic's car? My first thought was, it's the same $1200. Because, if the mechanic's car makes no difference, how can that number change?

But it does change. The reason is: when the user's car matches the mechanic's, it's much less likely to be a diesel. The gasoline owners are over-represented when it comes to matching: each has an 80% chance of being included in the "UTM" sample, while the diesel owner has only a 20% chance.

In the overall population, the ratio of gasoline to diesel is 4:1. But the ratio of "gasoline/gasoline" to "diesel/diesel" is 16:1. So instead of 20%, the proportion of "double diesels" in the "both cars match" population is only 1 in 17, or 5.9%.

That means the average cost of UTM repairs is only $1059. That's 94.1 percent of $1000, plus 5.9% of $2000. That works out to 13.3 percent less than the overall $1200.

Here's a chart that maybe makes it clearer. Here's how the raw numbers of UTM pairings break down, per 1000 population:

Technician      Gasoline    Diesel    Total
-------------------------------------------
User gasoline     640        160       800
User diesel       160         40       200
-------------------------------------------
Total             800        200      1000 
 

The highlighted diagonal is where the user matches the mechanic. There are 680 cars on that diagonal, but only 40 (1 in 17) are diesel.

In short: the "UTM" coefficient is significant not because matching the mechanic selects better mechanics, but because it selectively samples for more reliable (gasoline) cars.

--------

In the umpire/race study I talked about last post, they had a regression like that, where they put all the umpires and batters together into one regression and looked at the "UBM" variable, where the umpire's race matches the batter's race. 
From last post, here's the table the author included. The numbers are umpire errors per 1000 outside-of-zone pitches (negative favors the batter).

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8       ---     +5.9
White batter       +5.6      -4.4      ---

I had adjusted that to equalize the baseline:

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

I think I'm able to estimate, from the original study, that the batter population was almost exactly in the 2:3:4 range -- 22 percent Black, 34 percent Hispanic, and 44 percent White. Using those numbers, I'm going to adjust the chart one more time, to show approximately what it would look like if the umpires were exactly alike (no bias) and each column added to zero. 

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -2.2     -2.2     -2.2
Hispanic batter    +3.8     +3.8     +3.8
White batter       -1.7     -1.7     -1.7

I chose those numbers so the average UBM (average of diagonals in ratio 22:34:44) is zero, and also to closely fit the actual numbers the study found. That is: suppose you ran a regression using the author's data, but controlling for batter and umpire race.  And suppose there was no racial bias. In that case, you'd get that table, which represents our null hypothesis of no racial bias.

If the null hypothesis is true, what will a regression spit out for UBM? If the batters were represented in their actual ratio, 22:34:44, you'd get zero:

Diagonal          Effect Weight    Product
-------------------------------------------------
Black UBM          -2.2    22%     -0.5  
Hispanic UBM       +3.8    34%     +1.5  
White UBM          -1.7    44%     -0.8  
-------------------------------------------------
Overall UBM               100%     -0.0  per 1000

However: in the actual population in the MLB study, the diagonals do NOT appear in the 22:34:44 ratio. That's because the umpires were overwhelmingly White -- 88 percent White. There were only 5 percent Black umpires, and 7 percent Hispanic umpires. So the White batters matched their umpire much more often than the Hispanic or Black batters.

Using 5:7:88 for umpires, and 22:34:44 for batters, the relative frequency of each combination looks like this. Here's the breakdown per 1000 pitches:

                                             Batter
Umpire             Black   Hispanic  White    Total
---------------------------------------------------
Black batter        11        15      194      220
Hispanic batter     17        24      300      341
White batter        22        31      387      439
---------------------------------------------------
Umpire total        50        70      881     1000

Because there are so few minority umpires, there are only 24 Hispanic/Hispanic pairs out of 422 total matches on the UBM diagonal.  That's only 5.7% Hispanic batters, rather than 34 percent:

Diagonal       Frequency  Percent
----------------------------------
Black UBM            11     2.6% 
Hispanic UBM         24     5.7%
White UBM           387    91.7%     
----------------------------------
Overall UBM         422     100%

If we calculate the observed average of the diagonal, with this 11/24/387 breakdown, we get this:
                                      
                  Effect  Weight      Product
--------------------------------------------------
Black UBM          -2.2    2.6%    -0.06 per 1000
Hispanic UBM       +3.8    5.7%    +0.22 per 1000
White UBM          -1.7   91.7%    -1.56 per 1000 
--------------------------------------------------
Overall UBM                100%    -1.40 per 1000

Hispanic batters receive more bad calls for reasons other than racial bias. By restricting the sample of Hispanic batters to only those who see a Hispanic umpire, we selectively sample fewer Hispanic batters in the UBM pool, and so we get fewer bad calls. 

Under the null hypothesis of no bias, UBM plate appearances still see 1.40 fewer bad calls per 100 pitches, because of selective sampling.

------

That 1.40 figure is compared to the overall average. The regression coefficient, however, compares it to the non-UBM case. What's the average of the non-UBM case?

Well, if a UBM happens 422 times out of 1000, and results in 1.40 pitches fewer than average, and the average is zero, then the other 578 times out of 1000, there must have been 1.02 pitches more than average. 

                  Effect  Weight       Product
--------------------------------------------------
UBM                -1.40   42.2%   -0.59 per 1000
Non-UBM            +1.02   57.8%   +0.59 per 1000
--------------------------------------------------
Full sample                100%    -0.00 per 1000

So the coefficient the regression produces -- UBM compared to non-UBM -- will be 2.42.

What did the actual study find? 2.81. 

That leaves only 0.39 as the estimate of potential umpire bias:

-2.81  Selective sampling plus possible bias
-2.42  Effect of selective sampling only
---------------------------------------------
-0.39  Revised estimate of possible bias

The study found 2.81 fewer bad calls (per 1000) when the umpire matched the pitcher, but 2.42 of that is selective sampling, leaving only 0.39 that could be umpire bias.

Is that 0.39 statistically significant? I doubt it. For what it's worth, the original estimate had an SD of 0.44. So adjusting for selective sampling, we're less than 1 SD from zero.

--------

So, the conclusion: the study's finding of a 0.28% UBM effect cannot be attributed to umpire bias. It's mostly a natural mathematical artifact resulting from the fact that

(a) Hispanic batters see more incorrect calls for reasons other than bias, 

(b) Hispanic umpires are rare, and

(c) The regression didn't control for the race of batter and umpire separately.

Because of that, almost the entire effect the study attributes to racial bias is just selective sampling.













Labels: , , , ,

Monday, August 30, 2021

Are umpires racially biased? A 2021 study (Part I)

Are MLB umpires racially biased? There's a recent new study that claims they are. The author, who wrote it as an undergrad thesis, mentioned it on Twitter, and when I checked a week or so later, there were lots of articles and links to it. (Here, for instance, is a Baseball Prospectus post reporting on it.  And here's a Yahoo! report.)

The study tried to figure whether umpires make more bad calls against batters* of a race other than theirs (where there is no "umpire-batter match," or "UBM," as the literature calls it). It ran regressions on called pitches from 2008 to 2020, to figure out how best to predict the probability of the home-plate umpire calling a pitch incorrectly (based on MLB "Gameday" pitch location). The author controlled for many different factors, and found a statistically significant coefficient for UBM, concluding that the pitcher gains an advantage when the umpire is of the same race. It also argues that white umpires in particular "could be the driving force behind discrimination in MLB."  

I don't think any of that is right. I think the results point to something different, and benign. 

---------

Imagine a baseball league where some teams are comprised of dentists, while the others are jockeys. The league didn't hire any umpires, so the players take turns, and promise to call pitches fairly.

They play a bunch of games, and it turns out that the umpires call more strikes against the dentists than against the jockeys. Nobody is surprised -- jockeys are short, and thus have small strike zones.

It's true that the data shows that if you look at the Jockey umpires, you'll see that they call a lot fewer strikes against batters of their own group than against batters of the other group. Their "UBM" coefficient is high and statistically significant.

Does that mean the jockey umps are "racist" against dentists? No, of course not. It's just that the dentists have bigger strike zones. 

It's the same, but in reverse, for the dentist umpires. They call more strikes against their fellow dentists -- again, not because of pro-jockey "reverse racism," but because of the different strike zones.

Later, teams of NBA players enter the league. These guys are tall, with huge strike zones, so they get a lot of called strikes, even from their own umpires.

Let's put some numbers on this: we'll say there are 10 teams of dentists, 1 team of jockeys, and 2 teams of NBA players. The jockeys are -10 in called strikes compared to average, and the NBA players are +10. That leaves the dentists at -1 (in order for the average to be zero).

Here's a chart that shows every umpire is completely fair and unbiased. 

Umpire             Jockey    NBA    Dentist
-------------------------------------------
Jockey batter:       -10     -10     -10
NBA batter           +10     +10     +10
Dentist batter        -1      -1      -1

I've highlighted the "UBM" cells where the umpire matches the batter. If you look only at those cells, and don't think too much about what's going on, you could think the umpires are horribly biased. The Jockey batters get 10 fewer strikes than average from Jockey umpires!  That's awful!

But then when you look closer, you see the horizontal row is *all* -10. That means all the umpires called the jockeys the same way (-10), so it's probably something about the jockey batters that made that happen. In this case, it's that they're short.

I think this is what's going on in the actual study. But it's harder to see, because the chart isn't set up with the raw numbers. The author ran different regressions for the three different umpire races, and set a different set of batters as the zero-level for each. Since they're calibrated to a different standard of player, the results make the umpires look very different.

If I had done here what the author did there, the chart above would have looked like this:

Umpire             Jockey    NBA   Dentist
------------------------------------------
Jockey batter:         0    -20      -9
NBA batter           +20      0     +11
Dentist batter        +9    -11       0

If you just look at this chart without knowing you can't compare the columns to each other (because they're based on a different zero baseline), it's easy to think there's evidence of bias. You'd look at the chart and say, "Hey, it looks like Jockey umpires are racist against NBA batters and dentists. Also, dentist umpires are racist against NBA players but favor Jockeys somewhat. But, look!  NBA umpires actually *favor* other races!  That's probably because NBA umpires are new to the tournament, and are going out of their way to appear unbiased."  

That's a near-perfect analogue to the actual study.  This is the top half of Table 8, which measures "over-recognition" of pitchers, meaning balls incorrectly called as strikes (hurting the batter). I've multiplied everything by 1000, so the numbers are "wrong strike calls per 1000 called pitches outside the zone".

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8      ---      +5.9
White batter       +5.6      -4.4      ---

It's  very similar to my fake table above, where the dentists and Jockeys look biased, but the NBA players look "reverse biased". 

The study notes the chart and says,

"For White umpires, the results suggest that for pitches outside the zone, Hispanic batters ... face umpire discrimination. [But Hispanic umpires have a] "reverse-bias effect ... [which] holds for both Black and White batters... Lastly, the bias against non-Black batters by Black umpires is relatively consistent for both Hispanic and White batters."

And it rationalizes the apparent "reverse racism" from Hispanic umpires this way:

"This is perhaps attributable to the recent increase in MLB umpires from Hispanic countries, who could potentially fear the consequences of appearing biased towards Hispanic players."

But ... no. The apparent result is almost completely the result of setting a different zero level for each umpire/batter race -- in other words, by arbitrarily setting the diagonal to zero. That only works if the groups of batters are exactly the same. They're not. Just as Jockey batters have different characteristics than NBA player batters, it's likely that Hispanic batters don't have exactly the same characteristics as White and Black batters.

The author decided that White, Black, and Hispanic batters all should get exactly the same results from an unbiased umpire. If that assumption is false, the effect disappears. 

Instead, the study could have made a more conservative assumption: that unbiased umpires of any race should call *White* batters the same. (Or Black batters, or Hispanic batters. But White batters have the largest sample size, giving the best signal-to-noise ratio.)

That is, use a baseline where the bottom row is zero, rather than one where the diagonal is zero. To do that, take the original, set the bottom cells to zero, but keep the differences between any two rows in the same column:

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

Does this look like evidence of umpire bias? I don't think so. For any given race of batter, all three groups of umpires call about the same amount of bad strikes. In fact, all three groups of umpires even have the same *order* among batter groups: Hispanic the most, White second, and Black third. (The raw odds of that happening are 1 in 36). 

The only anomaly is that maybe it looks like there's some evidence that Black umpires benefit Black batters by about 5 pitches per 1,000, but even that difference is not statistically significant. 

In other words: the entire effect in the study disappears when you remove the hidden assumption that Hispanic batters respond to pitches exactly the same way as White or Black batters. And the pattern of "discrimination" is *exactly* what you'd expect if the Hispanic batters respond to pitches in ways that result in more errors -- that is, it explains the anomaly that Hispanic umpires tend to look "reverse racist."

Also, I think the entire effect would disappear if the author had expanded his regression to include dummy variables for the race of the batter.  

------

If, like me, you find it perfectly plausible that Hispanic batters respond to pitches in ways that generate more umpire errors, you can skip this section. If not, I will try to convince you.

First, keep in mind that it's a very, very small difference we're talking about: maybe 4 pitches per 1,000, or 0.4 percent. Compare that to some of the other, much larger effects the study found:

 +8.9%   3-0 count on the batter
 -0.9%   two outs
 +2.8%   visiting team batting
 -3.3%   right-handed batter
 +0.5%   right-handed pitcher
+19.7%   bases loaded (!!!)
 +1.4%   pitcher 2 WAR vs. 0 WAR
 +0.9%   pitcher has two extra all-star appearances
 +4.0%   2019 vs. 2008
---------------------------------------------------
 +0.4%   batter is Hispanic
---------------------------------------------------

I wouldn't have expected most of those other effects to exist, but they do. And they're so large that they make this one, at only +0.4%, look unremarkable. 

Also: with so many large effects found in the study, there are probably other factors the author didn't consider that are just as large. Just to make something up ... since handedness of pitcher and batter are so important, suppose that platoon advantage (the interaction between pitcher and batter hand, which the study didn't include) is worth, say, 5%. And suppose Hispanic batters are likely to have the platoon advantage, say, 8% less than White batters. That would give you an 0.4% effect right there.

I don't have data specifically for Hispanic batters, but I do have data for country of birth. Not all non-USA players are Hispanic, but probably a large subset are, so I split them up that way. Here is batting-handedness stats for players from 1969 to 2016:

Born in USA:       61.7% RHB
Born outside USA:  67.1% RHB

That's a 10% difference in handedness. I don't know how that translates into platoon advantage, but it's got to be the same order of magnitude as what we'd need for 0.4%.

Here's another theory. They used to say, about prospects from the Dominican Republic, that they deliberately become free swingers because "you can't walk off the island."  

Suppose, that knowing a certain player is a free swinger, the pitcher aims a bit more outside the strike zone than usual, knowing the batter is likely to swing anyway. If the catcher sets a target outside, and the pitcher hits it perfectly, the umpire may be more likely to miscall it as a strike (at least according to many broadcasters I've heard).

Couldn't that explain why Hispanic players get very slightly more erroneous strike calls? 

In support of that hypothesis, here are K/W ratios for that same set of batters (total K divided by total BB):

Born in USA:       1.82 K per BB
Born outside USA:  2.05 K per BB 

Again, that seems around the correct order of magnitude.

I'm not saying these are the right explanations -- they might be right, or they might not. The "right answer" is probably several factors, perhaps going different directions, but adding up to 0.4%. 

But the point is: there do seem to be significant differences in hitting styles between Hispanic and non-Hispanic batters, certainly significant enough that an 0.4% difference in bad calls is quite plausible. Attributing the entire 0.4% to racist umpires (and assuming that all races of umpires would have to discriminate against Hispanics!) doesn't have any justification whatsoever -- at least not without additional evidence.

-------

Here's a TLDR summary, with a completely different analogy this time:

Eddie Gaedel's father calls fewer strikes on Eddie Gaedel than Aaron Judge's father calls on Aaron Judge. So Gaedel Sr. must be biased! 

--------

There's another part of the study -- actually, the main part -- that throws everything into one big regression and still comes out with a significant "UBM" effect, which again it believes is racial bias. I think that conclusion is also wrong, for reasons that aren't quite the same. 

That's Part II, which is now here.

----------


(*The author found a similar result for pitchers, who gained an advantage in more called strikes when they were the same race as the umpire, and a similar result for called balls as well as called strikes. In this post, I'll just talk about the batting side and the called strikes, but the issues are the same for all four combinations of batter/pitcher ball/strike.)


Labels: , , , ,

Monday, January 26, 2015

Are umpires biased in favor of star pitchers? Part II

Last post, I talked about the study (.pdf) that found umpires grant more favorable calls to All-Stars because the umps unconsciously defer to their "high status." I suggested alternative explanations that seemed more plausible than "status bias."

Here are a few more possibilities, based on the actual coefficient estimates from the regression itself.

(For this post, I'll mostly be talking about the "balls mistakenly called as strikes" coefficients, the ones in Table 3 of the paper.)

---

1. The coefficient for "right-handed batter" seems way too high: -0.532. That's so big, I wondered whether it was a typo, but apparently it's not.  

How big? Well, to suffer as few bad calls as his right-handed teammate, a left-handed batter would have to be facing a pitcher with 11 All-Star appearances.

The likely explanation seems to be: umpires don't call strikes by the PITCHf/x (rulebook) standard, and the differences are bigger for lefty batters than righties. Mike Fast wrote, in 2010,

"Many analysts have shown that the average strike zone called by umpires extends a couple of inches outside the rulebook zone to right-handed hitters and several more inches beyond that to left-handed hitters." 

That's consistent with the study's findings in a couple of ways. First, in the other regression, for "strikes mistakenly called as balls", the equivalent coefficient is less than a tenth the size, at -0.047. Which makes sense: if the umpires' strike zone is "too big", it will affect undeserved strikes more than undeserved balls. 

Second: the two coefficients go in the same direction. You wouldn't expect that, right? You'd expect that if lefty batters get more undeserved strikes, they'd also get fewer undeserved balls. But this coefficient is negative both cases. That suggests something external and constant, like the PITCHf/x strike zone overestimating the real one.

And, of course, if the problem is umpires not matching the rulebook, the entire effect could just be that control pitchers are more often hitting the "illicit" part of the zone.  Which is plausible, since that's the part that's closest to the real zone.

---

2. The "All-Star" coefficient drops when it's interacted with control. Moreover, it drops further for pitchers with poor control than pitchers with good control. 

Perhaps, if there *is* a "status" effect, it's only for the very best pitchers, the ones with the best control. Otherwise, you have to believe that umpires are very sensitive to "status" differences between marginal pitchers' control rates. 

For instance, going into the 2009 season, say, J.C. Romero had a career 12.5% BB/PA rate, while Warner Madrigal's was 9.1%. According to the regression model, you'd expect umpires to credit Madrigal with 37% more undeserved strikes than Warner. Are umpires really that well calibrated?

Suppose I'm right, and all the differences in error rates really accrue to only the very best control pitchers. Since the model assumes the effect is linear all the way down the line, the regression will underestimate the best and worst control pitchers, and overestimate the average ones. (That's what happens when you fit a straight line to a curve; you can see an example in the pictures here.) 

Since the best control pitchers are underestimated, the regression tries to compensate by jiggling one of the other coefficients, something that correlates with only those pitchers with the very best control. The candidate it settles on: All-Star appearances. 

Which would explain why the All-Star coefficient is high, and why it's high mostly for pitchers with good control. 

---

3. The pitch's location, as you would expect, makes a big difference. The further outside the strike zone, the lower the chance that it will be mistakenly called a strike. 

The "decay rate" is huge. A pitch that's 0.1 feet outside the zone (1.2 inches) has only 43 percent the odds of being called a strike as one that's right on the border (0 feet).  A pitch 0.2 feet outside has only 18 percent the odds (43 percent squared).  And so on.*

(* Actually, the authors used a quadratic to estimate the effect -- which makes sense, since you'd expect the decay rate to increase. If the error rate at 0.1 feet is, say, 10 percent, you wouldn't expect the rate for 1 foot to be 1 percent. It would be much closer to zero. But the quadratic term isn't that big, it turns out, so I'll ignore it for simplicity. That just renders this argument more conservative.) 

The regression coefficient, per foot outside, was 8.292. The coefficient for a single All-Star appearance was 0.047. 

So an All-Star appearance is worth 1/176 of a foot -- which is a bit more than 1/15 of an inch.

That's the main regression. For the one with the lower value for All-Star appearances, it's only an eighteenth of an inch. 

Isn't it more plausible to think that the good pitchers are deceptive enough to fool the umpire by 1/15 inches per pitch, rather than that the umpire is responding to their status? 

Or, isn't it more likely that the good pitchers are hitting the "extra" parts of the umpires' inflated strike zone, at an increased rate of one inch per 15 balls? 

---

4. The distance from the edge of the strike zone is, I assume, "as the crow flies." So, a high pitch down the middle of the plate is treated as the same distance as a high pitch that's just on the inside edge. 

But, you'd think that the "down the middle" pitch has a better chance of being mistakenly called a strike than the "almost outside" pitch. And isn't it also plausible that control pitchers will have a different ratio of the two types than those with poor control? 

Also, a pitch that's 1 inch high and 1 inch outside registers as the same distance as a pitch over the plate that's 1.4 inches high. Might umpires not be evaluating two-dimensional balls differently than one-dimensional balls?

And, of course: umpires might be calling low balls differently than high balls, and outside pitches differently from inside pitches. If pitchers with poor control throw to the inside part of the plate more than All-Stars (say), and the umpires seldom err on balls inside because of the batter's reaction, that alone could explain the results.

------ 

All these explanations may strike you as speculative. But, are they really more speculative than the "status bias" explanation? They're all based on exactly the same data, and the study's authors don't provide any additional evidence other than citations that status bias exists.

I'd say that there are several different possibilities, all consistent with the data:

1.  Good pitchers get the benefit of umpires' "status bias" in their favor.

2.  Good pitchers hit the catcher's glove better, and that's what biases the umpires.

3.  Good pitchers have more deceptive movement, and the umpire gets fooled just as the batter does.

4.  Different umpires have different strike zones, and good pitchers are better able to exploit the differences.

5.  PITCHf/x significantly underestimates umpires in their opinions of what constitutes a strike. Since good pitchers are closer to the strike zone more often, they wind up with more umpire strikes that are PITCHf/x balls. The difference only has to be the equivalent one-fifteenth of an inch per ball.

6.  Umpires are "deliberately" biased. They know that when they're not sure about a pitch, considering the identity of the pitcher gives them a better chance of getting the call right. So that's what they do.

7.  All-Star pitchers have a positive coefficient to compensate for real-life non-linearity in the linear regression model.

8.  Not all pitches the same distance from the strike zone are the same. Better pitchers might err mostly (say) high or outside, and worse pitchers high *and* outside.  If umpires are less likely fooled in two dimensions than one, that would explain the results.

------

To my gut, #1, unconscious status bias, is the least plausible of the eight. I'd be willing to bet on any of the remaining seven, that they all are contributing to the results to some extent (possibly negatively).  

But I'd bet on #5 being the biggest factor, at least if the differences between umpires and the rulebook really *are* as big as reported.  

As always, your gut may be more accurate than mine.  




Labels: , , , ,

Sunday, January 18, 2015

Are umpires biased in favor of star pitchers?

Are MLB umpires are biased in favor of All-Star pitchers? An academic study, released last spring, says they are. Authored by business professors Braden King and Jerry Kim, it's called "Seeing Stars: Matthew Effects and Status Bias in Major League Baseball Umpiring."

"What Umpires Get Wrong" is the title of an Op-Ed piece in the New York Times where the authors summarize their study. Umps, they write, favor "higher status" pitchers when making ball/strike calls:


"Umpires tend to make errors in ways that favor players who have established themselves at the top of the game's status hierarchy."

But there's nothing special about umpires, the authors say. In deferring to pitchers with high status, umps are just exhibiting an inherent unconscious bias that affects everyone: 


" ... our findings are also suggestive of the way that people in any sort of evaluative role — not just umpires — are unconsciously biased by simple 'status characteristics.' Even constant monitoring and incentives can fail to train such biases out of us."

Well ... as sympathetic as I am to the authors' argument about status bias in regular life, I have to disagree that the study supports their conclusion in any meaningful way.

-----

The authors looked at PITCHf/x data for the 2008 and 2009 seasons, and found all instances where the umpire miscalled a ball or strike, based on the true, measured x/y coordinates of the pitch. After a large multiple regression, they found that umpire errors tend to be more favorable for "high status" pitchers -- defined as those with more All-Star appearances, and those who give up fewer walks per game. 

For instance, in one of their regressions, the odds of a favorable miscall -- the umpire calling a strike on a pitch that was actually out of the strike zone -- increased by 0.047 for every previous All-Star appearance by the pitcher. (It was a logit regression, but for low-probability events like these, the number itself is a close approximation of the geometric difference. So you can think of 0.047 as a 5 percent increase.)

The pitcher's odds also increased 1.4 percent for each year of service, and another 2.5 percent for each percentage point improvement in career BB/PA.

For unfavorable miscalls -- balls called on pitches that should have been strikes -- the effects were smaller, but still in favor of the better pitchers.

I have some issues with the regression, but will get to those in a future post. For now ... well, it seems to me that even if you accept that these results are correct, couldn't there be other, much more plausible explanations than status bias?

1. Maybe umpires significantly base their decisions on how well the pitcher hits the target the catcher sets up. Good pitchers come close to the target, and the umpire thinks, "good control" and calls it a strike. Bad pitchers vary, and the catcher moves the glove, and the umpire thinks, "not what was intended," and calls it a ball.

The authors talk about this, but they consider it an attribute of catcher skill, or "pitch framing," which they adjust for in their regression. I always thought of pitch framing as the catcher's ability to make it appear that he's not moving the glove as much as he actually is. That's separate from the pitcher's ability to hit the target.

2. Every umpire has a different strike zone. If a particular ump is calling a strike on a low pitch that day, a control pitcher is more able to exploit that opportunity by hitting the spot. That shows up as an umpire error in the control pitcher's favor, but it's actually just a change in the definition of the strike zone, applied equally to both pitchers.

3. The study controlled for the pitch's distance from the strike zone, but there's more to pitching than location. Better pitchers probably have better movement on their pitches, making them more deceptive. Those might deceive the umpire as well as the batter. 

Perhaps umpires give deceptive pitches the benefit of the doubt -- when the pitch has unusual movement, and it's close, they tend to call it a strike, either way. That would explain why the good pitchers get favorable miscalls. It's not their status, or anything about their identity -- just the trajectory of the balls they throw. 

4. And what I think is the most important possibility: the umpires are Bayesian, trying to maximize their accuracy. 

Start with this. Suppose that umpires are completely unbiased based on status -- in fact, they don't even know who the pitcher is. In that case, would an All-Star have the same chance of a favorable or unfavorable call as a bad pitcher? Would the data show them as equal?

I don't think so. 

There are times when an umpire isn't really sure about whether a pitch is a ball or a strike, but has to make a quick judgment anyway. It's a given that "high-status" control pitchers throw more strikes overall; that's probably also true in those "umpire not sure" situations. 

Let's suppose a borderline pitch is a strike 60% of the time when it's from an All-Star, but only 40% of the time when it's from a mediocre pitcher.

If the umpire is completely unbiased, what should he do? Maybe call it a strike 50% of the time, since that's the overall rate. 

But then: the good pitcher will get only five strike calls when he deserves six, and the bad pitcher will get five strike calls when he only deserves four. The good pitcher suffers, and the bad pitcher benefits.

So, unbiased umpires benefit mediocre pitchers. Even if umpires were completely free of bias, the authors' methodology would nonetheless conclude that umpires are unfairly favoring low-status pitchers!

----

Of course, that's not what's happening, since in real life, it's the better pitchers who seem to be benefiting. (But, actually, that does lead to a fifth (perhaps implausible) possibility for what the authors observed: umpires are unbiased, but the *worse* pitchers throw more deceptive pitches for strikes.)

So, there's something else happening. And, it might just be the umpires trying to improve their accuracy.

Our hypothetical unbiased umpire will have miscalled 5 out of 10 pitches for each player. To reduce his miscall rate, he might change his strategy to a Bayesian one. 

Since he understands that the star pitcher has a 60% true strike rate in these difficult cases, he might call *all* strikes in those situations. And, since he knows the bad pitcher's strike rate is only 40%, he might call *all balls* on those pitches. 

That is: the umpire chooses the call most likely to be correct. 60% beats 40%.

With that strategy, the umpire's overall accuracy rate improves to 60%. Even if he has no desire, conscious or unconscious, to favor the ace for the specific reason of "high status", it looks like he does -- but that's just a side-effect of a deliberate attempt to increase overall accuracy.

In other words: it could be that umpires *consciously* take the history of the pitcher into account, because they believe it's more important to minimize the number of wrong calls than to spread them evenly among different skills of pitcher. 

That could just as plausibly be what the authors are observing.

How can the ump improve his accuracy without winding up advantaging or disadvantaging any particular "status" of pitcher? By calling strikes in exactly the proportion he expects from each. For the good pitcher, he calls strikes 60% of the time when he's in doubt. For the bad pitcher, he calls 40% strikes. 

That strategy increases his accuracy rate only marginally -- from 50 percent to 52 percent (60% squared plus 40% squared). But, now, at least, neither pitcher can claim he's being hurt by umpire bias. 

But, even though the result is equitable, it's only because the umpire DOES have a "status bias." He's treating the two pitchers differently, on the basis of their historical performance. But King and Kim's study won't be able to tell there's a bias, because neither pitcher is hurt. The bias is at exactly the right level.

Is that what we should want umpires to do, bias just enough to balance the advantage with the disadvantage? That's a moral question, rather than an empirical one. 

Which are the most ethical instructions to give to the umpires? 

1. 

Make what you think is the correct call, on a "more likely than not" basis, *without* taking the pitcher's identity into account.

Advantages: No "status bias."  Every pitcher is treated the same.

Disadvantages: The good pitchers wind up being disadvantaged, and the bad pitchers advantaged. Also, overall accuracy suffers.

2. 

Make what you think is the correct call, on a "more likely than not" basis, but *do* take the pitcher's identity into account.

Advantages: Maximizes overall accuracy.

Disadvantages: The bad pitchers wind up being disadvantaged, and the good pitchers advantaged.

3. 

Make what you think is the most likely correct call, but adjust only slightly for the pitcher's identity, just enough that, overall, no type of pitcher is either advantaged or disadvantaged.

Advantages: No pitcher has an inherent advantage just because he's better or worse.

Disadvantages: Hard for an umpire to calibrate his brain to get it just right. Also, overall accuracy not as good as it could be. And, how do you explain this strategy to umpires and players and fans?


Which of the three is the right answer, morally? I don't know. Actually, I don't think there necessarily is one -- I think any of the three is fair, if understood by all parties, and applied consistently. Your opinion may vary, and I may be wrong. But, that's a side issue.

------

Getting back to the study: the fact that umpires make more favorable mistakes for good pitchers than bad pitchers is not, by any means, evidence that they are unconsciously biased against pitchers based on "status." It could just as easily be one of several other, more plausible reasons. 

So that's why I don't accept the study's conclusions. 

There's also another reason -- the regression itself. I'll talk about that next post.




(Hat tip: Charlie Pavitt)


Labels: , , , ,

Monday, August 08, 2011

Umpires' racial bias disappears for other years of data -- Part II

Hopefully this will be my last post on that Hamermesh umpire bias study ...

Two days before I was going to talk about the study at a presentation last week, I discovered that someone else was presenting a poster on it at the same conference.

Jeff Hamrick and John Rasp ran the data for 21 years of MLB, 1980-2010; the original Hamermesh study was for three years, 2004-2006.

Here's the chart they got. The numbers are, as usual, percentages of called pitches that were strikes:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 30.70 30.60 29.30
Hspnc Umpire-- 31.70 31.30 28.80
Black Umpire-- 30.80 30.30 28.70
--------------------------------

(Their numbers are actually only to one decimal place, not two, but I added the zero to make the numbers line up with the previous charts.)

There is no evidence of bias here ... the diagonal entries are not really any bigger than they "should" be. Still, after controlling for a whole bunch of stuff, including the identity of the pitcher, batter, and umpire, they got a significance level of 0.075. That's below the standard .05 threshold. Even if you accept the .075 as real, the amount of bias is very, very small.

As you may recall, the 3-year sample, which was significant at (I think) somewhere between .01 and .05, could have been caused by a mere 35 pitches per season. This study used seven times as much data, and got a lower significance level. The square root of 7 is about 2.6, so we divide 35 by 2.6 to get about 13 pitches per season. And, since the new study found only maybe 1.7 SDs instead of 2.5, we divide by another 1.5 to get maybe 9 pitches per season.

(That 9 pitches is the minimum, if maybe only one or two umpires are biased. It could be a lot more, if it's white pitchers who are biased. But there's no way to know for sure from the data.)

However, one thing we have to consider. The Hamermesh study found an effect only for low attendance situations. This new Hamrick/Rask data is for all attendance situations. However, when Hamrick and Rask included attendance in their regression, they got a significance level of 0.977, which shows the effect is almost completely random with regard to attendance.

So it's probably safe to conclude that, when you extend the Hamermesh study from 3 seasons to 21, the effect goes away.

Thanks to Jeff and John for making the data available.



Labels: , , ,

Monday, August 01, 2011

Do umpires discriminate in favor of veterans?

At the SABR convention last month, some evidence was unveiled that suggests that umpires give more favorable ball/strike calls to veterans.

Pat Kilgo gave a presentation, "Do Umpires Give Favorable Treatment to Some Players?" He and his colleagues -- Hillary Superak, Lisa Elon, Mark Katz, Paul Weiss, Jeff Switchenko, Brian Schmotzer, and Lance Waller -- looked at called pitches from 2009-10. They compared the call to the PitchF/X data, and, from that, decided if it was a correct call, a "false strike," or a "false ball".

They then created a matrix classifying both the batter and pitcher by years of experience. There were 16 classifications for each, from "0-1 year experience" to "more than 15 years experience". So the matrix had 256 cells. Each cell contained the percentage of "false strikes" for that situation.

As it turned out, there were many, many more false strikes when the pitcher had a lot of experience but the batter did not. And there were many *fewer* false strikes when the situation was reversed, with an experienced batter and rookie-ish pitcher.

Pat was kind enough to give me permission to post his PowerPoint slides, which are here. If you turn to slide 16, Pat and his colleagues color coded the cells, from dark green (lots of false strikes) to beige (few false strikes). Most of the green are at the bottom-left; most of the beige are at the top-right. There is no doubt that the distribution of colors is statistically significant.

Here, take a look:



On slide 22, the authors repeat the analysis for "false balls". This time, the pitcher's experience is significant (veterans don't get cheated out of a strike very often), but the batter's is not.

To summarize the authors' slide 33:

-- Umpires absolutely favor veterans with respect to false strikes
-- Umpires most likely favor veteran pitchers with respect to false balls
-- No evidence of benefit to veteran hitters [on false balls]


There are a couple of possible criticisms to the study. One is that PitchF/X might not be the best way to classify missed calls (I believe Mike Fast made this argument, but I don't have the link handy).

Another -- and I think this was raised by a questioner at the original presentation -- is that not all pitches are created equal. If veteran pitchers tend to throw down the middle, instead of trying to paint the corners, that would reduce their number of false balls (since their strikes are more obvious). I suppose you could check that out by controlling for pitch location.

Still, it seems to me that there's a good chance that Pat and his colleagues have found a real effect. Part of the reason is that the "umpires favor veterans" theory doesn't come out of the blue -- a lot of observers have long believed it to be true. That's unlike the "umpires have racial bias" hypothesis, which was (and still is) generally doubted by players and sportswriters.

I look forward to hearing what everyone else thinks. Thanks again to Pat for permission to post and link.


Labels: , ,

Thursday, July 28, 2011

More fastballs = fewer called strikes

A couple of weeks ago, I noticed that, from 2004 to 2006, even though hispanic and black pitchers received a lower percentage of called strikes than white pitchers (called strikes as a percentage of called pitches), they were able to post above-average numbers.

The reason, it turned out, was that despite not getting as many called strikes, they got a lot more *swinging* strikes, and that more than compensated.

I wondered why that would happen, what was so special about those pitchers. Then, commenter GuyM e-mailed me a suggestion: it looked like the ten pitchers I highlighted were all fastball pitchers.

I went over to Fangraphs and looked them up ... and Guy was right. With the exception of Ray King, the other nine pitchers threw fastballs at or above the MLB-average rate.

So, I did a more formal test. For 2004, 2005, and 2006 (separately), I split the league into the usual nine pitcher/umpire combinations (white/hispanic/black), and figured out the average fastball percentage (FB%) for each group that year. (I didn't have breakdowns on a per-pitch basis, so I used the player's overall season rate for each cell.)

Here's 2005:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 62.01 61.87 67.86
Hspnc Umpire-- 61.74 64.91 70.89
Black Umpire-- 62.20 60.57 66.78
--------------------------------

There's a big bump in the H/H row and column -- a lot more fastballs than you would expect. It would be hard to argue that that's racial bias, since the pitch chosen is a deliberate decision from the pitcher and catcher.

It just seems like, in 2005, the H/H pitchers happened to throw a lot of fastballs.

The situation was reversed in 2006:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 61.09 60.50 62.73
Hspnc Umpire-- 61.93 58.70 58.31
Black Umpire-- 60.80 61.72 61.53
--------------------------------

Suddenly, the H/H group is throwing many FEWER fastballs. Actually, it looks like fastballs were down across the board in 2006 -- I bet that was a change in how the stringers recorded pitches, rather than an actual change in what pitchers threw. In any case, even after adjusting for that, the H/H group is low.

So what's going on? Well, it's probably just different pitchers who make up that cell. It's somewhere around 1,000 pitches each year, which means the equivalent of maybe 20 hispanic pitchers starting against hispanic umpires. Just by chance, the 20 pitchers in 2005 were fastball pitchers, and the 20 pitchers in 2006 weren't.

Finally, here's 2004, just for completeness. It doesn't really show anything interesting.

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 61.81 61.86 66.52
Hspnc Umpire-- 61.66 61.88 64.75
Black Umpire-- 62.66 65.54 66.40
--------------------------------

So, as I was saying ... we want to try to figure out if more fastballs lead to more called strikes. To figure that out, I ran a regression to predict fastball percentage based on strike percentage, using all 27 cases in the above three tables. Since the overall FB% seems to vary from year to year, I added two dummy variables for the individual seasons.

The result: an r-squared of 0.4, and statistical significance. More important, the results of the regression equation: a relationship where, for every 1 percentage point more called strikes you get, you're likely to have thrown 1.67 percentage points fewer fastballs.

When I took out the bottom two cells in each of the "Black" columns (in which the sample sizes are very small, around 100 and 300 pitches each respectively), the result was even more significant (r-squared 0.53), and the relationship changed from 1.67 to 1.1.

So, we have a pretty good indication that more fastballs cause fewer called strikes. Technically, I shouldn't assume causation -- the data leave open the possibility that fewer called strikes cause fastballs, or that some third variable causes both lots of fastballs and fewer called strikes. But neither of those seems very plausible.

-------

Here's a more intuitive way to see the relationship. Here's 2005 again, for fastballs:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 62.01 61.87 67.86
Hspnc Umpire-- 61.74 64.91 70.89
Black Umpire-- 62.20 60.57 66.78
--------------------------------

And here's 2005 for called strikes:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.15 31.20 31.74
Hspnc Umpire-- 31.55 31.04 24.19
Black Umpire-- 31.39 31.53 30.88
--------------------------------

If you compare the charts, you can see for yourself that the high FB% cells generally seem to be paired with low CS%.

-------

Another important thing is that, now, we can't assume that when a pitcher gets few called strikes, his performance suffers. In fact, if the reason for fewer called strikes is more fastballs, it could be the other way around.

For instance, in the center cell in 2005, where the hispanic pitchers got only 31.04 percent called strikes, they gave up a very good 3.76 RC27 (like a 3.50 ERA). But in 2006, when they got 34.16% called strikes (which is very high), the batters facing them had an RC27 of 5.52. The more called strikes, the worse the performance. Very much opposite to the way you'd think.

That's when we look mostly *between* pitchers -- pitcher A, with more called strikes, is likely to be worse than pitcher B, with fewer called strikes. We don't know the relationship within the *same* pitcher. If pitcher A gets more called strikes in one start than another, is he likely to be worse in that start? We don't know.

So, when the Hamermesh study asserts that the H/H group benefits from the umpires having called more strikes in their favor, that's not necessarily true. It might be, but it also might not be. It's certainly true if the cause IS umpire bias, because that just changes the identical pitch from a ball to a strike. But if the cause is pitch selection, the relationship could be the exact opposite.

--------

Now, in my own little study, which was an attempt to reproduce the results of the original Hamermesh study, I did indeed find that the CS% in the "hispanic/hispanic" cell was very high. Now, we have an explanation other than umpire bias -- pitching style. It could just be that the overall H/H cell had fewer fastball pitchers than expected, and that caused the results.

But, while that would explain *my* results, it won't explain the original Hamermesh results. That's because the Hamermesh study controlled for the identity of the pitcher. So, if the center cell did indeed feature a lot of finesse pitchers, their study would have adjusted for that, even though mine didn't.

Still, we have a possible *weaker* explanation. Suppose that pitchers vary their fastball tendencies from year to year. One season, they might throw 65% fastballs, but, when they're a year or two older, their slider improves, and now they only throw 55% fastballs. The Hamermesh study adjusted for the identity of the player, but not for the individual player/season. So, if hispanic pitcher X threw 55% fastballs in the season where he faced the hispanic umpire, but 60% fastballs in the season where he faced the white umpire, that would bias the results and make it look like the umpire was biased.

Or, even more granular: if pitchers change their reperatoire *from game to game*, that would also do it. For instance, suppose hispanic pitcher Y finds out his curve ball isn't working well one game, and relies more on his fastball. If that happened more in games where the umpire was white, then, again, that would make the hispanic umpire look biased in his favor.

It's important to keep in mind that this is a valid criticism only if pitch selection differences are clustered over games or seasons. If a pitcher randomly decides to throw a fastball this pitch, but a breaking ball next pitch, that's included in the significance levels of the original study. It's only when the fastballs are *clustered* within umpires, rather than random over pitches, that that's something that affects the significance levels.

-------

So where does this leave us? Well, we haven't really found any smoking gun evidence that explains what the Hamermesh study found, since that study did control for who the pitcher is (which means they effectively controlled for fastball percentage). However, we *do* have a potential explanation, which is non-random pitch selection.

Normally, I hate when a study is criticized on the grounds of "you didn't control for X". That's a lazy argument, and it's an argument that can be leveled at any study, because, no matter how thorough, there's always *something* that hasn't been controlled for. Also, there's often no reason to believe X is important to control for. And, even if it is, there's no reason to believe that it's non-randomly distributed among the other variables.

In order to be taken seriously when you say "you didn't control for X," you need to come up with (a) an argument that X is actually an important factor, important enough to change the results, and (b) that there is reason to believe X is distributed non-randomly.

That's what I'm trying to do here. First, (a) I think I have proven that pitch type does seriously and significantly affect called strike percentage. Second (b), it's plausible that pitch type may vary *by the conscious choice of the pitcher* over seasons, and perhaps even games.

If I knew for sure that (b) happened -- if we had data that showed that it was common that, for some games a pitcher chooses to throw 70% fastballs, and some games he chooses to throw only 50% fastballs -- that would be enough to prove that the Hamermesh study's confidence intervals were overstated. Since we don't, it's just a possibility.

We don't know *for sure* that pitch types tend to cluster together. But it's a reasonable thing to look at in a future study. Based on the little I've looked at it so far, I suspect that it's a small but important factor.

-------

P.S. Thanks to GuyM for his e-mail discussion, and to Fangraphs' David Appelman for assistance in getting the FB% data I needed.

Labels: , , ,

Friday, July 22, 2011

Umpires' racial bias disappears for other years of data

The Hamermesh (et al) study on umpire racial bias looked at data from 2004 to 2006. When I tried reproducing their numbers, I got this chart (repeated here for the nth time):

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 31.88 31.27 31.27
Hspnc Umpire-- 31.41 32.47 28.29
Black Umpire-- 31.22 31.21 32.52
--------------------------------
All Umpires –- 31.83 31.30 31.32

There's some evidence of bias there; specifically, the entries in bold, which represent hispanic umpires calling hispanic pitchers, and black umpires calling black pitchers, seem a lot higher than they "should" be compared to their row and column.

I decided to try looking at other years: specifically, 2002, 2003, 2007, and 2008 combined. The only problem with that is that I didn't have a list of minority umpires and black pitchers for those years, so I had to use the same list as in the 2004-06 sample. That means some minority pitchers and umpires may have been excluded from their proper group, and misclassified as "white". Still, there shouldn't be too many of those, and their numbers would be small.

(This problem doesn't exist for hispanic pitchers, because I used country of birth for those.)

So, here's the same chart as above for those other years:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 31.47 30.97 31.22
Hspnc Umpire-- 31.19 30.77 34.65
Black Umpire-- 30.90 30.07 32.55
--------------------------------
All Umpires –- 31.83 31.30 31.32

The "umpires seem to favor pitchers of their own race" effect seems almost completely gone here. For instance, compare hispanic to white pitchers. Against white umpires, the hispanic pitchers got 0.50 percent fewer strikes. Against hispanic umpires, the hispanic pithcers got 0.42 percent fewer strikes. There's barely any difference.

Comparing umpires ... white pitchers called 0.28 percent more strikes for white pitchers. Hispanic umpires called 0.20 percent more strikes for white pitchers. Again, barely any difference.

The effect in the original sample was driven by the middle cell (hispanic/hispanic), which was more than a full percentage point higher than it was "supposed" to be. This doesn't happen in the new sample, where the middle cell seems to be within about 0.08 of where it should be.

The SD of that new middle cell (hispanic/hispanic) is 0.73 percent. The SD of the bottom middle cell (which appears to be very low) is 0.50 percent, so even that one isn't significant. And the two bottom cells in the right-hand column have very small sample sizes, so those can probably be ignored.


Verdict: although 2004-2006 does show some evidence of bias, there is no such effect for 2002-3-7-8.

Labels: , ,

Saturday, July 16, 2011

Minority pitchers succeed with fewer called strikes

I'm scheduled to talk about umpires and racial bias in a couple of weeks at JSM in Miami. I was hoping not to have to repeat same old things I've been talking about for the last few years, so I decided to see if there's anything new I could find. And I think I've got something, maybe. Well, I thought I had something, and it's interesting, but I now think it might be a false alarm with respect to umpires and race.

First, a quick review (and I promise it'll be quick). The Hamermesh study of racial bias (.pdf) was based on a chart that looked like this:


Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 31.88 31.27 31.27
Hspnc Umpire-- 31.41 32.47 28.29
Black Umpire-- 31.22 31.21 32.52
--------------------------------
All Umpires –- 31.83 31.30 31.32


The numbers are the percentage of called pitches (not swung at) that the umpire called a strike. This chart is based only on low-attendance games (less than 30,000 fans), where the study's authors found the strongest effect. It's my attempt to reproduce their results from Retrosheet data (the authors didn't provide the equivalent chart to what I have here).

If you look at the chart, you will see that, for any race of pitcher or umpire, the largest percentage of pitches are strikes exactly when the race of the umpire matches the race of the pitcher. The original study did a big regression and found that this result is indeed statistically significant. They concluded that umpires are biased in favor of pitchers of their own race (or biased against pitchers of a different race).

That's all I'm going to say about that; if you want to see my arguments, you can go here. Now, I'm going to take a different route.

--------

Let's start by ignoring umpires for now, and just looking at the pitchers. The bottom row of the chart shows the overall called strike percentage of the pitchers. Let me repeat them here for clarity:

31.83 percent strikes -- white pitchers
31.30 percent strikes -- hispanic pitchers
31.32 percent strikes -- black pitchers

It looks like there are real differences between the pitchers. Now, it's *possible* that the entire effect is actually caused by biased umpires, but nobody really believes that, including the authors of the original study. Different pitchers have different attributes, and it's probably just that the white pitchers are such that they happen to throw more called strikes than the minority pitchers.

Moreover, it would appear that the white pitchers happen to be *better* than the minority pitchers, since their strike percentage is higher. In fact, I think I may have said this a few times in the past, that the white pitchers were more successful.

I was wrong. Actually, it's the minority pitchers who performed better, *despite* the fact that their called pitches were less likely to be strikes.

Here are the opposition batting records for each of the three groups of pitchers, normalized to 600 PA:

------------AB--H--2B-3B-HR-BB-SO---avg-RC27
--------------------------------------------
White .... 543 147 30 3 17 51 099 0.271 5.02
Hispanic . 541 141 28 3 17 53 108 0.261 4.71
Black .... 546 145 28 3 14 48 106 0.266 4.57

The white pitchers performed the worst, striking out fewer batters and allowing more hits and runs. The last column of the batting record is "runs created per 27 outs."

What's going on? How is it that the minority pitchers did so much better despite having fewer called strikes? My first reaction was this: perhaps the relationship between called strikes and performance is *negative*. That is, maybe having lots of called strikes means you're throwing lots of pitches right down the middle of the plate, and you're getting hammered. Logically possible, right?

But it doesn't seem to be true. I ran a regression of Component ERA vs. Called Strike Percentage for starting pitchers with 100 IP or more, and the relationship goes the way you'd think: the higher the called strikes, the lower the ERA and the more successful the pitcher. In fact, it's a pretty strong relationship: every 0.1 percentage point in called strike percentage (example: from 31.83 percent to 31.93 percent) lowers ERA by 0.11. That's almost exactly what you'd expect knowing that the difference between a ball and a strike is approximately .14 runs.

So how is it that those pitchers bucked the relationship, and had a better performance despite fewer called strikes?

I think I was able to find the answer: they compensated by having more pitches swung at. As it turns out, the benefit of an extra percentage point in pitches swung at is also positive: an increase of 0.1 percent lowers ERA by 0.13 points.

Here are the numbers for pitches swung at:

44.99 percent pitches swung at -- White
45.52 percent pitches swung at -- Hispanic
46.84 percent pitches swung at -- Black

These are large differences, more than comparable to the differences in called strike percentage.

(By the way, keep in mind that the denominators of the two measures are different. Pitches swung at is (swung at and missed + foul balls + put in play) divided by total pitches. Called strike percentage is (called strikes) / (called strikes + balls).)

Here's the same 3x3 chart as earlier, but this time using swinging percentage:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 45.01 45.51 46.92
Hspnc Umpire-- 44.63 45.61 43.59
Black Umpire-- 44.77 46.05 46.82
--------------------------------
All Umpires –- 44.99 45.52 46.84

Just like in the original chart, the numbers are higher when the umpire's race matches the pitcher's race (with the exception of black pitchers facing white umpires).

Now, I suppose you could argue that these differences, also, could be attributed to umpire bias. It's possible that, knowing that more umpires are biased against them, minority pitchers have to throw down the middle to compensate. That results in batters swinging the bat more.

The problem with that theory is that the minority pitchers *improved* under this (alleged) injustice. If it's really racist bias, shouldn't they have gotten worse? Because, if the racism actually made them compensate in such a way that they got better, why wouldn't they compensate all the time, not just for umpires of the opposite race?

If you want to hold on to the hypothesis that it's umpire bias, you have to assume that the bias backfired, and that the pitchers, in their ignorance, didn't realize that there was a way to pitch better than they were already pitching. That seems farfetched.

---------

So, the minority pitchers have a *lower* percentage of called strikes, but a *higher* percentage of swinging strikes. When I saw that, I thought it might be normal: the more batters swing, the fewer strikes remain to be called by the umpire. But, again, that turns out not to be the case. There's a strong positive relationship between called strike percentage and swinging strike percentage, with a correlation coefficient of .23 (this is for 1,350 starting pitcher seasons of 100+ IP, 2000-2009).

Why, then, are the black and hispanic pitchers bucking the trend? The only thing I can think of is that even though the correlation between called strikes and swinging strikes is positive, maybe there are certain types of pitchers who go the opposite way. For instance, maybe there are three types of pitchers:

1. Pitchers who throw right down the middle. They get a lot of swings, and, when the batter doesn't swing, it's very likely to be a strike.

2. Pitchers with poor control. They don't get a lot of swings, and, when the batter doesn't swing, it's likely to be a ball.

3. Pitchers who normally throw right down the middle, but like to waste pitches frequently (or throw a certain type of pitch that sometimes goes awry). They get a lot of swings, but, when the batter doesn't swing, it's one of those waste pitches and likely to be a ball.

Types 1 and 2 would show a positive correlation between swings and called strikes. Type 3 would show a negative correlation. If there are a lot more types 1 and 2 than type 3, the overall correlation would be positive.

So, maybe black and minority pitchers are more likely to be Type 3. Any other explanations?

---------

BTW, my first reaction was that this all had to do with count. In "Scorecasting," the authors found that umpires were reluctant to call a third strike or a fourth ball on a close pitch. That would explain the observations perfectly, like this: The minority pitchers get more strikeouts. So they get more two-strike pitches. Therefore, they get more batters swinging on those pitches, and also fewer called strikes on those pitches. That's enough to give us the results we saw.

Alas, the beautiful theory doesn't hold up. I reran the tables, but looking only at 0-0 pitches. Again, (a) the minority pitchers had more swings, and (b) on the remaining pitches, the minority pitchers got fewer called strikes. Numbers available on request.

---------

So what is it that the minority pitchers have in common that gives them this unusual combination of low called strikes and high swinging strikes? I don't know, but I bet someone reading this can tell me.

For the ten black pitchers in the study, I looked at their tendencies from 2000 to 2009 (even though the study was only 2004 to 2006). The difference between their swinging strike percentage and their called strike percentage was 16.04, well above the average of 13.60. What is it about them, as a group, that would explain that?

Arthur Rhodes
CC Sabathia
Darren Oliver
Dontrelle Willis
Edwin Jackson
Ian Snell
Jerome Williams
LaTroy Hawkins
Ray King
Tom Gordon

I'd give you the hispanic pitchers -- I think there's about 30 of them -- but I don't have a list handy.

---------

In any case, and getting back to the issue of umpire bias ...

This is where the false alarm comes in. When I saw that a higher called strike percentage means different things for different pitchers, I thought we might have an explanation: rather than the umpires calling more unmerited strikes, maybe it was just those pitchers pursuing a different strategy. Maybe they were occasionally deciding to pitch how the average white pitcher does -- whatever that is -- and getting more called strikes, but without a change in performance.

Alas, that's not true. *Between* races of pitchers, increased called strike percentage didn't mean better performance. But *within* races of pitchers, it did.

Here's the original 3x3 chart, but with RC27 instead of called strike percentage:

Pitcher ------ Whte Hspn Blac
------------------------------
White Umpire-- 4.97 4.77 4.49
Hspnc Umpire-- 5.15 4.59 5.88
Black Umpire-- 5.47 4.20 5.39
-----------------------------
All Umpires –- 5.02 4.71 4.57

With the exception of the bottom-right cell and the bottom-center cell, the RC27 figures match the order of the called strike figures (see the very first chart of this post). It does seem like, as a characteristic of their style, black and hispanic pitchers successfully sacrifice called strikes in exchange for swinging strikes ... but when they *do* get those called strikes from certain umpires, they do even better.

So, pitchers *do* seem to benefit from extra called strikes, once you control for who the pitcher is. So we still have the same problem we had at the beginning.

------

That problem, still, appears to be that when the pitcher was hispanic, hispanic umpires called around 40 too many strikes out of 2,864 called pitches.

40 pitches doesn't seem like a lot over three years ... but it's only over the equivalent of about 30 or 40 team-games (1,349 PA). I don't really see an argument for how those 40 pitches could have been miscalled. It can't be anything the original study controlled for ... like home/road, starter/reliever, score, identity of the pitcher, etc. It would have to be an interaction of some of those things. Like, for instance, pitcher A throws a lot of inside sliders, and umpire B likes to call those strikes, and B happened to randomly umpire a lot of A's games.

But I don't see how the numbers work out. It's still 40 pitches in 30 games. With three hispanic umpires and 30 hispanic pitchers, that's 90 possible combinations. Some are more likely than others -- we're only looking at pitchers in front of 30,000 or fewer fans, which concentrates them a bit among certain teams -- but still, 90 combinations over 30 games makes it unlikely that one or two pairs would dominate to the tune of 40 pitches.

So, I thought I had an explanation ... but, after all this, I don't think I do. I still suspect that the result is just random, and not racial bias or any other explanation, but ... that's just my opinion.

Still, I need to think some more. Now that we know that more called strikes does not *always* lead to improved performance, and that it depends on the pitcher ... can you see any arguments that I'm missing, for what else might be happening?

-------

UPDATE: OK, one more theory I thought of. Suppose pitcher style varies from game to game. Take, for instance, a hispanic pitcher. Some games, he pitches one way, and gets few called strikes and lots of swinging strikes. Other games, and independently of the umpire, he consciously decides to pitch differently, and he gets more called strikes and fewer swinging strikes.

In that case, pitches are no longer independent -- it's *games* that are independent. That means that you have to use a different statistical technique, like cluster sampling. The bottom line, there, is that the SD goes way up. The results stay the same, but the confidence interval widens and the statistical significance disappears.

So, if there's evidence that pitchers' expected percentages change on a game-by-game basis (that is, the *expectations* have to change due to pitcher behavior, not just the outcome of the game fluctuating because of random variation), that probably negates the statistical significance, which is the only reason to suspect umpire bias.


Labels: , , ,