Tuesday, September 07, 2021

Are umpires racially biased? A 2021 study (Part II)

(Part I is here.)


20 percent of drivers own diesel cars, and the other 80 percent own regular (gasoline) cars. Diesels are, on average, less reliable than regular cars. The average diesel costs $2,000 a year in service, while the average regular car only costs $1,000. 

Researchers wonder if there's a way to reduce costs. Maybe diesels cost more partly because mechanics don't like them, or are unfamiliar with them? They create a regression that controls for the model, age, and mileage of the car, as well as driver age and habits. But they also include a variable for whether the mechanic owns the same type of car (diesel or gasoline) as the owner. They call that variable "UTM," or "user/technician match".

They run the regression, and the UTM coefficient turns out negative and significant. It turns out that when the mechanic owns the same type of car as the user, maintenance costs are more than 13 percent lower! The researchers conclude that finding a mechanic who owns the same kind of car as you will substantially reduce your maintenance costs.

But that's not correct. The mechanic makes no difference at all. That 13 percent from the regression is showing something completely different.

If you want to solve this as a puzzle, you can stop reading and try. There's enough information here to figure it out. 

-------

The overall average maintenance cost, combining gasoline and diesel, is $1200. That's the sum of 80 percent of $1000, plus 20 percent of $2000.

So what's the average cost for only those cars that match the mechanic's car? My first thought was, it's the same $1200. Because, if the mechanic's car makes no difference, how can that number change?

But it does change. The reason is: when the user's car matches the mechanic's, it's much less likely to be a diesel. The gasoline owners are over-represented when it comes to matching: each has an 80% chance of being included in the "UTM" sample, while the diesel owner has only a 20% chance.

In the overall population, the ratio of gasoline to diesel is 4:1. But the ratio of "gasoline/gasoline" to "diesel/diesel" is 16:1. So instead of 20%, the proportion of "double diesels" in the "both cars match" population is only 1 in 17, or 5.9%.

That means the average cost of UTM repairs is only $1059. That's 94.1 percent of $1000, plus 5.9% of $2000. That works out to 13.3 percent less than the overall $1200.

Here's a chart that maybe makes it clearer. Here's how the raw numbers of UTM pairings break down, per 1000 population:

Technician      Gasoline    Diesel    Total
-------------------------------------------
User gasoline     640        160       800
User diesel       160         40       200
-------------------------------------------
Total             800        200      1000 
 

The highlighted diagonal is where the user matches the mechanic. There are 680 cars on that diagonal, but only 40 (1 in 17) are diesel.

In short: the "UTM" coefficient is significant not because matching the mechanic selects better mechanics, but because it selectively samples for more reliable (gasoline) cars.

--------

In the umpire/race study I talked about last post, they had a regression like that, where they put all the umpires and batters together into one regression and looked at the "UBM" variable, where the umpire's race matches the batter's race. 
From last post, here's the table the author included. The numbers are umpire errors per 1000 outside-of-zone pitches (negative favors the batter).

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8       ---     +5.9
White batter       +5.6      -4.4      ---

I had adjusted that to equalize the baseline:

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

I think I'm able to estimate, from the original study, that the batter population was almost exactly in the 2:3:4 range -- 22 percent Black, 34 percent Hispanic, and 44 percent White. Using those numbers, I'm going to adjust the chart one more time, to show approximately what it would look like if the umpires were exactly alike (no bias) and each column added to zero. 

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -2.2     -2.2     -2.2
Hispanic batter    +3.8     +3.8     +3.8
White batter       -1.7     -1.7     -1.7

I chose those numbers so the average UBM (average of diagonals in ratio 22:34:44) is zero, and also to closely fit the actual numbers the study found. That is: suppose you ran a regression using the author's data, but controlling for batter and umpire race.  And suppose there was no racial bias. In that case, you'd get that table, which represents our null hypothesis of no racial bias.

If the null hypothesis is true, what will a regression spit out for UBM? If the batters were represented in their actual ratio, 22:34:44, you'd get zero:

Diagonal          Effect Weight    Product
-------------------------------------------------
Black UBM          -2.2    22%     -0.5  
Hispanic UBM       +3.8    34%     +1.5  
White UBM          -1.7    44%     -0.8  
-------------------------------------------------
Overall UBM               100%     -0.0  per 1000

However: in the actual population in the MLB study, the diagonals do NOT appear in the 22:34:44 ratio. That's because the umpires were overwhelmingly White -- 88 percent White. There were only 5 percent Black umpires, and 7 percent Hispanic umpires. So the White batters matched their umpire much more often than the Hispanic or Black batters.

Using 5:7:88 for umpires, and 22:34:44 for batters, the relative frequency of each combination looks like this. Here's the breakdown per 1000 pitches:

                                             Batter
Umpire             Black   Hispanic  White    Total
---------------------------------------------------
Black batter        11        15      194      220
Hispanic batter     17        24      300      341
White batter        22        31      387      439
---------------------------------------------------
Umpire total        50        70      881     1000

Because there are so few minority umpires, there are only 24 Hispanic/Hispanic pairs out of 422 total matches on the UBM diagonal.  That's only 5.7% Hispanic batters, rather than 34 percent:

Diagonal       Frequency  Percent
----------------------------------
Black UBM            11     2.6% 
Hispanic UBM         24     5.7%
White UBM           387    91.7%     
----------------------------------
Overall UBM         422     100%

If we calculate the observed average of the diagonal, with this 11/24/387 breakdown, we get this:
                                      
                  Effect  Weight      Product
--------------------------------------------------
Black UBM          -2.2    2.6%    -0.06 per 1000
Hispanic UBM       +3.8    5.7%    +0.22 per 1000
White UBM          -1.7   91.7%    -1.56 per 1000 
--------------------------------------------------
Overall UBM                100%    -1.40 per 1000

Hispanic batters receive more bad calls for reasons other than racial bias. By restricting the sample of Hispanic batters to only those who see a Hispanic umpire, we selectively sample fewer Hispanic batters in the UBM pool, and so we get fewer bad calls. 

Under the null hypothesis of no bias, UBM plate appearances still see 1.40 fewer bad calls per 100 pitches, because of selective sampling.

------

That 1.40 figure is compared to the overall average. The regression coefficient, however, compares it to the non-UBM case. What's the average of the non-UBM case?

Well, if a UBM happens 422 times out of 1000, and results in 1.40 pitches fewer than average, and the average is zero, then the other 578 times out of 1000, there must have been 1.02 pitches more than average. 

                  Effect  Weight       Product
--------------------------------------------------
UBM                -1.40   42.2%   -0.59 per 1000
Non-UBM            +1.02   57.8%   +0.59 per 1000
--------------------------------------------------
Full sample                100%    -0.00 per 1000

So the coefficient the regression produces -- UBM compared to non-UBM -- will be 2.42.

What did the actual study find? 2.81. 

That leaves only 0.39 as the estimate of potential umpire bias:

-2.81  Selective sampling plus possible bias
-2.42  Effect of selective sampling only
---------------------------------------------
-0.39  Revised estimate of possible bias

The study found 2.81 fewer bad calls (per 1000) when the umpire matched the pitcher, but 2.42 of that is selective sampling, leaving only 0.39 that could be umpire bias.

Is that 0.39 statistically significant? I doubt it. For what it's worth, the original estimate had an SD of 0.44. So adjusting for selective sampling, we're less than 1 SD from zero.

--------

So, the conclusion: the study's finding of a 0.28% UBM effect cannot be attributed to umpire bias. It's mostly a natural mathematical artifact resulting from the fact that

(a) Hispanic batters see more incorrect calls for reasons other than bias, 

(b) Hispanic umpires are rare, and

(c) The regression didn't control for the race of batter and umpire separately.

Because of that, almost the entire effect the study attributes to racial bias is just selective sampling.













Labels: , , , ,

Monday, August 30, 2021

Are umpires racially biased? A 2021 study (Part I)

Are MLB umpires racially biased? There's a recent new study that claims they are. The author, who wrote it as an undergrad thesis, mentioned it on Twitter, and when I checked a week or so later, there were lots of articles and links to it. (Here, for instance, is a Baseball Prospectus post reporting on it.  And here's a Yahoo! report.)

The study tried to figure whether umpires make more bad calls against batters* of a race other than theirs (where there is no "umpire-batter match," or "UBM," as the literature calls it). It ran regressions on called pitches from 2008 to 2020, to figure out how best to predict the probability of the home-plate umpire calling a pitch incorrectly (based on MLB "Gameday" pitch location). The author controlled for many different factors, and found a statistically significant coefficient for UBM, concluding that the pitcher gains an advantage when the umpire is of the same race. It also argues that white umpires in particular "could be the driving force behind discrimination in MLB."  

I don't think any of that is right. I think the results point to something different, and benign. 

---------

Imagine a baseball league where some teams are comprised of dentists, while the others are jockeys. The league didn't hire any umpires, so the players take turns, and promise to call pitches fairly.

They play a bunch of games, and it turns out that the umpires call more strikes against the dentists than against the jockeys. Nobody is surprised -- jockeys are short, and thus have small strike zones.

It's true that the data shows that if you look at the Jockey umpires, you'll see that they call a lot fewer strikes against batters of their own group than against batters of the other group. Their "UBM" coefficient is high and statistically significant.

Does that mean the jockey umps are "racist" against dentists? No, of course not. It's just that the dentists have bigger strike zones. 

It's the same, but in reverse, for the dentist umpires. They call more strikes against their fellow dentists -- again, not because of pro-jockey "reverse racism," but because of the different strike zones.

Later, teams of NBA players enter the league. These guys are tall, with huge strike zones, so they get a lot of called strikes, even from their own umpires.

Let's put some numbers on this: we'll say there are 10 teams of dentists, 1 team of jockeys, and 2 teams of NBA players. The jockeys are -10 in called strikes compared to average, and the NBA players are +10. That leaves the dentists at -1 (in order for the average to be zero).

Here's a chart that shows every umpire is completely fair and unbiased. 

Umpire             Jockey    NBA    Dentist
-------------------------------------------
Jockey batter:       -10     -10     -10
NBA batter           +10     +10     +10
Dentist batter        -1      -1      -1

I've highlighted the "UBM" cells where the umpire matches the batter. If you look only at those cells, and don't think too much about what's going on, you could think the umpires are horribly biased. The Jockey batters get 10 fewer strikes than average from Jockey umpires!  That's awful!

But then when you look closer, you see the horizontal row is *all* -10. That means all the umpires called the jockeys the same way (-10), so it's probably something about the jockey batters that made that happen. In this case, it's that they're short.

I think this is what's going on in the actual study. But it's harder to see, because the chart isn't set up with the raw numbers. The author ran different regressions for the three different umpire races, and set a different set of batters as the zero-level for each. Since they're calibrated to a different standard of player, the results make the umpires look very different.

If I had done here what the author did there, the chart above would have looked like this:

Umpire             Jockey    NBA   Dentist
------------------------------------------
Jockey batter:         0    -20      -9
NBA batter           +20      0     +11
Dentist batter        +9    -11       0

If you just look at this chart without knowing you can't compare the columns to each other (because they're based on a different zero baseline), it's easy to think there's evidence of bias. You'd look at the chart and say, "Hey, it looks like Jockey umpires are racist against NBA batters and dentists. Also, dentist umpires are racist against NBA players but favor Jockeys somewhat. But, look!  NBA umpires actually *favor* other races!  That's probably because NBA umpires are new to the tournament, and are going out of their way to appear unbiased."  

That's a near-perfect analogue to the actual study.  This is the top half of Table 8, which measures "over-recognition" of pitchers, meaning balls incorrectly called as strikes (hurting the batter). I've multiplied everything by 1000, so the numbers are "wrong strike calls per 1000 called pitches outside the zone".

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8      ---      +5.9
White batter       +5.6      -4.4      ---

It's  very similar to my fake table above, where the dentists and Jockeys look biased, but the NBA players look "reverse biased". 

The study notes the chart and says,

"For White umpires, the results suggest that for pitches outside the zone, Hispanic batters ... face umpire discrimination. [But Hispanic umpires have a] "reverse-bias effect ... [which] holds for both Black and White batters... Lastly, the bias against non-Black batters by Black umpires is relatively consistent for both Hispanic and White batters."

And it rationalizes the apparent "reverse racism" from Hispanic umpires this way:

"This is perhaps attributable to the recent increase in MLB umpires from Hispanic countries, who could potentially fear the consequences of appearing biased towards Hispanic players."

But ... no. The apparent result is almost completely the result of setting a different zero level for each umpire/batter race -- in other words, by arbitrarily setting the diagonal to zero. That only works if the groups of batters are exactly the same. They're not. Just as Jockey batters have different characteristics than NBA player batters, it's likely that Hispanic batters don't have exactly the same characteristics as White and Black batters.

The author decided that White, Black, and Hispanic batters all should get exactly the same results from an unbiased umpire. If that assumption is false, the effect disappears. 

Instead, the study could have made a more conservative assumption: that unbiased umpires of any race should call *White* batters the same. (Or Black batters, or Hispanic batters. But White batters have the largest sample size, giving the best signal-to-noise ratio.)

That is, use a baseline where the bottom row is zero, rather than one where the diagonal is zero. To do that, take the original, set the bottom cells to zero, but keep the differences between any two rows in the same column:

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

Does this look like evidence of umpire bias? I don't think so. For any given race of batter, all three groups of umpires call about the same amount of bad strikes. In fact, all three groups of umpires even have the same *order* among batter groups: Hispanic the most, White second, and Black third. (The raw odds of that happening are 1 in 36). 

The only anomaly is that maybe it looks like there's some evidence that Black umpires benefit Black batters by about 5 pitches per 1,000, but even that difference is not statistically significant. 

In other words: the entire effect in the study disappears when you remove the hidden assumption that Hispanic batters respond to pitches exactly the same way as White or Black batters. And the pattern of "discrimination" is *exactly* what you'd expect if the Hispanic batters respond to pitches in ways that result in more errors -- that is, it explains the anomaly that Hispanic umpires tend to look "reverse racist."

Also, I think the entire effect would disappear if the author had expanded his regression to include dummy variables for the race of the batter.  

------

If, like me, you find it perfectly plausible that Hispanic batters respond to pitches in ways that generate more umpire errors, you can skip this section. If not, I will try to convince you.

First, keep in mind that it's a very, very small difference we're talking about: maybe 4 pitches per 1,000, or 0.4 percent. Compare that to some of the other, much larger effects the study found:

 +8.9%   3-0 count on the batter
 -0.9%   two outs
 +2.8%   visiting team batting
 -3.3%   right-handed batter
 +0.5%   right-handed pitcher
+19.7%   bases loaded (!!!)
 +1.4%   pitcher 2 WAR vs. 0 WAR
 +0.9%   pitcher has two extra all-star appearances
 +4.0%   2019 vs. 2008
---------------------------------------------------
 +0.4%   batter is Hispanic
---------------------------------------------------

I wouldn't have expected most of those other effects to exist, but they do. And they're so large that they make this one, at only +0.4%, look unremarkable. 

Also: with so many large effects found in the study, there are probably other factors the author didn't consider that are just as large. Just to make something up ... since handedness of pitcher and batter are so important, suppose that platoon advantage (the interaction between pitcher and batter hand, which the study didn't include) is worth, say, 5%. And suppose Hispanic batters are likely to have the platoon advantage, say, 8% less than White batters. That would give you an 0.4% effect right there.

I don't have data specifically for Hispanic batters, but I do have data for country of birth. Not all non-USA players are Hispanic, but probably a large subset are, so I split them up that way. Here is batting-handedness stats for players from 1969 to 2016:

Born in USA:       61.7% RHB
Born outside USA:  67.1% RHB

That's a 10% difference in handedness. I don't know how that translates into platoon advantage, but it's got to be the same order of magnitude as what we'd need for 0.4%.

Here's another theory. They used to say, about prospects from the Dominican Republic, that they deliberately become free swingers because "you can't walk off the island."  

Suppose, that knowing a certain player is a free swinger, the pitcher aims a bit more outside the strike zone than usual, knowing the batter is likely to swing anyway. If the catcher sets a target outside, and the pitcher hits it perfectly, the umpire may be more likely to miscall it as a strike (at least according to many broadcasters I've heard).

Couldn't that explain why Hispanic players get very slightly more erroneous strike calls? 

In support of that hypothesis, here are K/W ratios for that same set of batters (total K divided by total BB):

Born in USA:       1.82 K per BB
Born outside USA:  2.05 K per BB 

Again, that seems around the correct order of magnitude.

I'm not saying these are the right explanations -- they might be right, or they might not. The "right answer" is probably several factors, perhaps going different directions, but adding up to 0.4%. 

But the point is: there do seem to be significant differences in hitting styles between Hispanic and non-Hispanic batters, certainly significant enough that an 0.4% difference in bad calls is quite plausible. Attributing the entire 0.4% to racist umpires (and assuming that all races of umpires would have to discriminate against Hispanics!) doesn't have any justification whatsoever -- at least not without additional evidence.

-------

Here's a TLDR summary, with a completely different analogy this time:

Eddie Gaedel's father calls fewer strikes on Eddie Gaedel than Aaron Judge's father calls on Aaron Judge. So Gaedel Sr. must be biased! 

--------

There's another part of the study -- actually, the main part -- that throws everything into one big regression and still comes out with a significant "UBM" effect, which again it believes is racial bias. I think that conclusion is also wrong, for reasons that aren't quite the same. 

That's Part II, which is now here.

----------


(*The author found a similar result for pitchers, who gained an advantage in more called strikes when they were the same race as the umpire, and a similar result for called balls as well as called strikes. In this post, I'll just talk about the batting side and the called strikes, but the issues are the same for all four combinations of batter/pitcher ball/strike.)


Labels: , , , ,

Thursday, May 03, 2018

NHL referees balance penalty calls between teams


That finding, from Michael Lopez, shows that the next penalty in an NHL game is significantly less likely to go to the team that's had more penalties so far in the game.

That was a new finding to me. A few years ago, I found that the next penalty is more likely to go to the team that had the (one) most recent penalty -- but I hadn't realized that quantity matters, too.

(My previous research can be found here: part one, two, three.)

So, I dug out my old hockey database and see if I could extend Michael's results. All the findings here are based on the same data as my other study -- regular season NHL games from 1953-54 to 1984-85, as provided by the Hockey Summary Project as at the end of 2011.

-------

Quickly revisiting the old finding: referees do appear to call "make-up" penalties. The team that got the benefit of the most recent power play is almost 50 percent more likely to have the next call go against them. That team got the next penalty 59.7% of the time, versus only 40.3% for the previously penalized team.

39599/98167 .403 -- team last penalized
58568/98167 .597 -- other team

Now, let's look at total numbers of penalties instead. I've split the data into home and road teams, because road teams do get more penalties -- 52 percent vs. 48 percent overall.  (That difference is mitigated by the fact that referees balance out the calls. The first penalty of the game goes to the road team 54 percent of the time. The drop from 54 percent for the first call, down to 52 percent overall, is due to the referees balancing out the next call or calls.)

So far, nothing exciting. But here's something. It turns out that the *second* call of the game is much more likely than average to be a makeup call:

.703 -- visiting penalty after home penalty
.297 -- home penalty after home penalty

.653 -- home penalty after visiting penalty 

.347 -- visiting penalty after visiting penalty

Those numbers are huge. Overall, there are more than twice as many "make up" calls as "same team" calls.

In this case, quantity and recency are the same thing. Let's move on to the third penalty of the game, where they can be different.  From now on, I'll show the results in chart form:

.705 0-2 
.462 1-1
.243 2-0

Here's how to read the chart: when the home team has gone "0-2" in penalties -- that is, both previous penalties to the visiting team -- it gets 70.5% of the third penalties. When the previous two penalties were split, the home team gets 46.2%, similar to the overall average. When the home team got both previous penalties, though, it draws the third only 24.3% of the time (in other words, the visiting team drew 75.7%).

Here's the fourth penalty. I've added sample sizes, in parentheses.

.701 0-3 (755)
.559 1-2 (6951)
.373 2-1 (5845)
.261 3-0 (468)

It's a very smooth progression, from .701 down to .261, exactly what you would expect given that make-up calls are so common. 

Here's the fifth penalty:

.677 0-4 ( 195)
.619 1-3 (3244)
.465 2-2 (6950)
.351 3-1 (2306)
.316 4-0 ( 117)

That's the chart that corresponds to Michael Lopez's tweet, and if you scroll back up you'll see that these numbers are pretty close to his.

Sixth penalty:

.667 0-5 (  48)
.637 1-4 (1182)
.520 2-3 (4930)
.413 3-2 (4134)
.323 4-1 ( 773)
.226 5-0 (  31)

Again, the percentages drop every step ("monotonically," as they say in math).

Seventh penalty:

.692 0-6 (  13)

.585 1-5 ( 369)
.577 2-4 (2528)
.489 3-3 (4140)
.399 4-2 (1798)
.379 5-1 ( 219)
.200 6-0 (  13)

Eighth penalty:

.667 0-7 (   3)
.607 1-6 ( 122)
.588 2-5 ( 969)
.527 3-4 (2721)
.422 4-3 (2414)
.374 5-2 ( 652)
.412 6-1 (  68)
.000 7-0 (   1)

Still a perfect pattern.  It breaks up just a little bit here, for the ninth penalty, but that's probably just small sample size.

.000 0-8 (   1)
.553 1-7 (  38)
.586 2-6 ( 348)
.566 3-5 (1358)
.484 4-4 (2063)
.392 5-3 (1037)
.340 6-2 ( 191)
.333 7-1 (  21)

(This is getting boring, so here's a technical note to break the monotony. I included all penalties, including misconducts. I omitted all cases where both teams took a penalty at the same time, even if one team took more penalties than the other. In fact, I treated those as if they never happened, so they don't break the string. This may cause the results to be incorrect in some cases: for instance, maybe Boston takes a minor, then there's a fight and Montreal gets a major and a minor while Boston gets only a major. Then, Montreal takes a minor. In that case, the study will treat the Montreal minor as a make-up call, when it's really not. I think this happens infrequently enough that the results are still valid.)

I'll give two more cases. Here's the twelfth penalty:

.692 2-9 ( 13)
.623 3-8 ( 61)
.532 4-7 (250)
.506 5-6 (478)
.488 6-5 (459)
.449 7-4 (198)
.457 8-3 ( 35)
.200 9-2 (  5)

Almost perfect.  But ... the pattern does seems to break down later on, at the 14th to 16th penalty (I stopped at 16), probably due to sample size issues. Here's the fourteenth, which I think is the most random-looking of the bunch. You could almost argue that it goes the "wrong way":

.000  2-11 (  1)
.375  3-10 (  8)
.333  4- 9 ( 27)
.516  5- 8 ( 95)
.438  6- 7 (169)
.480  7- 6 (148)
.465  8- 5 ( 71)
.577  9- 4 ( 26)
.600 10- 3 (  5)

Still, I think the overall conclusion isn't threatened, that quantity is a factor in make-up calls.

------

OK, so now we know that quantity matters. But couldn't that mean that recency doesn't matter? We did find that the team with the most recent penalty was less likely to get the next one -- but that might just be because that team is also more likely to have a higher quantity at that point. After all, when a team takes three of the first four penalties, there's a 75 percent chance* it also took the most recent one. 

(* It's actually not 75 percent, because make-up calls make the sequence non-random. But the point remains.)

So, maybe the recency effect is just an illusion, by the quantity effect. Or vice versa.

So, here's what I did: I broke down every row in every table by who got the more recent call. It turns out: recency does matter.

Let's take that 3-for-4 example I just used:

.613 home team overall     (3244)
---------------------------------
.508 after VVVH            ( 486)
.639 after other sequences (2758)

From this, it looks like there's both aspects here. When the home team is "up 3-1" in penalty advantage, it gets only 51 percent of the penalties if its previous penalty was the last of the four. That's still more than the 46.1 percent it gets to start the game, or the 46.5 percent it would get if it had been 2-2 instead of 3-1.

This seems to be true for most of the breakdowns -- maybe even all the ones with large enough sample sizes. I'll just arbitrarily pick one to show you ... the ninth penalty, home team 3-5.

.392 home team overall     (1037)
---------------------------------
.362 when most recent was H (743)
.469 when most recent was V (294)

Even better: here's the entire chart for the eighth penalty: overall vs. last penalty went to home team ("last H") vs. last penalty went to visiting team "last V". 

overall   last H    last V
----------------------------------
 .607      .750      .596      1-6 
 .588      .477      .609      2-5 
 .527      .446      .584      3-4 
 .422      .372      .518      4-3 
 .374      .357      .466      5-2 
 .412      .406      .500      6-1 

Clearly, both recency and quantity matter. Holding one constant, the other still follows the "make-up penalty" pattern. 

Can we figure out *how much* is recency and *how much* is quantity?  It's probably pretty easy to get a rough estimate with a regression. I'm about to leave for the weekend, but I'll look at that next week. Or you can download the results (speadsheet here) and do it yourself.




Labels: , , ,

Monday, January 26, 2015

Are umpires biased in favor of star pitchers? Part II

Last post, I talked about the study (.pdf) that found umpires grant more favorable calls to All-Stars because the umps unconsciously defer to their "high status." I suggested alternative explanations that seemed more plausible than "status bias."

Here are a few more possibilities, based on the actual coefficient estimates from the regression itself.

(For this post, I'll mostly be talking about the "balls mistakenly called as strikes" coefficients, the ones in Table 3 of the paper.)

---

1. The coefficient for "right-handed batter" seems way too high: -0.532. That's so big, I wondered whether it was a typo, but apparently it's not.  

How big? Well, to suffer as few bad calls as his right-handed teammate, a left-handed batter would have to be facing a pitcher with 11 All-Star appearances.

The likely explanation seems to be: umpires don't call strikes by the PITCHf/x (rulebook) standard, and the differences are bigger for lefty batters than righties. Mike Fast wrote, in 2010,

"Many analysts have shown that the average strike zone called by umpires extends a couple of inches outside the rulebook zone to right-handed hitters and several more inches beyond that to left-handed hitters." 

That's consistent with the study's findings in a couple of ways. First, in the other regression, for "strikes mistakenly called as balls", the equivalent coefficient is less than a tenth the size, at -0.047. Which makes sense: if the umpires' strike zone is "too big", it will affect undeserved strikes more than undeserved balls. 

Second: the two coefficients go in the same direction. You wouldn't expect that, right? You'd expect that if lefty batters get more undeserved strikes, they'd also get fewer undeserved balls. But this coefficient is negative both cases. That suggests something external and constant, like the PITCHf/x strike zone overestimating the real one.

And, of course, if the problem is umpires not matching the rulebook, the entire effect could just be that control pitchers are more often hitting the "illicit" part of the zone.  Which is plausible, since that's the part that's closest to the real zone.

---

2. The "All-Star" coefficient drops when it's interacted with control. Moreover, it drops further for pitchers with poor control than pitchers with good control. 

Perhaps, if there *is* a "status" effect, it's only for the very best pitchers, the ones with the best control. Otherwise, you have to believe that umpires are very sensitive to "status" differences between marginal pitchers' control rates. 

For instance, going into the 2009 season, say, J.C. Romero had a career 12.5% BB/PA rate, while Warner Madrigal's was 9.1%. According to the regression model, you'd expect umpires to credit Madrigal with 37% more undeserved strikes than Warner. Are umpires really that well calibrated?

Suppose I'm right, and all the differences in error rates really accrue to only the very best control pitchers. Since the model assumes the effect is linear all the way down the line, the regression will underestimate the best and worst control pitchers, and overestimate the average ones. (That's what happens when you fit a straight line to a curve; you can see an example in the pictures here.) 

Since the best control pitchers are underestimated, the regression tries to compensate by jiggling one of the other coefficients, something that correlates with only those pitchers with the very best control. The candidate it settles on: All-Star appearances. 

Which would explain why the All-Star coefficient is high, and why it's high mostly for pitchers with good control. 

---

3. The pitch's location, as you would expect, makes a big difference. The further outside the strike zone, the lower the chance that it will be mistakenly called a strike. 

The "decay rate" is huge. A pitch that's 0.1 feet outside the zone (1.2 inches) has only 43 percent the odds of being called a strike as one that's right on the border (0 feet).  A pitch 0.2 feet outside has only 18 percent the odds (43 percent squared).  And so on.*

(* Actually, the authors used a quadratic to estimate the effect -- which makes sense, since you'd expect the decay rate to increase. If the error rate at 0.1 feet is, say, 10 percent, you wouldn't expect the rate for 1 foot to be 1 percent. It would be much closer to zero. But the quadratic term isn't that big, it turns out, so I'll ignore it for simplicity. That just renders this argument more conservative.) 

The regression coefficient, per foot outside, was 8.292. The coefficient for a single All-Star appearance was 0.047. 

So an All-Star appearance is worth 1/176 of a foot -- which is a bit more than 1/15 of an inch.

That's the main regression. For the one with the lower value for All-Star appearances, it's only an eighteenth of an inch. 

Isn't it more plausible to think that the good pitchers are deceptive enough to fool the umpire by 1/15 inches per pitch, rather than that the umpire is responding to their status? 

Or, isn't it more likely that the good pitchers are hitting the "extra" parts of the umpires' inflated strike zone, at an increased rate of one inch per 15 balls? 

---

4. The distance from the edge of the strike zone is, I assume, "as the crow flies." So, a high pitch down the middle of the plate is treated as the same distance as a high pitch that's just on the inside edge. 

But, you'd think that the "down the middle" pitch has a better chance of being mistakenly called a strike than the "almost outside" pitch. And isn't it also plausible that control pitchers will have a different ratio of the two types than those with poor control? 

Also, a pitch that's 1 inch high and 1 inch outside registers as the same distance as a pitch over the plate that's 1.4 inches high. Might umpires not be evaluating two-dimensional balls differently than one-dimensional balls?

And, of course: umpires might be calling low balls differently than high balls, and outside pitches differently from inside pitches. If pitchers with poor control throw to the inside part of the plate more than All-Stars (say), and the umpires seldom err on balls inside because of the batter's reaction, that alone could explain the results.

------ 

All these explanations may strike you as speculative. But, are they really more speculative than the "status bias" explanation? They're all based on exactly the same data, and the study's authors don't provide any additional evidence other than citations that status bias exists.

I'd say that there are several different possibilities, all consistent with the data:

1.  Good pitchers get the benefit of umpires' "status bias" in their favor.

2.  Good pitchers hit the catcher's glove better, and that's what biases the umpires.

3.  Good pitchers have more deceptive movement, and the umpire gets fooled just as the batter does.

4.  Different umpires have different strike zones, and good pitchers are better able to exploit the differences.

5.  PITCHf/x significantly underestimates umpires in their opinions of what constitutes a strike. Since good pitchers are closer to the strike zone more often, they wind up with more umpire strikes that are PITCHf/x balls. The difference only has to be the equivalent one-fifteenth of an inch per ball.

6.  Umpires are "deliberately" biased. They know that when they're not sure about a pitch, considering the identity of the pitcher gives them a better chance of getting the call right. So that's what they do.

7.  All-Star pitchers have a positive coefficient to compensate for real-life non-linearity in the linear regression model.

8.  Not all pitches the same distance from the strike zone are the same. Better pitchers might err mostly (say) high or outside, and worse pitchers high *and* outside.  If umpires are less likely fooled in two dimensions than one, that would explain the results.

------

To my gut, #1, unconscious status bias, is the least plausible of the eight. I'd be willing to bet on any of the remaining seven, that they all are contributing to the results to some extent (possibly negatively).  

But I'd bet on #5 being the biggest factor, at least if the differences between umpires and the rulebook really *are* as big as reported.  

As always, your gut may be more accurate than mine.  




Labels: , , , ,

Sunday, January 18, 2015

Are umpires biased in favor of star pitchers?

Are MLB umpires are biased in favor of All-Star pitchers? An academic study, released last spring, says they are. Authored by business professors Braden King and Jerry Kim, it's called "Seeing Stars: Matthew Effects and Status Bias in Major League Baseball Umpiring."

"What Umpires Get Wrong" is the title of an Op-Ed piece in the New York Times where the authors summarize their study. Umps, they write, favor "higher status" pitchers when making ball/strike calls:


"Umpires tend to make errors in ways that favor players who have established themselves at the top of the game's status hierarchy."

But there's nothing special about umpires, the authors say. In deferring to pitchers with high status, umps are just exhibiting an inherent unconscious bias that affects everyone: 


" ... our findings are also suggestive of the way that people in any sort of evaluative role — not just umpires — are unconsciously biased by simple 'status characteristics.' Even constant monitoring and incentives can fail to train such biases out of us."

Well ... as sympathetic as I am to the authors' argument about status bias in regular life, I have to disagree that the study supports their conclusion in any meaningful way.

-----

The authors looked at PITCHf/x data for the 2008 and 2009 seasons, and found all instances where the umpire miscalled a ball or strike, based on the true, measured x/y coordinates of the pitch. After a large multiple regression, they found that umpire errors tend to be more favorable for "high status" pitchers -- defined as those with more All-Star appearances, and those who give up fewer walks per game. 

For instance, in one of their regressions, the odds of a favorable miscall -- the umpire calling a strike on a pitch that was actually out of the strike zone -- increased by 0.047 for every previous All-Star appearance by the pitcher. (It was a logit regression, but for low-probability events like these, the number itself is a close approximation of the geometric difference. So you can think of 0.047 as a 5 percent increase.)

The pitcher's odds also increased 1.4 percent for each year of service, and another 2.5 percent for each percentage point improvement in career BB/PA.

For unfavorable miscalls -- balls called on pitches that should have been strikes -- the effects were smaller, but still in favor of the better pitchers.

I have some issues with the regression, but will get to those in a future post. For now ... well, it seems to me that even if you accept that these results are correct, couldn't there be other, much more plausible explanations than status bias?

1. Maybe umpires significantly base their decisions on how well the pitcher hits the target the catcher sets up. Good pitchers come close to the target, and the umpire thinks, "good control" and calls it a strike. Bad pitchers vary, and the catcher moves the glove, and the umpire thinks, "not what was intended," and calls it a ball.

The authors talk about this, but they consider it an attribute of catcher skill, or "pitch framing," which they adjust for in their regression. I always thought of pitch framing as the catcher's ability to make it appear that he's not moving the glove as much as he actually is. That's separate from the pitcher's ability to hit the target.

2. Every umpire has a different strike zone. If a particular ump is calling a strike on a low pitch that day, a control pitcher is more able to exploit that opportunity by hitting the spot. That shows up as an umpire error in the control pitcher's favor, but it's actually just a change in the definition of the strike zone, applied equally to both pitchers.

3. The study controlled for the pitch's distance from the strike zone, but there's more to pitching than location. Better pitchers probably have better movement on their pitches, making them more deceptive. Those might deceive the umpire as well as the batter. 

Perhaps umpires give deceptive pitches the benefit of the doubt -- when the pitch has unusual movement, and it's close, they tend to call it a strike, either way. That would explain why the good pitchers get favorable miscalls. It's not their status, or anything about their identity -- just the trajectory of the balls they throw. 

4. And what I think is the most important possibility: the umpires are Bayesian, trying to maximize their accuracy. 

Start with this. Suppose that umpires are completely unbiased based on status -- in fact, they don't even know who the pitcher is. In that case, would an All-Star have the same chance of a favorable or unfavorable call as a bad pitcher? Would the data show them as equal?

I don't think so. 

There are times when an umpire isn't really sure about whether a pitch is a ball or a strike, but has to make a quick judgment anyway. It's a given that "high-status" control pitchers throw more strikes overall; that's probably also true in those "umpire not sure" situations. 

Let's suppose a borderline pitch is a strike 60% of the time when it's from an All-Star, but only 40% of the time when it's from a mediocre pitcher.

If the umpire is completely unbiased, what should he do? Maybe call it a strike 50% of the time, since that's the overall rate. 

But then: the good pitcher will get only five strike calls when he deserves six, and the bad pitcher will get five strike calls when he only deserves four. The good pitcher suffers, and the bad pitcher benefits.

So, unbiased umpires benefit mediocre pitchers. Even if umpires were completely free of bias, the authors' methodology would nonetheless conclude that umpires are unfairly favoring low-status pitchers!

----

Of course, that's not what's happening, since in real life, it's the better pitchers who seem to be benefiting. (But, actually, that does lead to a fifth (perhaps implausible) possibility for what the authors observed: umpires are unbiased, but the *worse* pitchers throw more deceptive pitches for strikes.)

So, there's something else happening. And, it might just be the umpires trying to improve their accuracy.

Our hypothetical unbiased umpire will have miscalled 5 out of 10 pitches for each player. To reduce his miscall rate, he might change his strategy to a Bayesian one. 

Since he understands that the star pitcher has a 60% true strike rate in these difficult cases, he might call *all* strikes in those situations. And, since he knows the bad pitcher's strike rate is only 40%, he might call *all balls* on those pitches. 

That is: the umpire chooses the call most likely to be correct. 60% beats 40%.

With that strategy, the umpire's overall accuracy rate improves to 60%. Even if he has no desire, conscious or unconscious, to favor the ace for the specific reason of "high status", it looks like he does -- but that's just a side-effect of a deliberate attempt to increase overall accuracy.

In other words: it could be that umpires *consciously* take the history of the pitcher into account, because they believe it's more important to minimize the number of wrong calls than to spread them evenly among different skills of pitcher. 

That could just as plausibly be what the authors are observing.

How can the ump improve his accuracy without winding up advantaging or disadvantaging any particular "status" of pitcher? By calling strikes in exactly the proportion he expects from each. For the good pitcher, he calls strikes 60% of the time when he's in doubt. For the bad pitcher, he calls 40% strikes. 

That strategy increases his accuracy rate only marginally -- from 50 percent to 52 percent (60% squared plus 40% squared). But, now, at least, neither pitcher can claim he's being hurt by umpire bias. 

But, even though the result is equitable, it's only because the umpire DOES have a "status bias." He's treating the two pitchers differently, on the basis of their historical performance. But King and Kim's study won't be able to tell there's a bias, because neither pitcher is hurt. The bias is at exactly the right level.

Is that what we should want umpires to do, bias just enough to balance the advantage with the disadvantage? That's a moral question, rather than an empirical one. 

Which are the most ethical instructions to give to the umpires? 

1. 

Make what you think is the correct call, on a "more likely than not" basis, *without* taking the pitcher's identity into account.

Advantages: No "status bias."  Every pitcher is treated the same.

Disadvantages: The good pitchers wind up being disadvantaged, and the bad pitchers advantaged. Also, overall accuracy suffers.

2. 

Make what you think is the correct call, on a "more likely than not" basis, but *do* take the pitcher's identity into account.

Advantages: Maximizes overall accuracy.

Disadvantages: The bad pitchers wind up being disadvantaged, and the good pitchers advantaged.

3. 

Make what you think is the most likely correct call, but adjust only slightly for the pitcher's identity, just enough that, overall, no type of pitcher is either advantaged or disadvantaged.

Advantages: No pitcher has an inherent advantage just because he's better or worse.

Disadvantages: Hard for an umpire to calibrate his brain to get it just right. Also, overall accuracy not as good as it could be. And, how do you explain this strategy to umpires and players and fans?


Which of the three is the right answer, morally? I don't know. Actually, I don't think there necessarily is one -- I think any of the three is fair, if understood by all parties, and applied consistently. Your opinion may vary, and I may be wrong. But, that's a side issue.

------

Getting back to the study: the fact that umpires make more favorable mistakes for good pitchers than bad pitchers is not, by any means, evidence that they are unconsciously biased against pitchers based on "status." It could just as easily be one of several other, more plausible reasons. 

So that's why I don't accept the study's conclusions. 

There's also another reason -- the regression itself. I'll talk about that next post.




(Hat tip: Charlie Pavitt)


Labels: , , , ,

Monday, January 30, 2012

Do NHL teams get a boost after killing a two-man advantage?

In an OHL game I was watching the other day, one of the teams had a two-man advantage and didn't score. The announcer was disappointed that the shorthanded team to get a boost from having killed off the penalties, as conventional wisdom says they should.

Is conventional wisdom right? Now that I have access to a database of NHL games (thanks again to the Hockey Summary Project), I was able to check.

This study is basically the same format as the study I did on fights a few weeks back. I found all games from 1967-68 to 1984-85 where one team killed off a two-man advantage (of any length). Then, I found a random control game, which matched the score differential and the relative quality of the home and road teams. When I was done, I had two pools, each comprised of 1,703 games.

The teams that killed the penalties scored an average 0.26 more goals than their opponents from that point to the end of the game (actually, to the 17:00 mark of the third period). On the other hand, the control team scored only 0.12 more goals then their opponents.

That's statistically significant, at almost exactly 2 SDs.

I'll put that in chart form to make it easier to read, along with the SD. I use the term "killing teams" to mean the ones that actually killed off the two-man advantage.

Killing teams .... +0.26 goals (+/- 0.05)
Control teams .... +0.12 goals (+/- 0.05)
------------------------------------------
Difference ....... +0.14 goals (+/- 0.07)

At six goals per win, you'd have expected the extra goals to have resulted in around 40 extra wins. They actually resulted in 32 extra wins. Actually, 36 extra wins, minus 8 fewer ties:

Killing teams .... 836-604-263
Control teams .... 806-626-271
------------------------------------
Difference ....... +36 wins, -8 ties

So, should we conclude that killing off a two-man advantage causes a psychological boost? Well, not so fast. Because, after you take two consecutive penalties, the referee is very likely to try to even things up by giving future penalties to the other team.

The difference of +0.14 goals is almost exactly what you'd get from a single power play. So, if the result of surviving a two-man advantage is that you get one extra "free" power play in the remainder of the game, that would explain the results exactly.

As it turns out, it's not quite that high. It's only half that high. On average, the teams that survived being shorthanded two men got about half an extra power play in the remainder of the game:

Killing teams ... +.346 power plays rest of game
Control teams ... -.130 power plays rest of game
------------------------------------------------
Difference ...... +.476 power plays rest of game

That leaves about 0.07 goals per game as the unexplained difference. It's only 1 SD, which is no longer statistically significant. It's about the effect of half a power play. Or, with an average save percentage of .900, it works out to 7/10 of an additional shot on goal.

--------

We can also handle the penalty issue another way. We can insist that when we choose a control game for the real game, we make sure the control team was the lone who took the last penalty. That way, we'd expect some of the referee "evening up" difference to disappear. Perhaps not all of it, because a two-man advantage isn't the same as a one-man advantage -- but at least part of it.

The additional restriction reduced the sample size to 1,662 games; for the remaining 41 games, I couldn't find a suitable control.

As it turns out, the goal difference stays about the same, even though the penalty difference is significantly reduced:

Killing teams ... +0.25 goals (+/- 0.05)
Control teams ... +0.08 goals (+/- 0.05)
----------------------------------------
Difference ...... +0.17 goals (+/- 0.07)

Killing teams ... +.340 power plays rest of game
Control teams ... +.032 power plays rest of game
------------------------------------------------
Difference ...... +.308 power plays rest of game

The difference of .308 power plays accounts for around .04 goals of the observed .17 difference. That leaves .13, which is a little less than 2 SD from zero. Not statistically significant, but close. (Technically, it's even less than that, because the control games aren't completely independent. Also, when I ran the study a second time, I got +0.10 goals instead of +0.08, which lowers the difference. So think of the 1.9 SD as probably a bit too high.)

Strangely, though, there wasn't as much difference in game results; only the equivalent of 13.5 wins:

Killing teams ... 815-591-256
Control teams ... 807-610-245
------------------------------
Difference: +8 wins, +11 ties

Again at six goals per win, you'd expect 47 wins, not 13.5. What happened?

Well, it turns out that the "killing" teams spent a lot of their goals winning blowouts. For instance, in games won by six goals or more, they were 81-34. The control group was only 73-51.

In those games, the difference was 12.5 wins. That normally "costs" 75 goals, but, for these games, the difference was really around 150 goals. So, that accounts for 75 of the 282 goal difference right there.

The "killing" group also "wasted" goals in the 3- and 4-goal games. That was offset by the opposite effect in five-goal games, but not by much.

------

If you recall, we found the same effect when we looked at fighting: teams that started a fight appeared to score more goals, but not necessarily win more games.

What connects the two studies is ... penalties. It could be that teams that get penalized a lot win a lot of blowouts. Not necessarily because of cause-and-effect, but because it just so happened that, between 1967 and 1984, certain teams just happened to be high in both categories.

Or, it could be coincidence. Or, it could be something else.

------

For my bottom line, I'd say: after killing off a two-man advantage, teams did appear to benefit by about 1/7 of a goal. Half of that can be traced to referees calling fewer penalties against them in the remainder of the game.

The other half is unknown. It's not statistically significant, so you have to give serious consideration to the idea that it's just coincidence ... but the teams *did* appear to benefit, by around 0.07 goals.

Historically, the average size of the "boost" in a team's play after a two-man kill has been small: the equivalent of less than a single shot on goal over the remainder of the game.



Labels: , , ,