## Saturday, March 18, 2023

### 1961 Yankees fielding is double-counted against Whitey Ford

Here are the 1961 pitching lines of Whitey Ford and Jack Kralick that Bill James wrote about back in 2019:

IP   W-L    H    K   BB    ERA
Ford     283  25- 4  242  209  92   3.21
Kralick  242  13-11  257  101  97   3.61

Kralick's season is decent, but clearly no match for Whitey, who has him beat in every category.

But, surprisingly, Baseball Reference has Kralick leading the American League with a WAR of 6.0. Whitey Ford, on the other hand, is 12th with only 3.7 WAR.

IP   W-L    H    K   BB    ERA    WAR
Ford     283  25- 4  242  209  92   3.21    3.7
Kralick  242  13-11  257  101  97   3.61    6.0

What's going on?

The answer, I think, is the fielding and park adjustments WAR uses are overinflated. That's for a future post. This post is about how, while trying to figure out what happened, I think I found an issue with the adjustments that turns out to be randomly specific to Whitey Ford's 1961 Yankees.

-----

B-R uses Sean Smith's "Total Zone Rating" (TZR) to estimate fielding and calculate defensive WAR (dWAR). For seasons before 1989, TZR is based on Retrosheet data, which, for most games, includes information on the type and location of balls in play (BIP). Basing dWAR on TZR means that for the team as a whole, the defensive evaluation is roughly equivalent to what you'd see just looking at batting average on balls in play (BABIP).

There are other factors too -- baserunner advancement, catcher arm -- but it's mostly BABIP. To confirm that, I ran a regression to predict dWAR based on BABIP (compared to league) with a dummy variable for franchise (which Sean's website says TZ adjusts for to take outfield park variation into account). For the years 1960-73, the correlation was high (r-squared .77), with the coefficient of BABIP coming out very close to the win value of turning outs into hits. For the 14 Yankee seasons specifically, the correlation was over 0.9.

So it seems like most of dWAR up to 1988 is BABIP.

-----

In 1961, the Yankees allowed an opposition BABIP of .261, compared to the AL average .275. That's an advantage of .014, or "14 points".

Yankee pitchers allowed 4414 balls in play that year. So the extra .014 represents about 62 hits turned into outs. I use 0.8 as the run value of each of those outs, so that's 49.4 runs. Call it 50 for short.

-----

As an aside: Baseball Reference has the 1961 Yanks at 72 runs, not 50. Why such a big difference? I'm not sure.

Maybe they use MLB instead of AL as their baseline. That would add 14 runs or so, because in 1961 the BABIP for both leagues combined was .279 instead of the AL-only .275.

It could also be a result of TZR not including popups and line drives in its evaluation (because presumably there's not much difference in fielding those). It also adds measures of outfielder arms (by looking at baserunner advancement), double play ability, and caught stealings for catchers. And there's that franchise adjustment. All those might contribute to the difference.

But there's another anomaly in the raw data. The MLB average for the 1961 season was almost +12 defensive runs per team. You'd think the average would have to be zero, by definition. It could be that the system uses an average based on a large number of seasons, and 1961 just happened to be a great year for fielding. But that doesn't seem like it could be the answer. For 1959-1967, the MLB total defensive runs saved is positive every one of those nine seasons. Then, it switches over: from 1968 to 1975, every season total is negative. It doesn't seem plausible to me that MLB spent eight years with good fielders, then the next eight with worse fielders.

Whatever it is ... over all 17 years, 1961 is the biggest outlier. The total fielding runs for 1961 is +214, which is +11.9 per team. The next highest, 1960, is only +146. None of the other positives break 100, and the highest negative is -71 in 1971.

Anyway, that still isn't my main point; it's just something that I noticed.

------

OK, now the interesting part.

As I said, the 1961 Yankees' opponents' BABIP was 14 points lower than the league. All things being equal, you'd expect the fielding to be equally good at home and on the road -- about the same 14 points either way.

But the Yankees BABIP advantage was much higher at home.

Overall, AL teams were 8 points better at home than on the road. But the Yankees were 37 points better:

NYY     AL
------------------
home   .242   .271
------------------
diff   .037   .008

On the road, the Yankee fielders were the same as the AL average, holding opponents to a .279 BABIP. At home, though, they were 29 points better than the league.

So, in effect, all 50 runs the Yankees fielders saved via BABIP were saved at Yankee Stadium.

Why does that matter? Because those 50 runs are going to be *double counted* against Yankee pitchers and their WAR.

First, those 50 runs will be attributed to the skill of the Yankee fielders. Whitey Ford's WAR will drop, because it appears the fielders behind him were responsible for turning so many of his balls in play into outs.

Second, those same 50 runs are going to be used in calculating the Park Factor, which is based on actual runs scored (by both teams). With 50 fewer runs scored at Yankee Stadium because of BABIP, and no fewer runs scored on the road, the calculation will implicitly attribute those 50 runs to the park and the park factor will drop. Whitey Ford's WAR again will drop because he pitches in a park where it appears to be easier to prevent runs.

The adjustment for BABIP is made twice: the first time it's attributed to the fielders, and the second time it's attributed to the park. But it can't be both. At least not fully both -- it could be 50% fielding and 50% park, but the WAR method treats it as 100% fielding and 100% park.

Specifically, according to Baseball Reference, the dWAR calculation credits the Yankee defense with 0.43 runs per game behind Whitey Ford. Over Whitey's 283 innings, that's 13.5 runs. At 10 runs per win, that's 1.35 WAR. At 8.5 runs per win -- which is what B-R seems to be using for the 1961 American League -- it's 1.6 WAR.

Whitey is being adjusted, implicitly, by 3.2 WAR instead of 1.6. Turning that double-counting back into single-counting would raise Whitey from 3.7 to 5.3, which seems much more reasonable for his performance.

-----

Except ... not quite. My calculation assumed that park factor is based on a single season's runs. It's actually the average of three seasons -- in this case, 1960, 1961, and 1962.

So only a third of the Yankee defense is being double-counted in 1961. That means you'd only adjust Whitey for 0.5 WAR, not the full 1.5. That brings him only to 4.2.

However, the missing 1.0 WAR is still double-counted: it's just that one third of it is moving to 1960, and one third is moving to 1962.

That means that Whitey will be shorted 0.5 WAR in 1960, and again in 1962. If you fixed that, his 1960 would go from 2.0 to 2.5, and his 1962 from 5.1 to 5.6.

Over Whitey's career, though, the Yankees' overall home BABIP (compared to road) will indeed wind up double counted towards his total WAR (with the exception of his first two and last two years, which will be "1.3-counted" or "1.6 counted").

------

I think this is something that will happen all the time, if my understanding is correct of how pitching WAR is calculated.

Any difference between home and road fielding will be counted as part of the park factor adjustment in addition to be counted as a fielding adjustment. If BABIP is better at home, the pitcher will be debited twice. If BABIP is worse at home, the pitcher will be credited twice.

BABIP is, like any other stat, subject to random variation. By my calculation, the SD of luck for home-minus-road BABIP is about 14 points, or 24 runs for a team-season. That's a lot. Whitey Ford pitched about 19.5 percent of his team's innings in 1961, so the SD of his random BABIP luck is about 5 runs. (The 1961 number looks like it was double that, or 2 SD, assuming no park effects. Whitey was double counted by about 10 runs by raw BABIP, and a little more than that by the dWAR calculation.)

Now, dWAR does remove popups and line drives from consideration ... that will reduce the luck SD (I'm not sure how much) compared to raw BABIP. But even if the SD drops from 0.5 to 0.4, or something, that's still pretty big.

-------

We could just adjust dWAR for this (the BABIP home/road numbers are readily available on B-R). But I think the adjustments for fielding and park are exaggerated in other ways -- as I wrote about in previous sets of posts.

A different algorithm for adjusting pitcher WAR -- where we regress both fielding and park to the mean by significant amounts -- might reduce the double-count enough that we won't really need to make a correction for it. It will probably adjust both Whitey Ford and Jack Kralick enough that Whitey winds up on top, although I haven't checked that in detail yet.

I'll work on that for the next post.

## Wednesday, November 16, 2022

### Home field advantage is naturally higher in a hitter's park

The Rockies have always had a huge home-field advantage (HFA) at Coors. From 1993 to 2001, Colorado has played .545 at home, but only .395 on the road. That's the equivalent of the difference between going 89-73 and 64-98.

Why such a big difference? I have some ideas I'm working on, but the most obvious one -- although it's not that big, as we will see -- is that higher scoring naturally, mathematically, leads to a bigger HFA.

When teams play better at home than on the road -- for whatever reason --the manifestation of "better" is in physical performance, not winning percentage as such. The translation from performance to winning percentage depends on the characteristics of the game.

In MLB, historically, the home team plays around .540. But if the commissioner decreed that now games were going to be 36 innings long instead of 9, the home advantage would roughly double, with the home team now winning at a .580 pace.

(Why? With the game four times as long, the SD of the score difference by luck would double. But the home team's run advantage would quadruple. So the run differential by talent would double compared to luck. Since the normal distribution is almost linear at such small differences (roughly, from 0.1 SD to 0.2 SD), HFA would approximately double.)

But it's not *always* that a higher score number increases HFA. If it was decided that all runs now count as 2 points, like in basketball, scoring would double, but, obviously, HFA would stay the same.

Roughly speaking, increased scoring increases the home advantage only if it also increases the "signal to noise ratio" of performance to luck. Increasing the length of the game does that; doubling all the scores does not.

In 2000, Coors Field increased scoring by about 40%. If that forty percent was obtained by increasing games from 9 innings to 13 innings, HFA would be around 20% higher. If the forty percent was obtained by making every run count as 1.4 runs, HFA would be 0% higher. In reality, the increase could be anywhere  between 0% and 20%, or beyond.

We probably have the tools available to get a pretty good estimate of the true increase.

------

Let's start with the overall average HFA. My subscription to Baseball Reference allowed me to obtain home and road batting records, all teams combined, for the 1980-2022 seasons:

AB        H     2B    3B    HR     BB     SO
------------------------------------------------------
home   3209469 846723 161290 19928 95790 321178 612545
road   3363640 859813 163954 17203 96043 308047 668363

What's the run differential between those two batting lines? We can look at actual runs, or even the difference in run statistics like Runs Created or Extrapolated Runs. But, for better accuracy, I used Tom Tango's on-line Markov Calculator (the version modified by Bill Skelton, found here). It turns out the home batting line leads to 4.79 runs per nine innings, and the road batting line works out to 4.36 R/9.

AB        H     2B    3B    HR     BB     SO    R/9
-------------------------------------------------------------
home   3209469 846723 161290 19928 95790 321178 612545  4.79
road   3363640 859813 163954 17203 96043 308047 668363  4.36
-------------------------------------------------------------
difference                                              0.43

That's a difference of 0.43 runs per game. Using the rule of thumb that 10 runs equals one win, a rough estimate is that the home team should have a win advantage of 0.043 wins per game, for a winning percentage of .543.

That's a pretty good estimate -- home teams actually went .539 in that span (51832-44409). But, we'll actually need to be more accurate than that, because the "10 runs per win" figure will change significantly for higher-scoring environments such as Coors.

So let's calculate an estimate of the actual runs per win for this scoring environment.

The Tango/Skelton Markov calculator includes a feature where, given the batting line, it will show the probability of a team scoring any particular number of runs in a nine-inning game. Here's part of that output:

----------------------
2 runs:  .1201  .1342
3 runs:  .1315  .1404
4 runs:  .1282  .1309

From this table, which actually extends from 0 to 30+ runs, we can calculate how many runs it would take for the road team to turn a loss into a win.

Case 1:  If the road team is tied after 9 innings, it has about a 50% chance of winning. With one additional run, it turns that into 100%. So an additional run in a tie game is worth half a win.

How often is the game tied? Well, the chance of a 2-2 tie is .1202*.1342, or about 1.6%. The chance of a 3-3 tie is .1315*.1404, or 1.8%. Adding up the 2-2 and the 3-3 and the 0-0 and the 1-1 and the 4-4 and the 5-5, and so on all the way down the line, the overall chance is 9.7%.

Case 2:  If the road team is down a run after 9 innings, it loses, which is a 0% chance of winning. With one additional run, it's tied, and turns that into a 50% chance. So, an additional run there is also worth half a win.

How often is the road team down a run? Well, the chance of a 3-2 result is .1315*.1342, or about 1.8%. The chance of 4-3 is .1282*.1404, another 1.8%. And so on.

The total: a 9.54% chance the road team winds up losing by one run.

What's the chance that the additional run will give the *home* team the extra half win? We can repeat the calculation, but instead of 3-2, we'll calculate 2-3. Instead of 4-3, we'll calculate 3-4. And so on.

The total: only 8.54%. It makes sense that it's smaller, because the better team is less likely to be behind by a run than ahead by a run.

We'll average the home and road numbers to get 9.04%.

So, we have:

9.7% chance of a tie
9.0% chance of behind one run
----------------------------------------------
18.7% chance that a run will create half a win

Converting that 18.7% chance to R/W:

0.187 half-wins per run
=   5.35 runs per half-win
=   10.7 runs per win

So, we'll use 10.7 runs per win for our calculation.

(Why, by the way, do we get 10.7 runs per win instead of the rule of thumb that it should be 10.0 flat? I think it's becuase the Markov simulation always plays the bottom of the ninth, even when the home team is already up. It therefore includes a bunch of meaningless runs that don't occur in reality. When some of the run currency is randomly useless, it pushes the price of a win higher.

We'd expect that roughly 1/18 of all runs scored are in the bottom of the ninth with the home team having already won. If we discount those by multiplying 10.7 by 17/18, we get ... 10.1 runs per win. Bingo.)

We saw earlier that the home team had an advantage of 0.43 runs per game.
Dividing that by 10.3 runs per win, gives us

Predicted: HFA of .42 wins per game (.542)
Actual:    HFA of .39 wins per game (.539)

We're off a bit. The difference is about 2 SD. My guess is that the Markov calculation, which is necessarily simplified, is very slightly off, and we only notice because of the huge sample size of almost 100,000 actual games.

-------

OK, now let's do the same thing, but this time for Coors Field only.

I could do the same thing I did for MLB as a whole: split the combined Coors batting line into home and road, and calculate those individually. The problem with that is ... well, if I do that, I'll be getting the Rockies' actual HFA at Coors, which is huge, because it includes all kinds of factors that we're not concerned with, like altitude acclimatization, tailoring of personnel to field, etc.

So, I'm going to try to convert the Coors line into an approximation of what the split would look like if it were similar to MLB as a whole.

Here's that 1980-2022 MLB split from above, except I've added the percentage difference between home and road (on a per-AB basis) below:

AB        H     2B      3B     HR     BB     SO
---------------------------------------------------------
home   3209469 846723 161290   19928  95790 321178 612545
road   3363640 859813 163954   17203  96043 308047 668363
---------------------------------------------------------
diff            +3.2%  +3.5%  +21.4%  +4.5%  +9.3%  -3.9%

I'll try to create something similar for 2000 Coors.  The overall batting line, for both teams, looked like this:

AB    H   2B 3B  HR  BB  SO     R/9
---------------------------------------------
Coors  5843  1860 359 56 245 633 933    7.43

Here's my arbitrary split, into Rockies vs. road team, in such a way to keep roughly the same percentage differences as in MLB overall, while also keeping the R/9 roughly 7.43. Here's what I came up with:

AB      H      2B     3B      HR     BB     SO
--------------------------------------------------------
home   5843   1884    362     66     249    672    936
road   5843   1826    350     54     238    615    974
--------------------------------------------------------
diff         +3.2%  +3.4%  +22.2%  +4.6%  +9.3%  -3.9%

I ran those through Tango's calculator to get runs per 9 innings:

AB     H    2B     3B   HR    BB    SO     R/9
---------------------------------------------------------
home   5843  1884  362     66  249   672   936    7.783
road   5843  1826  350     54  238   615   974    7.071
---------------------------------------------------------
avg                                               7.427
---------------------------------------------------------
diff                                              +.712

Next, I ran the runs-per-game distribution calculation to get a runs-per-win estimate. (I won't go through the details here, but it's the same thing as before: calculate the probability of a tie, then a one-run home win, then a one-run road win, etc.)

The result: 14.37 runs per win.

As expected, that's significantly higher than the 10.7 we calculated for MLB overall. (Adjusting 14.37 for the superfluous bottom-of-the-ninth gives about 13.6, so, if you prefer, you can compare 13.6 Coors to 10.1 overall.)

The difference of .712 runs per game, divided by 14.43 runs per win, gives an HFA of

0.0495 wins per game

Which translates to a home winning percentage of .5495.

Comparing the two results:

.542 home field winning percentage normal
.549 home field winning percentage Coors
-----------------------------------------
.007 difference

The difference of .007 is worth only about half a win per home season. Sure, half a win is half a win, but I'm a little disappointed that's all we wind up with after all this work.

It's certainly not as much of an effect as I thought there would be before I started. Even if you deducted this inherent .007, it would barely make a dent in the Rockies' 150 percentage point difference between Coors and road. The Rockies would still be in first place on the FanGraphs chart by a sizeable margin -- 42 points instead of 49.

Looked at another way, an additional .007 would move an average team from the middle of the 29-year standings, to about halfway to the top. So maybe it's not that small after all.

Still, our conclusion has to be that the Rockies' huge HFA over the years is maybe 10 percent a mathematical inevitability of all those extra runs, and 90 percent other causes.

## Tuesday, September 07, 2021

### Are umpires racially biased? A 2021 study (Part II)

(Part I is here.)

20 percent of drivers own diesel cars, and the other 80 percent own regular (gasoline) cars. Diesels are, on average, less reliable than regular cars. The average diesel costs \$2,000 a year in service, while the average regular car only costs \$1,000.

Researchers wonder if there's a way to reduce costs. Maybe diesels cost more partly because mechanics don't like them, or are unfamiliar with them? They create a regression that controls for the model, age, and mileage of the car, as well as driver age and habits. But they also include a variable for whether the mechanic owns the same type of car (diesel or gasoline) as the owner. They call that variable "UTM," or "user/technician match".

They run the regression, and the UTM coefficient turns out negative and significant. It turns out that when the mechanic owns the same type of car as the user, maintenance costs are more than 13 percent lower! The researchers conclude that finding a mechanic who owns the same kind of car as you will substantially reduce your maintenance costs.

But that's not correct. The mechanic makes no difference at all. That 13 percent from the regression is showing something completely different.

If you want to solve this as a puzzle, you can stop reading and try. There's enough information here to figure it out.

-------

The overall average maintenance cost, combining gasoline and diesel, is \$1200. That's the sum of 80 percent of \$1000, plus 20 percent of \$2000.

So what's the average cost for only those cars that match the mechanic's car? My first thought was, it's the same \$1200. Because, if the mechanic's car makes no difference, how can that number change?

But it does change. The reason is: when the user's car matches the mechanic's, it's much less likely to be a diesel. The gasoline owners are over-represented when it comes to matching: each has an 80% chance of being included in the "UTM" sample, while the diesel owner has only a 20% chance.

In the overall population, the ratio of gasoline to diesel is 4:1. But the ratio of "gasoline/gasoline" to "diesel/diesel" is 16:1. So instead of 20%, the proportion of "double diesels" in the "both cars match" population is only 1 in 17, or 5.9%.

That means the average cost of UTM repairs is only \$1059. That's 94.1 percent of \$1000, plus 5.9% of \$2000. That works out to 13.3 percent less than the overall \$1200.

Here's a chart that maybe makes it clearer. Here's how the raw numbers of UTM pairings break down, per 1000 population:

Technician      Gasoline    Diesel    Total
-------------------------------------------
User gasoline     640        160       800
User diesel       160         40       200
-------------------------------------------
Total             800        200      1000

The highlighted diagonal is where the user matches the mechanic. There are 680 cars on that diagonal, but only 40 (1 in 17) are diesel.

In short: the "UTM" coefficient is significant not because matching the mechanic selects better mechanics, but because it selectively samples for more reliable (gasoline) cars.

--------

In the umpire/race study I talked about last post, they had a regression like that, where they put all the umpires and batters together into one regression and looked at the "UBM" variable, where the umpire's race matches the batter's race.
From last post, here's the table the author included. The numbers are umpire errors per 1000 outside-of-zone pitches (negative favors the batter).

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8       ---     +5.9
White batter       +5.6      -4.4      ---

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

I think I'm able to estimate, from the original study, that the batter population was almost exactly in the 2:3:4 range -- 22 percent Black, 34 percent Hispanic, and 44 percent White. Using those numbers, I'm going to adjust the chart one more time, to show approximately what it would look like if the umpires were exactly alike (no bias) and each column added to zero.

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -2.2     -2.2     -2.2
Hispanic batter    +3.8     +3.8     +3.8
White batter       -1.7     -1.7     -1.7

I chose those numbers so the average UBM (average of diagonals in ratio 22:34:44) is zero, and also to closely fit the actual numbers the study found. That is: suppose you ran a regression using the author's data, but controlling for batter and umpire race.  And suppose there was no racial bias. In that case, you'd get that table, which represents our null hypothesis of no racial bias.

If the null hypothesis is true, what will a regression spit out for UBM? If the batters were represented in their actual ratio, 22:34:44, you'd get zero:

Diagonal          Effect Weight    Product
-------------------------------------------------
Black UBM          -2.2    22%     -0.5
Hispanic UBM       +3.8    34%     +1.5
White UBM          -1.7    44%     -0.8
-------------------------------------------------
Overall UBM               100%     -0.0  per 1000

However: in the actual population in the MLB study, the diagonals do NOT appear in the 22:34:44 ratio. That's because the umpires were overwhelmingly White -- 88 percent White. There were only 5 percent Black umpires, and 7 percent Hispanic umpires. So the White batters matched their umpire much more often than the Hispanic or Black batters.

Using 5:7:88 for umpires, and 22:34:44 for batters, the relative frequency of each combination looks like this. Here's the breakdown per 1000 pitches:

Batter
Umpire             Black   Hispanic  White    Total
---------------------------------------------------
Black batter        11        15      194      220
Hispanic batter     17        24      300      341
White batter        22        31      387      439
---------------------------------------------------
Umpire total        50        70      881     1000

Because there are so few minority umpires, there are only 24 Hispanic/Hispanic pairs out of 422 total matches on the UBM diagonal.  That's only 5.7% Hispanic batters, rather than 34 percent:

Diagonal       Frequency  Percent
----------------------------------
Black UBM            11     2.6%
Hispanic UBM         24     5.7%
White UBM           387    91.7%
----------------------------------
Overall UBM         422     100%

If we calculate the observed average of the diagonal, with this 11/24/387 breakdown, we get this:

Effect  Weight      Product
--------------------------------------------------
Black UBM          -2.2    2.6%    -0.06 per 1000
Hispanic UBM       +3.8    5.7%    +0.22 per 1000
White UBM          -1.7   91.7%    -1.56 per 1000
--------------------------------------------------
Overall UBM                100%    -1.40 per 1000

Hispanic batters receive more bad calls for reasons other than racial bias. By restricting the sample of Hispanic batters to only those who see a Hispanic umpire, we selectively sample fewer Hispanic batters in the UBM pool, and so we get fewer bad calls.

Under the null hypothesis of no bias, UBM plate appearances still see 1.40 fewer bad calls per 100 pitches, because of selective sampling.

------

That 1.40 figure is compared to the overall average. The regression coefficient, however, compares it to the non-UBM case. What's the average of the non-UBM case?

Well, if a UBM happens 422 times out of 1000, and results in 1.40 pitches fewer than average, and the average is zero, then the other 578 times out of 1000, there must have been 1.02 pitches more than average.

Effect  Weight       Product
--------------------------------------------------
UBM                -1.40   42.2%   -0.59 per 1000
Non-UBM            +1.02   57.8%   +0.59 per 1000
--------------------------------------------------
Full sample                100%    -0.00 per 1000

So the coefficient the regression produces -- UBM compared to non-UBM -- will be 2.42.

What did the actual study find? 2.81.

That leaves only 0.39 as the estimate of potential umpire bias:

-2.81  Selective sampling plus possible bias
-2.42  Effect of selective sampling only
---------------------------------------------
-0.39  Revised estimate of possible bias

The study found 2.81 fewer bad calls (per 1000) when the umpire matched the pitcher, but 2.42 of that is selective sampling, leaving only 0.39 that could be umpire bias.

Is that 0.39 statistically significant? I doubt it. For what it's worth, the original estimate had an SD of 0.44. So adjusting for selective sampling, we're less than 1 SD from zero.

--------

So, the conclusion: the study's finding of a 0.28% UBM effect cannot be attributed to umpire bias. It's mostly a natural mathematical artifact resulting from the fact that

(a) Hispanic batters see more incorrect calls for reasons other than bias,

(b) Hispanic umpires are rare, and

(c) The regression didn't control for the race of batter and umpire separately.

Because of that, almost the entire effect the study attributes to racial bias is just selective sampling.

Labels: , , , ,

## Monday, August 30, 2021

### Are umpires racially biased? A 2021 study (Part I)

Are MLB umpires racially biased? There's a recent new study that claims they are. The author, who wrote it as an undergrad thesis, mentioned it on Twitter, and when I checked a week or so later, there were lots of articles and links to it. (Here, for instance, is a Baseball Prospectus post reporting on it.  And here's a Yahoo! report.)

The study tried to figure whether umpires make more bad calls against batters* of a race other than theirs (where there is no "umpire-batter match," or "UBM," as the literature calls it). It ran regressions on called pitches from 2008 to 2020, to figure out how best to predict the probability of the home-plate umpire calling a pitch incorrectly (based on MLB "Gameday" pitch location). The author controlled for many different factors, and found a statistically significant coefficient for UBM, concluding that the pitcher gains an advantage when the umpire is of the same race. It also argues that white umpires in particular "could be the driving force behind discrimination in MLB."

I don't think any of that is right. I think the results point to something different, and benign.

---------

Imagine a baseball league where some teams are comprised of dentists, while the others are jockeys. The league didn't hire any umpires, so the players take turns, and promise to call pitches fairly.

They play a bunch of games, and it turns out that the umpires call more strikes against the dentists than against the jockeys. Nobody is surprised -- jockeys are short, and thus have small strike zones.

It's true that the data shows that if you look at the Jockey umpires, you'll see that they call a lot fewer strikes against batters of their own group than against batters of the other group. Their "UBM" coefficient is high and statistically significant.

Does that mean the jockey umps are "racist" against dentists? No, of course not. It's just that the dentists have bigger strike zones.

It's the same, but in reverse, for the dentist umpires. They call more strikes against their fellow dentists -- again, not because of pro-jockey "reverse racism," but because of the different strike zones.

Later, teams of NBA players enter the league. These guys are tall, with huge strike zones, so they get a lot of called strikes, even from their own umpires.

Let's put some numbers on this: we'll say there are 10 teams of dentists, 1 team of jockeys, and 2 teams of NBA players. The jockeys are -10 in called strikes compared to average, and the NBA players are +10. That leaves the dentists at -1 (in order for the average to be zero).

Here's a chart that shows every umpire is completely fair and unbiased.

Umpire             Jockey    NBA    Dentist
-------------------------------------------
Jockey batter:       -10     -10     -10
NBA batter           +10     +10     +10
Dentist batter        -1      -1      -1

I've highlighted the "UBM" cells where the umpire matches the batter. If you look only at those cells, and don't think too much about what's going on, you could think the umpires are horribly biased. The Jockey batters get 10 fewer strikes than average from Jockey umpires!  That's awful!

But then when you look closer, you see the horizontal row is *all* -10. That means all the umpires called the jockeys the same way (-10), so it's probably something about the jockey batters that made that happen. In this case, it's that they're short.

I think this is what's going on in the actual study. But it's harder to see, because the chart isn't set up with the raw numbers. The author ran different regressions for the three different umpire races, and set a different set of batters as the zero-level for each. Since they're calibrated to a different standard of player, the results make the umpires look very different.

If I had done here what the author did there, the chart above would have looked like this:

Umpire             Jockey    NBA   Dentist
------------------------------------------
Jockey batter:         0    -20      -9
NBA batter           +20      0     +11
Dentist batter        +9    -11       0

If you just look at this chart without knowing you can't compare the columns to each other (because they're based on a different zero baseline), it's easy to think there's evidence of bias. You'd look at the chart and say, "Hey, it looks like Jockey umpires are racist against NBA batters and dentists. Also, dentist umpires are racist against NBA players but favor Jockeys somewhat. But, look!  NBA umpires actually *favor* other races!  That's probably because NBA umpires are new to the tournament, and are going out of their way to appear unbiased."

That's a near-perfect analogue to the actual study.  This is the top half of Table 8, which measures "over-recognition" of pitchers, meaning balls incorrectly called as strikes (hurting the batter). I've multiplied everything by 1000, so the numbers are "wrong strike calls per 1000 called pitches outside the zone".

Umpire             Black   Hispanic   White
-------------------------------------------
Black batter:       ---      -5.3     -0.3
Hispanic batter    +7.8      ---      +5.9
White batter       +5.6      -4.4      ---

It's  very similar to my fake table above, where the dentists and Jockeys look biased, but the NBA players look "reverse biased".

The study notes the chart and says,

"For White umpires, the results suggest that for pitches outside the zone, Hispanic batters ... face umpire discrimination. [But Hispanic umpires have a] "reverse-bias effect ... [which] holds for both Black and White batters... Lastly, the bias against non-Black batters by Black umpires is relatively consistent for both Hispanic and White batters."

And it rationalizes the apparent "reverse racism" from Hispanic umpires this way:

"This is perhaps attributable to the recent increase in MLB umpires from Hispanic countries, who could potentially fear the consequences of appearing biased towards Hispanic players."

But ... no. The apparent result is almost completely the result of setting a different zero level for each umpire/batter race -- in other words, by arbitrarily setting the diagonal to zero. That only works if the groups of batters are exactly the same. They're not. Just as Jockey batters have different characteristics than NBA player batters, it's likely that Hispanic batters don't have exactly the same characteristics as White and Black batters.

The author decided that White, Black, and Hispanic batters all should get exactly the same results from an unbiased umpire. If that assumption is false, the effect disappears.

Instead, the study could have made a more conservative assumption: that unbiased umpires of any race should call *White* batters the same. (Or Black batters, or Hispanic batters. But White batters have the largest sample size, giving the best signal-to-noise ratio.)

That is, use a baseline where the bottom row is zero, rather than one where the diagonal is zero. To do that, take the original, set the bottom cells to zero, but keep the differences between any two rows in the same column:

Umpire             Black   Hispanic  White
------------------------------------------
Black batter:      -5.6     -0.9     -0.3
Hispanic batter    +2.2     +4.4     +5.9
White batter        ---      ---      ---

Does this look like evidence of umpire bias? I don't think so. For any given race of batter, all three groups of umpires call about the same amount of bad strikes. In fact, all three groups of umpires even have the same *order* among batter groups: Hispanic the most, White second, and Black third. (The raw odds of that happening are 1 in 36).

The only anomaly is that maybe it looks like there's some evidence that Black umpires benefit Black batters by about 5 pitches per 1,000, but even that difference is not statistically significant.

In other words: the entire effect in the study disappears when you remove the hidden assumption that Hispanic batters respond to pitches exactly the same way as White or Black batters. And the pattern of "discrimination" is *exactly* what you'd expect if the Hispanic batters respond to pitches in ways that result in more errors -- that is, it explains the anomaly that Hispanic umpires tend to look "reverse racist."

Also, I think the entire effect would disappear if the author had expanded his regression to include dummy variables for the race of the batter.

------

If, like me, you find it perfectly plausible that Hispanic batters respond to pitches in ways that generate more umpire errors, you can skip this section. If not, I will try to convince you.

First, keep in mind that it's a very, very small difference we're talking about: maybe 4 pitches per 1,000, or 0.4 percent. Compare that to some of the other, much larger effects the study found:

+8.9%   3-0 count on the batter
-0.9%   two outs
+2.8%   visiting team batting
-3.3%   right-handed batter
+0.5%   right-handed pitcher
+1.4%   pitcher 2 WAR vs. 0 WAR
+0.9%   pitcher has two extra all-star appearances
+4.0%   2019 vs. 2008
---------------------------------------------------
+0.4%   batter is Hispanic
---------------------------------------------------

I wouldn't have expected most of those other effects to exist, but they do. And they're so large that they make this one, at only +0.4%, look unremarkable.

Also: with so many large effects found in the study, there are probably other factors the author didn't consider that are just as large. Just to make something up ... since handedness of pitcher and batter are so important, suppose that platoon advantage (the interaction between pitcher and batter hand, which the study didn't include) is worth, say, 5%. And suppose Hispanic batters are likely to have the platoon advantage, say, 8% less than White batters. That would give you an 0.4% effect right there.

I don't have data specifically for Hispanic batters, but I do have data for country of birth. Not all non-USA players are Hispanic, but probably a large subset are, so I split them up that way. Here is batting-handedness stats for players from 1969 to 2016:

Born in USA:       61.7% RHB
Born outside USA:  67.1% RHB

That's a 10% difference in handedness. I don't know how that translates into platoon advantage, but it's got to be the same order of magnitude as what we'd need for 0.4%.

Here's another theory. They used to say, about prospects from the Dominican Republic, that they deliberately become free swingers because "you can't walk off the island."

Suppose, that knowing a certain player is a free swinger, the pitcher aims a bit more outside the strike zone than usual, knowing the batter is likely to swing anyway. If the catcher sets a target outside, and the pitcher hits it perfectly, the umpire may be more likely to miscall it as a strike (at least according to many broadcasters I've heard).

Couldn't that explain why Hispanic players get very slightly more erroneous strike calls?

In support of that hypothesis, here are K/W ratios for that same set of batters (total K divided by total BB):

Born in USA:       1.82 K per BB
Born outside USA:  2.05 K per BB

Again, that seems around the correct order of magnitude.

I'm not saying these are the right explanations -- they might be right, or they might not. The "right answer" is probably several factors, perhaps going different directions, but adding up to 0.4%.

But the point is: there do seem to be significant differences in hitting styles between Hispanic and non-Hispanic batters, certainly significant enough that an 0.4% difference in bad calls is quite plausible. Attributing the entire 0.4% to racist umpires (and assuming that all races of umpires would have to discriminate against Hispanics!) doesn't have any justification whatsoever -- at least not without additional evidence.

-------

Here's a TLDR summary, with a completely different analogy this time:

Eddie Gaedel's father calls fewer strikes on Eddie Gaedel than Aaron Judge's father calls on Aaron Judge. So Gaedel Sr. must be biased!

--------

There's another part of the study -- actually, the main part -- that throws everything into one big regression and still comes out with a significant "UBM" effect, which again it believes is racial bias. I think that conclusion is also wrong, for reasons that aren't quite the same.

That's Part II, which is now here.

----------

(*The author found a similar result for pitchers, who gained an advantage in more called strikes when they were the same race as the umpire, and a similar result for called balls as well as called strikes. In this post, I'll just talk about the batting side and the called strikes, but the issues are the same for all four combinations of batter/pitcher ball/strike.)

Labels: , , , ,

## Monday, July 26, 2021

### DRS team fielding seems overinflated

In a previous post, I noticed that the DRS estimates of team fielding seemed much too high in many cases. In fact, the spread (standard deviation) of team DRS was almost three times as high as other methods (UZR and OAA).

For instance, here are the three competing systems for the 2016 Chicago Cubs:

UZR: +43 runs (range)
OAA: +29 runs
DRS: +96 runs (107 - 11 for catcher framing)

Since I wrote that, the DRS people (Baseball Info Solutions, or BIS) have issued significant corrections for the 2018 and 2019 seasons (and smaller corrections for 2017). It seems the MLB feeds were off in their timing; when the camera switched from showing the batter to showing the batted ball, they skipped a fraction of a second, which is a big deal when evaluating fielders.

The corrections are a big improvement -- most of the extreme figures have shrunk. For instance, the 2018 Phillies improve from -111 runs to -75 runs. It seems that Baseball Reference has not yet updated with the new figures ... Aaron Nola's numbers remain where they were when I wrote the previous post in January.

However, as far as I can tell, DRS numbers before 2017 remain unchanged, so the problem is still there.

------

In my previous posts, I found that the SD of BABIP (batting average on balls in play, which is where the effects of fielding should be seen) had an SD of about 35 runs. (That's about 44 plays out of 3900, at an assumed value of 0.8 runs per play.)

Again from those posts, we should expect only about 42 percent of that variation to belong to the fielders -- the rest are the result of pitchers giving up easier balls in play (48 percent) and park effects (10 percent).

In other words, we should be seeing

23 runs fielders
24 runs pitchers
11 runs park
----------------
35 runs total

That means any metric that tries to quantify the performance of the fielders should come in with an SD of about 23 runs.

DRS, from 2003 to 2019, comes in at 41 runs (data courtesy Fangraphs). I didn't calculate the SD after subtracting off catcher, because Fangraphs doesn't provide it in their downloadable spreadsheet.

I did figure it out for 2018, where I typed in the numbers manually from the DRS website. That season, the SD without catcher was about 85 percent of the total SD. Using the same adjustment for other years would bring the multi-year observed SD from 41 down to 35.

That SD of 35 runs happens to be the same as the SD of BABIP runs. That means DRS is effectively attributing the *entire* team BABIP performance to the fielders, and none to the pitchers or park.

By comparison, the other two metrics are more reasonable:

OAA: 18 runs
UZR: 25 runs
DRS: 35 runs

(Note: Tango tells me I need to bump up official OAA by about 7 percent to account for missing plays, so I've done that. In previous posts, I used 20 percent, which is now wrong -- first, because the data has been improved since then, and, second, because I previously forgot about regression to the mean for the missing data. I should have used 14 percent instead of 20 then, I think.)

OAA and UZR are right around the theoretical 23 runs. DRS, on the other hand, is much higher. To get DRS down to 23 runs, you have to regress it to the mean by about a third. So the 2016 Cubs need to fall from +96 to +63.

To get DRS down to the OAA level of 18 runs, you have to regress by about half, from +96 to around +49.

-----

If DRS is overinflated, does that mean it's also less accurate in identifying the good and bad fielding teams? Apparently not! Despite outsized values, In 2018, DRS predicted BABIP better than OAA did, in terms of correlations:

OAA: .58 correlation
DRS: .62 correlation

Correlations include a "built in" regression to the mean, which is why DRS could do well despite being overexaggerated.

In 2019, though, DRS isn't nearly as accurate:

OAA: .48 correlation
DRS: .33 correlation

I guess you could do more years and figure out which metric is better, and by how much. You could include UZR in there too. I probably should have done that myself, but I didn't think of it earlier and I'm too lazy to go do it now.

-------

And, just for reference, here are the SDs for 2018 and 2019 specifically (DRS does not include catcher):

2018 DRS:      41
2018 OAA + 7%: 20
------------------------------------
regress DRS to mean 51% to match OAA

2019 DRS:      44
2019 OAA + 7%: 19
------------------------------------
regress DRS to mean 57% to match OAA

------

So I'm not sure what's going on with DRS. They seem to be double-counting somewhere in their algorithm, but I don't know how or where.

If you're using DRS, I would suggest you first regress to the mean by around a third if you want to match the theoretical SD of 23, and by around half if you want to match the OAA SD of 19. The correlations to BABIP suggest the regressed DRS could be as accurate as OAA after regressing.

Labels: , , , , ,

## Sunday, January 31, 2021

### Splitting defensive credit between pitchers and fielders (Part III)

(This is part 3.  Part 1 is here; part 2 is here.)

UPDATE, 2021-02-01: Thanks to Chone Smith in the comments, who pointed out an error.  I investigated and found an error in my code. I've updated this post -- specifically, the root mean error and the final equation. The description of how everything works remains the same.

------

Last post, we estimated that in 2018, Phillies fielders were 3 outs better than league average when Aaron Nola was on the mound. That estimate was based on the team's BAbip and Nola's own BAbip.

Our first step was to estimate the Phillies' overall fielding performance from their BAbip. We had to do that because BAbip is a combination of both pitching and fielding, and we had to guess how to split those up. To do that, we just used the overall ratio of fielding BAbip to overall BAbip, which was 47 percent. So we figured that the Phillies fielders were -24, which is 47 percent of their overall park-adjusted -52.

We can do better than that kind of estimate, because, at least for recent years, we have actual fielding data that can substitute for that estimate. Statcast tells us that the Phillies fielders were -39 outs above average (OAA) for the season*. That's 75 percent of BAbip, not 47 percent ... but still well within typical variation for teams.

(*The published estimate is -31, but I'm adding 25 percent (per Tango's suggestion) to account for games not included in the OAA estimate.)

So we can get much more accurate by starting with the true zone fielding number of -39, instead of the weaker estimate of -24.

-------

First, let's convert the -39 back to BAbip, by dividing it by 3903 BIP. That gives us ... almost exactly -10 points.

The SD of fielding talent is 6.1. The SD of fielding luck in 3903 BIP is 3.65. So it works out that luck is 2.6 of the 10 points, and talent is the remaining 7.3. (That's because 2.6 = 3.65^2/(3.65^2+6.1^2).)

We have no reason (yet) to believe Nola is any different from the rest of the team, so we'll start out with an estimate that he got team average fielding talent of -7.3, and team average fielding luck of -2.6.

Nola's BAbip was .254, in a league that was .296. That's an observed 41 point benefit. But, with fielders that averaged .00074 talent and -0.0026 luck, in a park that was +0.0025, that +41 becomes +48.5.

That's what we have to break down.

Here's Nola's SD breakdown, for his 519 BIP. We will no longer include fielding talent in the chart, because we're using the fixed team figure for Nola, which is estimated elsewhere and not subject to revision. But we keep a reduced SD for fielding luck relative to team, because that's different for every pitcher.

9.4 fielding luck
7.6 pitching talent
17.3 pitching luck
1.5 park
--------------------
21.2 total

Converting to percentages:

20% fielding luck
13% pitching talent
67% pitching luck
1% park
--------------------
100% total

Using the above percentages, the 48.5 becomes:

+ 9.5 points fielding luck
+ 6.3 points pitching talent
+32.5 points pitching luck
+ 0.2 points park
-------------------
+48.5 points

Adding back in the -7.3 points for observed Phillies talent, -2.6 for Phillies luck, and 2.5 points for the park, gives

-7.3 points fielding talent [0 - 7.3]
+6.9 points fielding luck   [+10.2 - 2.6]
+6.3 points pitching talent
+32.5 points pitching luck
+2.7 points park            [0.2 + 2.5]
-----------------------------------------
41   points

Stripping out the two fielding rows:

-7.3 points fielding talent
+6.9 points fielding luck
-----------------------------
-0.4 points fielding

The conclusion: instead of hurting him by 10 points, as the raw team BAbip might suggest, or helping him by 6 points, as we figured last post ... Nola's fielders only hurt him by 0.4 points. That's less than a fifth or a run. Basically, Nola got league-average fielding.

--------

Like before, I ran this calculation for all the pitchers in my database. Here are the correlations to actual "gold standard" OAA behind the pitcher:

r=0.23 assume pitcher fielding BAbip = team BAbip
r=0.37 BAbip method from last post
r=0.48 assume pitcher OAA = team OAA
r=0.53 this method

And the root mean square error:

13.7 assume pitcher fielding BAbip = team BAbip
11.3 BAbip method from last post
10.2 assume pitcher OAA = team OAA
10.0 this method

-------

Like in the last post, here's a simple formula that comes very close to the result of all these manipulations of SDs:

F = 0.8*T + 0.2*P

Here, "F" is fielding behind the pitcher, which is what we're trying to figure out. "T" is team OAA/BAbip. "P" is player BAbip compared to league.

Unlike the last post, here the team *does* include the pitcher you're concerned with. We had to do it this way because presumably we have data for the team without the pitcher. (If we did, we'd just subtract it from team and get the pitcher's number directly!)

It looks like 20% of a pitcher's discrepancy is attributable to his fielders. That number is for workloads similar to those in my sample -- around 175 IP. It does with playing time, but only slightly. At 320 IP, you can use 19% instead. At 40 IP, you can use 22%. Or, just use 20% for everyone, and you won't be too far wrong.

-------

Full disclosure: the real life numbers for 2017-19 are different. The theory is correct -- I wrote a simulation, and everything came out pretty much perfect. But on real data, not so perfect.

When I ran a linear regression to predict OAA from team and player BIP, it didn't come out to 20%. It came out to only about 11.5%. The 95% confidence interval only brings it up to 15% or 16%.

The same thing happened for the formula from the last post: instead of the predicted 26%, the actual regression came out to 17.5%.

For the record, these are the empirical regression equations, all numbers relative to league:

F = 0.23*(Team BAbip without pitcher) + 0.175*P
F = 0.92*(Team OAA/BIP including pitcher) + 0.115*P

Why so much lower than expected? I'm pretty sure it's random variation. The empirical estimate of 11.5% is very sensitive to small variations in the seasonal balance of variation in pitching and fielding luck vs. talent -- so sensitive that the difference between 11.5 points and 20 points is not statistically significant. Also, the actual number changes from year-to-year because of variation. So, I believe that the 20% number is correct as a long-term average, but for the seasons in the study, the actual number is probably somewhere between 11.5% and 20%.

I should probably explain that in a future post. But, for now, if you don't believe me, feel free to use the empirical numbers instead of my theoretical ones. Whether you use 11.5% or 20%, you'll still be much more accurate than using 100%, which is effectively what happens when you use the traditional method of assigning the overall team number equally to every pitcher.

Labels: , , ,

## Monday, January 11, 2021

### Splitting defensive credit between pitchers and fielders (Part II)

(Part 1 is here.  This is Part 2.  If you want to skip the math and just want the formula, it's at the bottom of this post.)

------

When evaluating a pitcher, you want to account for how good his fielders were. The "traditional" way of doing that is, you scale the team fielding to the pitcher. Suppose a pitcher was +20 plays better than normal, and his team fielding was -5 for the season. If the pitcher pitched 10 percent of the team innings, you might figure the fielding cost him 0.5 runs, and adjust him from +20 to +20.5.

I have argued that this isn't right. Fielding performance varies from game to game, just like run support does. Pitchers with better ball-in-play numbers probably got better fielding during their starts than pitchers with worse ball-in-play numbers.

By analogy to run support: in 1972, Steve Carlton famously went 27-10 on a Phillies team that was 32-87 without him. Imagine how good he must have been to go 27-10 for a team that scored only 3.22 runs per game!

Except ... in the games Carlton started, the Phillies actually scored 3.76 runs per game. In games he didn't start, the Phillies scored only 3.03 runs per game.

The fielding version of Steve Carlton might be Aaron Nola in 2018. A couple of years ago, Tom Tango pointed out the problem using Nola as an example, so I'll follow his lead.

Nola went 17-6 for the Phillies with a 2.37 ERA, and gave up a batting average on balls in play (BAbip) of only .254, against a league average of .295 -- that, despite an estimate that his fielders were 0.60 runs per game worse than average. If you subtract 0.60 from Nola's stat line, you wind up with Nola's pitching equivalent to an ERA in the 1s. As a result, Baseball-Reference winds up assigning Nola a WAR of 10.2, tied with Mike Trout for best in MLB that year.

But ... could Nola really have been hurt that much by his fielders? A BAbip of .254 is already exceptionally low. An estimate of -0.60 runs per game implies his BAbip with average fielders would have been .220, which is almost unheard of.

(In fairness: the Phillies 0.60 DRS fielding estimate, which comes from Baseball Info Solutions, is much, much worse than estimates from other sources -- three times the UZR estimate, for instance. I suspect there's some kind of scaling bug in recent BIS ratings, because, roughly, if you divide DRS by 3, you get more realistic numbers, and standard deviations that now match the other measures. But I'll save that for a future post.)

So Nola was almost certainly hurt less by his fielders than his teammates were, the same way Steve Carlton was hurt less by his hitters than his teammates were. But, how much less?

Phrasing the question another way: Nola's BAbip (I will leave out the word "against") was .254, on a team that was .306, in a league that was .295. What's the best estimate of how his fielders did?

I think we can figure that out, extending the results in my previous post.

------

First, let's adjust for park. In the five years prior to 2018, the Phillies BAbip for both teams combined was .0127 ("12.7 points") better at Citizens Bank Park than in Phillies road games. Since only half of Phillies games were at home, that's 6.3 points of park factor. Since there's a lot of luck involved, I regressed 60 percent to the mean of zero (with a limit of 5 points of regression, to avoid ruining outliers like Coors Field), leaving the Phillies with 2.5 points of park factor.

Now, look at how the Phillies did with all the other pitchers. For non-Nolas, the team BAbip was .3141, against a league average of .2954. Take the difference, subtract the park factor, and the Phillies were 21 points worse than average.

How much of those 21 points came from below-average fielding talent? To figure that out, here's the SD breakdown from the previous post, but adjusted. I've bumped luck upwards for the lower number of PA, dropped park down to 1.5 since we have an actual estimate, and increased the SD of pitching because the Phillies had more high-inning guys than average:

6.1 points fielding talent
3.9 points fielding luck
5.6 points pitching talent
6.8 points pitching luck
1.5 points park
---------------------------
11.5 points total

Of the Phillies' 21 points in BAbip, what percentage is fielding talent? The answer: (6.1/11.5)^2, or 28 percent. That's 5.9 points.

So, we assume that the Phillies' fielding talent was 5.9 points of BAbip worse than average. With that number in hand, we'll leave the Phillies without Nola and move on to Nola himself.

-------

On the raw numbers, Nola was 41 points better than the league average. But, we estimated, his fielding was about 6 points worse, while his park helped him by 2.5 points, so he was really 44.5 points better.

For an individual pitcher with 700 BIP, here's the breakdown of SDs, again from the previous post:

6.1  fielding talent
7.6  fielding luck
7.6  pitching talent
15.5  pitching luck
3.5  park
---------------------
20.2  total

We have to adjust all of these for Nola.

First, fielding talent goes down to 5.2. Why? Because we estimated it from other data, and so we have less variance than if we just took the all-time average. (A simulation suggests that we multiply the 6.1 by, from the "team without Nola" case, (SD without fielding talent)/(SD with fielding talent).)

Fielding luck and pitching luck increase because Nola had only 519 BIP, not 700.

Finally, park goes to 1.5 for the same reason as before.

5.2 fielding talent
10.0 fielding luck
7.6 pitching talent
17.3 pitching luck
1.5 park
--------------------
22.1 total

Convert to percentages:

5.5% fielding talent
20.4% fielding luck
11.8% pitching talent
61.3% pitching luck
0.5% park
---------------------
100% total

Multiply by Nola's 44.5 points:

2.5 fielding talent
9.1 fielding luck
5.3 pitching talent
27.3 pitching luck
0.2 park
--------------------
44.5 total

Now we add in our previous estimates of fielding talent and park, to get back to Nola's raw total of 41 points:

-3.4 fielding talent [2.5-5.9]
9.1 fielding luck
5.3 pitching talent
27.3 pitching luck
2.7 park            [0.2+2.5]
------------------------------
41 total

Consolidate fielding and pitching:

5.6 fielding
32.6 pitching
2.7 park
-------------
41   total

Conclusion: The best estimate is that Nola's fielders actually *helped him* by 5.6 points of BAbip. That's about 3 extra outs in his 519 BIP. At 0.8 runs per out, that's 2.4 runs, in 212.1 IP, for about 0.24 WAR or 10 points of ERA.

Baseball-reference had him at 60 points of ERA; we have him at 10. Our estimate brings his WAR down from 10.3 to 9.1, or something like that. (Again, in fairness, most of that difference is the weirdly-high DRS estimate of 0.60. If DRS had him at a more reasonable .20, we'd have adjusted him from 9.4 to 9.1, or something.)

-------

Our estimate of +3 outs is ... just an estimate. It would be nice if we had real data instead. We wouldn't have to do all this fancy stuff if we had a reliable zone-based estimate specifically for Nola.

Actually, we do! Since 2017, Statcast has been analyzing batted balls and tabulating "outs above average" (OAA) for every pitcher. For Nola, in 2018, they have +2. Tom Tango told me Statcast doesn't have data for all games, so I should multiply the OAA estimate by 1.25.

That brings Statcast to +2.5. We estimated +3. Not bad!

But Nola is just one case. And we might be biased in the case of Nola. This method is based on a pitcher of average talent. Nola is well above average, so it's likely some of the difference we attributed to fielding is really due to Nola's own BAbip pitching tendencies. Maybe instead of +3, his fielders were really +1 or something.

So I figured I'd better test other players too.

I found all pitchers from 2017 to 2019 that had Statcast estimates, with at least 300 BIP for a single team. There were a few players whose names didn't quite correlate with my Lahman database, so I just let those go instead of fixing them. That left 342 pitcher-seasons. I assume almost all of them were starters.

For each pitcher, I ran the same calculation as for Nola. For comparison, I also did the "traditional" estimate where I gave the pitcher the same fielding as the rest of the team. Here are the correlations to the "gold standard" OAA:

r=0.37 this method

Here are the approximate root-mean-square errors (lower is better):

11.3 points of BAbip this method

This method is meant to be especially relevant for a pitcher like Nola, whose own BAbip is very different from his team's. Here are the root-mean-squared errors for pitchers who, like Nola, had a BAbip at least 10 plays better than their team's:

9.3 points this method

And for pitchers at least 10 plays worse:

9.3 points this method

------

Now, the best part: there's an easy formula to get our estimates, so we don't have to use the messy sums-of-squares stuff we've been doing so far.

We found that the original estimate for team fielding talent was 28% of observed-BAbip-without-pitcher. And then, our estimate for additional fielding behind that pitcher was 26% of the difference between that pitcher and the team. In other words, if the team's non-Nola BAbip (relative to the league) is T, and Nola's is P,

Fielders = .28T + .26(P-.28T)

The coefficients vary by numbers of BIPs. But the .28 is pretty close for most teams. And, the .26 is pretty close for most single-season pitchers: luck is 25% fielding, and talent is about 30% fielding, so no matter your proportion of randomness-to-skill, you'll still wind up between 25% and 30%.

Expanding that out gives an easier version of the fielding adjustment, which I'll print bigger.

------

Suppose you have an average pitcher, and you want to know how much his fielders helped or hurt him in a given season. You can use this estimate:

F = .21T + .26P

Where:

T is his team's BAbip relative to league for the other pitchers on the team, and

P is the pitcher's BAbip relative to league, and

F is the estimated BAbip performance of the fielders, relative to league, when that pitcher was on the mound.

-----

Next: Part III, splitting team OAA among pitchers.

Labels: , , ,