DRS team fielding seems overinflated
In a previous post, I noticed that the DRS estimates of team fielding seemed much too high in many cases. In fact, the spread (standard deviation) of team DRS was almost three times as high as other methods (UZR and OAA).
For instance, here are the three competing systems for the 2016 Chicago Cubs:
UZR: +43 runs (range)
OAA: +29 runs
DRS: +96 runs (107 - 11 for catcher framing)
Since I wrote that, the DRS people (Baseball Info Solutions, or BIS) have issued significant corrections for the 2018 and 2019 seasons (and smaller corrections for 2017). It seems the MLB feeds were off in their timing; when the camera switched from showing the batter to showing the batted ball, they skipped a fraction of a second, which is a big deal when evaluating fielders.
The corrections are a big improvement -- most of the extreme figures have shrunk. For instance, the 2018 Phillies improve from -111 runs to -75 runs. It seems that Baseball Reference has not yet updated with the new figures ... Aaron Nola's numbers remain where they were when I wrote the previous post in January.
However, as far as I can tell, DRS numbers before 2017 remain unchanged, so the problem is still there.
------
In my previous posts, I found that the SD of BABIP (batting average on balls in play, which is where the effects of fielding should be seen) had an SD of about 35 runs. (That's about 44 plays out of 3900, at an assumed value of 0.8 runs per play.)
Again from those posts, we should expect only about 42 percent of that variation to belong to the fielders -- the rest are the result of pitchers giving up easier balls in play (48 percent) and park effects (10 percent).
In other words, we should be seeing
23 runs fielders
24 runs pitchers
11 runs park
----------------
35 runs total
That means any metric that tries to quantify the performance of the fielders should come in with an SD of about 23 runs.
DRS, from 2003 to 2019, comes in at 41 runs (data courtesy Fangraphs). I didn't calculate the SD after subtracting off catcher, because Fangraphs doesn't provide it in their downloadable spreadsheet.
I did figure it out for 2018, where I typed in the numbers manually from the DRS website. That season, the SD without catcher was about 85 percent of the total SD. Using the same adjustment for other years would bring the multi-year observed SD from 41 down to 35.
That SD of 35 runs happens to be the same as the SD of BABIP runs. That means DRS is effectively attributing the *entire* team BABIP performance to the fielders, and none to the pitchers or park.
By comparison, the other two metrics are more reasonable:
OAA: 18 runs
UZR: 25 runs
DRS: 35 runs
(Note: Tango tells me I need to bump up official OAA by about 7 percent to account for missing plays, so I've done that. In previous posts, I used 20 percent, which is now wrong -- first, because the data has been improved since then, and, second, because I previously forgot about regression to the mean for the missing data. I should have used 14 percent instead of 20 then, I think.)
OAA and UZR are right around the theoretical 23 runs. DRS, on the other hand, is much higher. To get DRS down to 23 runs, you have to regress it to the mean by about a third. So the 2016 Cubs need to fall from +96 to +63.
To get DRS down to the OAA level of 18 runs, you have to regress by about half, from +96 to around +49.
-----
If DRS is overinflated, does that mean it's also less accurate in identifying the good and bad fielding teams? Apparently not! Despite outsized values, In 2018, DRS predicted BABIP better than OAA did, in terms of correlations:
OAA: .58 correlation
DRS: .62 correlation
Correlations include a "built in" regression to the mean, which is why DRS could do well despite being overexaggerated.
In 2019, though, DRS isn't nearly as accurate:
OAA: .48 correlation
DRS: .33 correlation
I guess you could do more years and figure out which metric is better, and by how much. You could include UZR in there too. I probably should have done that myself, but I didn't think of it earlier and I'm too lazy to go do it now.
-------
And, just for reference, here are the SDs for 2018 and 2019 specifically (DRS does not include catcher):
2018 DRS: 41
2018 OAA + 7%: 20
2018 OAA + 7%: 20
------------------------------------
regress DRS to mean 51% to match OAA
2019 DRS: 44
2019 OAA + 7%: 19
2019 OAA + 7%: 19
------------------------------------
regress DRS to mean 57% to match OAA
------
So I'm not sure what's going on with DRS. They seem to be double-counting somewhere in their algorithm, but I don't know how or where.
If you're using DRS, I would suggest you first regress to the mean by around a third if you want to match the theoretical SD of 23, and by around half if you want to match the OAA SD of 19. The correlations to BABIP suggest the regressed DRS could be as accurate as OAA after regressing.