Can fans evaluate fielding better than sabermetric statistics?
      Team defenses differ in how well they turn batted balls into outs. How do you measure the various factors that influence the differences? The fielders obviously have a huge role, but do the pitchers and parks also have an influence?
Twelve years ago, in a group discussion, Erik Allen, Arvin Hsu, and Tom Tango broke down the variation in batting average on balls in play (BAbip). Their analysis was published in a summary called "Solving DIPS" (.pdf).
A couple of weeks ago, I independently repeated their analysis -- I had forgotten they had already done it -- and, reassuringly, got roughly the same result. In round numbers, it turns out that:
The SD of team BAbip fielding talent is roughly 30 runs over a season.
------
There are several competing systems for evaluating which players and teams are best in the field, and by how much. The Fangraphs stats pages list some of those stats, and let you compare.
I looked at those team stats for the 2014 season. Specifically, these three:
1. DRS, from The Fielding Bible -- specifically, the rPM column, runs above average from plays made. (That's the one we want, because it doesn't include outfielder/catcher arms, or double-play ability.)
2. The Fan Scouting Report (FSR), which is based on an annual fan survey run by Tom Tango.
3. Ultimate Zone Rating (UZR), a stat originally developed by Mitchel Lichtman, but which, as I understand it, is now public. I used the column "RngR," which is the range portion (again to leave out arms and other defensive skills).
All three stats are denominated in runs. Here are their team SDs for the 2014 season, rounded:
37 runs -- DRS (rPM column)
23 runs -- Fan Scouting Report (FSR)
29 runs -- UZR (RngR)
------------------------------------
30 runs -- team talent
The SD of DRS is much higher than the SD of team talent. Does that mean it's breaching the "speed of light" limit of forecasting, trying to (retrospectively) predict random luck as well as skill?
No, not necessarily. Because DRS isn't actually trying to evaluate talent.  It's trying to evaluate what actually happened on the field. That has a wider distribution than just talent, because there's luck involved.
A team with fielding talent of +30 runs might have actually saved +40 runs last year, just like a player with 30-home-run talent may have actually hit 40.
The thing is, though, that in the second case, we actually KNOW that the player hit 40 homers. For team fielding, we can only ESTIMATE that it saved 40 runs, because we don't have good enough data to know that the extra runs didn't just result from getting easier balls to field.
In defense, the luck of "made more good plays than average" is all mixed up with "had more easier balls to field than average."  The defensive statistics I've seen try their best to figure out which is which, but they can't, at least not very well.
What they do, basically, is classify every ball in play according to how difficult it was, based on location and trajectory. I found this post from 2003, which shows some of the classifications for UZR. For instance, a "hard" ground ball to the "56" zone (a specific portion of the field between third and short) gets turned into an out 43.5 percent of the time, and becomes a hit the other 56.5 percent. 
If it turns out a team had 100 of those balls to field, and converted them to outs at 45 percent instead of 43.5 percent, that's 1.5 extra outs it gets credited for, which is maybe 1.2 runs saved.
The problem with that is: the 43.5 percent is a very imprecise estimate of what the baseline should be. Because, even in the "hard-hit balls to zone 56" category, the opportunities aren't all the same. 
Some of them are hit close to the fielder, and those might be turned into outs 95 percent of the time, even for an average or bad-fielding team. Some are hit with a trajectory and location that makes them only 8 percent. And, of course, each individual case depends where the fielders are positioned, so the identical ball could be 80 percent in one case and 10 percent in another.
In a "Baseball Guts" thread at Tango's site, data from Sky Andrecheck and BIS suggested that only 20 percent of ground balls, and 10 percent of fly balls, are "in doubt", in the sense that if you were watching the game, you'd think it could have gone either way. In other words, at least 80% of balls in play are either "easy outs" or "sure hits."  ("In doubt" is my phrase, meaning BIPs in which it wasn't immediately at least 90 percent obvious to the observer whether it would be a hit or an out.)
That means that almost all the differences in talent and performance manifest themselves in just 10 to 20 percent of balls in play.
But, even the best fielding systems have few zones that are less than 20 percent or more than 80 percent. That means that there is still huge variation in difficulty *even accounting for zone*. 
So, when a team makes 40 extra plays over a season, it's a combination of:
(a) those 40 plays came from extra performance from the few "in doubt" balls;
(b) those 40 plays came from easier balls overall.
I think (b) is much more a factor than (a), and that you have to regress the +40 to the mean quite a bit to get a true estimate. 
Maybe when the zones get good enough to show large differences between teams -- like, say, 20% for a bad fielder and 80% for a good fielder -- well, at that point, you have a system that might work. But, without that, doesn't it almost have to be the case that most of the difference is just from what kinds of balls you get?
Tango made a very relevant point, indirectly, in a recent post. He asked, "Is it possible that Manny Ramirez never made an above-average play in the outfield?"  The consensus answer, which sounds right to me, was ... it would be very rare to see Manny make a play that an average outfielder wouldn't have made. (Leaving positioning out of the argument for now.)
Suppose BIPs to a certain difficult zone get caught 30% of the time by an average fielder, and Manny catches them 20% of the time. Since ANY outfielder would catch a ball that Manny gets to ... well, that zone must really be at least TWO zones: a "very easy" zone with a 100% catch rate, and a "harder" zone with an 10% catch rate for an average fielder, and a 0% catch rate for Manny.
In other words, if Manny makes 30% plays in that zone and a Gold Glove outfielder makes 25%, it's almost certain that Manny just got easier balls to catch. 
The only way to eliminate that kind of luck is to classify the zones in enough micro detail that you get close to 0% for the worst, or close to 100% for the best.
And that's not what's happening. Which means, there's no way to tell how many runs a defense saved.
------
And this brings us back to the point I made last month, about figuring out how to split observed runs allowed into observed pitching and observed fielding. There's really no way to do it, because you can't tell a good fielding play from an average one with the numbers currently available. 
Which means: the DRS and UZR numbers in the Fangraphs tables are actually just estimates -- not estimates of talent, but estimates of *what happened in the field*. 
There's nothing wrong with that, in principle: but, I don't think it's generally realized that that's what those are, just estimates. They wind up in the same statistical summaries as pitching and hitting metrics, which themselves are reliable observations. 
At baseball-reference, for instance, you can see, on the hitting page, that Robinson Cano hit .302-28-118 (fact), which was worth 31 runs above average (close enough to be called fact).
On his fielding page, you can see that Cano had 323 putouts (fact) and 444 assists (fact), which, by Total Zone Rating, was worth 4 runs below average (uh-oh).
Unlike the other columns, UZR column is an *estimate*. Maybe it really was -4 runs, but it could easily have been -10 runs, or -20 runs, or +6 runs. 
To the naked eye, the hitting and fielding numbers both look equally official and reliable, as accurate observations of what happened. But one is based on an observation of what happened, and the other is based on an estimate of what happened.
------
OK, that's a bit of an exaggeration, so let me backtrack and explain what I mean.
Cano had 28 home runs, and 444 assists. Those are "facts", in the sense that the error is zero, if the observations are recorded correctly.
Cano's offense was 31 runs above average. I'm saying that's accurate enough to be called a "fact."  But admittedly, it is, in fact, an estimate. Even if the Linear Weights formula (or whatever) is perfectly accurate, the "runs above average" number is after adjusting for park effects (which are imperfect estimates, albeit pretty good ones). Also, the +31 assumes Cano faced league-average pitching. That, again, is an estimate, but, again, it's a pretty strong one.
For defense, comparatively, the UZR of "-4" is a very, very, weak estimate. It carries an implicit assumption that Cano's "relative difficulty of balls in play" was zero. That's much less reliable than the estimate that his "relative difficulty of pitchers faced" was zero. If you wanted, you could do the math, and show how much weaker the one estimate is than the other; the difference is huge.
But, here's a thought experiment to make it clear. Suppose Cano faces an the worst pitcher in the league, and hits a home run. In that case, he's at worst 1.3 runs above average for that plate appearance, instead of our estimate of 1.4. It's a real difference in how we evaluate his performance, but a small one.
On the other hand, suppose Cano faces a grounder in a 50% zone, but one of the easy ones, that almost any fielder would get to. Then, he's maybe +0.01 hits above average, but we're estimating +0.5. That is a HUGE difference. 
It's also completely at odds with our observation of what happens on the field. After an easy ground ball, even the most casual fan would say he observed Cano saving his team 0 runs over what another player would do. But we write it down as +0.4 runs, which is ... well, it's so big, you have to call it *wrong*. We are not accurately recording what happened on the field.
So, if you take "what happened on the field" in broad, intutive terms, the home run matches: "he did a good thing on the field and created over a run" both to the observer and the statistic. But for the ground ball, the statistic lies. It says Cano "did a good thing on the field and saved almost half a run," but the observer says Cano "made a routine play." 
The batting statistics match what a human would say happened. The fielding stats do not.
------
How much random error is in those fielding statistics? When UZR gives an SD of 29 runs, how much of that is luck, and how much is talent? If we knew, we could at least regress to the mean. But we don't. 
That's because we don't know the idealized actual SD of observed performance, adjusted for the difficulty of the balls in play. It must be somewhere between 47 runs (the SD of observed performance without adjusting for difficulty), and 30 runs (the SD of talent). But where in between?
In addition: how sure are we that the estimates are even unbiased, in the sense that they're independently just as likely to be too high as too low? If they're truly unbiased, that makes them much easier to live with -- at the very least, you know they'll get more accurate as you average over multiple seasons. But if they inappropriately adjust for park effects, or pitcher talent, you might find some teams being consistently overestimated or underestimated. And that could really screw up your evaluations, especially if you're using those fielding estimates to rejig pitching numbers. 
-------
For now, the estimates I like best are the ones from Tango's "Fan Scouting Report" (FSR). As I understand it, those are actually estimates of talent, rather than estimates of what happened on the field. 
Team FSR has an SD of 23 runs. That's very reasonable. It's even more conservative than it looks. That 23 includes all the "other than range" stuff -- throwing arm, double plays, and so on. So the range portion of FSR is probably a bit lower than 23.
We know the true SD of talent is closer to 30, but there's no way for subjective judgments to be that precise. For one thing, the humans that respond to Tango's survey aren't perfect evaluators of what they see on the field. Second, even if they *were* perfect, a portion of what they're observing is random luck anyway. You have to temper your conclusions for the amount of noise that must be there. 
It might be a little bit apples-to-oranges to compare FSR to the other estimates, because FSR has much more information to work with. The survey respondents don't just use the ball-in-play stats for a single year -- they consider the individual players' entire careers, ages and trajectories; the opinions of their peers and the press; their personal understanding of how fielding works; and anything else they deem relevant.
But, that's OK. If your goal is to try to estimate the influence of team fielding, you might as well just use the best estimate you've got. 
For my part, I think FSR is the one I trust the most. When it comes to evaluating fielding, I think sabermetrics is still way behind the best subjective evaluations.
Labels: Allen, BABIP, baseball, DSR, fielding, FSR, Hsu, statistics, Tango, UZR


