Friday, June 19, 2015

Can fans evaluate fielding better than sabermetric statistics?

Team defenses differ in how well they turn batted balls into outs. How do you measure the various factors that influence the differences? The fielders obviously have a huge role, but do the pitchers and parks also have an influence?

Twelve years ago, in a group discussion, Erik Allen, Arvin Hsu, and Tom Tango broke down the variation in batting average on balls in play (BAbip). Their analysis was published in a summary called "Solving DIPS" (.pdf).

A couple of weeks ago, I independently repeated their analysis -- I had forgotten they had already done it -- and, reassuringly, got roughly the same result. In round numbers, it turns out that:

The SD of team BAbip fielding talent is roughly 30 runs over a season.


There are several competing systems for evaluating which players and teams are best in the field, and by how much. The Fangraphs stats pages list some of those stats, and let you compare.

I looked at those team stats for the 2014 season. Specifically, these three:

1. DRS, from The Fielding Bible -- specifically, the rPM column, runs above average from plays made. (That's the one we want, because it doesn't include outfielder/catcher arms, or double-play ability.)

2. The Fan Scouting Report (FSR), which is based on an annual fan survey run by Tom Tango.

3. Ultimate Zone Rating (UZR), a stat originally developed by Mitchel Lichtman, but which, as I understand it, is now public. I used the column "RngR," which is the range portion (again to leave out arms and other defensive skills).

All three stats are denominated in runs. Here are their team SDs for the 2014 season, rounded:

37 runs -- DRS (rPM column)
23 runs -- Fan Scouting Report (FSR)
29 runs -- UZR (RngR)
30 runs -- team talent

The SD of DRS is much higher than the SD of team talent. Does that mean it's breaching the "speed of light" limit of forecasting, trying to (retrospectively) predict random luck as well as skill?

No, not necessarily. Because DRS isn't actually trying to evaluate talent.  It's trying to evaluate what actually happened on the field. That has a wider distribution than just talent, because there's luck involved.

A team with fielding talent of +30 runs might have actually saved +40 runs last year, just like a player with 30-home-run talent may have actually hit 40.

The thing is, though, that in the second case, we actually KNOW that the player hit 40 homers. For team fielding, we can only ESTIMATE that it saved 40 runs, because we don't have good enough data to know that the extra runs didn't just result from getting easier balls to field.

In defense, the luck of "made more good plays than average" is all mixed up with "had more easier balls to field than average."  The defensive statistics I've seen try their best to figure out which is which, but they can't, at least not very well.

What they do, basically, is classify every ball in play according to how difficult it was, based on location and trajectory. I found this post from 2003, which shows some of the classifications for UZR. For instance, a "hard" ground ball to the "56" zone (a specific portion of the field between third and short) gets turned into an out 43.5 percent of the time, and becomes a hit the other 56.5 percent. 

If it turns out a team had 100 of those balls to field, and converted them to outs at 45 percent instead of 43.5 percent, that's 1.5 extra outs it gets credited for, which is maybe 1.2 runs saved.

The problem with that is: the 43.5 percent is a very imprecise estimate of what the baseline should be. Because, even in the "hard-hit balls to zone 56" category, the opportunities aren't all the same. 

Some of them are hit close to the fielder, and those might be turned into outs 95 percent of the time, even for an average or bad-fielding team. Some are hit with a trajectory and location that makes them only 8 percent. And, of course, each individual case depends where the fielders are positioned, so the identical ball could be 80 percent in one case and 10 percent in another.

In a "Baseball Guts" thread at Tango's site, data from Sky Andrecheck and BIS suggested that only 20 percent of ground balls, and 10 percent of fly balls, are "in doubt", in the sense that if you were watching the game, you'd think it could have gone either way. In other words, at least 80% of balls in play are either "easy outs" or "sure hits."  ("In doubt" is my phrase, meaning BIPs in which it wasn't immediately at least 90 percent obvious to the observer whether it would be a hit or an out.)

That means that almost all the differences in talent and performance manifest themselves in just 10 to 20 percent of balls in play.

But, even the best fielding systems have few zones that are less than 20 percent or more than 80 percent. That means that there is still huge variation in difficulty *even accounting for zone*. 

So, when a team makes 40 extra plays over a season, it's a combination of:

(a) those 40 plays came from extra performance from the few "in doubt" balls;
(b) those 40 plays came from easier balls overall.

I think (b) is much more a factor than (a), and that you have to regress the +40 to the mean quite a bit to get a true estimate. 

Maybe when the zones get good enough to show large differences between teams -- like, say, 20% for a bad fielder and 80% for a good fielder -- well, at that point, you have a system that might work. But, without that, doesn't it almost have to be the case that most of the difference is just from what kinds of balls you get?

Tango made a very relevant point, indirectly, in a recent post. He asked, "Is it possible that Manny Ramirez never made an above-average play in the outfield?"  The consensus answer, which sounds right to me, was ... it would be very rare to see Manny make a play that an average outfielder wouldn't have made. (Leaving positioning out of the argument for now.)

Suppose BIPs to a certain difficult zone get caught 30% of the time by an average fielder, and Manny catches them 20% of the time. Since ANY outfielder would catch a ball that Manny gets to ... well, that zone must really be at least TWO zones: a "very easy" zone with a 100% catch rate, and a "harder" zone with an 10% catch rate for an average fielder, and a 0% catch rate for Manny.

In other words, if Manny makes 30% plays in that zone and a Gold Glove outfielder makes 25%, it's almost certain that Manny just got easier balls to catch. 

The only way to eliminate that kind of luck is to classify the zones in enough micro detail that you get close to 0% for the worst, or close to 100% for the best.

And that's not what's happening. Which means, there's no way to tell how many runs a defense saved.


And this brings us back to the point I made last month, about figuring out how to split observed runs allowed into observed pitching and observed fielding. There's really no way to do it, because you can't tell a good fielding play from an average one with the numbers currently available. 

Which means: the DRS and UZR numbers in the Fangraphs tables are actually just estimates -- not estimates of talent, but estimates of *what happened in the field*. 

There's nothing wrong with that, in principle: but, I don't think it's generally realized that that's what those are, just estimates. They wind up in the same statistical summaries as pitching and hitting metrics, which themselves are reliable observations. 

At baseball-reference, for instance, you can see, on the hitting page, that Robinson Cano hit .302-28-118 (fact), which was worth 31 runs above average (close enough to be called fact).

On his fielding page, you can see that Cano had 323 putouts (fact) and 444 assists (fact), which, by Total Zone Rating, was worth 4 runs below average (uh-oh).

Unlike the other columns, UZR column is an *estimate*. Maybe it really was -4 runs, but it could easily have been -10 runs, or -20 runs, or +6 runs. 

To the naked eye, the hitting and fielding numbers both look equally official and reliable, as accurate observations of what happened. But one is based on an observation of what happened, and the other is based on an estimate of what happened.


OK, that's a bit of an exaggeration, so let me backtrack and explain what I mean.

Cano had 28 home runs, and 444 assists. Those are "facts", in the sense that the error is zero, if the observations are recorded correctly.

Cano's offense was 31 runs above average. I'm saying that's accurate enough to be called a "fact."  But admittedly, it is, in fact, an estimate. Even if the Linear Weights formula (or whatever) is perfectly accurate, the "runs above average" number is after adjusting for park effects (which are imperfect estimates, albeit pretty good ones). Also, the +31 assumes Cano faced league-average pitching. That, again, is an estimate, but, again, it's a pretty strong one.

For defense, comparatively, the UZR of "-4" is a very, very, weak estimate. It carries an implicit assumption that Cano's "relative difficulty of balls in play" was zero. That's much less reliable than the estimate that his "relative difficulty of pitchers faced" was zero. If you wanted, you could do the math, and show how much weaker the one estimate is than the other; the difference is huge.

But, here's a thought experiment to make it clear. Suppose Cano faces an the worst pitcher in the league, and hits a home run. In that case, he's at worst 1.3 runs above average for that plate appearance, instead of our estimate of 1.4. It's a real difference in how we evaluate his performance, but a small one.

On the other hand, suppose Cano faces a grounder in a 50% zone, but one of the easy ones, that almost any fielder would get to. Then, he's maybe +0.01 hits above average, but we're estimating +0.5. That is a HUGE difference. 

It's also completely at odds with our observation of what happens on the field. After an easy ground ball, even the most casual fan would say he observed Cano saving his team 0 runs over what another player would do. But we write it down as +0.4 runs, which is ... well, it's so big, you have to call it *wrong*. We are not accurately recording what happened on the field.

So, if you take "what happened on the field" in broad, intutive terms, the home run matches: "he did a good thing on the field and created over a run" both to the observer and the statistic. But for the ground ball, the statistic lies. It says Cano "did a good thing on the field and saved almost half a run," but the observer says Cano "made a routine play." 

The batting statistics match what a human would say happened. The fielding stats do not.


How much random error is in those fielding statistics? When UZR gives an SD of 29 runs, how much of that is luck, and how much is talent? If we knew, we could at least regress to the mean. But we don't. 

That's because we don't know the idealized actual SD of observed performance, adjusted for the difficulty of the balls in play. It must be somewhere between 47 runs (the SD of observed performance without adjusting for difficulty), and 30 runs (the SD of talent). But where in between?

In addition: how sure are we that the estimates are even unbiased, in the sense that they're independently just as likely to be too high as too low? If they're truly unbiased, that makes them much easier to live with -- at the very least, you know they'll get more accurate as you average over multiple seasons. But if they inappropriately adjust for park effects, or pitcher talent, you might find some teams being consistently overestimated or underestimated. And that could really screw up your evaluations, especially if you're using those fielding estimates to rejig pitching numbers. 


For now, the estimates I like best are the ones from Tango's "Fan Scouting Report" (FSR). As I understand it, those are actually estimates of talent, rather than estimates of what happened on the field. 

Team FSR has an SD of 23 runs. That's very reasonable. It's even more conservative than it looks. That 23 includes all the "other than range" stuff -- throwing arm, double plays, and so on. So the range portion of FSR is probably a bit lower than 23.

We know the true SD of talent is closer to 30, but there's no way for subjective judgments to be that precise. For one thing, the humans that respond to Tango's survey aren't perfect evaluators of what they see on the field. Second, even if they *were* perfect, a portion of what they're observing is random luck anyway. You have to temper your conclusions for the amount of noise that must be there. 

It might be a little bit apples-to-oranges to compare FSR to the other estimates, because FSR has much more information to work with. The survey respondents don't just use the ball-in-play stats for a single year -- they consider the individual players' entire careers, ages and trajectories; the opinions of their peers and the press; their personal understanding of how fielding works; and anything else they deem relevant.

But, that's OK. If your goal is to try to estimate the influence of team fielding, you might as well just use the best estimate you've got. 

For my part, I think FSR is the one I trust the most. When it comes to evaluating fielding, I think sabermetrics is still way behind the best subjective evaluations.

Labels: , , , , , , , , ,


At Wednesday, August 19, 2015 2:52:00 AM, Anonymous Anonymous said...

I'm late to the party, but I have a few relevant comments: One, the "weak estimate" of these fielding metrics actually becomes quite a bit stronger with more data.

Two, even though hitting and most pitching metrics record exactly what happened, of what use is that? Most of the time we are using these metrics to either estimate true talent or to project future performance, which is essentially the same thing of course. Given that, even though a single is a single, we care very much about the quality of the single, not just the fact that it was a single. In fact, we could argue that a MUCH better hitting metric is one that ignores the actual result of a batted ball and uses only the type, location and trajectory of that batted ball. In fact, there are some hitting and pitching metrics that do this. If we do that, and it is arguably better than using actual results, especially for small samples, now we are in exactly the same boat as the fielding metrics! So you can't in one breath criticize the fielding metrics because they are only a weak estimate of "what happened" and then in the same breath say that a hitting metric that estimates "what happened" (based on the characteristics of the batted ball) is better than one in which the actual result of the batted ball is used, which in fact it is!

Basically what I am saying is that the fact that we can only estimate "what happened" in these advanced fielding metrics, at least at the player level, is both a virtue and a curse. But, as can be proved by looking at it from a hitting perspective, the virtue far outweighs the curse, otherwise would simply use something like simple ZR or even range factor, which does in fact essentially tell us exactly what happened (a ball was hit near a fielder and he did or did not turn it into an out).

If you wanted to estimate true talent of a hitter, and I told you two things: One, the batter got a single, or two, the batter hit a hard line drive just over the IF (or a medium hit ground ball to the IF), which one would you prefer? It has to be the second one, and that proves my point that estimating what actually happened on the field is MUCH better than what actually happened, as long as your estimate of what happened has a decent (not perfect and not necessarily unbiased) amount of information in it.

At Wednesday, August 19, 2015 3:05:00 AM, Anonymous Anonymous said...

All that being said, the problem with the current fielding metrics, including my own UZR, is that they do not do the proper regression toward the mean in order to better estimate "what happened" even in large samples. If they were to do that, then those fielding metrics would be very, very good, even though you call them a "weak estimate" of what happened. However, this problem is a systematic bias, which means that it is not a huge problem other than the fact that a highly ratef fielder likely did not field as well as these "unregressed" metrics tell is they did and ditto for a poorly rated fielder. However, since we have to regress these sample data anyway when we estimate true talent and for making projections, we take care of this problem anyway.

For example, say that UZR says that Jeter was a -10 per 150 games in 2012. He probably only fielded at a -8 clip or something like that. If we wanted to estimate his true talent from that one year only we might regress the -8 to -4 or something like that. OK, since the metric is making a mistake and reporting that -10 is what Jeter actually did in 2012, all we have to do is regress a little more in order to get that -10 down to -4 rather than more accurate -8. And we do the math to come up with the regressions in order to estimate true talent, we probably come up with the right regression equation which does in fact take the -10 down to the -4.

So even though the reported "runs saved or cost" is too extreme, we get to the correct place anyway when we estimate true talent.

One interesting thing is that the same bias actually happens with offensive (and pitching) stats. Anytime we report that say, a player, batted .320 in 2012, even though that is a "fact" it is guaranteed that on average, he will have gotten more than his fair share of "cheap hits." So to some extent, even though he did in fact hit .320, we sort of want to regress that number to report what actually happened, if we include in the definition of what actually happened, the quality of those hits and whether they were overall lucky or not.

I for one don't ever care what "actually happened anyway" so it doesn't matter to me whether a -10 UZR was really a -8. It also doesn't matter to me whether a player actually hit .320 if he got more than his fair of cheap hits by luck alone. All I care about is estimates of true talent. I can estimate that from the -10 just as well as I can get that from the -8. I also can get a player's estimated true talent from the .320 AND I can do even better if I know the quality of the hits and outs withing that .320. In fact, as I said earlier, I do even better with my true talent BA estimate if I know a lot about those batted balls and I don't even know that the player in question batted .320!

At Thursday, August 20, 2015 10:18:00 AM, Blogger Phil Birnbaum said...

MGL (assuming these are MGL),

I generally agree with what you say here. However, we (and you) still do talk about a player's "traditional" statistics (which I refer to as "what happened on the field"). In other words, we DO say the player hit .320, even when we have trajectory statistics that are better for our purposes.

And there's a legitimate use for "what happened" statistics. Why did the Pirates win? Because "this is what happened on the field" when in Bill Mazeroski's last PA. For those discussions, it doesn't matter what "true talent" was.

My point is that the .320 for batting is much, much more accurate for "what happened" than the "-10" for fielding, if you want to use those respective numbers to explain why teams won or lost.


Post a Comment

<< Home