Do defensive statistics overrate the importance of fielding?
The job of a baseball team's pitchers and fielders is to prevent the opposition from scoring. It's easy to see how well they succeeded, collectively -- just count the number of runs that crossed the plate.
But how can you tell how many of those runs were because of the actions of the pitchers, and how many were because of the actions of the fielders?
I think it's very difficult, and that existing attempts do more harm than good.
The breakdowns I've been looking at lately are the ones from Baseball Reference, on their WAR pages -- pitching Wins Above Replacement (pWAR) and defensive Wins Above Replacment (dWAR).
They start out with team Runs Allowed. Then, they try to figure out how good the defense was, behind the pitchers, by using play-by-play data. If they conclude that the defense cost the team, say, 20 runs, they bump the pitchers by 20 runs to balance that out.
So, if the team gave up 100 runs more than average, the pitchers might come in at -80, and the fielders at -20.
I'm going to argue that method doesn't work right. The results are unreliable, inaccurate, and hard to interpret.
Before I start, five quick notes:
1. My argument is strongest when fielding numbers are based on basic play-by-play data, like the kind from Retrosheet. When using better data, like the Fielding Bible numbers (which use subjective observations and batted ball timing measurements), the method becomes somewhat better. (But, I think, still not good enough.) Baseball Reference uses the better method for 2003 and later.
2. The WAR method tries to estimate the value of what actually took place, not the skill level of the players who made it happen. That is: it's counting performance, not measuring talent.
3. Defensive WAR (dWAR) correlates highly with opposition batting average on balls in play (BAbip). For team-seasons from 1982 to 2009, the correlation was +0.60. For 2003 to 2009, with the improved defense data, the correlation was still +0.56.
dWAR actually includes much more than just BAbip. There are adjustments for outfielder arms, double plays, hit types, and so on. But, to keep things simple, I'm going to argue as if dWAR and BAbip are measuring the same thing. The argument wouldn't change much if I kept adding disclaimers for the other stuff.
4. I'll be talking about *team* defense and pitching. The calculation for individuals has other issues that I'm not going to deal with here.
5. I don't mean to pick on B-R specifically ... I think there are other systems that do things the same way. I just happened to run across this one most recently. Also, even though the example is in the context of WAR, the criticism isn't about WAR at all; it's only about splitting observed performance between pitching and fielding.
The problem here is a specific case of a more general issue: it's easy to see what happens on the field, but often very difficult to figure out how to allocate it to the various players involved.
That's why hockey and football and basketball and soccer are so much harder to figure out than baseball. In basketball, when there's a rebound, how do you figure out who "caused" it? It could be the rebounder being skillful, or the other players drawing the defenders away, or even the coach's strategy.
But in baseball, when a single is hit, you pretty much know you can assign all the offensive responsibility to the batter. (The baserunners might have some effect, but it's small.) And even though you have to assign the defensive responsibility to the nine fielders collectively, the pitcher is so dominant that, for years, we've chosen to almost ignore the other eight players completely.
But now that we *don't* want to ignore them ... well, how do you figure out which players "caused" a run to be prevented? Now it gets hard.
Even for a single ball in play, it's hard.
The pitcher gives up a ground ball to the shortstop, who throws the batter out. We all agree that we observed an out, and we credit the out we observed to the pitcher and fielders collectively (under "opposition batting" ).
But how do we allocate it separately rather than collectively? Do we credit the pitcher? The fielders? Both, in some proportion?
It probably depends on the specifics of the play.
With two outs in the bottom of the ninth, when a third baseman dives to catch a screaming line drive over the bag, we say his defense saved the game. When it's a soft liner hit right to him, we say the pitcher saved the game. Our observation of what happened on the field actually depends on our perception of where the credit lies.
We might want to credit the fielder in proportion to the difficulty of the play. If it's an easy grounder, one that would be a hit only 5% of the time, we might credit the defense with 5% of the value of the out, with the rest to the pitcher. If it's a difficult play, we might go 80% instead of 5%.
That sounds reasonable. And it's actually what dWAR does; it apportions the credit based on how often it thinks an average defense would have turned that batted ball into an out.
The problem is: how do you estimate that probability? For older seasons, where you have only Retrosheet data, you can't. At best, from play-by-play data, you can figure if it's a ground ball, fly ball, or liner. If you don't have that information, you have to just assume that every ball in play is league-average, around 30 percent chance of being a hit and 70 percent chance of being an out.
Here is the problem, which I think is not immediately obvious: when you assume certain balls in play are the same, you wind up, mathematically, giving 100 percent of the allocation of credit to the fielders. Even if you go through the arithmetic of calculating everything to the average of 70/30, and assigning 30 percent of every hit to the pitcher, and so forth ... if you do all that, you'll still wind up, at the end of the calculation, that if the fielders gave up 29 percent hits instead of 30 percent, the fielders get ALL the credit for the one "missing" hit.
Of course, the actual Baseball Reference numbers don't treat every BIP the same ... they do have the hit type, and, for recent seasons, they have trajectory data to classify BIPs better. But, it's not perfect. There are still discrepancies, all of which wind up in the fielding column. [UPDATE: I think even the best data currently available barely puts a dent in the problem.]
For Retrosheet data, and making up numbers: suppose a fly ball has a 26 percent chance of being a hit, but a ground ball has a 33 percent chance. The pitchers will get the credit for what types of balls they gave up, but the fielders will get 100% of the credit after that. So, if the pitcher gives up fewer ground balls, he gets the credit. But if he gives up *easier* ground balls, the fielders get the credit instead.
This is the key point. Everything that's not known about the ball in play, everything that's random, or anything that has to be averaged or guessed or estimated or ignored -- winds up entirely in the fielders' column.
Now, maybe someone could argue that's actually what we want, that all the uncertainty goes to the fielders. Because, it's been proven, pitchers don't differ much in control over balls in play.
But that argument fails. "Pitchers don't differ much in BAbip" is a statement about *talent*, not about *observations*. In actual results, pitchers DO differ in BAbip, substantially, because of luck. Take two identical pitcher APBA cards, and play out seasons, and you'll find big differences.
Observations are a combination of talent and luck. If you want to divide the observed balls in play into observed pitching and observed fielding, you're also going to have to divide the luck properly. Not, zero to the pitcher and 100 percent to the fielders.
Traditionally, with observations at the team level, our observations were close to 100% perfect, in how they reflected what happened on the field. But when you take those observations and allocate them, it's no longer 100% perfect. You're estimating, rather than observing.
For the first time, we are "guessing" about how to allocate the observation.
Watching a specific game, in 1973 or whenever, it would have been obvious that a certain run was prevented by the pitcher inducing three weak ground balls to second base. Now, we can't tell that from Retrosheet data, and so we (mistakenly, in this case) assign the credit to the fielders.
Nowhere else do we have to guess. We'll observe the same number of doubles, home runs, strikeouts, and popups to first base as the fan in the seats in 1973. We'll observe the same runs and runs allowed, and innings pitched, and putouts by the left fielder. But we will NOT "observe" the same "performance" of the fielders. Fans who watched the games got a reasonable idea how many runs the fielders saved (or cost) by their play. We don't; we have to guess.
Of course, in guessing, our error could go either way. Sometimes we'll overcredit the defense (compared to the pitcher), and sometimes we'll undercredit the defense. Doesn't it all come out in the wash? Nope. A single season is nowhere near enough for the variation to even out.
For single seasons, we will be substantially overestimating the extent of the fielders' responsibility for runs prevented (or allowed) on balls in play.
This means that you actually have to regress team dWAR to the mean to get an unbiased estimate for what happened on the field. That's never happened before. Usually, we need to regress to estimate *talent*. Here, we need to regress just to estimate *observations*. (To estimate talent, we'd need to regress again, afterwards.)
Here's an analogy I think will make it much clearer.
Imagine that we didn't have data about home runs directly, only about deep fly balls (say, 380 feet or more). And we found that, on average, 75% of those fly balls turn out to be home runs.
One year, the opposition hits 200 fly balls, but instead of 150 of them being home runs, only 130 of them are.
And we say, "wow, those outfielders must be awesome, to have saved 20 home runs. They must have made 20 spectacular, reaching-over-the-wall, highlight-reel catches!"
No, probably not. It most likely just turned out that only 130 of those 150 deep fly balls had enough distance. By ignoring the substantial amount of luck in the relationship between fly balls and home run potential, we wind up overestimating the outfielders' impact on runs.
That's exactly what's happening here.
"What actually happened on the field" is partly subjective. We observe what we think is important, and attribute it where we think it makes sense. You could count take the outcome of an AB and assign it to the on-deck hitter instead of the batter, but that would make no sense to the way our gut assigns credit or blame. (Our gut, of course, is based on common-sense understandings of how baseball and physics work.)
We assign the observations to the players we think "caused" the result. But we do it even when we know the result is probably the result of luck. It's just the way our brains work. If it was luck, we want to assign it to the lucky party. Otherwise, our observation is *wrong*.
Here's an analogy, a coin-flipping game. The rules go like this:
-- The pitcher flips a coin. If it's a head, it's a weak ground ball, which is always an out. If it's a tail, it's a hard ground ball.
-- If it's a hard ground ball, the fielders take over, and flip their own coin. If that coin is a head, they make the play for an out. If it's a tail, it's a hit.
You play a season of this, and you see the team allowed 15 more hits than expected. How do you assign the blame?
Suppose the "fielders" coin flipped tails exactly half the time, as expected, but the "pitchers" coin flipped too many tails, so the pitchers gave up 30 too many hard ground balls. In that case, we'd say that the fielders can't be blamed, that the 15 extra hits were the pitcher's fault.
If it were the other way around -- the pitcher coin flipped heads half the time, but the "fielder" coin flipped 15 too few heads, letting 15 too many balls drop in for hits -- we'd "blame" the fielders.
We have very specific criteria about how to assign the observations properly, even when they're just random.
The dWAR calculation violates those criteria. It refuses to look at the pitcher coin, or at least, it doesn't have complete data for it. So it just assigns all 15 hits, or whatever the incomplete pitcher coin data can't explain, to the fielder coin.
Why is this a big deal? What does it matter, that the split between pitching and defense has this flaw?
1. First, it matters because it introduces something that's new to us, and not intuitively obvious -- the need to regress the dWAR to the mean *just to get an unbiased estimate of what happened on the field*. It took me a lot of thinking until I realized that's what's going on, partly because it's so counterintuitive.
2. It matters because most estimates of fielding runs saved don't do any regression to the mean. This leads to crazy overestimates of the impact of fielding.
My guess, based on some calculations I did, is that you have to regress dWAR around halfway to the mean, for cases where you just use BAbip as your criterion. If my guess is right, it means fielding only half as important as dWAR thinks it is.
Of course, if you have better data, like the Fielding Bible's, you may have to regress less -- depending on the accuracy of your estimates are of how hard each ball is to field. Maybe with that data you only have to regress, say, a third of the way to the mean, instead of a half. I have no idea.
The first edition of the Fielding Bible figures fielding cost the 2005 New York Yankees a total of 164 hits -- more than a hit a game. That's about 130 runs, or 13 wins.
They were saying that if you watched the games, and evaluated what happened, you'd see the fielders screw up often enough that you'd observe an average of a hit per game.
I'm saying ... no way, that's too hard to believe. I don't know what the real answer is, but I'd be willing to bet that it's closer to half a hit than a full hit -- with the difference caused by the Fielding Bible's batted-ball observations not being perfect.
I'll investigate this further, how much you have to regress.
3. The problem *really* screws up the pitching numbers. What you're really trying to is start with Runs Allowed and subtract observed defense. But the measure of observed defense you're using is far too extreme. So, effectively, you're subjecting true pWAR to an overadjustment, along with a random shock.
Even so, that doesn't necessarily mean pWAR becomes less accurate. If the errors were the same magnitude as the true effect of fielding, it would generally be a break-even. If the Yankees are actually 40 runs worse than average, the error is the same whether you credit the pitchers 80 runs, or 0 runs ... it's just a matter of which direction.
Except: the errors aren't fixed. Even if you were to adjust dWAR by regressing it to the mean exactly the right amount, it would still be just an estimate, with a random error. Adding that in, and you'd still be behind.
And, perhaps more importantly, with the adjustment, we lose our understanding of what the numbers might mean. The traditional way, when the error is due to *not* adjusting for defense, we intuitively know how to deal with the numbers, what they might not mean. We've always known we can't evaluate pitchers based on runs allowed, unless we adjusting for fielding. But, we've developed a gut feel for what the unadjusted numbers mean, because we've dealt with them so often.
We probably even have an idea what direction the adjustment has to go, whether the pitchers in question had a good or bad defense behind them -- because we know the players, and we've seen them play, and we know their reputations.
But, the dWAR way ... well, we have no gut feel for how we need to adjust, because it's no longer about adding in the fielders; it's about figuring out how bad the defense overestimate might have been, and how randomness might have affected the final number.
When you adjust for dWAR, what you're saying is: "instead of living with the fact that team pitching stats are biased by the effects of fielding, I prefer to have team pitching stats biased by some random factor that's actually a bigger bias than the original."
All things considered, I think I'd rather just stick with runs allowed.