Sunday, January 31, 2021

Splitting defensive credit between pitchers and fielders (Part III)

(This is part 3.  Part 1 is here; part 2 is here.)

UPDATE, 2021-02-01: Thanks to Chone Smith in the comments, who pointed out an error.  I investigated and found an error in my code. I've updated this post -- specifically, the root mean error and the final equation. The description of how everything works remains the same.

------

Last post, we estimated that in 2018, Phillies fielders were 3 outs better than league average when Aaron Nola was on the mound. That estimate was based on the team's BAbip and Nola's own BAbip.

Our first step was to estimate the Phillies' overall fielding performance from their BAbip. We had to do that because BAbip is a combination of both pitching and fielding, and we had to guess how to split those up. To do that, we just used the overall ratio of fielding BAbip to overall BAbip, which was 47 percent. So we figured that the Phillies fielders were -24, which is 47 percent of their overall park-adjusted -52.

We can do better than that kind of estimate, because, at least for recent years, we have actual fielding data that can substitute for that estimate. Statcast tells us that the Phillies fielders were -39 outs above average (OAA) for the season*. That's 75 percent of BAbip, not 47 percent ... but still well within typical variation for teams. 

(*The published estimate is -31, but I'm adding 25 percent (per Tango's suggestion) to account for games not included in the OAA estimate.)  

So we can get much more accurate by starting with the true zone fielding number of -39, instead of the weaker estimate of -24. 

-------

First, let's convert the -39 back to BAbip, by dividing it by 3903 BIP. That gives us ... almost exactly -10 points.

The SD of fielding talent is 6.1. The SD of fielding luck in 3903 BIP is 3.65. So it works out that luck is 2.6 of the 10 points, and talent is the remaining 7.3. (That's because 2.6 = 3.65^2/(3.65^2+6.1^2).)

We have no reason (yet) to believe Nola is any different from the rest of the team, so we'll start out with an estimate that he got team average fielding talent of -7.3, and team average fielding luck of -2.6.

Nola's BAbip was .254, in a league that was .296. That's an observed 41 point benefit. But, with fielders that averaged .00074 talent and -0.0026 luck, in a park that was +0.0025, that +41 becomes +48.5.  

That's what we have to break down. 

Here's Nola's SD breakdown, for his 519 BIP. We will no longer include fielding talent in the chart, because we're using the fixed team figure for Nola, which is estimated elsewhere and not subject to revision. But we keep a reduced SD for fielding luck relative to team, because that's different for every pitcher.

 9.4 fielding luck
 7.6 pitching talent
17.3 pitching luck
 1.5 park
--------------------
21.2 total

Converting to percentages:

 20% fielding luck
 13% pitching talent
 67% pitching luck
  1% park
--------------------
100% total

Using the above percentages, the 48.5 becomes:

+ 9.5 points fielding luck
+ 6.3 points pitching talent
+32.5 points pitching luck
+ 0.2 points park
-------------------
+48.5 points

Adding back in the -7.3 points for observed Phillies talent, -2.6 for Phillies luck, and 2.5 points for the park, gives

 -7.3 points fielding talent [0 - 7.3]
 +6.9 points fielding luck   [+10.2 - 2.6]
 +6.3 points pitching talent
+32.5 points pitching luck
 +2.7 points park            [0.2 + 2.5]
-----------------------------------------
 41   points

Stripping out the two fielding rows:

-7.3 points fielding talent 
+6.9 points fielding luck
-----------------------------
-0.4 points fielding

The conclusion: instead of hurting him by 10 points, as the raw team BAbip might suggest, or helping him by 6 points, as we figured last post ... Nola's fielders only hurt him by 0.4 points. That's less than a fifth or a run. Basically, Nola got league-average fielding.

--------

Like before, I ran this calculation for all the pitchers in my database. Here are the correlations to actual "gold standard" OAA behind the pitcher:

r=0.23 assume pitcher fielding BAbip = team BAbip
r=0.37 BAbip method from last post
r=0.48 assume pitcher OAA = team OAA
r=0.53 this method

And the root mean square error:

13.7 assume pitcher fielding BAbip = team BAbip
11.3 BAbip method from last post
10.2 assume pitcher OAA = team OAA
10.0 this method

-------

Like in the last post, here's a simple formula that comes very close to the result of all these manipulations of SDs:

F = 0.8*T + 0.2*P

Here, "F" is fielding behind the pitcher, which is what we're trying to figure out. "T" is team OAA/BAbip. "P" is player BAbip compared to league.

Unlike the last post, here the team *does* include the pitcher you're concerned with. We had to do it this way because presumably we have data for the team without the pitcher. (If we did, we'd just subtract it from team and get the pitcher's number directly!)

It looks like 20% of a pitcher's discrepancy is attributable to his fielders. That number is for workloads similar to those in my sample -- around 175 IP. It does with playing time, but only slightly. At 320 IP, you can use 19% instead. At 40 IP, you can use 22%. Or, just use 20% for everyone, and you won't be too far wrong.

-------

Full disclosure: the real life numbers for 2017-19 are different. The theory is correct -- I wrote a simulation, and everything came out pretty much perfect. But on real data, not so perfect.

When I ran a linear regression to predict OAA from team and player BIP, it didn't come out to 20%. It came out to only about 11.5%. The 95% confidence interval only brings it up to 15% or 16%.

The same thing happened for the formula from the last post: instead of the predicted 26%, the actual regression came out to 17.5%.
  
For the record, these are the empirical regression equations, all numbers relative to league:

F = 0.23*(Team BAbip without pitcher) + 0.175*P
F = 0.92*(Team OAA/BIP including pitcher) + 0.115*P

Why so much lower than expected? I'm pretty sure it's random variation. The empirical estimate of 11.5% is very sensitive to small variations in the seasonal balance of variation in pitching and fielding luck vs. talent -- so sensitive that the difference between 11.5 points and 20 points is not statistically significant. Also, the actual number changes from year-to-year because of variation. So, I believe that the 20% number is correct as a long-term average, but for the seasons in the study, the actual number is probably somewhere between 11.5% and 20%.

I should probably explain that in a future post. But, for now, if you don't believe me, feel free to use the empirical numbers instead of my theoretical ones. Whether you use 11.5% or 20%, you'll still be much more accurate than using 100%, which is effectively what happens when you use the traditional method of assigning the overall team number equally to every pitcher.















Labels: , , ,

Monday, January 11, 2021

Splitting defensive credit between pitchers and fielders (Part II)

(Part 1 is here.  This is Part 2.  If you want to skip the math and just want the formula, it's at the bottom of this post.)

------

When evaluating a pitcher, you want to account for how good his fielders were. The "traditional" way of doing that is, you scale the team fielding to the pitcher. Suppose a pitcher was +20 plays better than normal, and his team fielding was -5 for the season. If the pitcher pitched 10 percent of the team innings, you might figure the fielding cost him 0.5 runs, and adjust him from +20 to +20.5.

I have argued that this isn't right. Fielding performance varies from game to game, just like run support does. Pitchers with better ball-in-play numbers probably got better fielding during their starts than pitchers with worse ball-in-play numbers.

By analogy to run support: in 1972, Steve Carlton famously went 27-10 on a Phillies team that was 32-87 without him. Imagine how good he must have been to go 27-10 for a team that scored only 3.22 runs per game!

Except ... in the games Carlton started, the Phillies actually scored 3.76 runs per game. In games he didn't start, the Phillies scored only 3.03 runs per game. 

The fielding version of Steve Carlton might be Aaron Nola in 2018. A couple of years ago, Tom Tango pointed out the problem using Nola as an example, so I'll follow his lead.

Nola went 17-6 for the Phillies with a 2.37 ERA, and gave up a batting average on balls in play (BAbip) of only .254, against a league average of .295 -- that, despite an estimate that his fielders were 0.60 runs per game worse than average. If you subtract 0.60 from Nola's stat line, you wind up with Nola's pitching equivalent to an ERA in the 1s. As a result, Baseball-Reference winds up assigning Nola a WAR of 10.2, tied with Mike Trout for best in MLB that year.

But ... could Nola really have been hurt that much by his fielders? A BAbip of .254 is already exceptionally low. An estimate of -0.60 runs per game implies his BAbip with average fielders would have been .220, which is almost unheard of.

(In fairness: the Phillies 0.60 DRS fielding estimate, which comes from Baseball Info Solutions, is much, much worse than estimates from other sources -- three times the UZR estimate, for instance. I suspect there's some kind of scaling bug in recent BIS ratings, because, roughly, if you divide DRS by 3, you get more realistic numbers, and standard deviations that now match the other measures. But I'll save that for a future post.)

So Nola was almost certainly hurt less by his fielders than his teammates were, the same way Steve Carlton was hurt less by his hitters than his teammates were. But, how much less? 

Phrasing the question another way: Nola's BAbip (I will leave out the word "against") was .254, on a team that was .306, in a league that was .295. What's the best estimate of how his fielders did?

I think we can figure that out, extending the results in my previous post.

------

First, let's adjust for park. In the five years prior to 2018, the Phillies BAbip for both teams combined was .0127 ("12.7 points") better at Citizens Bank Park than in Phillies road games. Since only half of Phillies games were at home, that's 6.3 points of park factor. Since there's a lot of luck involved, I regressed 60 percent to the mean of zero (with a limit of 5 points of regression, to avoid ruining outliers like Coors Field), leaving the Phillies with 2.5 points of park factor.

Now, look at how the Phillies did with all the other pitchers. For non-Nolas, the team BAbip was .3141, against a league average of .2954. Take the difference, subtract the park factor, and the Phillies were 21 points worse than average.

How much of those 21 points came from below-average fielding talent? To figure that out, here's the SD breakdown from the previous post, but adjusted. I've bumped luck upwards for the lower number of PA, dropped park down to 1.5 since we have an actual estimate, and increased the SD of pitching because the Phillies had more high-inning guys than average:

6.1 points fielding talent
3.9 points fielding luck
5.6 points pitching talent
6.8 points pitching luck
1.5 points park
---------------------------
11.5 points total

Of the Phillies' 21 points in BAbip, what percentage is fielding talent? The answer: (6.1/11.5)^2, or 28 percent. That's 5.9 points.

So, we assume that the Phillies' fielding talent was 5.9 points of BAbip worse than average. With that number in hand, we'll leave the Phillies without Nola and move on to Nola himself.

-------

On the raw numbers, Nola was 41 points better than the league average. But, we estimated, his fielding was about 6 points worse, while his park helped him by 2.5 points, so he was really 44.5 points better.

For an individual pitcher with 700 BIP, here's the breakdown of SDs, again from the previous post:

 6.1  fielding talent
 7.6  fielding luck
 7.6  pitching talent
15.5  pitching luck
 3.5  park
---------------------
20.2  total

We have to adjust all of these for Nola.

First, fielding talent goes down to 5.2. Why? Because we estimated it from other data, and so we have less variance than if we just took the all-time average. (A simulation suggests that we multiply the 6.1 by, from the "team without Nola" case, (SD without fielding talent)/(SD with fielding talent).)

Fielding luck and pitching luck increase because Nola had only 519 BIP, not 700.

Finally, park goes to 1.5 for the same reason as before. 

 5.2 fielding talent
10.0 fielding luck  
 7.6 pitching talent
17.3 pitching luck
 1.5 park
--------------------
22.1 total

Convert to percentages:

 5.5% fielding talent
20.4% fielding luck
11.8% pitching talent
61.3% pitching luck
 0.5% park
---------------------
100% total

Multiply by Nola's 44.5 points:

 2.5 fielding talent 
 9.1 fielding luck
 5.3 pitching talent
27.3 pitching luck
 0.2 park
--------------------
44.5 total

Now we add in our previous estimates of fielding talent and park, to get back to Nola's raw total of 41 points:
 
-3.4 fielding talent [2.5-5.9]
 9.1 fielding luck
 5.3 pitching talent
27.3 pitching luck
 2.7 park            [0.2+2.5]
------------------------------
41 total

Consolidate fielding and pitching:

 5.6 fielding
32.6 pitching 
 2.7 park  
-------------          
41   total

Conclusion: The best estimate is that Nola's fielders actually *helped him* by 5.6 points of BAbip. That's about 3 extra outs in his 519 BIP. At 0.8 runs per out, that's 2.4 runs, in 212.1 IP, for about 0.24 WAR or 10 points of ERA.

Baseball-reference had him at 60 points of ERA; we have him at 10. Our estimate brings his WAR down from 10.3 to 9.1, or something like that. (Again, in fairness, most of that difference is the weirdly-high DRS estimate of 0.60. If DRS had him at a more reasonable .20, we'd have adjusted him from 9.4 to 9.1, or something.)

-------

Our estimate of +3 outs is ... just an estimate. It would be nice if we had real data instead. We wouldn't have to do all this fancy stuff if we had a reliable zone-based estimate specifically for Nola.

Actually, we do! Since 2017, Statcast has been analyzing batted balls and tabulating "outs above average" (OAA) for every pitcher. For Nola, in 2018, they have +2. Tom Tango told me Statcast doesn't have data for all games, so I should multiply the OAA estimate by 1.25. 

That brings Statcast to +2.5. We estimated +3. Not bad!

But Nola is just one case. And we might be biased in the case of Nola. This method is based on a pitcher of average talent. Nola is well above average, so it's likely some of the difference we attributed to fielding is really due to Nola's own BAbip pitching tendencies. Maybe instead of +3, his fielders were really +1 or something.

So I figured I'd better test other players too.

I found all pitchers from 2017 to 2019 that had Statcast estimates, with at least 300 BIP for a single team. There were a few players whose names didn't quite correlate with my Lahman database, so I just let those go instead of fixing them. That left 342 pitcher-seasons. I assume almost all of them were starters.

For each pitcher, I ran the same calculation as for Nola. For comparison, I also did the "traditional" estimate where I gave the pitcher the same fielding as the rest of the team. Here are the correlations to the "gold standard" OAA:

r=0.37 this method
r=0.23 traditional

Here are the approximate root-mean-square errors (lower is better):

11.3 points of BAbip this method
13.7 points of BAbip traditional

This method is meant to be especially relevant for a pitcher like Nola, whose own BAbip is very different from his team's. Here are the root-mean-squared errors for pitchers who, like Nola, had a BAbip at least 10 plays better than their team's:

 9.3 points this method
11.9 points traditional 

And for pitchers at least 10 plays worse:

 9.3 points this method
10.9 points traditional

------

Now, the best part: there's an easy formula to get our estimates, so we don't have to use the messy sums-of-squares stuff we've been doing so far. 

We found that the original estimate for team fielding talent was 28% of observed-BAbip-without-pitcher. And then, our estimate for additional fielding behind that pitcher was 26% of the difference between that pitcher and the team. In other words, if the team's non-Nola BAbip (relative to the league) is T, and Nola's is P,

Fielders = .28T + .26(P-.28T)

The coefficients vary by numbers of BIPs. But the .28 is pretty close for most teams. And, the .26 is pretty close for most single-season pitchers: luck is 25% fielding, and talent is about 30% fielding, so no matter your proportion of randomness-to-skill, you'll still wind up between 25% and 30%.

Expanding that out gives an easier version of the fielding adjustment, which I'll print bigger.

------

Suppose you have an average pitcher, and you want to know how much his fielders helped or hurt him in a given season. You can use this estimate:

F = .21T + .26P 

Where: 

T is his team's BAbip relative to league for the other pitchers on the team, and

P is the pitcher's BAbip relative to league, and 

F is the estimated BAbip performance of the fielders, relative to league, when that pitcher was on the mound.


-----

Next: Part III, splitting team OAA among pitchers.




Labels: , , ,

Tuesday, December 29, 2020

Splitting defensive credit between pitchers and fielders (Part I)

(Update, 2020-12-29: This is take 2. I had posted this a few days ago, but, after further research, I tweaked the numbers and this is the result. Explanations are in the text.)

-----

Suppose a team has a good year in terms of opposition batted ball quality. Instead of giving up a batting average on balls in play (BAbip) of .300, their opponents hit only .280. In other words, they were .020 better than average in turning (inside-the-park) batted balls into outs. 

How much of those "20 points" was because of the fielders, and how much was because of the pitcher?

Thanks to previous work by Tom Tango, Sky Andrecheck, and others, I think we have what we need to figure this out. If you don't want to see the math or logic, just head to the last section of this post for the two-sentence answer.

------

In 2003, a paper called "Solving DIPS," (by Erik Allen, Arvin Hsu, Tom Tango, et al) did a great job in trying to establish what factors affect BAbip, and in what proportion. I did my own estimation in 2015 (having forgotten about the previous paper). I'll use my breakdown here. 

Looking at a large number of actual team-seasons, I found that the observed SD of BAbip was 11.2 points. I estimated the breakdown of SDs as:


 7.7  fielding talent
 2.5  pitching staff talent
 7.1  luck
 2.5  park
--------------------------
11.0  total

(If you haven't seen this kind of chart before, the "total" doesn't actually add up to the components unless you square them all. That's how SDs work -- when you have two independent variables, the SD of their sum is the square root of the sum of their squares.)

OK, this is where I update a bit from the numbers in the previous version of this post.

First, I'm bumping the SD of park from 2.5 points to 3.5 points, to match Tango's numbers for 1999-2002.  Second, I'm bumping luck to 7.3, since that's the theoretical value (as I'll calculate later).  Third, I'm bumping the pitching staff to 4.3, because after checking, it turns out I made an incorrect mathematical assumption in the previous post.  Finally, fielding talent drops to 6.1 to make it all add up.  So the new breakdown:


 6.1  fielding talent
 4.3  pitching staff talent
 7.3  luck
 3.5  park
--------------------------
11.0  total


----

We can use that chart to break the team's 20-point advantage into its components. But ... we can't yet calculate how much of that 20 points goes to the fielders, and how much to the pitchers. Because, we have an entry called "luck". We need to know how to break down the luck and assign it to either side. 

Your first reaction might be -- it's luck, so why should we care? If we're looking to assign deserved credit, why would we want to assign randomness?

But ... if we want to know how the players actually performed, we *do* want to include the luck. We want to know that Roger Maris hit 61 home runs in 1961, even if it's undoubtedly the case that he played over his head in doing so. In this context, "luck" just means the team did somewhat better or worse than their actual talent. That's still part of their record.

Similarly here. If a team gets lucky in opponent BAbip, all that means is they did better than their talent suggests. But how much of that extra performance was the pitchers, giving up easier balls in play? And how much was the fielders, making more and better plays than expected?

That's easy to figure out if we have zone-type fielding stats, calculated by watching where the ball is hit (and sometimes how fast and at what angle), and figuring out the difficulty of every ball, and whether or not the fielders were able to turn it into an out. With those stats, we don't have to risk "blaming" a fielder for not making a play on a bloop single he really had no chance on. 

So where we have those stats, and they work, we have the answer right there, and this post is unnecessary. If the team was +60 runs on balls in play, and the fielders' zone ratings add up to +30, that's half-and-half, so we can say that the 20-point BAbip advantage was 10 points pitching and 10 points hitting.

But for seasons where we don't have the zone rating, what do we do, if we don't know how to split up the luck factor?

Interestingly, it will the stats compiled by the Zone Rating people that allow us to calculate estimates for the years in which we don't have them.

------

Intuitively, the more common "easy outs" and "sure hits" are, the less fielders matter. In fact, if *all* balls in player were 0% or 100%, fielding performance wouldn't matter at all, and fielding luck wouldn't come into play. All the luck would be in what proportion the pitcher split between 0s and 100s. 

On the other hand, if all balls in play were exactly the league average of 30%, it would be the other way around. There would be no difference in the types of hits pitchers gave up, which means there would be no BAbip pitching luck at all. All the luck would be in whether the fielders handled more or fewer than 30% of the chances.

So: the more BIP are "near-automatic" hits or "near-automatic" outs, the more pitchers matter. The more BIP that could go either way, the more fielders matter.

That means we need to know the distribution of ball-in-play difficulty. And that's the data that we wouldn't have without the development of Zone ratings now keeping track of it. 

The data I'm using comes from Sky Andrecheck, who actually published it in 2009, but I didn't realize what it could do until now. (Actually, I'm repeating some of Sky's work here, because I got his data before I saw his analysis of it.  See also Tango's post at his old blog.)

Here's the distribution. Actually, I tweaked it just a tiny bit to make the average work out to .300 (.29987) instead of Sky's .310, for no other reason than I've been thinking .300 forever and didn't want to screw up and forget I need to use .310. Either way, the results that follow would be almost the same. 


43.0% of BIP:  .000 to  .032 chance of a hit*
23.0% of BIP:  .032 to  .140 chance of a hit
10.3% of BIP:  .140 to  .700 chance of a hit
 4.7% of BIP:  .700 to 1.000 chance of a hit
19.0% of BIP:          1.000 chance of a hit
---------------------------------------------
overall average: really close to .300

(*Within a group, the probability is uniform, so anything between .032 and .140 is equally likely once that group is selected.)


The SD of this distribution is around .397. Over 3900 BIP, which I used to represent a team-season, it's .00636. That's the SD of pitcher luck.

The random binomial SD of BAbip over 3900 PA is the square root of (.3)(1-.3)/3900, which comes out to .00733. That's the SD of overall luck.

Since var(overall luck) = var(pitcher luck) + var(fielder luck), we can solve for fielder luck, which turns out to be .00367.


6.36 points pitcher luck (.00636)
3.67 points fielder luck (.00367)
--------------------------------
7.33 points overall luck (.00733)

If you square all the numbers and convert to percentages, you get


 75.3 percent pitcher luck
 24.7 percent fielder luck
--------------------------
100.0 percent overall luck

So there it is. BAbip luck is, on average, 75 pitching and 25 percent fielding. Of course, it varies randomly around that, but those are the averages.

What does that mean in practice? Suppose you notice that a team from the past, which you know has average talent in both pitching and fielding, gave up 20 fewer hits than expected on balls in play. If you were to go back and watch re-broadcasts of all 162 games, you'd expect to find that the fielders made 5 more plays than expected, based on what types of balls in play they were. And, you'd expect to find that the other 15 plays were the result of balls being having been hit a bit easier to field than average.

Again, we are not estimating talent here: we are estimating *what happened in games*. This is a substitute for actually watching the games and measuring balls in play, or having zone ratings, which are based on someone else actually having done that. 

------

So, now that we know the luck breaks down 75/25, we can take our original breakdown, which was this:


 6.1  fielding talent
 4.3  pitching staff talent
 7.3  luck
 3.5  park
--------------------------
11.0  total

And split up the 7.3 points of luck as we calculated:


6.36 pitching luck
3.67 fielding luck
--------------------------
7.3  total luck

And substitute that split back in to the original:


 6.1  fielding talent
 3.67 fielding luck
 4.3  pitching staff talent
 6.36 pitching staff luck
 3.5  park
--------------------------
11.0  total

Since talent+luck = observed performance, and talent and luck are independent, we can consolidate each pair of "talent" and "luck" by summing their squares and taking the square root:


 7.1 fielding observed
 7.7 pitching observed 
 3.5 park
----------------------
11.0 total

Squaring, taking percentages, and rounding, we get

 42 percent fielding
 48 percent pitching
 10 percent park
--------------------
100 percent total 

If you're playing in an average park, or you're adjusting for park some other way, it doesn't apply here, and you can say 


 47 percent fielding
 53 percent pitching
---------------------
100 percent total

So now we have our answer. If you see a team's stats one year that show them to have been particularly good or bad at turning batted balls into outs, on average, after adjusting for park, 47 percent of the credit goes to the fielders, and 53 percent to the pitchers.

But it varies. Some teams might have been 40/60, or 60/40, or even 120/-20! (The latter result might happen if, say, the fielders saved 24 hits, but the pitchers gave up harder BIPs that cost 4 extra hits.)

How can you know how far a particular team is from the 47/53 average? Watch the games and calculate zone ratings. Or, just rely on someone else's reliable zone rating. Or, start with 47/53, and adjust for what you know about how good the pitching and fielding were, relative to each other. Or, if you don't know, just use 47/53 as your estimate.

To verify empirically whether I got this right, find a bunch of published Zone Ratings that you trust, and see if they work out to about 42 percent of what you'd expect if the entire excess BAbip was allocated to fielding.  (I say 42 percent because I assume zone ratings correct for park.)

(Actually, I ran across about five years of data, and tried it, and it came out to 39 percent rather than 42 percent. Maybe I'm a bit off, or it's just random variation, or I'm way off and there's lots of variation.)

-------

So what we've found so far:

-- Luck in BAbip belongs 25% to fielders, 75% to pitchers;

-- For a team-season, excess performance in observed BAbip belongs 42% to fielders, 48% to pitchers, and 10% to park.

-------

That 42 percent figure is for a team-season only. For an individual pitcher, it's different. 

Here's the breakdown for an individual pitcher who allows 700 BIP for the season. 


 6.1  fielding talent
 7.6  pitching talent
17.3  luck
 3.5  park
---------------------------
20.2  total


The SD of pitching talent is larger now, because you're dealing with one specific pitcher, rather than the average of all the team's pitchers (who will partially offset each other, reducing variability). Also, luck has jumped from 7.3 points to 17.2, because of the smaller sample size.

OK, now let's break up the luck portion again:


 6.1  fielding talent
 7.6  fielding luck
 7.6  pitching talent
15.5  pitching luck
 3.5  park
---------------------------
20.2  total

And consolidating:


 9.75 observed fielding
17.3  observed pitching
 3.5  park
---------------------------
20.2  total

Converting to percentages, and rounding from 31/69:


 23%  observed fielding
 73%  observed pitching
  3%  park
---------------------------
100%  total

If we've already adjusted for park, then

 24%  observed fielding
 76%  observed pitching
---------------------------
100%  total


So it's quite different for an individual pitcher than for a team season, because luck and talent break down differently between pitchers and fielders. 

The conclusion: if you know nothing specific about the pitcher, his fielders, his park, or his team, your best guess is that 25 percent of his BAbip (compared to average) came from how well his fielders made plays, and 75 percent of his BAbip comes from what kind of balls in play he gave up.

------

Here's the two-sentence summary. On average,

-- For teams with 3900 BIP, 47 percent of BABIP is fielding and 53 percent is pitching.

-- For starters with 700 BIP, 24 percent of BABIP is fielding and 76 percent is pitching.

------

Next: Part II, where I try applying this to pitcher evaluation, such as WAR.




Labels: , , , ,

Friday, June 19, 2015

Can fans evaluate fielding better than sabermetric statistics?

Team defenses differ in how well they turn batted balls into outs. How do you measure the various factors that influence the differences? The fielders obviously have a huge role, but do the pitchers and parks also have an influence?

Twelve years ago, in a group discussion, Erik Allen, Arvin Hsu, and Tom Tango broke down the variation in batting average on balls in play (BAbip). Their analysis was published in a summary called "Solving DIPS" (.pdf).

A couple of weeks ago, I independently repeated their analysis -- I had forgotten they had already done it -- and, reassuringly, got roughly the same result. In round numbers, it turns out that:

The SD of team BAbip fielding talent is roughly 30 runs over a season.

------

There are several competing systems for evaluating which players and teams are best in the field, and by how much. The Fangraphs stats pages list some of those stats, and let you compare.

I looked at those team stats for the 2014 season. Specifically, these three:

1. DRS, from The Fielding Bible -- specifically, the rPM column, runs above average from plays made. (That's the one we want, because it doesn't include outfielder/catcher arms, or double-play ability.)

2. The Fan Scouting Report (FSR), which is based on an annual fan survey run by Tom Tango.

3. Ultimate Zone Rating (UZR), a stat originally developed by Mitchel Lichtman, but which, as I understand it, is now public. I used the column "RngR," which is the range portion (again to leave out arms and other defensive skills).

All three stats are denominated in runs. Here are their team SDs for the 2014 season, rounded:

37 runs -- DRS (rPM column)
23 runs -- Fan Scouting Report (FSR)
29 runs -- UZR (RngR)
------------------------------------
30 runs -- team talent

The SD of DRS is much higher than the SD of team talent. Does that mean it's breaching the "speed of light" limit of forecasting, trying to (retrospectively) predict random luck as well as skill?

No, not necessarily. Because DRS isn't actually trying to evaluate talent.  It's trying to evaluate what actually happened on the field. That has a wider distribution than just talent, because there's luck involved.

A team with fielding talent of +30 runs might have actually saved +40 runs last year, just like a player with 30-home-run talent may have actually hit 40.

The thing is, though, that in the second case, we actually KNOW that the player hit 40 homers. For team fielding, we can only ESTIMATE that it saved 40 runs, because we don't have good enough data to know that the extra runs didn't just result from getting easier balls to field.

In defense, the luck of "made more good plays than average" is all mixed up with "had more easier balls to field than average."  The defensive statistics I've seen try their best to figure out which is which, but they can't, at least not very well.

What they do, basically, is classify every ball in play according to how difficult it was, based on location and trajectory. I found this post from 2003, which shows some of the classifications for UZR. For instance, a "hard" ground ball to the "56" zone (a specific portion of the field between third and short) gets turned into an out 43.5 percent of the time, and becomes a hit the other 56.5 percent. 

If it turns out a team had 100 of those balls to field, and converted them to outs at 45 percent instead of 43.5 percent, that's 1.5 extra outs it gets credited for, which is maybe 1.2 runs saved.

The problem with that is: the 43.5 percent is a very imprecise estimate of what the baseline should be. Because, even in the "hard-hit balls to zone 56" category, the opportunities aren't all the same. 

Some of them are hit close to the fielder, and those might be turned into outs 95 percent of the time, even for an average or bad-fielding team. Some are hit with a trajectory and location that makes them only 8 percent. And, of course, each individual case depends where the fielders are positioned, so the identical ball could be 80 percent in one case and 10 percent in another.

In a "Baseball Guts" thread at Tango's site, data from Sky Andrecheck and BIS suggested that only 20 percent of ground balls, and 10 percent of fly balls, are "in doubt", in the sense that if you were watching the game, you'd think it could have gone either way. In other words, at least 80% of balls in play are either "easy outs" or "sure hits."  ("In doubt" is my phrase, meaning BIPs in which it wasn't immediately at least 90 percent obvious to the observer whether it would be a hit or an out.)

That means that almost all the differences in talent and performance manifest themselves in just 10 to 20 percent of balls in play.

But, even the best fielding systems have few zones that are less than 20 percent or more than 80 percent. That means that there is still huge variation in difficulty *even accounting for zone*. 

So, when a team makes 40 extra plays over a season, it's a combination of:

(a) those 40 plays came from extra performance from the few "in doubt" balls;
(b) those 40 plays came from easier balls overall.

I think (b) is much more a factor than (a), and that you have to regress the +40 to the mean quite a bit to get a true estimate. 

Maybe when the zones get good enough to show large differences between teams -- like, say, 20% for a bad fielder and 80% for a good fielder -- well, at that point, you have a system that might work. But, without that, doesn't it almost have to be the case that most of the difference is just from what kinds of balls you get?

Tango made a very relevant point, indirectly, in a recent post. He asked, "Is it possible that Manny Ramirez never made an above-average play in the outfield?"  The consensus answer, which sounds right to me, was ... it would be very rare to see Manny make a play that an average outfielder wouldn't have made. (Leaving positioning out of the argument for now.)

Suppose BIPs to a certain difficult zone get caught 30% of the time by an average fielder, and Manny catches them 20% of the time. Since ANY outfielder would catch a ball that Manny gets to ... well, that zone must really be at least TWO zones: a "very easy" zone with a 100% catch rate, and a "harder" zone with an 10% catch rate for an average fielder, and a 0% catch rate for Manny.

In other words, if Manny makes 30% plays in that zone and a Gold Glove outfielder makes 25%, it's almost certain that Manny just got easier balls to catch. 

The only way to eliminate that kind of luck is to classify the zones in enough micro detail that you get close to 0% for the worst, or close to 100% for the best.

And that's not what's happening. Which means, there's no way to tell how many runs a defense saved.

------

And this brings us back to the point I made last month, about figuring out how to split observed runs allowed into observed pitching and observed fielding. There's really no way to do it, because you can't tell a good fielding play from an average one with the numbers currently available. 

Which means: the DRS and UZR numbers in the Fangraphs tables are actually just estimates -- not estimates of talent, but estimates of *what happened in the field*. 

There's nothing wrong with that, in principle: but, I don't think it's generally realized that that's what those are, just estimates. They wind up in the same statistical summaries as pitching and hitting metrics, which themselves are reliable observations. 

At baseball-reference, for instance, you can see, on the hitting page, that Robinson Cano hit .302-28-118 (fact), which was worth 31 runs above average (close enough to be called fact).

On his fielding page, you can see that Cano had 323 putouts (fact) and 444 assists (fact), which, by Total Zone Rating, was worth 4 runs below average (uh-oh).

Unlike the other columns, UZR column is an *estimate*. Maybe it really was -4 runs, but it could easily have been -10 runs, or -20 runs, or +6 runs. 

To the naked eye, the hitting and fielding numbers both look equally official and reliable, as accurate observations of what happened. But one is based on an observation of what happened, and the other is based on an estimate of what happened.

------

OK, that's a bit of an exaggeration, so let me backtrack and explain what I mean.

Cano had 28 home runs, and 444 assists. Those are "facts", in the sense that the error is zero, if the observations are recorded correctly.

Cano's offense was 31 runs above average. I'm saying that's accurate enough to be called a "fact."  But admittedly, it is, in fact, an estimate. Even if the Linear Weights formula (or whatever) is perfectly accurate, the "runs above average" number is after adjusting for park effects (which are imperfect estimates, albeit pretty good ones). Also, the +31 assumes Cano faced league-average pitching. That, again, is an estimate, but, again, it's a pretty strong one.

For defense, comparatively, the UZR of "-4" is a very, very, weak estimate. It carries an implicit assumption that Cano's "relative difficulty of balls in play" was zero. That's much less reliable than the estimate that his "relative difficulty of pitchers faced" was zero. If you wanted, you could do the math, and show how much weaker the one estimate is than the other; the difference is huge.

But, here's a thought experiment to make it clear. Suppose Cano faces an the worst pitcher in the league, and hits a home run. In that case, he's at worst 1.3 runs above average for that plate appearance, instead of our estimate of 1.4. It's a real difference in how we evaluate his performance, but a small one.

On the other hand, suppose Cano faces a grounder in a 50% zone, but one of the easy ones, that almost any fielder would get to. Then, he's maybe +0.01 hits above average, but we're estimating +0.5. That is a HUGE difference. 

It's also completely at odds with our observation of what happens on the field. After an easy ground ball, even the most casual fan would say he observed Cano saving his team 0 runs over what another player would do. But we write it down as +0.4 runs, which is ... well, it's so big, you have to call it *wrong*. We are not accurately recording what happened on the field.

So, if you take "what happened on the field" in broad, intutive terms, the home run matches: "he did a good thing on the field and created over a run" both to the observer and the statistic. But for the ground ball, the statistic lies. It says Cano "did a good thing on the field and saved almost half a run," but the observer says Cano "made a routine play." 

The batting statistics match what a human would say happened. The fielding stats do not.

------

How much random error is in those fielding statistics? When UZR gives an SD of 29 runs, how much of that is luck, and how much is talent? If we knew, we could at least regress to the mean. But we don't. 

That's because we don't know the idealized actual SD of observed performance, adjusted for the difficulty of the balls in play. It must be somewhere between 47 runs (the SD of observed performance without adjusting for difficulty), and 30 runs (the SD of talent). But where in between?

In addition: how sure are we that the estimates are even unbiased, in the sense that they're independently just as likely to be too high as too low? If they're truly unbiased, that makes them much easier to live with -- at the very least, you know they'll get more accurate as you average over multiple seasons. But if they inappropriately adjust for park effects, or pitcher talent, you might find some teams being consistently overestimated or underestimated. And that could really screw up your evaluations, especially if you're using those fielding estimates to rejig pitching numbers. 

-------

For now, the estimates I like best are the ones from Tango's "Fan Scouting Report" (FSR). As I understand it, those are actually estimates of talent, rather than estimates of what happened on the field. 

Team FSR has an SD of 23 runs. That's very reasonable. It's even more conservative than it looks. That 23 includes all the "other than range" stuff -- throwing arm, double plays, and so on. So the range portion of FSR is probably a bit lower than 23.

We know the true SD of talent is closer to 30, but there's no way for subjective judgments to be that precise. For one thing, the humans that respond to Tango's survey aren't perfect evaluators of what they see on the field. Second, even if they *were* perfect, a portion of what they're observing is random luck anyway. You have to temper your conclusions for the amount of noise that must be there. 

It might be a little bit apples-to-oranges to compare FSR to the other estimates, because FSR has much more information to work with. The survey respondents don't just use the ball-in-play stats for a single year -- they consider the individual players' entire careers, ages and trajectories; the opinions of their peers and the press; their personal understanding of how fielding works; and anything else they deem relevant.

But, that's OK. If your goal is to try to estimate the influence of team fielding, you might as well just use the best estimate you've got. 

For my part, I think FSR is the one I trust the most. When it comes to evaluating fielding, I think sabermetrics is still way behind the best subjective evaluations.







Labels: , , , , , , , , ,

Thursday, May 28, 2015

Pitchers influence BAbip more than the fielders behind them

It's generally believed that when pitchers' teams vary in their success rate in turning batted balls into outs, the fielders should get the credit or blame. That's because of the conventional wisdom that pitchers have little control over balls in play.

I ran some numbers, and ... well, I think that's not right. I think individual pitchers actually have as much influence on batting average on balls in play (BAbip) as the defense behind them, and maybe even a bit more.

------

UPDATE: turns out all the work I did is just confirming a result from 2003, in a document called "Solving DIPS" (.pdf).  It's by Erik Allen, Arvin Hsu, and Tom Tango. (I had read it, too, several years ago, and promptly forgot about it.)

It's striking how close their numbers are to these, even though I'm calculating things in a different way than they did. That suggests that we're all measuring the same thing with the same accuracy.

One advantage of their analysis over mine is they have good park effect numbers.  See the first comment in this post for Tango's links to "batting average on balls in play" park effect data.

------

For the first step, I'll run the usual "Tango method" to divide BAbip into talent and luck.

For all team-seasons from 2001 to 2011, I figured the SD of team BAbip, adjusted for the league average. That SD turned out to be .01032, which I'll refer to as "10.3 points", as in "points of batting average."  

The average SD of binomial luck for those seasons was 7.1 points. Since

SD(observed)^2 = SD(luck)^2 + SD(talent)^2

We can calculate that SD(talent) = 7.5 points.

"Talent," here, doesn't yet differentiate between pitcher and fielder talent. Actually, it's a conglomeration of everything other than luck -- fielders, pitchers, slight randomness of opposition batters, day/night effects, and park effects. (In this context, we're saying that Oakland's huge foul territory has the "talent" of reducing BAbip by producing foul pop-ups.)

So:

7.2 = SD(luck) 
7.5 = SD(talent) 

For a team-season from 2001 to 2011, talent was more important than luck, but not by much. 

I did the same calculation for other sets of seasons. Here's the summary:


            Obsrvd  Luck Talents
--------------------------------
1960-1968    11.41  6.95   9.05
1969-1976    12.24  6.86  10.14
1977-1991    10.95  6.94   8.46
1992-2000    11.42  7.22   8.85
2001-2011    10.32  7.09   7.50
-------------------------------
"Average"    11.00  7.00   8.50

I've arbitrarily decided to "average" the eras out to round numbers:  7 points for luck, and 8.5 points for talent. Feel free to use actual averages if you like. 

It's interesting how close that breakdown is to the (rounded) one for team W-L records:

          Observed  Luck  Talent
--------------------------------
BABIP        11.00  7.00   8.50
Team Wins    11.00  6.50   9.00
--------------------------------

That's just coincidence, but still interesting and intuitively helpful.

-------

That works for separating BAbip into skill and luck, but we still need to break down the skill into pitching and fielding.

I found every pitcher-season from 1981 to 2011 where the pitcher faced at least 400 batters. I compared his BAbip allowed to that of the rest of his team. The comparison to teammates effectively controls for defense, since, presumably, the defense is the same no matter who's on the mound. 

Then, I took the player/rest-of-team difference, and calculated the Z-score: if the difference were all random, how many SDs of luck would it be? 

If BAbip was all luck, the SD of the Z-scores would be exactly 1.0000. It wasn't, of course. It was actually 1.0834. 

Using the "observed squared = talent squared plus luck squared", we can calculate that SD(talent) is 0.417 times as big as SD(luck). For the full dataset, the (geometric) average SD(luck) was 21.75 points. So, SD(talent) must be 0.417 times 21.75, which is 9.07 points.

We're not quite done. The 9.07 isn't an estimate of a single pitcher's talent SD; it's the estimate of the difference between that pitcher and his teammates. There's randomness in the teammates, too, which we have to remove.

I arbitrarily chose to assume the pitcher has 8 times the luck variance of the teammates (he probably pitched more than 1/8 of the innings, but there are more than 8 other pitchers to dilute the SD; I just figured maybe the two forces balance out). That would mean 8/9 of the total variance belongs to the individual pitcher, or the square root of 8/9 of the SD. That reduces the 9.07 points to 8.55 points.

8.55 = SD(single pitcher talent)

That's for individual pitchers. The SD for the talent of a *pitching staff* would be lower, of course, since the individual pitchers would even each other out. If there were nine pitchers on the team, each with equal numbers of BAbip, we'd just divide that by the square root of 9, which would give 2.85. I'll drop that to 2.5, because real life is probably a bit more dilute than that.

So for a single team-season, we have

8.5 = SD(overall talent) 
-------------------------------------
2.5 = SD(pitching staff talent) 
8.1 = SD(fielding + all other talent)

------

What else is in that 8.1 other than fielding? Well, there's park effects. The only effect I have good data for, right now (I was too lazy to look hard), is foul outs. I searched for those because of all the times I've read about the huge foul territory in Oakland, and how big an effect it has.

Google found me a FanGraphs study by Eno Sarris, showing huge differences in foul outs among parks. The difference between top and bottom is more than double -- 398 outs in Oakland over two years, compared to only 139 in Colorado. 

The team SD from Sarris's chart was about 24 outs per year. Only half of those go to the home pitchers' BAbip, so that's 12 per year. Just to be conservative, I'll reduce that to 10.

Ten extra outs on a team-season's worth of BIP is around 2.5 points.

So: if 8.1 is the remaining unexplained talent SD, we can break it down as 2.5 points of foul territory, and 7.7 points of everything else (including fielding).

Our breakdown is now:

11.0 = SD(observed) 
--------------------------
 7.1 = SD(luck) 
 2.5 = SD(pitching staff)
 2.5 = SD(park foul outs)
 7.7 = SD(fielders + unexplained)

We can combine the first three lines of the breakdown to get this:

11.0 = SD(observed) 
--------------------------
 7.9 = SD(luck/pitchers/park) 
 7.7 = SD(fielders/unexplained)

Fielding and non-fielding are almost exactly equal. Which is why I think you have to regress BAbip around halfway to the mean to get an unbiased estimate for the contribution of fielding.

UPDATE: as mentioned, Tango has better park effect data, here.

------

Now, remember when I said that pitchers differ more in BAbip than fielders? Not for a team, but for an individual pitcher,

8.5 = SD(individual pitcher)
7.7 = SD(fielders + unexplained)

The only reason fielding is more important than pitching for a *team*, is that the multiple pitchers on a staff tend to cancel each other out, reducing the 8.5 SD down to 2.5.

-------

Well, those last three charts are the main conclusions of this study. The rest of this post is just confirming the results from a couple of different angles.

-------

Let's try this, to start. Earlier, when we found that SD(pitchers) = 8.5, we did it by comparing a pitcher's BAbip to that of his teammates. What if we compare his BAbip to the rest of the pitchers in the league, the ones NOT on his team?

In that case, we should get a much higher SD(observed), since we're adding the effects of different teams' fielders.

We do. When I convert the pitchers to Z-scores, I get an SD of 1.149. That means SD(talent) is  0.57 as big as SD(luck). With SD(luck) calculated to be about 20.54 points, based on the average number of BIPs in the two samples ... that makes SD(talents) equal to 11.6 points.

In the other study, we found SD(pitcher) was 8.5 points. Subtracting the square of 8.5 from the square of 11.6, as usual, gives

11.6 = SD(pitcher+fielders+park)
--------------------------------
 8.5 = SD(pitcher)
 7.9 = SD(fielding+park)

So, SD(fielding+park) works out to 7.9 by this method, 8.1 by the other method. Pretty good confirmation.

-------

Let's try another. This time, we'll look at pitchers' careers, rather than single seasons. 

For every player who pitched at least 4,000 outs (1333.1 innings) between 1980 and 2011, I looked at his career BAbip, compared to his teammates' weighted BAbip in those same seasons. 

And, again, I calculated the Z-scores for number of luck SDs he was off. The SD of those Z-scores was 1.655. That means talent was 1.32 times as important as luck (since 1.32 squared plus 1 squared equals 1.655 squared).

The SD of luck, averaged for all pitchers in the study, was 6.06 points. So SD(talent) was 1.32 times 6.06, or 8.0 points.

10.0 = SD(pitching+luck)
------------------------
 6.1 = SD(luck)
 8.0 = SD(pitching)

The 8.0 is pretty close to the 8.5 we got earlier. And, remember, we didn't include all pitchers in this study, just those with long careers. That probably accounts for some of the difference.

Here's the same thing, but for 1960-1979:

 9.3 = SD(pitching+luck)
------------------------
 6.0 = SD(luck)
 7.2 = SD(pitching)

It looks like variation in pitcher BAbip skill was lower in the olden times than it is now. Or, it's just random variation.

--------

I did the career study again, but compare each pitcher to OTHER teams' pitchers. Just like when we did this for single seasons, the SD should be higher, because now we're not controlling for differences in fielding talent. 

And, indeed, it jumps from 8.0 to 8.8. If we keep our estimate that 8.0 is pitching, the remainder must be fielding. Doing the breakdown:

10.5 = SD(pitching+fielding+luck)
---------------------------------
 5.8 = SD(luck
 8.0 = SD(pitching)
 3.6 = SD(fielding)

That seems to work out. Fielding is smaller for a career than a season, because the quality of the defense behind the pitcher tends to even out over a career. I was surprised it was even that large, but, then, it does include park effects (and those even out less than fielders do). 

For 1960-1979:

10.2 = SD(pitching+fielding+luck)
---------------------------------
 5.7 = SD(luck)
 7.2 = SD(pitching)
 4.4 = SD(fielding)

Pretty much along the same lines.

------

Unless I've screwed up somewhere, I think we've got these as our best estimates for BAbip variation in talent:

8.5 = SD(individual pitcher BAbip talent)
2.5 = SD(team pitching staff BAbip talent)
7.7 = SD(team fielding staff BAbip talent)
2.5 = SD(park foul territory BAbip talent)

And, for a single team-season,

7.1 = SD(team season BAbip luck)

For a single team-season, it appears that luck, pitching, and park effects, combined, are about as big an influence on BAbip as fielding skill.  



Labels: , , , ,