Sunday, January 31, 2021

Splitting defensive credit between pitchers and fielders (Part III)

(This is part 3.  Part 1 is here; part 2 is here.)

UPDATE, 2021-02-01: Thanks to Chone Smith in the comments, who pointed out an error.  I investigated and found an error in my code. I've updated this post -- specifically, the root mean error and the final equation. The description of how everything works remains the same.

------

Last post, we estimated that in 2018, Phillies fielders were 3 outs better than league average when Aaron Nola was on the mound. That estimate was based on the team's BAbip and Nola's own BAbip.

Our first step was to estimate the Phillies' overall fielding performance from their BAbip. We had to do that because BAbip is a combination of both pitching and fielding, and we had to guess how to split those up. To do that, we just used the overall ratio of fielding BAbip to overall BAbip, which was 47 percent. So we figured that the Phillies fielders were -24, which is 47 percent of their overall park-adjusted -52.

We can do better than that kind of estimate, because, at least for recent years, we have actual fielding data that can substitute for that estimate. Statcast tells us that the Phillies fielders were -39 outs above average (OAA) for the season*. That's 75 percent of BAbip, not 47 percent ... but still well within typical variation for teams. 

(*The published estimate is -31, but I'm adding 25 percent (per Tango's suggestion) to account for games not included in the OAA estimate.)  

So we can get much more accurate by starting with the true zone fielding number of -39, instead of the weaker estimate of -24. 

-------

First, let's convert the -39 back to BAbip, by dividing it by 3903 BIP. That gives us ... almost exactly -10 points.

The SD of fielding talent is 6.1. The SD of fielding luck in 3903 BIP is 3.65. So it works out that luck is 2.6 of the 10 points, and talent is the remaining 7.3. (That's because 2.6 = 3.65^2/(3.65^2+6.1^2).)

We have no reason (yet) to believe Nola is any different from the rest of the team, so we'll start out with an estimate that he got team average fielding talent of -7.3, and team average fielding luck of -2.6.

Nola's BAbip was .254, in a league that was .296. That's an observed 41 point benefit. But, with fielders that averaged .00074 talent and -0.0026 luck, in a park that was +0.0025, that +41 becomes +48.5.  

That's what we have to break down. 

Here's Nola's SD breakdown, for his 519 BIP. We will no longer include fielding talent in the chart, because we're using the fixed team figure for Nola, which is estimated elsewhere and not subject to revision. But we keep a reduced SD for fielding luck relative to team, because that's different for every pitcher.

 9.4 fielding luck
 7.6 pitching talent
17.3 pitching luck
 1.5 park
--------------------
21.2 total

Converting to percentages:

 20% fielding luck
 13% pitching talent
 67% pitching luck
  1% park
--------------------
100% total

Using the above percentages, the 48.5 becomes:

+ 9.5 points fielding luck
+ 6.3 points pitching talent
+32.5 points pitching luck
+ 0.2 points park
-------------------
+48.5 points

Adding back in the -7.3 points for observed Phillies talent, -2.6 for Phillies luck, and 2.5 points for the park, gives

 -7.3 points fielding talent [0 - 7.3]
 +6.9 points fielding luck   [+10.2 - 2.6]
 +6.3 points pitching talent
+32.5 points pitching luck
 +2.7 points park            [0.2 + 2.5]
-----------------------------------------
 41   points

Stripping out the two fielding rows:

-7.3 points fielding talent 
+6.9 points fielding luck
-----------------------------
-0.4 points fielding

The conclusion: instead of hurting him by 10 points, as the raw team BAbip might suggest, or helping him by 6 points, as we figured last post ... Nola's fielders only hurt him by 0.4 points. That's less than a fifth or a run. Basically, Nola got league-average fielding.

--------

Like before, I ran this calculation for all the pitchers in my database. Here are the correlations to actual "gold standard" OAA behind the pitcher:

r=0.23 assume pitcher fielding BAbip = team BAbip
r=0.37 BAbip method from last post
r=0.48 assume pitcher OAA = team OAA
r=0.53 this method

And the root mean square error:

13.7 assume pitcher fielding BAbip = team BAbip
11.3 BAbip method from last post
10.2 assume pitcher OAA = team OAA
10.0 this method

-------

Like in the last post, here's a simple formula that comes very close to the result of all these manipulations of SDs:

F = 0.8*T + 0.2*P

Here, "F" is fielding behind the pitcher, which is what we're trying to figure out. "T" is team OAA/BAbip. "P" is player BAbip compared to league.

Unlike the last post, here the team *does* include the pitcher you're concerned with. We had to do it this way because presumably we have data for the team without the pitcher. (If we did, we'd just subtract it from team and get the pitcher's number directly!)

It looks like 20% of a pitcher's discrepancy is attributable to his fielders. That number is for workloads similar to those in my sample -- around 175 IP. It does with playing time, but only slightly. At 320 IP, you can use 19% instead. At 40 IP, you can use 22%. Or, just use 20% for everyone, and you won't be too far wrong.

-------

Full disclosure: the real life numbers for 2017-19 are different. The theory is correct -- I wrote a simulation, and everything came out pretty much perfect. But on real data, not so perfect.

When I ran a linear regression to predict OAA from team and player BIP, it didn't come out to 20%. It came out to only about 11.5%. The 95% confidence interval only brings it up to 15% or 16%.

The same thing happened for the formula from the last post: instead of the predicted 26%, the actual regression came out to 17.5%.
  
For the record, these are the empirical regression equations, all numbers relative to league:

F = 0.23*(Team BAbip without pitcher) + 0.175*P
F = 0.92*(Team OAA/BIP including pitcher) + 0.115*P

Why so much lower than expected? I'm pretty sure it's random variation. The empirical estimate of 11.5% is very sensitive to small variations in the seasonal balance of variation in pitching and fielding luck vs. talent -- so sensitive that the difference between 11.5 points and 20 points is not statistically significant. Also, the actual number changes from year-to-year because of variation. So, I believe that the 20% number is correct as a long-term average, but for the seasons in the study, the actual number is probably somewhere between 11.5% and 20%.

I should probably explain that in a future post. But, for now, if you don't believe me, feel free to use the empirical numbers instead of my theoretical ones. Whether you use 11.5% or 20%, you'll still be much more accurate than using 100%, which is effectively what happens when you use the traditional method of assigning the overall team number equally to every pitcher.















Labels: , , ,

Monday, January 11, 2021

Splitting defensive credit between pitchers and fielders (Part II)

(Part 1 is here.  This is Part 2.  If you want to skip the math and just want the formula, it's at the bottom of this post.)

------

When evaluating a pitcher, you want to account for how good his fielders were. The "traditional" way of doing that is, you scale the team fielding to the pitcher. Suppose a pitcher was +20 plays better than normal, and his team fielding was -5 for the season. If the pitcher pitched 10 percent of the team innings, you might figure the fielding cost him 0.5 runs, and adjust him from +20 to +20.5.

I have argued that this isn't right. Fielding performance varies from game to game, just like run support does. Pitchers with better ball-in-play numbers probably got better fielding during their starts than pitchers with worse ball-in-play numbers.

By analogy to run support: in 1972, Steve Carlton famously went 27-10 on a Phillies team that was 32-87 without him. Imagine how good he must have been to go 27-10 for a team that scored only 3.22 runs per game!

Except ... in the games Carlton started, the Phillies actually scored 3.76 runs per game. In games he didn't start, the Phillies scored only 3.03 runs per game. 

The fielding version of Steve Carlton might be Aaron Nola in 2018. A couple of years ago, Tom Tango pointed out the problem using Nola as an example, so I'll follow his lead.

Nola went 17-6 for the Phillies with a 2.37 ERA, and gave up a batting average on balls in play (BAbip) of only .254, against a league average of .295 -- that, despite an estimate that his fielders were 0.60 runs per game worse than average. If you subtract 0.60 from Nola's stat line, you wind up with Nola's pitching equivalent to an ERA in the 1s. As a result, Baseball-Reference winds up assigning Nola a WAR of 10.2, tied with Mike Trout for best in MLB that year.

But ... could Nola really have been hurt that much by his fielders? A BAbip of .254 is already exceptionally low. An estimate of -0.60 runs per game implies his BAbip with average fielders would have been .220, which is almost unheard of.

(In fairness: the Phillies 0.60 DRS fielding estimate, which comes from Baseball Info Solutions, is much, much worse than estimates from other sources -- three times the UZR estimate, for instance. I suspect there's some kind of scaling bug in recent BIS ratings, because, roughly, if you divide DRS by 3, you get more realistic numbers, and standard deviations that now match the other measures. But I'll save that for a future post.)

So Nola was almost certainly hurt less by his fielders than his teammates were, the same way Steve Carlton was hurt less by his hitters than his teammates were. But, how much less? 

Phrasing the question another way: Nola's BAbip (I will leave out the word "against") was .254, on a team that was .306, in a league that was .295. What's the best estimate of how his fielders did?

I think we can figure that out, extending the results in my previous post.

------

First, let's adjust for park. In the five years prior to 2018, the Phillies BAbip for both teams combined was .0127 ("12.7 points") better at Citizens Bank Park than in Phillies road games. Since only half of Phillies games were at home, that's 6.3 points of park factor. Since there's a lot of luck involved, I regressed 60 percent to the mean of zero (with a limit of 5 points of regression, to avoid ruining outliers like Coors Field), leaving the Phillies with 2.5 points of park factor.

Now, look at how the Phillies did with all the other pitchers. For non-Nolas, the team BAbip was .3141, against a league average of .2954. Take the difference, subtract the park factor, and the Phillies were 21 points worse than average.

How much of those 21 points came from below-average fielding talent? To figure that out, here's the SD breakdown from the previous post, but adjusted. I've bumped luck upwards for the lower number of PA, dropped park down to 1.5 since we have an actual estimate, and increased the SD of pitching because the Phillies had more high-inning guys than average:

6.1 points fielding talent
3.9 points fielding luck
5.6 points pitching talent
6.8 points pitching luck
1.5 points park
---------------------------
11.5 points total

Of the Phillies' 21 points in BAbip, what percentage is fielding talent? The answer: (6.1/11.5)^2, or 28 percent. That's 5.9 points.

So, we assume that the Phillies' fielding talent was 5.9 points of BAbip worse than average. With that number in hand, we'll leave the Phillies without Nola and move on to Nola himself.

-------

On the raw numbers, Nola was 41 points better than the league average. But, we estimated, his fielding was about 6 points worse, while his park helped him by 2.5 points, so he was really 44.5 points better.

For an individual pitcher with 700 BIP, here's the breakdown of SDs, again from the previous post:

 6.1  fielding talent
 7.6  fielding luck
 7.6  pitching talent
15.5  pitching luck
 3.5  park
---------------------
20.2  total

We have to adjust all of these for Nola.

First, fielding talent goes down to 5.2. Why? Because we estimated it from other data, and so we have less variance than if we just took the all-time average. (A simulation suggests that we multiply the 6.1 by, from the "team without Nola" case, (SD without fielding talent)/(SD with fielding talent).)

Fielding luck and pitching luck increase because Nola had only 519 BIP, not 700.

Finally, park goes to 1.5 for the same reason as before. 

 5.2 fielding talent
10.0 fielding luck  
 7.6 pitching talent
17.3 pitching luck
 1.5 park
--------------------
22.1 total

Convert to percentages:

 5.5% fielding talent
20.4% fielding luck
11.8% pitching talent
61.3% pitching luck
 0.5% park
---------------------
100% total

Multiply by Nola's 44.5 points:

 2.5 fielding talent 
 9.1 fielding luck
 5.3 pitching talent
27.3 pitching luck
 0.2 park
--------------------
44.5 total

Now we add in our previous estimates of fielding talent and park, to get back to Nola's raw total of 41 points:
 
-3.4 fielding talent [2.5-5.9]
 9.1 fielding luck
 5.3 pitching talent
27.3 pitching luck
 2.7 park            [0.2+2.5]
------------------------------
41 total

Consolidate fielding and pitching:

 5.6 fielding
32.6 pitching 
 2.7 park  
-------------          
41   total

Conclusion: The best estimate is that Nola's fielders actually *helped him* by 5.6 points of BAbip. That's about 3 extra outs in his 519 BIP. At 0.8 runs per out, that's 2.4 runs, in 212.1 IP, for about 0.24 WAR or 10 points of ERA.

Baseball-reference had him at 60 points of ERA; we have him at 10. Our estimate brings his WAR down from 10.3 to 9.1, or something like that. (Again, in fairness, most of that difference is the weirdly-high DRS estimate of 0.60. If DRS had him at a more reasonable .20, we'd have adjusted him from 9.4 to 9.1, or something.)

-------

Our estimate of +3 outs is ... just an estimate. It would be nice if we had real data instead. We wouldn't have to do all this fancy stuff if we had a reliable zone-based estimate specifically for Nola.

Actually, we do! Since 2017, Statcast has been analyzing batted balls and tabulating "outs above average" (OAA) for every pitcher. For Nola, in 2018, they have +2. Tom Tango told me Statcast doesn't have data for all games, so I should multiply the OAA estimate by 1.25. 

That brings Statcast to +2.5. We estimated +3. Not bad!

But Nola is just one case. And we might be biased in the case of Nola. This method is based on a pitcher of average talent. Nola is well above average, so it's likely some of the difference we attributed to fielding is really due to Nola's own BAbip pitching tendencies. Maybe instead of +3, his fielders were really +1 or something.

So I figured I'd better test other players too.

I found all pitchers from 2017 to 2019 that had Statcast estimates, with at least 300 BIP for a single team. There were a few players whose names didn't quite correlate with my Lahman database, so I just let those go instead of fixing them. That left 342 pitcher-seasons. I assume almost all of them were starters.

For each pitcher, I ran the same calculation as for Nola. For comparison, I also did the "traditional" estimate where I gave the pitcher the same fielding as the rest of the team. Here are the correlations to the "gold standard" OAA:

r=0.37 this method
r=0.23 traditional

Here are the approximate root-mean-square errors (lower is better):

11.3 points of BAbip this method
13.7 points of BAbip traditional

This method is meant to be especially relevant for a pitcher like Nola, whose own BAbip is very different from his team's. Here are the root-mean-squared errors for pitchers who, like Nola, had a BAbip at least 10 plays better than their team's:

 9.3 points this method
11.9 points traditional 

And for pitchers at least 10 plays worse:

 9.3 points this method
10.9 points traditional

------

Now, the best part: there's an easy formula to get our estimates, so we don't have to use the messy sums-of-squares stuff we've been doing so far. 

We found that the original estimate for team fielding talent was 28% of observed-BAbip-without-pitcher. And then, our estimate for additional fielding behind that pitcher was 26% of the difference between that pitcher and the team. In other words, if the team's non-Nola BAbip (relative to the league) is T, and Nola's is P,

Fielders = .28T + .26(P-.28T)

The coefficients vary by numbers of BIPs. But the .28 is pretty close for most teams. And, the .26 is pretty close for most single-season pitchers: luck is 25% fielding, and talent is about 30% fielding, so no matter your proportion of randomness-to-skill, you'll still wind up between 25% and 30%.

Expanding that out gives an easier version of the fielding adjustment, which I'll print bigger.

------

Suppose you have an average pitcher, and you want to know how much his fielders helped or hurt him in a given season. You can use this estimate:

F = .21T + .26P 

Where: 

T is his team's BAbip relative to league for the other pitchers on the team, and

P is the pitcher's BAbip relative to league, and 

F is the estimated BAbip performance of the fielders, relative to league, when that pitcher was on the mound.


-----

Next: Part III, splitting team OAA among pitchers.




Labels: , , ,

Tuesday, December 29, 2020

Splitting defensive credit between pitchers and fielders (Part I)

(Update, 2020-12-29: This is take 2. I had posted this a few days ago, but, after further research, I tweaked the numbers and this is the result. Explanations are in the text.)

-----

Suppose a team has a good year in terms of opposition batted ball quality. Instead of giving up a batting average on balls in play (BAbip) of .300, their opponents hit only .280. In other words, they were .020 better than average in turning (inside-the-park) batted balls into outs. 

How much of those "20 points" was because of the fielders, and how much was because of the pitcher?

Thanks to previous work by Tom Tango, Sky Andrecheck, and others, I think we have what we need to figure this out. If you don't want to see the math or logic, just head to the last section of this post for the two-sentence answer.

------

In 2003, a paper called "Solving DIPS," (by Erik Allen, Arvin Hsu, Tom Tango, et al) did a great job in trying to establish what factors affect BAbip, and in what proportion. I did my own estimation in 2015 (having forgotten about the previous paper). I'll use my breakdown here. 

Looking at a large number of actual team-seasons, I found that the observed SD of BAbip was 11.2 points. I estimated the breakdown of SDs as:


 7.7  fielding talent
 2.5  pitching staff talent
 7.1  luck
 2.5  park
--------------------------
11.0  total

(If you haven't seen this kind of chart before, the "total" doesn't actually add up to the components unless you square them all. That's how SDs work -- when you have two independent variables, the SD of their sum is the square root of the sum of their squares.)

OK, this is where I update a bit from the numbers in the previous version of this post.

First, I'm bumping the SD of park from 2.5 points to 3.5 points, to match Tango's numbers for 1999-2002.  Second, I'm bumping luck to 7.3, since that's the theoretical value (as I'll calculate later).  Third, I'm bumping the pitching staff to 4.3, because after checking, it turns out I made an incorrect mathematical assumption in the previous post.  Finally, fielding talent drops to 6.1 to make it all add up.  So the new breakdown:


 6.1  fielding talent
 4.3  pitching staff talent
 7.3  luck
 3.5  park
--------------------------
11.0  total


----

We can use that chart to break the team's 20-point advantage into its components. But ... we can't yet calculate how much of that 20 points goes to the fielders, and how much to the pitchers. Because, we have an entry called "luck". We need to know how to break down the luck and assign it to either side. 

Your first reaction might be -- it's luck, so why should we care? If we're looking to assign deserved credit, why would we want to assign randomness?

But ... if we want to know how the players actually performed, we *do* want to include the luck. We want to know that Roger Maris hit 61 home runs in 1961, even if it's undoubtedly the case that he played over his head in doing so. In this context, "luck" just means the team did somewhat better or worse than their actual talent. That's still part of their record.

Similarly here. If a team gets lucky in opponent BAbip, all that means is they did better than their talent suggests. But how much of that extra performance was the pitchers, giving up easier balls in play? And how much was the fielders, making more and better plays than expected?

That's easy to figure out if we have zone-type fielding stats, calculated by watching where the ball is hit (and sometimes how fast and at what angle), and figuring out the difficulty of every ball, and whether or not the fielders were able to turn it into an out. With those stats, we don't have to risk "blaming" a fielder for not making a play on a bloop single he really had no chance on. 

So where we have those stats, and they work, we have the answer right there, and this post is unnecessary. If the team was +60 runs on balls in play, and the fielders' zone ratings add up to +30, that's half-and-half, so we can say that the 20-point BAbip advantage was 10 points pitching and 10 points hitting.

But for seasons where we don't have the zone rating, what do we do, if we don't know how to split up the luck factor?

Interestingly, it will the stats compiled by the Zone Rating people that allow us to calculate estimates for the years in which we don't have them.

------

Intuitively, the more common "easy outs" and "sure hits" are, the less fielders matter. In fact, if *all* balls in player were 0% or 100%, fielding performance wouldn't matter at all, and fielding luck wouldn't come into play. All the luck would be in what proportion the pitcher split between 0s and 100s. 

On the other hand, if all balls in play were exactly the league average of 30%, it would be the other way around. There would be no difference in the types of hits pitchers gave up, which means there would be no BAbip pitching luck at all. All the luck would be in whether the fielders handled more or fewer than 30% of the chances.

So: the more BIP are "near-automatic" hits or "near-automatic" outs, the more pitchers matter. The more BIP that could go either way, the more fielders matter.

That means we need to know the distribution of ball-in-play difficulty. And that's the data that we wouldn't have without the development of Zone ratings now keeping track of it. 

The data I'm using comes from Sky Andrecheck, who actually published it in 2009, but I didn't realize what it could do until now. (Actually, I'm repeating some of Sky's work here, because I got his data before I saw his analysis of it.  See also Tango's post at his old blog.)

Here's the distribution. Actually, I tweaked it just a tiny bit to make the average work out to .300 (.29987) instead of Sky's .310, for no other reason than I've been thinking .300 forever and didn't want to screw up and forget I need to use .310. Either way, the results that follow would be almost the same. 


43.0% of BIP:  .000 to  .032 chance of a hit*
23.0% of BIP:  .032 to  .140 chance of a hit
10.3% of BIP:  .140 to  .700 chance of a hit
 4.7% of BIP:  .700 to 1.000 chance of a hit
19.0% of BIP:          1.000 chance of a hit
---------------------------------------------
overall average: really close to .300

(*Within a group, the probability is uniform, so anything between .032 and .140 is equally likely once that group is selected.)


The SD of this distribution is around .397. Over 3900 BIP, which I used to represent a team-season, it's .00636. That's the SD of pitcher luck.

The random binomial SD of BAbip over 3900 PA is the square root of (.3)(1-.3)/3900, which comes out to .00733. That's the SD of overall luck.

Since var(overall luck) = var(pitcher luck) + var(fielder luck), we can solve for fielder luck, which turns out to be .00367.


6.36 points pitcher luck (.00636)
3.67 points fielder luck (.00367)
--------------------------------
7.33 points overall luck (.00733)

If you square all the numbers and convert to percentages, you get


 75.3 percent pitcher luck
 24.7 percent fielder luck
--------------------------
100.0 percent overall luck

So there it is. BAbip luck is, on average, 75 pitching and 25 percent fielding. Of course, it varies randomly around that, but those are the averages.

What does that mean in practice? Suppose you notice that a team from the past, which you know has average talent in both pitching and fielding, gave up 20 fewer hits than expected on balls in play. If you were to go back and watch re-broadcasts of all 162 games, you'd expect to find that the fielders made 5 more plays than expected, based on what types of balls in play they were. And, you'd expect to find that the other 15 plays were the result of balls being having been hit a bit easier to field than average.

Again, we are not estimating talent here: we are estimating *what happened in games*. This is a substitute for actually watching the games and measuring balls in play, or having zone ratings, which are based on someone else actually having done that. 

------

So, now that we know the luck breaks down 75/25, we can take our original breakdown, which was this:


 6.1  fielding talent
 4.3  pitching staff talent
 7.3  luck
 3.5  park
--------------------------
11.0  total

And split up the 7.3 points of luck as we calculated:


6.36 pitching luck
3.67 fielding luck
--------------------------
7.3  total luck

And substitute that split back in to the original:


 6.1  fielding talent
 3.67 fielding luck
 4.3  pitching staff talent
 6.36 pitching staff luck
 3.5  park
--------------------------
11.0  total

Since talent+luck = observed performance, and talent and luck are independent, we can consolidate each pair of "talent" and "luck" by summing their squares and taking the square root:


 7.1 fielding observed
 7.7 pitching observed 
 3.5 park
----------------------
11.0 total

Squaring, taking percentages, and rounding, we get

 42 percent fielding
 48 percent pitching
 10 percent park
--------------------
100 percent total 

If you're playing in an average park, or you're adjusting for park some other way, it doesn't apply here, and you can say 


 47 percent fielding
 53 percent pitching
---------------------
100 percent total

So now we have our answer. If you see a team's stats one year that show them to have been particularly good or bad at turning batted balls into outs, on average, after adjusting for park, 47 percent of the credit goes to the fielders, and 53 percent to the pitchers.

But it varies. Some teams might have been 40/60, or 60/40, or even 120/-20! (The latter result might happen if, say, the fielders saved 24 hits, but the pitchers gave up harder BIPs that cost 4 extra hits.)

How can you know how far a particular team is from the 47/53 average? Watch the games and calculate zone ratings. Or, just rely on someone else's reliable zone rating. Or, start with 47/53, and adjust for what you know about how good the pitching and fielding were, relative to each other. Or, if you don't know, just use 47/53 as your estimate.

To verify empirically whether I got this right, find a bunch of published Zone Ratings that you trust, and see if they work out to about 42 percent of what you'd expect if the entire excess BAbip was allocated to fielding.  (I say 42 percent because I assume zone ratings correct for park.)

(Actually, I ran across about five years of data, and tried it, and it came out to 39 percent rather than 42 percent. Maybe I'm a bit off, or it's just random variation, or I'm way off and there's lots of variation.)

-------

So what we've found so far:

-- Luck in BAbip belongs 25% to fielders, 75% to pitchers;

-- For a team-season, excess performance in observed BAbip belongs 42% to fielders, 48% to pitchers, and 10% to park.

-------

That 42 percent figure is for a team-season only. For an individual pitcher, it's different. 

Here's the breakdown for an individual pitcher who allows 700 BIP for the season. 


 6.1  fielding talent
 7.6  pitching talent
17.3  luck
 3.5  park
---------------------------
20.2  total


The SD of pitching talent is larger now, because you're dealing with one specific pitcher, rather than the average of all the team's pitchers (who will partially offset each other, reducing variability). Also, luck has jumped from 7.3 points to 17.2, because of the smaller sample size.

OK, now let's break up the luck portion again:


 6.1  fielding talent
 7.6  fielding luck
 7.6  pitching talent
15.5  pitching luck
 3.5  park
---------------------------
20.2  total

And consolidating:


 9.75 observed fielding
17.3  observed pitching
 3.5  park
---------------------------
20.2  total

Converting to percentages, and rounding from 31/69:


 23%  observed fielding
 73%  observed pitching
  3%  park
---------------------------
100%  total

If we've already adjusted for park, then

 24%  observed fielding
 76%  observed pitching
---------------------------
100%  total


So it's quite different for an individual pitcher than for a team season, because luck and talent break down differently between pitchers and fielders. 

The conclusion: if you know nothing specific about the pitcher, his fielders, his park, or his team, your best guess is that 25 percent of his BAbip (compared to average) came from how well his fielders made plays, and 75 percent of his BAbip comes from what kind of balls in play he gave up.

------

Here's the two-sentence summary. On average,

-- For teams with 3900 BIP, 47 percent of BABIP is fielding and 53 percent is pitching.

-- For starters with 700 BIP, 24 percent of BABIP is fielding and 76 percent is pitching.

------

Next: Part II, where I try applying this to pitcher evaluation, such as WAR.




Labels: , , , ,

Saturday, October 31, 2020

Calculating park factors from batting lines instead of runs

I missed a post Tango wrote back in 2019 about park factors. In the comments, he said,

"That’s one place where we failed with our park factors, using actual runs instead of "component" runs. They should be based on Linear Weights or RC or wOBA, something like that.

"Using actual runs means introducing unnecessary random variation in the mix."

Yup. One of those bits of brilliance that's obvious in retrospect.

The idea is, there's a certain amount of luck involved in turning batting events into runs, which depends on the sequence -- in other words, "clutch hitting," which is thought to be mostly random. If teams wind up scoring, say, 20 runs above average in a certain park, it could be that the park lends itself to higher offense. But, it could also be that the park is neutral, and those 20 runs just came from better clutch hitting.

So if we calculated park factors from raw batting lines, instead of actual runs, we eliminate that luck, and should get better estimates. We can still convert to expected runs afterwards.

Let's do it. I'll start with using runs as usual. Then, I'll do it for wOBA, and we'll compare.

-------

I used team-seasons from 2000-2019, except Coors Field (because it`s so extreme an outlier). I included only parks that were used at least 16 of the 20 seasons. 

To get the observed park effects, I just took home scoring (both teams combined) and subtracted road scoring (both teams combined). 

For those 444 datapoints, I got

SD(observed) = 81.6 runs

To estimate luck, I used the rule of thumb that SD(runs) for a single team's games is about 3. (Tango uses the square root of total runs for both teams, but I didn't bother.)  

If SD(1 game) = 3, then SD(81 games) = 27. But we want both teams combined, so multiply by root 2. Then, we want (home - road), so multiply by root 2 again. That gives us 54.

SD(luck) = 54 runs

Since var(observed) = var(luck) + var(non-luck), we get*

SD(non-luck) = 61.2 runs

*"var" is variance, the square of SD. I'm using it instead of "SD^2" because it makes it much easier to read.

Now, what's this thing I called "non-luck"? It's a combination of the differences between parks, and season-to season differences within the same park -- weather, how well the players are suited to the park, the parks used by other teams in the division (because of the unbalanced schedule), the parks used by interleague opponents, the somewhat-random distribution of opposing pitchers ... stuff like that.

var(non-luck) = var(between parks) + var(within park)

To estimate SD(within park), I just looked at the observed SDs of the same park across the 16-20 seasons in the dataset. There were 23 parks in the sample, and I took the root-mean-square of those 23 individual SDs. I got

SD(different seasons of park) = 64.1

But ... that 64.1 includes luck, and we want only the non-luck portion. So let's remove luck:

var(diff. seas. of park)= var(luck) + var(within park)
64.1 squared = 54 squared  + var(within park)
SD(within park) = 34.5 runs

And now we can estimate SD(between parks):

var(non-luck) = var(between parks) + var(within park)
61.2 squared = var(between parks) + 34.5 squared
SD(between parks) = 50.5 runs

Summarizing:

81.6  runs total
---------------------------------
54    luck
50.5  between parks
34.5  within park between seasons

Park squared is only 38 percent of the total squared. That means that only 38 percent of the observed park effect is real, and you have to regress to the mean by 62 percent to get an unbiased estimate.

That's a lot. And it's one reason that most sites publish park factors based on more than one season, to give luck a chance to even out.

-------

Now, let's try Tango's suggestion to use wOBA instead, and see how much luck that squeezes out.

For the same individual parks, I calculated every year's observed park difference the same way as for runs -- home wOBA minus road wOBA, both teams combined.

For the sample, SD(observed) was 0.01524, against an average wOBA of .3248. That's a ratio of 4.7%. I did a regression and found runs-per-PA increase 1.8x as fast as wOBA (probably proportional to the 1.77th power, or something), so 4.7% in wOBA is 8.45% in runs.

In the full sample, there were .118875 runs per PA, and an average 6207 PA for each home park-season. That's about 738 runs. Taking 8.45 percent of that works out to an SD of 67.3 runs.

SD(observed) = 67.3 runs

The luck SD for wOBA for a single PA is .532 (as calculated from an average batting line APBA card). I'll spare you repeating the percentage calculations, but for 6207 PA,

SD(luck) = 41.9 runs

As before, var(observed) = var(luck) + var(non-luck), so

SD(non-luck) = 52.7 runs

Looking at the RMS between-season SD of the 23 teams in the sample, 

SD(different seasons of park) = 51.2 runs

Eliminating luck to get true season-to-season differences:

var(diff. seas. of park)= var(luck) + var(within park)
51.2 squared = 41.9 squared  + var(within park)
SD(within park) = 29.4 runs

And, finally,

var(non-luck) = var(between parks) + var(within park)
52.7 squared  = var(between parks) + 29.4 squared
SD(between parks) = 43.7 runs

The summary:

67.3  runs total
---------------------------------
41.9  luck
43.7  between park
29.4  within park between seasons

Here the "between park" variance is 42 percent of the total, up from 38 percent when we used runs. So we have, in fact, gotten more accurate estimates.

------

But wait! The two methods really should give us the same estimate of the SD of the "between" and "within" park factors, since they're trying to measure the same thing. But they don't:

runs  wOBA
-----------------------------------------
81.6  67.3   runs total
-----------------------------------------
54    41.9   luck
50.5  43.7   between park
34.5  29.4   within park between seasons

(The "luck" SD is supposed to be different, since that was the whole purpose of using wOBA, to eliminate some of the random noise.)

I think the difference is due to the fact that the wOBA variances were all based on averages per PA, while the runs variances were based on averages per game (roughly, per 27 outs).

On average, the more runs you score, the more PA you'll have. So changing the denominator to PA reduces the high-scoring games relative to the low-scoring games, which compresses the differences, which reduces the SD. 

Although the differences in PA look small, they actually indicate large differences in scoring. Because, per season, every park gets roughly the same number of outs, which means roughly the same number of PA that are outs. So any "extra PA" are mostly baserunners, and very valuable in terms of runs.

If you switch from "observed runs per game" to "observed runs per 6207 PA," the observed SD drops from 81.6 to 72.7 runs.  That's an 11 percent drop. When I did the same for wOBA, the observed SD dropped by 13 percent. So, let's estimate that the difference between "per game" and "per PA" is 12 percent, and reduce everything in the runs column by 12 percent:

runs  wOBA
--------------------------------------------
71.8  67.3   runs total
--------------------------------------------
47.5  41.9   luck
44.4  43.7   between park
30.4  29.4   within park between seasons
--------------------------------------------
62%   58%    regression to long-term mean 

I'm not 100% sure this is legitimate, but it's probably pretty close. One thing I want to do to make the comparison better, is to use the same value for "between park" and "within park", since we expect the methods to produce the same estimate, and we expect that any difference is random (in things like wOBA to run conversion, or how PA vary between games, or the fact that the wOBA calculation omits factors like baserunning).

So after my manual adjustment, we have:

runs  wOBA
--------------------------------------------
71.4  67.8   runs total
--------------------------------------------
47.5  41.9   luck
44    44     between park
30    30     within park between seasons
--------------------------------------------
62%   58%    regression to long-term mean 

-------

That's still a fair bit you have to regress either way -- more than half -- but that would be reduced if you used more than one season in your sample. If we go to the average of four seasons, "luck" and "within park" both get cut in half (the square root of 1/4). 

I'll divide both of those by 2, and recalculate the top and bottom line:

runs  wOBA
--------------------------------------------
52.3  51.0   runs total
--------------------------------------------
24    21     luck
44    44     between park
15    15     within park between seasons
--------------------------------------------
29%   26%    regression to long-term mean 

So if we use a four-year park average, we should only have to regress 29 percent (for runs) or 26 percent (for wOBA). 

-------

Thanks to Tango for the wOBA data making this possible, and for other observations I'm saving for a future post.

My three previous posts on park factors are here:  one two three



Labels: , , ,

Thursday, August 27, 2020

Charlie Pavitt: Open the Hall of Fame to sabermetric pioneers

This guest post is from occasional contributor Charlie Pavitt. Here's a link to some of Charlie's previous posts.

-----

Induction into the National Baseball Hall of Fame (HOF) is of course the highest honor available to those associated with the game.  When one thinks of the HOF, one first thinks of the greatest players, such as the first five inductees in 1936 (Cobb, Johnson, Matthewson, Ruth, and Wagner). But other categories of contributors were added almost immediately; league presidents (Morgan Bulkeley, Ban Johnson) and managers (Mack, McGraw) plus George Wright in 1937, pioneers (Alexander Cartwright and Henry Chadwick) in 1938, owners (Charles Comiskey) in 1939, umpires (Bill Klem) and what would now be considered general managers (Ed Barrow) in 1953, and even union leaders (Marvin Miller, this year for induction next year). There is an additional type of honor associated with the HOF for contributions to the game; the J. G. Taylor Spink Award (given by the Baseball Writers Association of America) annually since 1962, the Ford C. Frick Award for broadcasters annually since 1978, and thus far five Buck O’Neill Lifetime Achievement Awards given every three years since 2008.  Even songs get honored ("Centerfield", 2010; "Talkin' Baseball", 2011).

But what about sabermetricians? Are they not having a major influence on the game?  Are there not some who are deserving of an honor of this magnitude?

I am proposing that an honor analogous to the Spink, Frick, and O’Neill awards be given to sabermetricians who have made significant and influential contributions to the analytic study of baseball. I would have called it the Henry Chadwick Award to pay tribute to the inventor of the box score, batting average, and earned run average, but SABR has already reserved that title for its award for research contributions, a few of which have gone to sabermetricians but most to other contributors. So instead I will call it the F. C. Lane award, not in reference to Frank C. Lane (general manager of several teams in the 1950s and 1960s) but rather Ferdinand C. Lane, editor of the Baseball Magazine between 1911 and 1937. Lane wrote two articles for the publication ("Why the System of Batting Should Be Reformed," January 1917, pages 52-60; "The Base on Balls," March 1917, pages 93-95) in which he proposed linear weight formulas for evaluating batting performance, the second of which is remarkably accurate.

I shall now list those whom I think have made "significant and influential contributions to the analytic study of baseball" (that phrase was purposely worded in order to delineate the intent of the award). The HOF began inductions with five players, so I will propose who I think should be the first five recipients:


George Lindsay

Between 1959 and 1963, based on data from a few hundred games either he or his father had scored, George Lindsay published three academic articles in which he examined issues such as the stability of the batting average, average run expectancies for each number of outs during an inning and for different innings, the length of extra-inning games, the distribution of total runs for each team in a game, the odds of winning games with various leads in each inning, and the value of intentional walks and base stealing. It was revolutionary work, and opened up areas of study that have been built upon by generations of sabermetricians since.

 

Bill James

Starting with his first-self-published Baseball Abstract back in 1977, James built up an audience that resulted in the Abstract becoming a conventionally-published best seller between 1982 and 1988.  During those years, he proposed numerous concepts – to name just three, Runs Created, the Pythagorean Equation, and the Defensive Spectrum – that have influenced sabermetric work ever since.  But at least if not more important were his other contributions.  He proposed and got off the ground Project Scoresheet, the first volunteer effort to compile pitch-by-pitch data for games to be made freely available to researchers; this was the forerunner and inspiration for Retrosheet. During the same years as the Abstract was conventionally published, he oversaw a sabermetric newsletter/journal, the Baseball Analyst, which provided a pre-Internet outlet for amateur sabermetricians (including myself) who had few if any other opportunities to get their work out to the public.  Perhaps most importantly, his work was the first serious sabermetric (a term he coined) analysis many of us saw, and served as an inspiration for us to try our hand at it too. I might add that calls for James to be inducted into the Hall itself can be found on a New York Times article from January 20, 2019 by Jamie Malinowski and the Last Word on Baseball website by its editor Evan Thompson.


Pete Palmer

George Lindsay’s work was not readily available. The Hidden Game of Baseball, written by Palmer and John Thorn, was, and included both a history of previous quantitative work and advancement on that work in the spirit of Lindsay’s. Palmer’s use of linear-weight equations to measure offensive performance and of run expectancies to evaluate strategy options were not entirely new, as Lane and Lindsay had respectively been first, but it was Palmer’s presentation that served to familiarize those that followed with these possibilities, and as with James these were inspirations to many of us to try our hands at baseball analytics ourselves.  Probably the most important of Palmer’s contributions has been On-base Plus Slugging (OPS), one of the few sabermetric concepts to have become commonplace on baseball broadcasts.

 

David Smith

I’ve already mentioned Project Scoresheet, which lasted as a volunteer organization from 1984 through 1989. I do not wish to go into its fiery ending, a product of a fight about conflict of interest and, in the end, money.  Out of its ashes like the proverbial phoenix rose Retrosheet, the go-to online source for data describing what occurred during all games dating back to 1973, most games back to 1920, and some from before then. Since its beginning, those involved with Retrosheet have known not to repeat the Project’s errors and have made data freely available to everyone even if the intended use for that data is personal financial profit. Dave Smith was the last director of Project Scoresheet, the motivator behind the beginning of Retrosheet, and the latter’s president ever since. Although it is primed to continue when Dave is gone, Retrosheet’s existence would be inconceivable without him.  Baseball Prospectus’s analyst Russell Carleton, whose work relies on Retrosheet, has made it clear in print that he thinks that Dave should be inducted into the Hall itself.

 

Sean Forman

It is true that Forman copied from other sources, but no matter; it took a lot of work to begin what is now the go-to online source for data on seasonal performance. Baseball Reference began as a one-man sideline for an academic, and has become home to information about all American major team sports plus world-wide info on “real” football. 

 

-----

Here are two others that I believe should eventually be recipients.

Sherri Nichols

Only two women have been bestowed with HOF-related awards; Claire Smith is a past winner of the Spink Award and Rachel Robinson is a recipient of the O’Neill Award.  Sherri Nichols would become the third. I became convinced that she deserved it after reading Ben Lindbergh’s tribute, and recommend it for all interested in learning about the "founding mother" of sabermetrics. I remember when the late Pete DeCoursey (I was scoring Project Scoresheet Phillies games and he was our team captain) proposed the concept of Defensive Average, for which (as Lindbergh’s article noted) Nichols did the computations. This was revolutionary work at that time, and laid the groundwork for all of the advanced fielding information we now have at our disposal.

 

Tom Tango

Tango has had significant influence on many areas of sabermetric work, two of which have joined Palmer’s OPS as commonplaces on baseball-related broadcasts. Wins Above Replacement (WAR) was actually Bill James’s idea, but James never tried to implement it. Tango has helped define it, and his offensive index wOBA is at the basis of the two most prominent instantiations, those from Baseball Reference (alternatively referred to as bWAR and rWAR) and FanGraphs (fWAR).  Leverage was an idea whose time had come, as our blogmaster Phil Birnbaum came up with the same concept at about the same time, but it was Tango’s usage that became definitive. His Fielding Independent Pitching (FIP) corrective to weaknesses in ERA is also well-known and often used. Tango currently oversees data collection for MLB Advanced Media, and has done definitive work on MLBAM’s measurement of fielding (click here for a magisterial discussion of that topic).

There are some historical figures that might be deserving; Craig Wright, Dick Cramer, and Allan Roth come to mind as possibilities. Maybe even Earnshaw Cook, as wrong as he was about just about everything, because of what he was attempting to do without the data he needed to do it right (see his Percentage Baseball book for a historically significant document). Perhaps the Award could also go to organizations as a whole, such as Baseball Prospectus and FanGraphs; if so, SABR should get it first.


Labels: ,

Wednesday, August 05, 2020

The NEJM hydroxychloroquine study fails to notice its largest effect

Before hydroxychloroquine was a Donald Trump joke, the drug was considered a promising possibility for prevention and treatment of Covid-19. It had been previously shown to work against respiratory viruses in the lab, and, for decades, it was safely and routinely given to travellers before departing to malaria-infested regions. A doctor friend of mine (who, I am hoping, will have reviewed this post for medical soundness before I post it) recalls having taken it before a trip to India.

Travellers start on hydroxychloroquine two weeks before departure; this gives the drug time to build up in the body. Large doses at once can cause gastrointestinal side effects, but since hydroxychloroquine has a very long half-life in the body -- three weeks or so -- you build it up gradually.

For malaria, hydroxychloroquine can also be used for treatment. However, several recent studies have found it to be ineffective treating advanced Covid-19.

That leaves prevention. Can hydroxychloroquine be used to prevent Covid-19 infections? The "gold standard" would be a randomized double-blind placebo study, and we got one a couple of months ago, in the New England Journal of Medicine (NEJM). 

It concluded that there was no statistically significant difference between the treatment and placebo groups, and concluded

"After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19 or confirmed infection when used as postexposure prophylaxis within 4 days after exposure."

But ... after looking at the paper in more detail, I'm not so sure.

-------

The study reported on 821 subjects who had been exposed, within the past four days, to a patient testing positive for Covid-19. They received a dose of either hydroxychloroquine or placebo for the next five days (the first day was a higher "loading dose"), and followed over the next couple of weeks to see if they contracted the virus.

The results:

49 of 414 treatment subjects (11.8%) became infected
58 of 407   placebo subjects (14.3%) became infected.

That's about 17 percent fewer cases in patients who got the real drug. 

But that wasn't a large enough difference to show statistical significance, with only about 400 subjects in each group. The paper recognizes that, stating the study was designed only with sufficient power to find a reduction of at least 50 percent, not the 17 percent reduction that actually appeared. Still, by the usual academic standards for this sort of thing, the authors were able to declare that "hydroxychloroquine did not prevent illness."

At this point I would normally rant about statistical significance and how "absence of evidence is not evidence of absence."  But I'll skip that, because there's something more interesting going on.

------

Recall that the study tested hydroxychloroquine on subjects who feared they were already exposed to the virus. That's not really testing prevention ... it's testing treatment, albeit early treatment. It does have elements of prevention in it, as perhaps the subjects may not have been infected at that point, but would be infected later. (The study doesn't say explicitly, but I would assume some of the exposures were to family members, so repeated exposures over the next two weeks would be likely.)

Also: it did take five days of dosing until the full dose of hydroxychloroquine was taken. That means the subject didn't get a full dose until up to nine days after exposure to the virus.

So this is where it gets interesting. Here's Figure 2 from the paper:



These lines are cumulative infections during the course of the study. As of day 5, there were actually more infections in the group that took hydroxychloroquine than in the group that got the placebo ... which is perhaps not that surprising, since the subjects hadn't finished their full doses until that fifth day. By day 10, the placebo group has caught up, but the groups are still about equal.

But now ... look what happens from Day 10 to Day 14. The group that got the hydroxychloroquine doesn't move much ... but the placebo group shoots up.

What's the difference in new cases? The study doesn't give the exact numbers that correspond to the graph, so I used a pixel ruler to measure the distances between points of the graph. It turns out that from Day 10 to Day 14, they found:

-- 11 new infections in the placebo group
--  2 new infections in the hydroxychloroquine group.

What is the chance that of 13 new infections, they would get split 11:2? 

About 1.12 percent one-tailed, 2.24 percent two-tailed.

Now, I know that it's usually not legitimate to pick specific findings out of a study ... with 100 findings, you're bound to find one or two random ones that fall into that significance level. But this is not an arbitrary random pattern -- it's exactly what we would have expected to find if hydroxychloroquine worked as a preventative. 

It takes, on average, about a week for symptoms to appear after COVID-19 infection. So for those subjects in the "1-5" group, most were probably infected *before* the start of their hydroxychloroquine regimen (up to four days before, as the study notes). So those don't necessarily provide evidence of prevention. 

In the "6-10" group, we'd expect most of them to have been already infected before the drugs were administered; the reason they were admitted to the study in the first place was because they feared they had been exposed. So probably many of those who didn't experience symptoms until, say, Day 9, were already infected but had a longer incubation period. Also, most of the subsequently-infected subjects in that group probably got infected in the first five days, while they didn't have a full dose of the drug yet.

But in the last group, the "11-14" group, that's when you'd expect the largest preventative effect -- they'd have had a full dose of the drug for at least six days, and they were the most likely to have become infected only after the start of the trial.

And that's when the hydroxychloroquine group had an 84 percent lower infection rate than the placebo group.

------

In everything I've been reading about hydroxychloroquine and this study, I have not seen anyone notice this anomaly, that beyond ten days, there were almost seven times as many infections among those who didn't get the hydroxychloroquine. In fact, even the authors of the study didn't notice. They stopped the study on the basis of "futility" once they realized they were not going to achieve statistical significance (or, in other words, once they realized the reduction in infections was much less than the 50% minimum they would endorse). In other words: they stopped the study just as the results were starting to show up! 

And then the FDA, noting the lack of statistical significance, revoked authorization to use hydroxychloroquine.

I'm not trying to push hydroxychloroquine here ... and I'm certainly not saying that I think it will definitely work. If I had to give a gut estimate, based on this data and everything else I've seen, I'd say ... I dunno, maybe a 15 percent chance. Your guess may be lower. Even if your gut says there's only a one-in-a-hundred chance that this 84 percent reduction is real and not a random artifact ... in the midst of this kind of pandemic, isn't even 1 percent enough to say, hey, maybe it's worth another trial?

I know hydroxychloroquine is considered politically unpopular, and it's fun to make a mockery of it. But these results are strongly suggestive that there might be something there. If we all agree that Trump is an idiot, and even a stopped clock is right twice a day, can we try evaluating the results of this trial on what the evidence actually shows? Can we not elevate common sense over the politics of Trump, and the straitjacket of statistical significance, and actually do some proper science?




Labels: