Here are the 1961 pitching lines of Whitey Ford and Jack Kralick that Bill James wrote about back in 2019:
IP W-L H K BB ERA
Ford 283 25- 4 242 209 92 3.21
Kralick 242 13-11 257 101 97 3.61
Kralick's season is decent, but clearly no match for Whitey, who has him beat in every category.
But, surprisingly, Baseball Reference has Kralick leading the American League with a WAR of 6.0. Whitey Ford, on the other hand, is 12th with only 3.7 WAR.
IP W-L H K BB ERA WAR
Ford 283 25- 4 242 209 92 3.21 3.7
Kralick 242 13-11 257 101 97 3.61 6.0
What's going on?
The answer, I think, is the fielding and park adjustments WAR uses are overinflated. That's for a future post. This post is about how, while trying to figure out what happened, I think I found an issue with the adjustments that turns out to be randomly specific to Whitey Ford's 1961 Yankees.
-----
B-R uses Sean Smith's "Total Zone Rating" (TZR) to estimate fielding and calculate defensive WAR (dWAR). For seasons before 1989, TZR is based on Retrosheet data, which, for most games, includes information on the type and location of balls in play (BIP). Basing dWAR on TZR means that for the team as a whole, the defensive evaluation is roughly equivalent to what you'd see just looking at batting average on balls in play (BABIP).
There are other factors too -- baserunner advancement, catcher arm -- but it's mostly BABIP. To confirm that, I ran a regression to predict dWAR based on BABIP (compared to league) with a dummy variable for franchise (which Sean's website says TZ adjusts for to take outfield park variation into account). For the years 1960-73, the correlation was high (r-squared .77), with the coefficient of BABIP coming out very close to the win value of turning outs into hits. For the 14 Yankee seasons specifically, the correlation was over 0.9.
So it seems like most of dWAR up to 1988 is BABIP.
-----
In 1961, the Yankees allowed an opposition BABIP of .261, compared to the AL average .275. That's an advantage of .014, or "14 points".
Yankee pitchers allowed 4414 balls in play that year. So the extra .014 represents about 62 hits turned into outs. I use 0.8 as the run value of each of those outs, so that's 49.4 runs. Call it 50 for short.
-----
As an aside: Baseball Reference has the 1961 Yanks at 72 runs, not 50. Why such a big difference? I'm not sure.
Maybe they use MLB instead of AL as their baseline. That would add 14 runs or so, because in 1961 the BABIP for both leagues combined was .279 instead of the AL-only .275.
It could also be a result of TZR not including popups and line drives in its evaluation (because presumably there's not much difference in fielding those). It also adds measures of outfielder arms (by looking at baserunner advancement), double play ability, and caught stealings for catchers. And there's that franchise adjustment. All those might contribute to the difference.
But there's another anomaly in the raw data. The MLB average for the 1961 season was almost +12 defensive runs per team. You'd think the average would have to be zero, by definition. It could be that the system uses an average based on a large number of seasons, and 1961 just happened to be a great year for fielding. But that doesn't seem like it could be the answer. For 1959-1967, the MLB total defensive runs saved is positive every one of those nine seasons. Then, it switches over: from 1968 to 1975, every season total is negative. It doesn't seem plausible to me that MLB spent eight years with good fielders, then the next eight with worse fielders.
Whatever it is ... over all 17 years, 1961 is the biggest outlier. The total fielding runs for 1961 is +214, which is +11.9 per team. The next highest, 1960, is only +146. None of the other positives break 100, and the highest negative is -71 in 1971.
Anyway, that still isn't my main point; it's just something that I noticed.
------
OK, now the interesting part.
As I said, the 1961 Yankees' opponents' BABIP was 14 points lower than the league. All things being equal, you'd expect the fielding to be equally good at home and on the road -- about the same 14 points either way.
But the Yankees BABIP advantage was much higher at home.
Overall, AL teams were 8 points better at home than on the road. But the Yankees were 37 points better:
NYY AL
------------------
home .242 .271
road .279 .279
------------------
diff .037 .008
On the road, the Yankee fielders were the same as the AL average, holding opponents to a .279 BABIP. At home, though, they were 29 points better than the league.
So, in effect, all 50 runs the Yankees fielders saved via BABIP were saved at Yankee Stadium.
Why does that matter? Because those 50 runs are going to be *double counted* against Yankee pitchers and their WAR.
First, those 50 runs will be attributed to the skill of the Yankee fielders. Whitey Ford's WAR will drop, because it appears the fielders behind him were responsible for turning so many of his balls in play into outs.
Second, those same 50 runs are going to be used in calculating the Park Factor, which is based on actual runs scored (by both teams). With 50 fewer runs scored at Yankee Stadium because of BABIP, and no fewer runs scored on the road, the calculation will implicitly attribute those 50 runs to the park and the park factor will drop. Whitey Ford's WAR again will drop because he pitches in a park where it appears to be easier to prevent runs.
The adjustment for BABIP is made twice: the first time it's attributed to the fielders, and the second time it's attributed to the park. But it can't be both. At least not fully both -- it could be 50% fielding and 50% park, but the WAR method treats it as 100% fielding and 100% park.
Specifically, according to Baseball Reference, the dWAR calculation credits the Yankee defense with 0.43 runs per game behind Whitey Ford. Over Whitey's 283 innings, that's 13.5 runs. At 10 runs per win, that's 1.35 WAR. At 8.5 runs per win -- which is what B-R seems to be using for the 1961 American League -- it's 1.6 WAR.
Whitey is being adjusted, implicitly, by 3.2 WAR instead of 1.6. Turning that double-counting back into single-counting would raise Whitey from 3.7 to 5.3, which seems much more reasonable for his performance.
-----
Except ... not quite. My calculation assumed that park factor is based on a single season's runs. It's actually the average of three seasons -- in this case, 1960, 1961, and 1962.
So only a third of the Yankee defense is being double-counted in 1961. That means you'd only adjust Whitey for 0.5 WAR, not the full 1.5. That brings him only to 4.2.
However, the missing 1.0 WAR is still double-counted: it's just that one third of it is moving to 1960, and one third is moving to 1962.
That means that Whitey will be shorted 0.5 WAR in 1960, and again in 1962. If you fixed that, his 1960 would go from 2.0 to 2.5, and his 1962 from 5.1 to 5.6.
Over Whitey's career, though, the Yankees' overall home BABIP (compared to road) will indeed wind up double counted towards his total WAR (with the exception of his first two and last two years, which will be "1.3-counted" or "1.6 counted").
------
I think this is something that will happen all the time, if my understanding is correct of how pitching WAR is calculated.
Any difference between home and road fielding will be counted as part of the park factor adjustment in addition to be counted as a fielding adjustment. If BABIP is better at home, the pitcher will be debited twice. If BABIP is worse at home, the pitcher will be credited twice.
BABIP is, like any other stat, subject to random variation. By my calculation, the SD of luck for home-minus-road BABIP is about 14 points, or 24 runs for a team-season. That's a lot. Whitey Ford pitched about 19.5 percent of his team's innings in 1961, so the SD of his random BABIP luck is about 5 runs. (The 1961 number looks like it was double that, or 2 SD, assuming no park effects. Whitey was double counted by about 10 runs by raw BABIP, and a little more than that by the dWAR calculation.)
Now, dWAR does remove popups and line drives from consideration ... that will reduce the luck SD (I'm not sure how much) compared to raw BABIP. But even if the SD drops from 0.5 to 0.4, or something, that's still pretty big.
-------
We could just adjust dWAR for this (the BABIP home/road numbers are readily available on B-R). But I think the adjustments for fielding and park are exaggerated in other ways -- as I wrote about in previous sets of posts.
A different algorithm for adjusting pitcher WAR -- where we regress both fielding and park to the mean by significant amounts -- might reduce the double-count enough that we won't really need to make a correction for it. It will probably adjust both Whitey Ford and Jack Kralick enough that Whitey winds up on top, although I haven't checked that in detail yet.
I'll work on that for the next post.
Phil,
ReplyDeleteI don't think it's double counting, but will have to look into it more. I did a park adjustment on TZ, which is done at the position level. The biggest place you can see that is the home/road TZ splits for Fenway Park left fielders.
Bill James thought it might be double counting for Oakland, specifically the 1980 season. I think he looked at that many years ago and I just recently stumbled upon it in his archives.
I might have made a mistake somewhere in there, but theoretically I don't think it's double counting. So long as the Yankee defense rating represents true good defense and not just the benefit of being in a pitcher's park. For 1961 it doesn't look good, but I'd have to check a few more years.
Glad I stumbled upon this. You don't post much these days, but when you do it's grade A stuff.
On why +73 runs and not +50, which you get from DER:
ReplyDeletePart is the popups/line drives/ other plays that don’t exactly match but part is this:
Catcher defense (pb, wp, sb/cs) +4
Outfield arms +4
Infielder DP: +9
So that leaves 56 runs for TZ, from converting more balls in play.
On the things not adding up - I did use MLB totals, not league specific. If I could go back I’d make it add up by league. As for the years where the whole of MLB doesn’t add up, I can’t say much. I’d probably have a better chance starting from scratch than figuring that out.
Thanks, Sean!
ReplyDeleteI would argue that the double-counting isn't anything wrong with the fielding metric; it's just a result of the combination of fielding and park methodologies. I'd actually put the "blame" -- although that's too strong a word -- on the park factor calculation.
PPF (or BPF) is based on actual runs scored in games, and attributes any difference between home and road to the park -- WITHOUT taking defense into account. In those years where fielding happens, by luck, to be unbalanced between home and road, PPF counts the H-minus-R difference a second time towards the park.
Most of the time, it's minor, because H and R fielding wind up close. But, occasionally, like the 1961 Yankees, you randomly get big splits and pitchers can be impacted by 1 WAR or more.
Thanks for the breakdown of the 73 runs. Those +17 runs you mention are definitely fielding ... I'm getting ready to put together a post on trying to break down the 61 Yankees fielding numbers between fielders and pitchers [giving up easier BIP], and now I know I have to add +17 back in for the fielding that isn't BABIP.
Hope all this makes sense! Thanks again, appreciate these comments.