## Saturday, March 18, 2023

### 1961 Yankees fielding is double-counted against Whitey Ford

Here are the 1961 pitching lines of Whitey Ford and Jack Kralick that Bill James wrote about back in 2019:

IP   W-L    H    K   BB    ERA
Ford     283  25- 4  242  209  92   3.21
Kralick  242  13-11  257  101  97   3.61

Kralick's season is decent, but clearly no match for Whitey, who has him beat in every category.

But, surprisingly, Baseball Reference has Kralick leading the American League with a WAR of 6.0. Whitey Ford, on the other hand, is 12th with only 3.7 WAR.

IP   W-L    H    K   BB    ERA    WAR
Ford     283  25- 4  242  209  92   3.21    3.7
Kralick  242  13-11  257  101  97   3.61    6.0

What's going on?

The answer, I think, is the fielding and park adjustments WAR uses are overinflated. That's for a future post. This post is about how, while trying to figure out what happened, I think I found an issue with the adjustments that turns out to be randomly specific to Whitey Ford's 1961 Yankees.

-----

B-R uses Sean Smith's "Total Zone Rating" (TZR) to estimate fielding and calculate defensive WAR (dWAR). For seasons before 1989, TZR is based on Retrosheet data, which, for most games, includes information on the type and location of balls in play (BIP). Basing dWAR on TZR means that for the team as a whole, the defensive evaluation is roughly equivalent to what you'd see just looking at batting average on balls in play (BABIP).

There are other factors too -- baserunner advancement, catcher arm -- but it's mostly BABIP. To confirm that, I ran a regression to predict dWAR based on BABIP (compared to league) with a dummy variable for franchise (which Sean's website says TZ adjusts for to take outfield park variation into account). For the years 1960-73, the correlation was high (r-squared .77), with the coefficient of BABIP coming out very close to the win value of turning outs into hits. For the 14 Yankee seasons specifically, the correlation was over 0.9.

So it seems like most of dWAR up to 1988 is BABIP.

-----

In 1961, the Yankees allowed an opposition BABIP of .261, compared to the AL average .275. That's an advantage of .014, or "14 points".

Yankee pitchers allowed 4414 balls in play that year. So the extra .014 represents about 62 hits turned into outs. I use 0.8 as the run value of each of those outs, so that's 49.4 runs. Call it 50 for short.

-----

As an aside: Baseball Reference has the 1961 Yanks at 72 runs, not 50. Why such a big difference? I'm not sure.

Maybe they use MLB instead of AL as their baseline. That would add 14 runs or so, because in 1961 the BABIP for both leagues combined was .279 instead of the AL-only .275.

It could also be a result of TZR not including popups and line drives in its evaluation (because presumably there's not much difference in fielding those). It also adds measures of outfielder arms (by looking at baserunner advancement), double play ability, and caught stealings for catchers. And there's that franchise adjustment. All those might contribute to the difference.

But there's another anomaly in the raw data. The MLB average for the 1961 season was almost +12 defensive runs per team. You'd think the average would have to be zero, by definition. It could be that the system uses an average based on a large number of seasons, and 1961 just happened to be a great year for fielding. But that doesn't seem like it could be the answer. For 1959-1967, the MLB total defensive runs saved is positive every one of those nine seasons. Then, it switches over: from 1968 to 1975, every season total is negative. It doesn't seem plausible to me that MLB spent eight years with good fielders, then the next eight with worse fielders.

Whatever it is ... over all 17 years, 1961 is the biggest outlier. The total fielding runs for 1961 is +214, which is +11.9 per team. The next highest, 1960, is only +146. None of the other positives break 100, and the highest negative is -71 in 1971.

Anyway, that still isn't my main point; it's just something that I noticed.

------

OK, now the interesting part.

As I said, the 1961 Yankees' opponents' BABIP was 14 points lower than the league. All things being equal, you'd expect the fielding to be equally good at home and on the road -- about the same 14 points either way.

But the Yankees BABIP advantage was much higher at home.

Overall, AL teams were 8 points better at home than on the road. But the Yankees were 37 points better:

NYY     AL
------------------
home   .242   .271
------------------
diff   .037   .008

On the road, the Yankee fielders were the same as the AL average, holding opponents to a .279 BABIP. At home, though, they were 29 points better than the league.

So, in effect, all 50 runs the Yankees fielders saved via BABIP were saved at Yankee Stadium.

Why does that matter? Because those 50 runs are going to be *double counted* against Yankee pitchers and their WAR.

First, those 50 runs will be attributed to the skill of the Yankee fielders. Whitey Ford's WAR will drop, because it appears the fielders behind him were responsible for turning so many of his balls in play into outs.

Second, those same 50 runs are going to be used in calculating the Park Factor, which is based on actual runs scored (by both teams). With 50 fewer runs scored at Yankee Stadium because of BABIP, and no fewer runs scored on the road, the calculation will implicitly attribute those 50 runs to the park and the park factor will drop. Whitey Ford's WAR again will drop because he pitches in a park where it appears to be easier to prevent runs.

The adjustment for BABIP is made twice: the first time it's attributed to the fielders, and the second time it's attributed to the park. But it can't be both. At least not fully both -- it could be 50% fielding and 50% park, but the WAR method treats it as 100% fielding and 100% park.

Specifically, according to Baseball Reference, the dWAR calculation credits the Yankee defense with 0.43 runs per game behind Whitey Ford. Over Whitey's 283 innings, that's 13.5 runs. At 10 runs per win, that's 1.35 WAR. At 8.5 runs per win -- which is what B-R seems to be using for the 1961 American League -- it's 1.6 WAR.

Whitey is being adjusted, implicitly, by 3.2 WAR instead of 1.6. Turning that double-counting back into single-counting would raise Whitey from 3.7 to 5.3, which seems much more reasonable for his performance.

-----

Except ... not quite. My calculation assumed that park factor is based on a single season's runs. It's actually the average of three seasons -- in this case, 1960, 1961, and 1962.

So only a third of the Yankee defense is being double-counted in 1961. That means you'd only adjust Whitey for 0.5 WAR, not the full 1.5. That brings him only to 4.2.

However, the missing 1.0 WAR is still double-counted: it's just that one third of it is moving to 1960, and one third is moving to 1962.

That means that Whitey will be shorted 0.5 WAR in 1960, and again in 1962. If you fixed that, his 1960 would go from 2.0 to 2.5, and his 1962 from 5.1 to 5.6.

Over Whitey's career, though, the Yankees' overall home BABIP (compared to road) will indeed wind up double counted towards his total WAR (with the exception of his first two and last two years, which will be "1.3-counted" or "1.6 counted").

------

I think this is something that will happen all the time, if my understanding is correct of how pitching WAR is calculated.

Any difference between home and road fielding will be counted as part of the park factor adjustment in addition to be counted as a fielding adjustment. If BABIP is better at home, the pitcher will be debited twice. If BABIP is worse at home, the pitcher will be credited twice.

BABIP is, like any other stat, subject to random variation. By my calculation, the SD of luck for home-minus-road BABIP is about 14 points, or 24 runs for a team-season. That's a lot. Whitey Ford pitched about 19.5 percent of his team's innings in 1961, so the SD of his random BABIP luck is about 5 runs. (The 1961 number looks like it was double that, or 2 SD, assuming no park effects. Whitey was double counted by about 10 runs by raw BABIP, and a little more than that by the dWAR calculation.)

Now, dWAR does remove popups and line drives from consideration ... that will reduce the luck SD (I'm not sure how much) compared to raw BABIP. But even if the SD drops from 0.5 to 0.4, or something, that's still pretty big.

-------

We could just adjust dWAR for this (the BABIP home/road numbers are readily available on B-R). But I think the adjustments for fielding and park are exaggerated in other ways -- as I wrote about in previous sets of posts.

A different algorithm for adjusting pitcher WAR -- where we regress both fielding and park to the mean by significant amounts -- might reduce the double-count enough that we won't really need to make a correction for it. It will probably adjust both Whitey Ford and Jack Kralick enough that Whitey winds up on top, although I haven't checked that in detail yet.

I'll work on that for the next post.