Monday, November 28, 2016

How should we evaluate Detroit's defense behind Verlander?

Privately and publicly, Bill James, Tom Tango, and Joe Posnanski have been arguing about Baseball Reference's version of Wins Above Replacement. Specifically, they're questioning the 2016 WAR totals for Justin Verlander and Rick Porcello:

Verlander +6.6
Porcello  +5.0

Verlander is evaluated to have created 1.6 more wins than Porcello. But their stat lines aren't that much different:

            W-L   IP   K   BB   ERA
Verlander  16-9  227  254  57  3.04
Porcello   22-4  223  189  32  3.15

So why does Verlander finish so far ahead of Porcello?


Baseball Reference credits Verlander with an extra 13 runs, compared to Porcello, to adjust for team fielding. 13 runs corresponds to 1.3 WAR -- roughly, a half-run per nine innings pitched. 

Why so big an adjustment? Because the Red Sox fielders were much better than the Tigers'. Baseball Info Solutions (who evaluate fielding performance from ball trajectory data), had Boston at 108 runs better than Detroit for the season. The 13-run difference between Porcello and Verlander is their share of that difference.

It all seems to make sense, except ... it doesn't. Posnanski, backed by the stats, thinks that even though Detroit's defense was worse than Boston's, the difference didn't affect those two particular pitchers that much. Posnanski argues, plausibly, that even though Detroit's fielders didn't play well over the season as a whole, they DID play well when Verlander was on the mound:

"For one thing, I think it’s quite likely that Detroit played EXCELLENT defense behind Verlander, even if they were shaky behind everyone else. I’m not sure how you can expect a defense to allow less than a .256 batting average on balls in play (the second-lowest of Verlander’s career and second lowest in the American League in 2016) or allow just three runners to reach on error all year (the lowest total of Verlander’s career).

"For another, the biggest difference in the two defenses was in right and centerfield. The Red Sox centerfielder and rightfielder saved 44 runs, because Jackie Bradley and Mookie Betts are awesome. The Tigers centerfield and rightfielder cost 49 runs because Cameron Maybin, J.D. Martinez and a cast of thousands are not awesome.

"But the Tigers outfield certainly didn’t cost Verlander. He allowed 216 fly balls in play, and only 16 were hits. Heck, the .568 average he allowed on line drives was the lowest in the American League. I find it almost impossible to believe that the Boston outfield would have done better than that."


So, that's the debate. Accepting that the Tigers' fielding, overall, was 49 runs worse than average for the season, can we simultaneously accept that the reverse was true on those days when Verlander was pitching? Could the crappy Detroit fielders have turned good -- or at least average -- one day out of every five?

Here's an analogy that says yes.

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. They were farther ahead of the second-place Yankees than the Yankees were above the 26th place Reds. 

Unless, of course, Toronto's powerhouse offense just happened to sputter on those 29 days when Dickey was on the mound. Is that possible?


It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

Of course, it's not really that the Blue Jays turned into a bad-hitting team, that their skill level actually changed. It's just randomness. Some days, even great-hitting teams have trouble scoring, and, by dumb luck, there happened to be more of those days when Dickey pitched than when Buehrle pitched.

Generally, runs per game has a standard deviation of about 3, so the SD over 29 games is around 0.56. Dickey's bad luck was only around 1.6 SDs from zero, not even statistically significant.

(* Note: As I was writing this post, Posnanski posted a followup using a similar run support analogy.)


Just as we only have season fielding stats for evaluating Verlander's defense, imagine that we only had season batting stats for evaluating Dickey's run support.

In that case, when we evaluated Dickey's record, we'd say, "Dickey looks like an average pitcher, at 11-11. But his team scored a lot more runs than average. If he could only finish with a .500 record with such a good offense, he's worse than his 11-11 record shows. So, we have to adjust him down, maybe to 9-13 or something, to accurately compare him to pitchers on average-hitting teams."

And that wouldn't be right, in light of the further information we have: that the Jays did NOT score that many runs on days that Dickey pitched. 

Well, the same is true for the Verlander/Porcello case, right? It's quite possible that even though the Tigers were a bad defensive team, they happened to play good defense during Verlander's starts, just because the sample size is small enough that that kind of thing can happen. In that light, Posnanski's analysis is crucial -- it's evidence that, yes, the Tigers fielders DID play well (or at least, appear to play well) behind Verlander, even if they didn't play well behind the Tigers' other pitchers.

Because, fielding is subject to variation just like hitting is. Some games, an outfielder makes a great diving catch, and, other days, the same outfielder just misses the catch on an identical ball. More importantly, some days the balls in play are just easier to field than others, and even the BIS data doesn't fully capture that fact, and the fielders look better than they actually played. 

(In fact, I suspect that the errors from misclassifying the difficulty of balls in play are much bigger than the effect of actual randomness in how balls are fielded. But that's not important for this argument.)


What if we don't have evidence, either way, on whether Detroit's fielders were better or worse with Verlander on the mound? In that case, it's OK to use the season numbers, right?

No, I don't think so. If the pitcher had better results than expected, you have to assume that the defense played better as well. Otherwise, you'll consistently overrate the pitchers who performed well on bad-fielding teams, and underrate the pitchers who performed poorly on good-fielding teams.

The argument is pretty simple -- it's the usual "regression to the mean" argument to adjust for luck.

When a pitcher does well, he was probably lucky. Not just lucky in how well he himself played, but in EVERY possible area where he could be lucky -- parks, defense, umpire calls, weather ... everything. (See MGL's comment here.)  If a pitcher pitched well, he was probably lucky in how he pitched, and he was probably lucky in how his team fielded.

You roll ten dice, and wind up with a total of 45. You were lucky to get such a high sum, because the expected total was only 35.

Since the overall total was lucky, each individual roll, taken independently, is more lucky to have been lucky than unlucky. Because, obviously, you can't be lucky with the total without being lucky with the numbers that sum to the total. We don't know which of the ten were lucky and which were not, but, for each die, we should retrospectively be willing to bet that it was higher than 3.5.

It would be wrong to say something like: "Overall for each of these dice, the expectation was 3.5. That means the first six tosses probably averaged 21. That means that the last four tosses probably scored 24. Wow! Your last four tosses were 6-6-6-6! They were REALLY lucky!"

It's wrong, of course, because you can't arbitrarily attribute all the luck to the last four tosses. All ten are equally likely to have been lucky ones.

And the same is true for Verlander. His excellent ERA is the sum of pitching and fielding. You can't arbitrarily assume all the good luck came in the rolls his pitching dice, and he had exactly zero luck in the rolls of his team's fielding dice.


If that isn't obvious, try this. 

The WAR method works like this: it's taking a single game Verlander started, assigning the results to Verlander, and adjusting for what the average of what the fielders' did in ALL the games they played, not just this one.

Imagine that we reverse it: we take a single game Verlander started, assign the results to the FIELDERS, and adjust for the average of what Verlander did in ALL the games he pitched, not just this one.

One day, Verlander and the Tigers give up 7 runs, and the argument goes something like this:

"Wow! Normally, the Tigers fielders give up only 5 runs, so today they were -2. But wait!  Justin Velander was on the mound, and he's a great pitcher, and saves an average of two runs a game! If they gave up 7 runs despite Verlander's stellar pitching, the fielders must have been exceptionally bad, and we need to give them a -4 instead of a -2!"

Verlander's stats aren't just a measure of Verlander's own performance. As Tango would say, they're a measure of *what happened when Verlander was on the mound*. That encompasses Verlander's pitching AND his teammates' fielding. 

So, if the results with Verlander on the mound are better than expected, chances are that BOTH of those things were better than expected. 


I should probably leave it there, but if you're still not convinced, here's an explicit model.

There's a wheel of fortune with ten slots. You spin the wheel to represent a ball in play. Normally, slots 1, 2, and 3 represent a hit, and 4 through 10 represent an out. But because the Tigers fielders are so bad, number 4 is changed to a hit instead of an out.

In the long term, you expect that the Tigers' defense, compared to average, will cost a Verlander one hit for every 10 balls in play. 

But: your expectation of how many hits it actually cost depends on the specific pitcher's results.

(1) Suppose Verlander's results were better than expected. Out of 10 balls in play, he gets 8 outs. How many hits did the defense cost him?

Eight of Verlander's spins must have landed somewhere in slots 5 through 10. Out of those spins, the defense didn't cost him anything, since the defense is only at fault when the wheel stops at slot 4. 

For hits, we expect that one in four came from slot 4. For the two spins that wound up a hit, that works out to half a hit.

So, with the Tigers having given up few hits, we estimate his defense cost Verlander only 0.5 hits, not 1.0 hits.

(2) Suppose Verlander's results were below average -- he gave up 6 hits. Slot 4 hits, which are the fielders' fault, are a quarter of the 6 hits allowed. So, the defense here cost him 1.5 hits, not 1.0 hits.

(3) Suppose Verlander's results were exactly as predicted -- he gave up four hits. On average, one out of those four hits is from slot 4. So, in this case, yes, the defense would have cost him one hit per ten balls in play, exactly the average rate for the team. 

Which means, the better Verlander's stat line, the more likely the fielders played better than their overall average.

Labels: , , , , ,


At Monday, November 28, 2016 8:33:00 PM, Blogger Don Coffin said...

More concretely, the Tigers as a team had a .280 BABIP. The Red Sox, .272.
Verlander's BABIP was .255; Porcello's was .269.

Verlander's defense was, it would appear, considerably better when he pitched than when other Tiger pitchers pitched.

Porcello's defense was essentially the same when he pitched as it was when other pitchers pitched.

So, in this case, it looks pretty clear that the FIP version of WAR at FanGraphs would do a better job of assessing their performances.

At Tuesday, November 29, 2016 1:42:00 PM, Anonymous Tangotiger said...

I'd be careful in equation BABIP with "defense" as the above poster is doing. BABIP is a combination of luck, pitcher, fielders, park.


Post a Comment

<< Home