Thursday, March 02, 2017

How much to park-adjust a performance depends on the performance itself

In 2016, the Detroit Tigers' fielders were below average -- by about 50 runs, according to Baseball Reference. Still, Justin Verlander had an excellent season, going 16-9 with a 3.04 ERA. Should we rate Verlander's season even higher than his stat line, since he had to overcome his team's poor fielding behind him?

Not necessarily, I have argued. A team's defense is better some games than others (in results, not necessarily in talent). The fact that Verlander had a good season suggests that his starts probably got the benefit of the better games. 

I used this analogy:

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. 

Except that ... it WAS run support. 

It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

------

Just for fun, I decided to run a little study to see how big the effect actually is, for pitcher run support.

I found all starters from 1950 to 2015, who:

-- played for teams with below-league-average runs scored;

-- had at least 15 starts and 15 decisions, pitching no more than 5 games in relief; and

-- had a W-L record at least 10 games above .500 (e.g. 16-6).

There were 102 qualifying pitchers, mostly from the modern era. Their average record was 20-8 (19.8-7.7). 

They played in leagues where an average 4.41 runs were scored per game, but their below-average teams scored only 4.22. 

A first instinct might be to say, "hey, these pitchers should have had a W-L record even better than they did, because their teams gave them worse run support than the league average, by 0.19 runs per start!"

But, I'm arguing, you can't say that. Run support varies from game to game. Since we're doing selective sampling, concentrating on pitchers with excellent W-L records, we're more likely to have selected pitchers who got better run support than the rest of their team.

And the results show that. 

As mentioned, the pitchers' teams scored only 4.22 runs per game that season, compared with the league average 4.41. But, in the specific games those pitchers started, their teams gave them 4.54 runs of support. 

That's not just more than the team normally scored -- it's actually even more than the league average.

4.22 team
4.41 league
4.54 these pitchers

That's a pretty large effect. The size is due in part to the fact that we took pitchers with exceptionally good records.

Suppose a pitcher goes 22-8. Because run support varies, it could be that:

-- he pitched to (say) a 20-10 level, but got better run support;
-- he pitched to (say) a 24-6 level, but got worse run support.

But it's much less common to pitch at a 24-6 level than it is to pitch at a 20-10 level. So, the 22-8 guy was much more likely to be a 20-10 guy who got good run support than a 24-6 guy who got poor run support.

The same is true for lesser pitchers, to a lesser extent. It's not as much rarer to (say) pitch at a 14-10 level than at a 12-12 level. So, the effect should be there, for those pitchers, too, but it should be smaller.

I reran the study, but this time, pitchers were included if they were even one game over .500. That increased the sample size to 1024 team-seasons. The average pitcher in the sample was 14-10 (14.4 and 9.7).

Here are the run support numbers:

4.15 team
4.40 league
4.32 these pitchers

This time, the effect wasn't so big that the pitchers actually got more support than the league average. But it did move them two-thirds of the way there. 

And, of course, not *every* pitcher in the study got better run support than his teammates. That figure was only 62.1 percent. The point is, we should expect it to be more than half.

-------

Suppose a player has an exceptionally good result -- an extremely good W-L record, or a lot of home runs, or a high batting average, or whatever. 

Then, in any way that it's possible for him to have been lucky or unlucky -- that is, influenced by external factors that you might want to correct for -- he's more likely to have been lucky than unlucky.

If a player hits 40 home runs in an extreme pitcher's park, he probably wasn't hurt by the park as much as other players. If a player steals 80 bases and is only caught 6 times, he probably faced weaker-throwing catchers than the league average. If a shortstop rates very high for his fielding runs one year, he was probably lucky in that he got easier balls to field than normal (relative to the standards of the metric you're using).

"Probably" doesn't mean "always," of course. It just means more than a 50 percent chance. It could be anywhere from 50.0001 percent to 99.9999 percent. (As I mentioned, it was 62.1 percent for the run support study.)

The actual probability, and the size of the effect, depends on a lot of things. It depends on how you define "extreme" performances. It depends on the variances of the performances and the factor you're correcting for. It depends on how many external factors actually affect the extreme performance you're seeing.

So: for any given case, is the effect big, or is it small? You have to think about it and make an argument. Here's an argument you could make for run support, without actually having to do the study.

In most seasons, the SD of a single team's runs per game is about 3. That means that in a season of 36 starts, the SD of average run support is 0.5 runs (which is 3 divided by the square root of 36). 

In the 2015 AL, the SD of season runs scored between teams was only 0.4 runs per game.

0.5 runs of variation between pitchers on a team
0.4 runs of variation between teams

That means, that, for a given starting pitcher's W-L record, randomness in what games he starts matters *more* than his team's overall level of run support. 

That's why we should expect the effect to be large.

There are other sources of luck that might affect a pitcher's W-L record. Home/road starts, for instance. If you find a pitcher with a good record, there's better than a 50-50 shot that he started more games at home than on the road. But, the amount of overall randomness in that stat is so small -- especially since there's usually a regular rotation -- that the expectation is probably closer to, say, 50.1 percent, than to the 62.1 percent that we found for run support.

But, in theory, the effect must exist, at some magnitude. Whether it's big enough that you have to worry about, is something that you have to figure out.

I've always wanted to try to study this for park effects. I've always suspected that when a player hits 40 home runs in a pitcher's park, and he gets adjusted up to 47 or something ... that that's way too high. But I haven't figured out how to figure it out.







Labels: , , , ,

Monday, November 28, 2016

How should we evaluate Detroit's defense behind Verlander?

Privately and publicly, Bill James, Tom Tango, and Joe Posnanski have been arguing about Baseball Reference's version of Wins Above Replacement. Specifically, they're questioning the 2016 WAR totals for Justin Verlander and Rick Porcello:

Verlander +6.6
Porcello  +5.0

Verlander is evaluated to have created 1.6 more wins than Porcello. But their stat lines aren't that much different:

            W-L   IP   K   BB   ERA
------------------------------------
Verlander  16-9  227  254  57  3.04
Porcello   22-4  223  189  32  3.15

So why does Verlander finish so far ahead of Porcello?

Fielding.

Baseball Reference credits Verlander with an extra 13 runs, compared to Porcello, to adjust for team fielding. 13 runs corresponds to 1.3 WAR -- roughly, a half-run per nine innings pitched. 

Why so big an adjustment? Because the Red Sox fielders were much better than the Tigers'. Baseball Info Solutions (who evaluate fielding performance from ball trajectory data), had Boston at 108 runs better than Detroit for the season. The 13-run difference between Porcello and Verlander is their share of that difference.

It all seems to make sense, except ... it doesn't. Posnanski, backed by the stats, thinks that even though Detroit's defense was worse than Boston's, the difference didn't affect those two particular pitchers that much. Posnanski argues, plausibly, that even though Detroit's fielders didn't play well over the season as a whole, they DID play well when Verlander was on the mound:


"For one thing, I think it’s quite likely that Detroit played EXCELLENT defense behind Verlander, even if they were shaky behind everyone else. I’m not sure how you can expect a defense to allow less than a .256 batting average on balls in play (the second-lowest of Verlander’s career and second lowest in the American League in 2016) or allow just three runners to reach on error all year (the lowest total of Verlander’s career).

"For another, the biggest difference in the two defenses was in right and centerfield. The Red Sox centerfielder and rightfielder saved 44 runs, because Jackie Bradley and Mookie Betts are awesome. The Tigers centerfield and rightfielder cost 49 runs because Cameron Maybin, J.D. Martinez and a cast of thousands are not awesome.

"But the Tigers outfield certainly didn’t cost Verlander. He allowed 216 fly balls in play, and only 16 were hits. Heck, the .568 average he allowed on line drives was the lowest in the American League. I find it almost impossible to believe that the Boston outfield would have done better than that."

------

So, that's the debate. Accepting that the Tigers' fielding, overall, was 49 runs worse than average for the season, can we simultaneously accept that the reverse was true on those days when Verlander was pitching? Could the crappy Detroit fielders have turned good -- or at least average -- one day out of every five?

Here's an analogy that says yes.

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. They were farther ahead of the second-place Yankees than the Yankees were above the 26th place Reds. 

Unless, of course, Toronto's powerhouse offense just happened to sputter on those 29 days when Dickey was on the mound. Is that possible?

Yup. 

It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

Of course, it's not really that the Blue Jays turned into a bad-hitting team, that their skill level actually changed. It's just randomness. Some days, even great-hitting teams have trouble scoring, and, by dumb luck, there happened to be more of those days when Dickey pitched than when Buehrle pitched.

Generally, runs per game has a standard deviation of about 3, so the SD over 29 games is around 0.56. Dickey's bad luck was only around 1.6 SDs from zero, not even statistically significant.

(* Note: As I was writing this post, Posnanski posted a followup using a similar run support analogy.)

------

Just as we only have season fielding stats for evaluating Verlander's defense, imagine that we only had season batting stats for evaluating Dickey's run support.

In that case, when we evaluated Dickey's record, we'd say, "Dickey looks like an average pitcher, at 11-11. But his team scored a lot more runs than average. If he could only finish with a .500 record with such a good offense, he's worse than his 11-11 record shows. So, we have to adjust him down, maybe to 9-13 or something, to accurately compare him to pitchers on average-hitting teams."

And that wouldn't be right, in light of the further information we have: that the Jays did NOT score that many runs on days that Dickey pitched. 

Well, the same is true for the Verlander/Porcello case, right? It's quite possible that even though the Tigers were a bad defensive team, they happened to play good defense during Verlander's starts, just because the sample size is small enough that that kind of thing can happen. In that light, Posnanski's analysis is crucial -- it's evidence that, yes, the Tigers fielders DID play well (or at least, appear to play well) behind Verlander, even if they didn't play well behind the Tigers' other pitchers.

Because, fielding is subject to variation just like hitting is. Some games, an outfielder makes a great diving catch, and, other days, the same outfielder just misses the catch on an identical ball. More importantly, some days the balls in play are just easier to field than others, and even the BIS data doesn't fully capture that fact, and the fielders look better than they actually played. 

(In fact, I suspect that the errors from misclassifying the difficulty of balls in play are much bigger than the effect of actual randomness in how balls are fielded. But that's not important for this argument.)

------

What if we don't have evidence, either way, on whether Detroit's fielders were better or worse with Verlander on the mound? In that case, it's OK to use the season numbers, right?

No, I don't think so. If the pitcher had better results than expected, you have to assume that the defense played better as well. Otherwise, you'll consistently overrate the pitchers who performed well on bad-fielding teams, and underrate the pitchers who performed poorly on good-fielding teams.

The argument is pretty simple -- it's the usual "regression to the mean" argument to adjust for luck.

When a pitcher does well, he was probably lucky. Not just lucky in how well he himself played, but in EVERY possible area where he could be lucky -- parks, defense, umpire calls, weather ... everything. (See MGL's comment here.)  If a pitcher pitched well, he was probably lucky in how he pitched, and he was probably lucky in how his team fielded.

You roll ten dice, and wind up with a total of 45. You were lucky to get such a high sum, because the expected total was only 35.

Since the overall total was lucky, each individual roll, taken independently, is more lucky to have been lucky than unlucky. Because, obviously, you can't be lucky with the total without being lucky with the numbers that sum to the total. We don't know which of the ten were lucky and which were not, but, for each die, we should retrospectively be willing to bet that it was higher than 3.5.

It would be wrong to say something like: "Overall for each of these dice, the expectation was 3.5. That means the first six tosses probably averaged 21. That means that the last four tosses probably scored 24. Wow! Your last four tosses were 6-6-6-6! They were REALLY lucky!"

It's wrong, of course, because you can't arbitrarily attribute all the luck to the last four tosses. All ten are equally likely to have been lucky ones.

And the same is true for Verlander. His excellent ERA is the sum of pitching and fielding. You can't arbitrarily assume all the good luck came in the rolls his pitching dice, and he had exactly zero luck in the rolls of his team's fielding dice.

-------

If that isn't obvious, try this. 

The WAR method works like this: it's taking a single game Verlander started, assigning the results to Verlander, and adjusting for what the average of what the fielders' did in ALL the games they played, not just this one.

Imagine that we reverse it: we take a single game Verlander started, assign the results to the FIELDERS, and adjust for the average of what Verlander did in ALL the games he pitched, not just this one.

One day, Verlander and the Tigers give up 7 runs, and the argument goes something like this:

"Wow! Normally, the Tigers fielders give up only 5 runs, so today they were -2. But wait!  Justin Velander was on the mound, and he's a great pitcher, and saves an average of two runs a game! If they gave up 7 runs despite Verlander's stellar pitching, the fielders must have been exceptionally bad, and we need to give them a -4 instead of a -2!"

Verlander's stats aren't just a measure of Verlander's own performance. As Tango would say, they're a measure of *what happened when Verlander was on the mound*. That encompasses Verlander's pitching AND his teammates' fielding. 

So, if the results with Verlander on the mound are better than expected, chances are that BOTH of those things were better than expected. 

------

I should probably leave it there, but if you're still not convinced, here's an explicit model.

There's a wheel of fortune with ten slots. You spin the wheel to represent a ball in play. Normally, slots 1, 2, and 3 represent a hit, and 4 through 10 represent an out. But because the Tigers fielders are so bad, number 4 is changed to a hit instead of an out.

In the long term, you expect that the Tigers' defense, compared to average, will cost a Verlander one hit for every 10 balls in play. 

But: your expectation of how many hits it actually cost depends on the specific pitcher's results.

(1) Suppose Verlander's results were better than expected. Out of 10 balls in play, he gets 8 outs. How many hits did the defense cost him?

Eight of Verlander's spins must have landed somewhere in slots 5 through 10. Out of those spins, the defense didn't cost him anything, since the defense is only at fault when the wheel stops at slot 4. 

For hits, we expect that one in four came from slot 4. For the two spins that wound up a hit, that works out to half a hit.

So, with the Tigers having given up few hits, we estimate his defense cost Verlander only 0.5 hits, not 1.0 hits.

(2) Suppose Verlander's results were below average -- he gave up 6 hits. Slot 4 hits, which are the fielders' fault, are a quarter of the 6 hits allowed. So, the defense here cost him 1.5 hits, not 1.0 hits.

(3) Suppose Verlander's results were exactly as predicted -- he gave up four hits. On average, one out of those four hits is from slot 4. So, in this case, yes, the defense would have cost him one hit per ten balls in play, exactly the average rate for the team. 

Which means, the better Verlander's stat line, the more likely the fielders played better than their overall average.



Labels: , , , , ,