Sabermetric Research: The Bayesian Cy Young

At Fangraphs, Dave Cameron and Eric Seidman have a nice discussion (hat tip: Tango) on who's the better Cy Young candidate: Clayton Kershaw, or Roy Halladay?

Part of the discussion hinges on BABIP: batting average on balls in play. As Voros McCracken discovered years ago, pitchers generally don't differ much in what happens when a non-home-run ball is hit off them. Most of the overall differences between pitchers, then, are due to the fielders behind them, but mostly due to luck.

So far in 2011, Clayton Kershaw has a BABIP of .272, which Eric decribes as "absurdly low." Still, Eric thinks it might actually be skill rather than luck, since since .272 it's not that much different than Kershaw allowed in previous years. Dave argues that Kershaw's three seasons is still a fairly small sample size, and points out that most of his BABIP advantage comes from his record at home (he's about average on the road).

Anyway, my point isn't to weigh in to which one is right -- they do a fine job hashing things out in their discussion. What I want to talk about is something they both seem to agree on: that it's important whether the BABIP is luck or skill. If it's luck, that reduces Kershaw's Cy Young credentials. If it's skill, he's a better candidate.

Seems reasonable, and I don't necessarily disagree. But let's see where that logic leads.

Because, there are other kinds of luck, or factors that pitchers can't control. For instance, there's park (which is usually already adjusted for in WAR, the statistic Eric and Dave cite most in this debate).

There's also quality of opposition batting. It's probably not too hard, if you have good data, to figure out how much either of the pitchers gained by being able to pitch to inferior hitters. You could also check if one of them had the platoon advantage more often. And, if one of them pitched more at home than the other one did.

We'd probably all agree, right, that you'd want to adjust for those kinds of things if we had the information? To be clear, I'm not criticizing Dave or Eric for not spending hours figuring this stuff out. I'm just saying that if you have the data, it's relevant in comparing the pitchers.

There are other things too, that eventually we'll be able to figure out, that we can't right now because (as far as I know) the research hasn't been done. Suppose Kershaw throws a pitch at a certain speed, with a certain break, on a certain count. And, someday, we'll know that kind of pitch is swung on and missed 30% of the time, called a ball 5% of the time, called a strike 10% of the time, fouled off 10% of the time, and hit in play 45% of the time with an OPS of .850. Maybe, overall, that pitch is worth (say) +0.05 runs (in favor of the pitcher).

Once we have that kind of information, we can check for "batter swing luck". If it turns out that batters just randomly happened to go +0.03 on that pitch from Kershaw this season, instead of +0.05, we should credit him the extra 0.02, right? He delivered a certain performance, and the batters just happened to get a bit lucky on it, as if his BABIP was too high. (This measure would probably substitute for BABIP: it includes balls in play, but also home runs, swings-and-misses, and walk potential.)

So we'd adjust Kershaw and Halladay for how lucky the batters were on those swings.

That's not unrealistic, and it'll probably eventually happen, to some degree of accuracy. Here's one that probably won't, at least not for a few decades, but it works as a thought experiment.

Imagine we hook a probe to every batter's brain, so on every pitch we can tell if he's guessing fastball or curve, and if he's guessing inside or outside. After a couple of years of analyzing this data, we figure that when he guesses right, it's worth +0.1 runs (for the batter), when he guesses half-right, it's worth 0, and when he guesses wrong, it's -0.1.

That again, is something out of the control of the pitcher (especially if both batter and pitcher are randomizing using game theory). So you'd want to control for it, right? If Halladay is having a good year just because batters were unlucky enough to guess right only 23% of the time instead of 25%, you have to adjust, just like you'd adjust for a lucky BABIP.

This will change the definition of "batter swing luck," but not replace it. First, the batter may have been lucky enough to guess right, which is worth something. Then, he might have been lucky enough to get better than expected wood on the ball even controlling for the fact that he guessed right.

So you've got lots of sources of luck:

-- park
-- day/night
-- distribution of batters
-- platoon luck
-- BABIP luck
-- batter swing luck
-- batter guess luck

You'd want to adjust for all of these. Right now, as I understand WAR, we're adjusting for park and BABIP.

What about the others? Well, we can't really adjust for those. We *want* to, but we can't.

So, we make do with just park and BABIP. Still, no matter how many decimal places we go to with the debate on Kershaw/Halladay, we're still only going to have our best guess.

At least we can argue that if all the other things are random, we should still be unbiased. Right?

Well, not really. From a Bayesian standpoint, we have a pretty good idea who had more luck. It's much more likely to be Kershaw.

Why? Because Halladay's performance is much more consistent with his career than Kershaw's. Kershaw's a good pitcher, but wasn't expected to be *that* good. Halladay, on the other hand, is having a typical Halladay season. Well, a bit better than typical, but not much.

I'd be willing to bet a lot of money that if you found 50 pitchers who had a better-than-career season, by at least (say) 1.5 WAR, you would find that those 50 pitchers had above-average BABIP luck. It stands to reason. I won't make a full statistical argument, but here's a quick oversimplification of one.

A pitcher can have his talent go up or down from year to year. He can have his luck go up or down from year to year. That's four combinations. Only three of them are possibly consistent with a big improvement in WAR: talent up/luck up; talent up/luck down; talent down/luck up. Two of those have his luck going up. So, two times out of three, the pitcher was lucky.

The argument applies to *all* sources of luck. Even after taking BABIP into account, if a pitcher's adjusted performance is still above his career average, he's still more likely to have had good luck than bad, in other ways (batter swings, say).

I don't have an easy way to quantify this, but still I'd give you better-than-even odds that, stripping out all the above, Halladay is performing better than Kershaw -- even after adjusting for park and BABIP.

If you have two players with similar, outstanding performances, the player with the better expectation of talent is probably the one who's actually having the better year. To believe that Kershaw was really likely to have had a better year than Halladay, you really need him to have put up *much* better numbers. Either that, or you need a way to actually work out all the luck, and prove that the residual still favors Kershaw.

I should emphasize that I am NOT talking about talent here. I think most people would agree that Halladay is still more talented than Kershaw, but would nonetheless argue Kershaw might still be having the better season.

But, what I'm saying is, no, I bet Kershaw is NOT having a better season, even if his numbers look better. I'm saying that it's likely that Kershaw *is actually not pitching better*. If we had the data, it's more likely than not that we'd see that batters are just having bad luck -- not only are they (perhaps) hitting the ball directly to fielders, as BABIP suggests, but they're probably swinging and missing at hittable pitches.

---------

Another way to look at it: if two pitchers have mostly the same results, but one has better stuff, what does that mean? It means that the pitcher with the better stuff must have been unluckier than the pitcher with the worse stuff. In other words, the batters facing the better stuff must have been luckier.

We don't know for sure, of course, that Halladay had better stuff than Kershaw. But history suggests that's more likely. And so, the odds are on the side of Kershaw having been luckier than Halladay. How much so?

I don't know. One mitigating factor is that Kershaw is young, so you'd expect more of his improvement to be real. But, still, a small improvement is more likely than a large improvement, so the odds are still on the side of postive luck over negative luck.

---------

Does that take some of the fun out of the Cy Young? I think it certainly does make it a little bit less entertaining, at least until we have better data. That's because, as long as we remain ignorant of a significant amount of luck, it requires a much bigger hurdle to award the honor to anyone other than Halladay.

This is a bit counterintuitive, but it's true. Suppose a good but not great pitcher -- Matt Cain, say -- has almost exactly the same stat line as Roy Halladay, including BABIP, but is actually better in some categories. Perhaps he a couple of extra strikeouts, and a couple fewer walks.

From the usual arguments, there would be absolutely no debate that Cain's season is better, right? He's better than Halladay in some categories, and the same as Halladay in all the others.

But ... if you're trying to bet on which player actually pitched better after removing all the luck, you'd still have to go with Halladay.

-----

UPDATE: on his blog, Tango writes,

Aside to Phil: Marcel had Kershaw with a 3.07 ERA for 2011, and Halladay at 3.04. So, while you make great points in your article, you didn’t have the right examples! Sabathia and Verlander would have been better examples.

Oops! I'll just leave it the way it is for now, but point taken.

Labels: BABIP, baseball, bayes, Cy Young, pitching, statistics

3 Comments:

At Thursday, September 22, 2011 1:53:00 PM, mettle said...: I think you've really hit on the absurdity of using advanced metrics for awards. Most advanced metrics attempt to sift out the true talent from all the luck and variability in players' seasons. This is all well and good for GMS building a team, but using these sorts of metrics for awards is sort of absurd -- there would be no suspense, and the award would go to the same players year after year.
Trying to find some happy mean (e.g., ignore wins, but use ERA) seems to satisfy some people, but you've really done a great job of pointing out the hypocrisy of doing that.
That's why, to me, it makes sense to go whole hog into to output camp. Use WPA, or something similar (wins, RBI) for awards and leave the advanced metrics for the personnel decisions and the barroom arguments.
At Thursday, September 22, 2011 3:21:00 PM, Mike Fast said...: Regarding "batter swing luck", check out Josh Smolow's two-part series on the "luckiest" and "unluckiest" pitches this year:

The Luckiest Pitches in the Majors This Year

The Un-Luckiest Pitches in the Majors This Year
At Thursday, September 22, 2011 6:17:00 PM, Phil Birnbaum said...: Mike,

That's great stuff! I'm not sure I get it though ... where does the "expected" value for a pitch come from? For instance, if Heath Bell got a lot of strikes on his four-seam fastball, how does the study make sure it's not just that he has a really, really awesome four-seam fastball? The article is unclear on that.

<< Home

Sabermetric Research

Thursday, September 22, 2011

The Bayesian Cy Young

3 Comments:

About Me

Previous Posts