Thursday, March 02, 2017

How much to park-adjust a performance depends on the performance itself

In 2016, the Detroit Tigers' fielders were below average -- by about 50 runs, according to Baseball Reference. Still, Justin Verlander had an excellent season, going 16-9 with a 3.04 ERA. Should we rate Verlander's season even higher than his stat line, since he had to overcome his team's poor fielding behind him?

Not necessarily, I have argued. A team's defense is better some games than others (in results, not necessarily in talent). The fact that Verlander had a good season suggests that his starts probably got the benefit of the better games. 

I used this analogy:

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. 

Except that ... it WAS run support. 

It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

------

Just for fun, I decided to run a little study to see how big the effect actually is, for pitcher run support.

I found all starters from 1950 to 2015, who:

-- played for teams with below-league-average runs scored;

-- had at least 15 starts and 15 decisions, pitching no more than 5 games in relief; and

-- had a W-L record at least 10 games above .500 (e.g. 16-6).

There were 102 qualifying pitchers, mostly from the modern era. Their average record was 20-8 (19.8-7.7). 

They played in leagues where an average 4.41 runs were scored per game, but their below-average teams scored only 4.22. 

A first instinct might be to say, "hey, these pitchers should have had a W-L record even better than they did, because their teams gave them worse run support than the league average, by 0.19 runs per start!"

But, I'm arguing, you can't say that. Run support varies from game to game. Since we're doing selective sampling, concentrating on pitchers with excellent W-L records, we're more likely to have selected pitchers who got better run support than the rest of their team.

And the results show that. 

As mentioned, the pitchers' teams scored only 4.22 runs per game that season, compared with the league average 4.41. But, in the specific games those pitchers started, their teams gave them 4.54 runs of support. 

That's not just more than the team normally scored -- it's actually even more than the league average.

4.22 team
4.41 league
4.54 these pitchers

That's a pretty large effect. The size is due in part to the fact that we took pitchers with exceptionally good records.

Suppose a pitcher goes 22-8. Because run support varies, it could be that:

-- he pitched to (say) a 20-10 level, but got better run support;
-- he pitched to (say) a 24-6 level, but got worse run support.

But it's much less common to pitch at a 24-6 level than it is to pitch at a 20-10 level. So, the 22-8 guy was much more likely to be a 20-10 guy who got good run support than a 24-6 guy who got poor run support.

The same is true for lesser pitchers, to a lesser extent. It's not as much rarer to (say) pitch at a 14-10 level than at a 12-12 level. So, the effect should be there, for those pitchers, too, but it should be smaller.

I reran the study, but this time, pitchers were included if they were even one game over .500. That increased the sample size to 1024 team-seasons. The average pitcher in the sample was 14-10 (14.4 and 9.7).

Here are the run support numbers:

4.15 team
4.40 league
4.32 these pitchers

This time, the effect wasn't so big that the pitchers actually got more support than the league average. But it did move them two-thirds of the way there. 

And, of course, not *every* pitcher in the study got better run support than his teammates. That figure was only 62.1 percent. The point is, we should expect it to be more than half.

-------

Suppose a player has an exceptionally good result -- an extremely good W-L record, or a lot of home runs, or a high batting average, or whatever. 

Then, in any way that it's possible for him to have been lucky or unlucky -- that is, influenced by external factors that you might want to correct for -- he's more likely to have been lucky than unlucky.

If a player hits 40 home runs in an extreme pitcher's park, he probably wasn't hurt by the park as much as other players. If a player steals 80 bases and is only caught 6 times, he probably faced weaker-throwing catchers than the league average. If a shortstop rates very high for his fielding runs one year, he was probably lucky in that he got easier balls to field than normal (relative to the standards of the metric you're using).

"Probably" doesn't mean "always," of course. It just means more than a 50 percent chance. It could be anywhere from 50.0001 percent to 99.9999 percent. (As I mentioned, it was 62.1 percent for the run support study.)

The actual probability, and the size of the effect, depends on a lot of things. It depends on how you define "extreme" performances. It depends on the variances of the performances and the factor you're correcting for. It depends on how many external factors actually affect the extreme performance you're seeing.

So: for any given case, is the effect big, or is it small? You have to think about it and make an argument. Here's an argument you could make for run support, without actually having to do the study.

In most seasons, the SD of a single team's runs per game is about 3. That means that in a season of 36 starts, the SD of average run support is 0.5 runs (which is 3 divided by the square root of 36). 

In the 2015 AL, the SD of season runs scored between teams was only 0.4 runs per game.

0.5 runs of variation between pitchers on a team
0.4 runs of variation between teams

That means, that, for a given starting pitcher's W-L record, randomness in what games he starts matters *more* than his team's overall level of run support. 

That's why we should expect the effect to be large.

There are other sources of luck that might affect a pitcher's W-L record. Home/road starts, for instance. If you find a pitcher with a good record, there's better than a 50-50 shot that he started more games at home than on the road. But, the amount of overall randomness in that stat is so small -- especially since there's usually a regular rotation -- that the expectation is probably closer to, say, 50.1 percent, than to the 62.1 percent that we found for run support.

But, in theory, the effect must exist, at some magnitude. Whether it's big enough that you have to worry about, is something that you have to figure out.

I've always wanted to try to study this for park effects. I've always suspected that when a player hits 40 home runs in a pitcher's park, and he gets adjusted up to 47 or something ... that that's way too high. But I haven't figured out how to figure it out.







Labels: , , , ,

Monday, November 28, 2016

How should we evaluate Detroit's defense behind Verlander?

Privately and publicly, Bill James, Tom Tango, and Joe Posnanski have been arguing about Baseball Reference's version of Wins Above Replacement. Specifically, they're questioning the 2016 WAR totals for Justin Verlander and Rick Porcello:

Verlander +6.6
Porcello  +5.0

Verlander is evaluated to have created 1.6 more wins than Porcello. But their stat lines aren't that much different:

            W-L   IP   K   BB   ERA
------------------------------------
Verlander  16-9  227  254  57  3.04
Porcello   22-4  223  189  32  3.15

So why does Verlander finish so far ahead of Porcello?

Fielding.

Baseball Reference credits Verlander with an extra 13 runs, compared to Porcello, to adjust for team fielding. 13 runs corresponds to 1.3 WAR -- roughly, a half-run per nine innings pitched. 

Why so big an adjustment? Because the Red Sox fielders were much better than the Tigers'. Baseball Info Solutions (who evaluate fielding performance from ball trajectory data), had Boston at 108 runs better than Detroit for the season. The 13-run difference between Porcello and Verlander is their share of that difference.

It all seems to make sense, except ... it doesn't. Posnanski, backed by the stats, thinks that even though Detroit's defense was worse than Boston's, the difference didn't affect those two particular pitchers that much. Posnanski argues, plausibly, that even though Detroit's fielders didn't play well over the season as a whole, they DID play well when Verlander was on the mound:


"For one thing, I think it’s quite likely that Detroit played EXCELLENT defense behind Verlander, even if they were shaky behind everyone else. I’m not sure how you can expect a defense to allow less than a .256 batting average on balls in play (the second-lowest of Verlander’s career and second lowest in the American League in 2016) or allow just three runners to reach on error all year (the lowest total of Verlander’s career).

"For another, the biggest difference in the two defenses was in right and centerfield. The Red Sox centerfielder and rightfielder saved 44 runs, because Jackie Bradley and Mookie Betts are awesome. The Tigers centerfield and rightfielder cost 49 runs because Cameron Maybin, J.D. Martinez and a cast of thousands are not awesome.

"But the Tigers outfield certainly didn’t cost Verlander. He allowed 216 fly balls in play, and only 16 were hits. Heck, the .568 average he allowed on line drives was the lowest in the American League. I find it almost impossible to believe that the Boston outfield would have done better than that."

------

So, that's the debate. Accepting that the Tigers' fielding, overall, was 49 runs worse than average for the season, can we simultaneously accept that the reverse was true on those days when Verlander was pitching? Could the crappy Detroit fielders have turned good -- or at least average -- one day out of every five?

Here's an analogy that says yes.

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. They were farther ahead of the second-place Yankees than the Yankees were above the 26th place Reds. 

Unless, of course, Toronto's powerhouse offense just happened to sputter on those 29 days when Dickey was on the mound. Is that possible?

Yup. 

It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

Of course, it's not really that the Blue Jays turned into a bad-hitting team, that their skill level actually changed. It's just randomness. Some days, even great-hitting teams have trouble scoring, and, by dumb luck, there happened to be more of those days when Dickey pitched than when Buehrle pitched.

Generally, runs per game has a standard deviation of about 3, so the SD over 29 games is around 0.56. Dickey's bad luck was only around 1.6 SDs from zero, not even statistically significant.

(* Note: As I was writing this post, Posnanski posted a followup using a similar run support analogy.)

------

Just as we only have season fielding stats for evaluating Verlander's defense, imagine that we only had season batting stats for evaluating Dickey's run support.

In that case, when we evaluated Dickey's record, we'd say, "Dickey looks like an average pitcher, at 11-11. But his team scored a lot more runs than average. If he could only finish with a .500 record with such a good offense, he's worse than his 11-11 record shows. So, we have to adjust him down, maybe to 9-13 or something, to accurately compare him to pitchers on average-hitting teams."

And that wouldn't be right, in light of the further information we have: that the Jays did NOT score that many runs on days that Dickey pitched. 

Well, the same is true for the Verlander/Porcello case, right? It's quite possible that even though the Tigers were a bad defensive team, they happened to play good defense during Verlander's starts, just because the sample size is small enough that that kind of thing can happen. In that light, Posnanski's analysis is crucial -- it's evidence that, yes, the Tigers fielders DID play well (or at least, appear to play well) behind Verlander, even if they didn't play well behind the Tigers' other pitchers.

Because, fielding is subject to variation just like hitting is. Some games, an outfielder makes a great diving catch, and, other days, the same outfielder just misses the catch on an identical ball. More importantly, some days the balls in play are just easier to field than others, and even the BIS data doesn't fully capture that fact, and the fielders look better than they actually played. 

(In fact, I suspect that the errors from misclassifying the difficulty of balls in play are much bigger than the effect of actual randomness in how balls are fielded. But that's not important for this argument.)

------

What if we don't have evidence, either way, on whether Detroit's fielders were better or worse with Verlander on the mound? In that case, it's OK to use the season numbers, right?

No, I don't think so. If the pitcher had better results than expected, you have to assume that the defense played better as well. Otherwise, you'll consistently overrate the pitchers who performed well on bad-fielding teams, and underrate the pitchers who performed poorly on good-fielding teams.

The argument is pretty simple -- it's the usual "regression to the mean" argument to adjust for luck.

When a pitcher does well, he was probably lucky. Not just lucky in how well he himself played, but in EVERY possible area where he could be lucky -- parks, defense, umpire calls, weather ... everything. (See MGL's comment here.)  If a pitcher pitched well, he was probably lucky in how he pitched, and he was probably lucky in how his team fielded.

You roll ten dice, and wind up with a total of 45. You were lucky to get such a high sum, because the expected total was only 35.

Since the overall total was lucky, each individual roll, taken independently, is more lucky to have been lucky than unlucky. Because, obviously, you can't be lucky with the total without being lucky with the numbers that sum to the total. We don't know which of the ten were lucky and which were not, but, for each die, we should retrospectively be willing to bet that it was higher than 3.5.

It would be wrong to say something like: "Overall for each of these dice, the expectation was 3.5. That means the first six tosses probably averaged 21. That means that the last four tosses probably scored 24. Wow! Your last four tosses were 6-6-6-6! They were REALLY lucky!"

It's wrong, of course, because you can't arbitrarily attribute all the luck to the last four tosses. All ten are equally likely to have been lucky ones.

And the same is true for Verlander. His excellent ERA is the sum of pitching and fielding. You can't arbitrarily assume all the good luck came in the rolls his pitching dice, and he had exactly zero luck in the rolls of his team's fielding dice.

-------

If that isn't obvious, try this. 

The WAR method works like this: it's taking a single game Verlander started, assigning the results to Verlander, and adjusting for what the average of what the fielders' did in ALL the games they played, not just this one.

Imagine that we reverse it: we take a single game Verlander started, assign the results to the FIELDERS, and adjust for the average of what Verlander did in ALL the games he pitched, not just this one.

One day, Verlander and the Tigers give up 7 runs, and the argument goes something like this:

"Wow! Normally, the Tigers fielders give up only 5 runs, so today they were -2. But wait!  Justin Velander was on the mound, and he's a great pitcher, and saves an average of two runs a game! If they gave up 7 runs despite Verlander's stellar pitching, the fielders must have been exceptionally bad, and we need to give them a -4 instead of a -2!"

Verlander's stats aren't just a measure of Verlander's own performance. As Tango would say, they're a measure of *what happened when Verlander was on the mound*. That encompasses Verlander's pitching AND his teammates' fielding. 

So, if the results with Verlander on the mound are better than expected, chances are that BOTH of those things were better than expected. 

------

I should probably leave it there, but if you're still not convinced, here's an explicit model.

There's a wheel of fortune with ten slots. You spin the wheel to represent a ball in play. Normally, slots 1, 2, and 3 represent a hit, and 4 through 10 represent an out. But because the Tigers fielders are so bad, number 4 is changed to a hit instead of an out.

In the long term, you expect that the Tigers' defense, compared to average, will cost a Verlander one hit for every 10 balls in play. 

But: your expectation of how many hits it actually cost depends on the specific pitcher's results.

(1) Suppose Verlander's results were better than expected. Out of 10 balls in play, he gets 8 outs. How many hits did the defense cost him?

Eight of Verlander's spins must have landed somewhere in slots 5 through 10. Out of those spins, the defense didn't cost him anything, since the defense is only at fault when the wheel stops at slot 4. 

For hits, we expect that one in four came from slot 4. For the two spins that wound up a hit, that works out to half a hit.

So, with the Tigers having given up few hits, we estimate his defense cost Verlander only 0.5 hits, not 1.0 hits.

(2) Suppose Verlander's results were below average -- he gave up 6 hits. Slot 4 hits, which are the fielders' fault, are a quarter of the 6 hits allowed. So, the defense here cost him 1.5 hits, not 1.0 hits.

(3) Suppose Verlander's results were exactly as predicted -- he gave up four hits. On average, one out of those four hits is from slot 4. So, in this case, yes, the defense would have cost him one hit per ten balls in play, exactly the average rate for the team. 

Which means, the better Verlander's stat line, the more likely the fielders played better than their overall average.



Labels: , , , , ,

Friday, January 20, 2012

GiveWell: Overcomplicating research studies can cost lives

"GiveWell" is an organization that evaluates charities. Not just the usual things -- how well they're run, or how much money goes to administrative expenses -- but also how much good they do for the money they receive.

The idea is: if you have $100 to give to try to make the world a better place, shouldn't you give that $100 where it would give the most benefit? Not just to whoever shows up at your door that day, or whatever organization makes you feel guiltiest, or whoever's suffering kids look the cutest ... but, seriously, to where you can do the most good.

That might not appeal to everyone. If you donate to maximize your own good feelings, instead of the good your donation actually does, GiveWell's evaluations won't make much difference to you. Some people hate to say "no", and so they prefer to give $5 to each of the twenty charities that ask for money. Some people prefer to give to diseases that killed their loved ones, or diseases associated with heroes like Terry Fox. Some people give to causes that signal their political views. Most people prefer to give to help people in their own city or country, even when their dollars will save many more lives abroad.

(I've done all these things, and I'm bit embarrassed about some of them. But I'm not alone. I mean, people give money to the Children's Wish Foundation to send a terminally ill kid to Disneyland ... which is nice, but, that same amount of money might actually save ten lives if they sent it to Africa where kids are actually dying of things that are easily preventable. I'm not sure what's up with me, and my fellow humans, sometimes. But I digress.)

So, in at least one sense, GiveWell is to donors what sabermetrics is to Joe Morgan. It does analysis to reach conclusions that some might find uncomfortable.

However, in another sense, what GiveWell does is *unlike* sabermetrics, in that it usually doesn't try to get down to the third decimal place. It argues that it can evaluate charities heuristically, that the differences are big enough that they can figure out which charities are the best, using the charities' own reports. As I interpret what they're saying, GiveWell can very easily tell you whether a charity is a Danny Ainge or an Albert Pujols, and it can even tell you more subtle things, like whether a charity is a Joe Carter or an Albert Pujols. But it doesn't try to figure out if a charity is a Ryan Braun or an Albert Pujols. It will just tell you that both are recommended.

That is, GiveWell argues that its goals are better met by the transparency of its recommendations than by any detailed, opaque analyses.

Which is almost exactly what I argued in one of my recent posts -- that, in research, simplicity and transparency are more important than rigor. Simple studies make it much easier to understand the results and catch the inevitable errors. A gentleman from GiveWell, Elie Hassenfeld, read that post, and pointed me to a particular example of a serious error that his organization uncovered.

(Disclaimer: I don't really know much about GiveWell. However, I've been impressed by what I've seen, and at least two of the blogs I read and respect (here's one) say very good things about them. So my Bayesian evaluation of them is quite high.)

-----

As I said, GiveWell doesn't believe they need detailed statistical cost/benefit studies to decide which charities to recommend. However, charities themselves often use such analyses to decide where the money should be spent. There's a whole bunch of organizations and academics devoted to figuring out how to save the most lives for the fewest dollars.

With that objective, the Bill and Melinda Gates Foundation donated $3.5 million to fund a study, "Disease Control Priorities in Developing Countries". They published a report ranking various interventions on cost-effectiveness. The Gates Foundation didn't do that itself -- it was done jointly by The World Bank, the National Institutes of Health, the World Health Organization, and the Population Reference Bureau. Those sound like heavyweights in the world health field.

The results found that -- unsurprisingly to me -- hygiene promotion was the cheapest way to reduce death and disease. The second cheapest, though, was deworming. Specifically, "soil-transmitted helminth" (STH) deworming treatments.

After the report was released, the Gates Foundation provided another $4.4 million to promote the findings. And the findings did indeed attract serious attention. GiveWell writes,

The DCP2’s cost-effectiveness estimates for deworming have been cited widely to advocate a greater focus on treating STH infections, including in:

-- an article in The Lancet

-- a report by REACH, a consortium of large international NGOs and other organizations working to end child hunger, which labeled deworming one of 11 “promoted interventions”

-- the most-cited paper published in the journal International Health

-- an editorial by Peter Hotez, a co-founder of the Global Network for Neglected Tropical Diseases, which has received more than $40 million in funding from the Gates Foundation

-- work by charity evaluators, such as GiveWell, Giving What We Can, and the University of Pennsylvania’s Center for High Impact Philanthropy.


But, as GiveWell later discovered, it turns out the STH estimate was wrong.

That doesn't sound too serious, but here's the thing: it's not just that the estimate was wrong. It was wrong by a factor of almost ONE HUNDRED. The study said that you could save one "disability-adjusted life year" by spending $3.41 on deworming treatments. But, after correcting for the (acknowledged) errors in the study, the actual number was $326.43.

All these well-respected organizations, with serious researchers and serious money, wound up promoting a conclusion that was about as wrong as it could have been. Until the error was caught, then, effectively, 99% of the money devoted to STH treatment was wasted.

How did GiveWell catch the error? Subject matter expertise, mostly. In reading the report, they noticed that the STH estimate was much, much lower than other estimates they had seen. Instead of just assuming that this research was somehow better than the previous studies, they investigated.

That seems like just common sense, right? If you see a study that says an iPod can be bought for $3, when you know it usually costs $300, you should look again, shouldn't you? But that didn't happen until someone at GiveWell decided to figure out what was going on.

So they wrote to one researcher, who sent them to other researchers, who sent them complicated spreadsheets. They tried to figure those out, but they couldn't, so they wrote back and forth with questions and explanations. They were referred to still another researcher, who sent them a copy of yet another study that was the source of some of the data.

Eventually, they figured out where the issues were ... if you want a full explanation, it's in their post. It was a lot of detailed, technical effort to figure out what went wrong, and which parameters were in error.

GiveWell's conclusions:

We believe that the errors we’ve found in the estimate would have been caught by a helminth expert independently examining the estimate. Therefore, the presence of these errors implies to us that there has been no such examination. If this is the case, it would argue against the reliability of the DCP2’s estimates in general.

We’ve previously argued for a limited role for cost-effectiveness estimates; we now think that the appropriate role may be even more limited, at least for opaque estimates (e.g., estimates published without the details necessary for others to independently examine them) like the DCP2’s.

More generally, we see this case as a general argument for expecting transparency, rather than taking recommendations on trust - no matter how pedigreed the people making the recommendations. Note that the DCP2 was published by the Disease Control Priorities Project, a joint enterprise of The World Bank, the National Institutes of Health, the World Health Organization, and the Population Reference Bureau, which was funded primarily by a $3.5 million grant from the Gates Foundation. The DCP2 chapter on helminth infections, which contains the $3.41/DALY estimate, has 18 authors, including many of the world’s foremost experts on soil-transmitted helminths.

Absolutely right. You can't substitute credentials for subject matter expertise, and you can't substitute complexity for transparency.

And, one thing I would add: when a study appears to discover that you can get benefits at 99% off the original, well-accepted price ... you have to be suspicious about accepting that conclusion, even if you have no other reason to believe there was any mistake.

-----

P.S. GiveWell expands on the theme here.



Labels: , , ,

Wednesday, November 30, 2011

Why it's hard to estimate small effects

Here's a great 2009 paper (.pdf) by Andrew Gelman and David Weakliem (whom I'll call "G/W"), on the difficulty of finding small effects in a research study.

I'm translating to baseball to start.

-----

Let's suppose you have someone who claims to be a clutch hitter. He's a .300 hitter, but, with runners on base, he claims to be a bit better.

So, you say, show us! You watch his 2012 season, and see how well he hits in the clutch. You decide in advance that if it's statistically significantly different from .300, that will be evidence he's a clutch hitter.

Will that work? No, it won't.

Over 100 AB, the standard deviation of batting average is about 46 points. To find statistical significance, you want 2 SD. That means to convince you, the player would have to hit .392 in the clutch.

The problem is, he's not a .392 hitter! He, himself, is only claiming to be a little bit better than .300. So, in your study, the only evidence you're willing to allow, is evidence that you *know* can't be taken at face value.

Let's say the batter actually does meet your requirement. In fact, let's suppose he exceeds it, and hits .420. What can you conclude?

Well, suppose you didn't know in advance that you were looking for small effect. Suppose you were just doing a "normal" paper. You'd say, "look, he beat his expectation by 2.6 SD, which is statistically significant. Therefore, we conclude he's a clutch hitter." And then you write a "conclusions" section with all the implications of having a .420 clutch hitter in your lineup.

But, in this case, that would be wrong, because you KNOW he's not a .420 clutch hitter, even though that's what he hit and you found statistical significance. He's .310 at best, maybe .320, if you stretch it. You KNOW that the .420 was mostly due to luck.

Still ... even if you can't conclude that the guy is truly a .420 clutch hitter, you SHOULD be able to at least conclude that he's better than .300 right? Because you did get that statistical significance.

Well ... not really, I don't think. Because, the same evidence that purports to show he's not a .300 hitter ALSO shows he's not a .320 hitter! That is, .420 is also more than 2 standard deviations from .320, which is the best he possibly could be.

What you CAN do, perhaps, is compare the two discrepancies. .420 is 2.6 SDs from .300, but only 2.2 SDs from .320. That does appear to make .320 more likely than .300. In fact, the probability of a .320 hitter going 42-for-100 is almost five times as high as the probability of the .300 hitter going 42-for-100.

But, first, that's only 5 in 6. Second, that ignores the fact that there are a lot more .300 hitters than .320 hitters, which you have to take into account.

So, all things considered, you should know in advance that you won't be able to conclude much from this study. The sample size is too small.

-------

That's Gelman and Weakliem's point: if you're looking for a very small effect, and you don't have much data, you're ALWAYS going to have this problem. If you're looking for the difference between .300 and .320, that's a difference of 20 points. If the standard error of your experiment is a lot more than 20 points ... how are you ever going to prove anything? Your instrument is just too blunt.

In our example, the standard error is 46 points. To find statistical significance, you'd have to observe an effect of at least 92 points! And so, if you're pretty sure clutch hitting talent is less than 92 points, why do the experiment at all?

But what if you don't know if clutch hitting talent is less than 92 points? Well, fine. But you're still never going to find an effect less than 92 points. And so, your experiment is biased, in a way: it's set up to only find effects of 92 points or more.

That means that if the effect is small, no matter how many scientists you have independently searching for it, they'll never find it. Moreover, they will frequently find a LARGE effect.

No matter what happens, the experiment will either be wrong too high, or wrong too low. It is impossible for it to be accurate for a small effect. The only way to find a small effect is to increase the sample size. But even then, that doesn't eliminate the problem: it just reduces it. No matter what your experiment, and how big your sample size, if the effect your looking for is smaller than 2 SDs, you'll never find it.

That's G/W's criticism. It's a good one.

-------

G/W's example, of course, is not about clutch hitting. It's about a previously-published paper, which found that good-looking people are more likely to produce female offspring than male offspring. That study found an 8 percentage point difference between the nicest-looking parents and the worst-looking parents -- 52 percent girls vs. 44 percent girls.

And what G/W are saying is, that 8 point difference is HUGE. How do they know? Well, it's huge as compared to a wide range of other results in the field. Based on the history of studies on birth sex bias, two or three points is about the limit. Eight points, on the other hand, is unheard of.

Therefore, they argue, this study suffers from the "can't find the real effect" problem. The standard error of the study was over 4 points. How can you find an effect of less than 3 points, if your standard error is 4 points? Any reasonable confidence interval will cover so much of the plausible territory, that you can't really conclude anything at all.

Gelman and Weakliem don't say so explicitly, but this is a Bayesian argument. In order to make it, you have to argue that the plausible effect is small, compared to the standard error. How do you know the plausible effect is small? Because of your subject matter expertise. In Bayesian terms, you know, from your prior, that the effect is most likely in the 0-3 range, so any study that can only find an 8-point difference must be biased.

Every study has its own limits of how the standard error compares to the expected "small" effect. You need to know what "small" is. If a clutch hitting study was only accurate to within .0000001 points of batting average ... well, that would be just fine, because we know, from prior experience, that a clutch effect of .0000002 is relatively plausible. On the other hand, if it's only accurate to within .046, that's too big -- because a clutch effect of .092 is much too large to be plausible.

It's our prior that tells us that. As I've argued, interpreting the conclusions of your study is an informal Bayesian process. G/W's paper is one example of how that kind of argument works.

--------

Hat tip: Alex Tabarrok at Marginal Revolution


Labels: , ,

Monday, November 28, 2011

Why p-value isn't enough, reiterated

Question 1:

People are routinely tested for disease X, which 1 in 1000 people have overall. It is known that if the person has the disease, the test is correct 99% of the time. If the person does not have the disease, the test is also correct 99% of the time.

A patient goes to his doctor for the test. It comes out positive.

What is the probability that the patient has the disease?



Question 2:

Researchers routinely run studies to test unexpected hypotheses (such as: can outside prayer help cure disease?), of which 1 in 1000 tend to be true overall. It is known that if a hypothesis is true, a study correctly finds statistical significance 99% of the time. If the hypothesis is false, the study correctly finds NO statistical significance 99% of the time.

A researcher tests one such unexpected hypothesis. He finds statistical significance.

What is the probability that the hypothesis is true?



--------

Hat Tip: Inspired by Jeremy's last paragraph of comment #25, here.

--------

P.S. Answer to question 1 (very slightly modified question, but the same answer) at my previous post, here.


Labels: , ,

Wednesday, November 23, 2011

Research conclusions *have* to be bayesian

The last couple of posts here have been about interpreting the results of statistical studies. I argued that the statistical method itself might be just fine, but the *interpretation* of what it means, the conclusions you draw about real life, require an argument. That is, you can get the regression right, but the conclusions wrong, because the conclusions call for argument and judgment.

Or, as some commenters have substituted, "intuition" and "subjectivity". Those are negative things, in academic circles. Objectivity is the ideal, and the idea that the reliability of a work of scholarship depends on a subjective evaluation of the author's judgment doesn't seem to be something that people like.

But, I think it absolutely has to follow. If you find a connection between A and B, how do you know if it's A that causes B, or B that causes A, or if it's all just random? That's something no statistical analysis can tell you. By definition, it calls for judgment, doesn't it? At least a little bit. Recall the recent (contrived) study that showed that listening to kids' music is linked to being physically older. Nobody would conclude that the music MAKES you older, right? But that's not a result of the statistical analysis -- it's a judgment based on outside knowledge. An easy, obvious judgment, but a judgment nonetheless.

It occurred to me that this judgment, that takes you from regression results to conclusions, is really an informal Bayesian inference. I don't think this is a particularly novel insight, but it helps to make the issue clearer. My argument is this: first, even if you do a completely normal, ("frequentist") experiment, the step from the results to the conclusions HAS to be Bayesian. And, more importantly, because Bayesian techniques sometimes require judgment, and are therefore not completely objective, the convention has been to avoid such judgment in academic papers. Therefore, these studies have locked themselves in to a situation in which they have to suspend judgment, and use strict rules, which sometimes lead to wrong -- or seemingly absurd -- answers.

OK, let me start by explaining Bayesianism, as I understand it, first intuitively, then in a baseball context. As always, real statisticians should correct me where I got it wrong.

----------

Generally, Bayesian is a process by which you refine your probability estimate. You start out with whatever evidence you have which leads you to a "prior" estimate for how things are. Then, you get more evidence. You add that to the pile, and refine your estimate by combining the evidence. That gives you a new, "posterior" estimate for how things are.

You're a juror at a trial. At the beginning of the trial, you have no idea whether the guy is guilty or not. You might think it's 50/50 -- not necessarily explicitly, but just intuitively. Then, a witness comes up that says he saw the crime happen, and he's "pretty sure" this is the guy. Combining that with the 50/50, you might now think it's 80/20.

Then, the defense calls the guy's boss, who said he was at work when the crime happened. Hmmm, you say, that sounds like he couldn't have done it. But there's still the eyewitness. Maybe, then, it's now 40/60.

And so on, as the other evidence unfolds.


That's how Bayesian works. You start out with your "prior" estimate, based on all the evidence to date: 50/50. Then, you see some new evidence: there's an eyewitness, but the boss provides an alibi. You combine that new evidence with the prior, and you adjust your estimate accordingly. So your new best estimate, your "posterior," is now 40/60.

---------

That's an intuitive example, but there is a formal mathematical way this works. There's one famous example, which goes like this:

People are routinely tested for disease X, which 1 in 1000 people have overall. It is known that if the person has the disease, the test is correct 100% of the time. If the person does not have the disease, the test is correct 99% of the time.

A patient goes to his doctor for the test. It comes out positive. What is the probability that the patient has the disease?


If you've never seen this problem before, you might think the chance is pretty high. After all, the test is correct at least 99% of the time! But that's not right, because you're ignoring all the "prior" evidence, which is that only 1 in 1000 people have the disease to begin with. Therefore, there's still a strong chance that the test is a false positive, despite the 99 percent accuracy.

The answer turns out to be about 1 in 11. The (non-rigorous) explanation goes like this: 1000 people see the doctor. One has the disease and tests positive. Of the other 999 who don't have the disease, about 10 test positive. So the positive tests comprise 10 people who don't have the disease, and 1 person who does. So the chance of having the disease is 1 in 11.

Phrasing the answer in terms of Bayesian analysis: The "prior" estimate, before the evidence of the test, is 0.001 (1 in 1000). The new evidence, though, is very significant, which means it changes things a fair bit. So, when we combine the new evidence with the prior, we get a "posterior" of 0.091 (1 in 11).

If that still seems counterintuitive to you, think of it this way: if the test is 99% positive, that's 1 in 100 that it's wrong. That's low odds, which makes you think the test is probably right! But ... the original chance of having the disease is only 1 in 1000. Those are even worse odds. The prior of 1/1000 competes with the new evidence of 1/100. Because the new number (test being wrong) is more likely than the old number (no disease), the odds are skewed to the test being wrong: odds of 10:1 that the test is wrong, compared to the patient having the disease.

Another way to put it: the less likely the disease was to start with, the more evidence you need to overcome those low odds. 1/100 isn't enough to completely overcome 1/1000.

(Perhaps you can see where this will be going, which is: if a research study's hypothesis is extremely unlikely in the first place, even a .01 significance level shouldn't be enough to overcome your skepticism. But I'm getting ahead of myself here.)

---------

Let's do an oversimplified baseball example. At the beginning of the 2011 baseball season, you (unrealistically) know there's a 50% chance that Albert Pujols' batting average talent will be .300 for the season, and a 50% chance that his batting average talent will be .330. Then, in April, he goes 26 for 106 (.245). What is your revised estimate of the chance that he's actually a .300 hitter?

You start with your "prior" -- a 50% chance he's a .300 hitter. Then, you add the new evidence: 26 for 106. Doing some calculation, you get your "posterior." I won't do the math here, but if I've got it right, the answer is that now the chance is 80% that Pujols is actually a .300 hitter and not a .330 hitter.

That should be in line with your intuition. Before, you thought there was a good chance he was a .330 hitter. After, you think there's still a chance, but less of a chance.

We started thinking Pujols was still awesome. Then he hit .245 in April. We thought, "Geez, he probably isn't really a .245 hitter, because we have many years of prior evidence that he's great! But, on the other hand, maybe there's something wrong, because he just hit .245. Or maybe it's just luck, but still ... he's probably not as good as I thought."

That's how Bayesian thinking works. We start with an estimate based on previous evidence, and we update that estimate based on the new evidence we add to the pile.

--------

Now, for the good part, where we talk about academic regression studies.

You want to figure out whether using product X causes cancer. You do a study, and you find statistical significance at p=0.02, and the coefficient says that using product X is linked with a 1% increased chance of cancer. You are excited about your new discovery. What do you put in the "conclusions" section of your paper?

Well, maybe you say "this study has found evidence consistent with X causing cancer." But that isn't helpful, is it? I mean, you also found evidence that's consistent with X *not* causing cancer -- because, after all, it could have just been random luck. (A significance level of .02 would happen by chance 1 out of 50 times.)

Can you say, "this is strong evidence that X causes cancer?" Well, if you do, it's subjective. "Strong" is an imprecise, subjective word. And what makes the evidence "strong"? You better have a good argument about why it's strong and not weak, or moderate. The .02 isn't enough. As we saw in the disease example, a positive test -- which is equivalent to a significance level of .01, since a positive test happens only 1 in 100 times -- was absolutely NOT strong evidence of having the disease. (It meant only a 1 in 11 chance.)

Similarly, you can't say "weak" evidence, because how do you know? You can't say anything, can you?

It turns out that ANY conclusion about what this study means in real life has to be Bayesian, based not just on the result, but on your PRIOR information about the link between cancer and X. There is no conclusion you can draw otherwise.

Why? Well, it's because the study has it backwards.

What we want to know is, "assuming the data came up the way it did, what is the chance that X causes cancer?"

But the study only tells us the converse: "assuming X does not cause cancer, what is the chance that the data would come up the way it did?"

The p=0.02 is the answer to the second question only. It is NOT the answer to the first question, which is what we really want to know. There is a step of logic required to go from the second question to the first question. In fact, Bayes' Theorem gives us the equation for finding the answer to the first question given the second. That equation requires us to know the prior.

What the study is asking is, "given that we got p=0.02 in this experiment, what's the chance that X causes cancer?" Bayes' Theorem tells us the question is unanswerable. All we can answer is, "given that we got p=0.02 in this experiment, what is the chance that X causes cancer, given our prior estimate before this experiment?"

That is: you CANNOT make a conclusion about the likelihood of "X causes cancer" after the experiment, unless you had a reliable estimate of the likelihood of "X causes cancer" BEFORE the experiment. (In mathematical terms, to calculate P(A|B) from P(B|A), you need to know p(B) and p(A).)

Does this sound wrong? Do you think you can get a good intuitive estimate just from this experiment alone? Do you feel like the .02 we got is enough to be convincing?

Well, then, let me ask you this: what's your answer? What do you think the chance is that X causes cancer?

If you don't agree with me that there's no answer, then figure out what you think the answer is. You may assume the experiment is perfectly designed, the sample size is adequate, and so on.

If you don't have a number -- you probably don't -- think of a description, at least. Like, "X probably causes cancer." Or, "I doubt that X causes cancer." Or, "by the precautionary principle, I think everyone should avoid X." Or, "I don't know, but I'd sure keep my kids away from X until there's evidence that it's safe!"

Go ahead. I'll leave some white space for you. Get a good intuitive idea of what your answer is.




(Link to Jeopardy music while you think (optional))





OK. Now, I'm going to tell you: product X is a bible.

Does that change your mind?

It should. Your conclusion about the dangers of X should absolutely depend on what X is -- more specifically, what you knew about X before. That is, your PRIOR. Your prior, I hope, had a probability of close to 0% that a Bible can cause cancer. That's not just a wild-ass intuition. There are very good, rational, objective reasons to believe it. Indeed, there is no evidence that the information content of a book can cause cancer, and there is no evidence or logic that would lead you to believe that bibles are more carcinogenic than, say, copies of the 1983 Bill James Baseball Abstract.

Call this "intuition" or "subjectivity" if you want. But if you decide not to use your own subjective judgment, what are you going to do? Are you going to argue that bibles cause cancer just to avoid having to take a stand?

I suppose you can stop at saying, "this study shows a statistically significant relationship between bible use and cancer." That's objectively true, but not very useful. Because the whole point of the study is: do bibles cause cancer? What good is the study if you can't apply the evidence to the question?

--------

You could do the Bayesian approach thing more formally. That's what researchers usually mean when they talk about "Bayesian methods" -- they mean formal statistical algorithms.

To do a Bayesian analysis, you need a prior. You could just arbitrarily take something you think is reasonable. "Well, we don't believe there's much of a chance bibles cause cancer, so we're going to assume a prior 99.9999% probability that there's no effect, and we'll split up the last remaining .0001 in a range between -2% and +2%." Now, you do the study, and recalculate your posterior distribution, to see if you now have enough evidence to conclude there's a danger.

If you did that, you'd find that your posterior distribution -- your conclusion -- was that the probability of no effect went down, but only from 99.9999% to 99.995%, or something. That would make your conclusion easy: "the evidence should increase our worry that bibles cause cancer, but only from 1 in a million to 1 in 20,000."

But, that Bayesian technique is not really welcome in academic studies. Why? Because that prior distribution is SUBJECTIVE. The author can choose any distribution he wants, really. I chose 99.9999%, but why not 99.99999% (which is probably more realistic)? The rule is that academic papers are required to be objective. If you allow the author to choose any prior he wants, based on his own intuition or judgment, then, first, the paper is no longer objective, and second, there is the fear that the author could get any conclusion he wanted just by choosing the appropriate prior.

So papers don't want to assume a prior. So instead of arguing about the chance the effect is real, the paper just assumes it's real, and takes it at face value. If X appears to increase cancer by 1%, and it's statistically significant, then the conclusion will assume that X actually *does* increase cancer by 1%.

That sounds like it's not Bayesian. But, in a sense, it is. It's exactly the result you'd get from a Bayesian analysis with a prior that assumes every result is equally likely. Yes, it's objective, because you're always using the same prior. But it's the *wrong* prior. You're using a fixed assumption, instead of the best assumption you can, just because the best assumption is a matter of discretion. You're saying, "Look, I don't want to make any subjective assumptions, because then I'm not an objective scientist. So I'm going to assume that bibles are just as likely to cause 1% more cancers as they are to cause 0% more cancers."

That's obviously silly in the bible case, and, when it's that obvious, it looks "objective" enough that the study can acknowledge it. But most of the time, it's not obvious. In those cases, the studies will just take their results at face value, *as if theirs is the only evidence*. That way, they don't have to decide if their result is plausible or not, in terms of real-life considerations.

Suppose you have two baseball studies. One says that certain batters can hit .375 when the pitcher throws lots of curve balls. Another says that batters gain 100 points on their batting average after the manager yells at them in the dugout. Both studies find exactly the same size effect, with exactly the same significance level of, say, .04.

Of the two conclusions, which one is more likely to be true? The curve ball study, of course. We know that some batters hit curve balls better than others, and we know some batters hit well over .300 in talent. It's fairly plausible that someone might actually have .375 talent against curve balls.

But the "manager yells at them" study? No way. We have a strong reason to believe it's very, very unlikely that batters would improve by that much just because they were yelled at. We have clutch hitting studies, that barely find an effect even when the game is on the line. We have lots of other studies that, even when they do find an effect, like platooning, find it to be much, much less than 100 points. Our prior for the "manager yelling is worth 100 points" hypothesis is so low that a .04 will barely move it.

Still ... I guarantee you that if these two studies were published, the two "conclusions" sections would not give the reader any indication of the relative real-life likelihood of the conclusions being correct, except by reference to the .04. In their desire to be objective, the two studies would not only fail to give shadings of their hypotheses' overall plausibility, but they'd probably just treat both conclusions as if they were almost certainly true. That's the general academic standard: if you have statistical significance, you're entitled to just go ahead and assume the null hypothesis is false. To do anything else would be "subjective."

But while that eliminates subjectivity, it also eliminates truth, doesn't it? What you're doing, when you use a significance level instead of an argument, is that you're choosing what's most objective, instead of what's most likely to be right. You're saying, "I refuse to make a judgment, and so I'm going to go by rote and not consider that I might be wrong." That's something that sounds silly in all other aspects of life. Doesn't it also sound silly here?


--------

So, am I arguing that academics need to start doing explicit Bayesian analysis, with formal mathematical priors? No, absolutely not. I disagree with that approach for the same reasons other critics do: it's too subjective, and too subject to manipulation. As opponents argue, how do you know you have the right prior? And how can you trust the conclusions if you don't?

So, that's why I actually prefer the informal, "casual Bayesian" approach, where you use common sense and make an informal argument. You take everything you already know about the subject -- which is your prior -- and discuss it informally. Then, you add the new evidence from your study. Then, finally, you conclude about your evaluation of the real-life implications of what you found.

You say, "Well, the study found that reading the bible is associated with a 1% increase in cancer. But, that just sounds so implausible, based on our existing [prior] knowledge of how cancer works, that it would be silly to believe it."

Or, you say, "Yes, the study found that batters hit 100 points better after being yelled at by their manager. But, if that were true, it would be very surprising, given the hundreds of other [prior] studies that never found any kind of psychological effect even 1/20 that big. So, take it with a grain of salt, and wait for more studies."

Or, you say, "We found that using this new artificial sweetener is linked to one extra case of cancer per 1,000,000 users. That's not much different from what was found in [prior] studies with chemicals in the same family. So, we think there's a good chance the effect is real, and advise caution until other evidence makes the answer clearer."

That's what I meant, two posts ago, where I said "you have to make an argument." If you want to go from "I found a statistically significant 4% connection between cancer and X," to "There is a good chance X causes cancer," you can't do that, logically or mathematically, without a prior. The p value is NEVER enough information.

The argument is where you informally think about your prior, even if you don't use that word explicitly. The argument is where you say that it's implausible that bibles cause cancer, but more plausible that artificial sweeteners cause cancer. It's where you say that it's implausible that songs make you older, but not that the effect is just random. It's where you say that there's so much existing evidence that triples are a good thing, that the fact that this one correlation is negative is not enough to change your mind about that, and there must be some other explanation.

You always, always, have to make that argument. If you disagree, fine. But don't blame me. Blame Bayes' Theorem.



Labels: , ,

Thursday, September 22, 2011

The Bayesian Cy Young

At Fangraphs, Dave Cameron and Eric Seidman have a nice discussion (hat tip: Tango) on who's the better Cy Young candidate: Clayton Kershaw, or Roy Halladay?

Part of the discussion hinges on BABIP: batting average on balls in play. As Voros McCracken discovered years ago, pitchers generally don't differ much in what happens when a non-home-run ball is hit off them. Most of the overall differences between pitchers, then, are due to the fielders behind them, but mostly due to luck.

So far in 2011, Clayton Kershaw has a BABIP of .272, which Eric decribes as "absurdly low." Still, Eric thinks it might actually be skill rather than luck, since since .272 it's not that much different than Kershaw allowed in previous years. Dave argues that Kershaw's three seasons is still a fairly small sample size, and points out that most of his BABIP advantage comes from his record at home (he's about average on the road).

Anyway, my point isn't to weigh in to which one is right -- they do a fine job hashing things out in their discussion. What I want to talk about is something they both seem to agree on: that it's important whether the BABIP is luck or skill. If it's luck, that reduces Kershaw's Cy Young credentials. If it's skill, he's a better candidate.

Seems reasonable, and I don't necessarily disagree. But let's see where that logic leads.

Because, there are other kinds of luck, or factors that pitchers can't control. For instance, there's park (which is usually already adjusted for in WAR, the statistic Eric and Dave cite most in this debate).

There's also quality of opposition batting. It's probably not too hard, if you have good data, to figure out how much either of the pitchers gained by being able to pitch to inferior hitters. You could also check if one of them had the platoon advantage more often. And, if one of them pitched more at home than the other one did.

We'd probably all agree, right, that you'd want to adjust for those kinds of things if we had the information? To be clear, I'm not criticizing Dave or Eric for not spending hours figuring this stuff out. I'm just saying that if you have the data, it's relevant in comparing the pitchers.

There are other things too, that eventually we'll be able to figure out, that we can't right now because (as far as I know) the research hasn't been done. Suppose Kershaw throws a pitch at a certain speed, with a certain break, on a certain count. And, someday, we'll know that kind of pitch is swung on and missed 30% of the time, called a ball 5% of the time, called a strike 10% of the time, fouled off 10% of the time, and hit in play 45% of the time with an OPS of .850. Maybe, overall, that pitch is worth (say) +0.05 runs (in favor of the pitcher).

Once we have that kind of information, we can check for "batter swing luck". If it turns out that batters just randomly happened to go +0.03 on that pitch from Kershaw this season, instead of +0.05, we should credit him the extra 0.02, right? He delivered a certain performance, and the batters just happened to get a bit lucky on it, as if his BABIP was too high. (This measure would probably substitute for BABIP: it includes balls in play, but also home runs, swings-and-misses, and walk potential.)

So we'd adjust Kershaw and Halladay for how lucky the batters were on those swings.

That's not unrealistic, and it'll probably eventually happen, to some degree of accuracy. Here's one that probably won't, at least not for a few decades, but it works as a thought experiment.

Imagine we hook a probe to every batter's brain, so on every pitch we can tell if he's guessing fastball or curve, and if he's guessing inside or outside. After a couple of years of analyzing this data, we figure that when he guesses right, it's worth +0.1 runs (for the batter), when he guesses half-right, it's worth 0, and when he guesses wrong, it's -0.1.

That again, is something out of the control of the pitcher (especially if both batter and pitcher are randomizing using game theory). So you'd want to control for it, right? If Halladay is having a good year just because batters were unlucky enough to guess right only 23% of the time instead of 25%, you have to adjust, just like you'd adjust for a lucky BABIP.

This will change the definition of "batter swing luck," but not replace it. First, the batter may have been lucky enough to guess right, which is worth something. Then, he might have been lucky enough to get better than expected wood on the ball even controlling for the fact that he guessed right.

So you've got lots of sources of luck:

-- park
-- day/night
-- distribution of batters
-- platoon luck
-- BABIP luck
-- batter swing luck
-- batter guess luck

You'd want to adjust for all of these. Right now, as I understand WAR, we're adjusting for park and BABIP.

What about the others? Well, we can't really adjust for those. We *want* to, but we can't.

So, we make do with just park and BABIP. Still, no matter how many decimal places we go to with the debate on Kershaw/Halladay, we're still only going to have our best guess.

At least we can argue that if all the other things are random, we should still be unbiased. Right?

Well, not really. From a Bayesian standpoint, we have a pretty good idea who had more luck. It's much more likely to be Kershaw.

Why? Because Halladay's performance is much more consistent with his career than Kershaw's. Kershaw's a good pitcher, but wasn't expected to be *that* good. Halladay, on the other hand, is having a typical Halladay season. Well, a bit better than typical, but not much.

I'd be willing to bet a lot of money that if you found 50 pitchers who had a better-than-career season, by at least (say) 1.5 WAR, you would find that those 50 pitchers had above-average BABIP luck. It stands to reason. I won't make a full statistical argument, but here's a quick oversimplification of one.

A pitcher can have his talent go up or down from year to year. He can have his luck go up or down from year to year. That's four combinations. Only three of them are possibly consistent with a big improvement in WAR: talent up/luck up; talent up/luck down; talent down/luck up. Two of those have his luck going up. So, two times out of three, the pitcher was lucky.

The argument applies to *all* sources of luck. Even after taking BABIP into account, if a pitcher's adjusted performance is still above his career average, he's still more likely to have had good luck than bad, in other ways (batter swings, say).

I don't have an easy way to quantify this, but still I'd give you better-than-even odds that, stripping out all the above, Halladay is performing better than Kershaw -- even after adjusting for park and BABIP.

If you have two players with similar, outstanding performances, the player with the better expectation of talent is probably the one who's actually having the better year. To believe that Kershaw was really likely to have had a better year than Halladay, you really need him to have put up *much* better numbers. Either that, or you need a way to actually work out all the luck, and prove that the residual still favors Kershaw.

I should emphasize that I am NOT talking about talent here. I think most people would agree that Halladay is still more talented than Kershaw, but would nonetheless argue Kershaw might still be having the better season.

But, what I'm saying is, no, I bet Kershaw is NOT having a better season, even if his numbers look better. I'm saying that it's likely that Kershaw *is actually not pitching better*. If we had the data, it's more likely than not that we'd see that batters are just having bad luck -- not only are they (perhaps) hitting the ball directly to fielders, as BABIP suggests, but they're probably swinging and missing at hittable pitches.

---------

Another way to look at it: if two pitchers have mostly the same results, but one has better stuff, what does that mean? It means that the pitcher with the better stuff must have been unluckier than the pitcher with the worse stuff. In other words, the batters facing the better stuff must have been luckier.

We don't know for sure, of course, that Halladay had better stuff than Kershaw. But history suggests that's more likely. And so, the odds are on the side of Kershaw having been luckier than Halladay. How much so?

I don't know. One mitigating factor is that Kershaw is young, so you'd expect more of his improvement to be real. But, still, a small improvement is more likely than a large improvement, so the odds are still on the side of postive luck over negative luck.

---------

Does that take some of the fun out of the Cy Young? I think it certainly does make it a little bit less entertaining, at least until we have better data. That's because, as long as we remain ignorant of a significant amount of luck, it requires a much bigger hurdle to award the honor to anyone other than Halladay.

This is a bit counterintuitive, but it's true. Suppose a good but not great pitcher -- Matt Cain, say -- has almost exactly the same stat line as Roy Halladay, including BABIP, but is actually better in some categories. Perhaps he a couple of extra strikeouts, and a couple fewer walks.

From the usual arguments, there would be absolutely no debate that Cain's season is better, right? He's better than Halladay in some categories, and the same as Halladay in all the others.

But ... if you're trying to bet on which player actually pitched better after removing all the luck, you'd still have to go with Halladay.

-----

UPDATE: on his blog, Tango writes,

Aside to Phil: Marcel had Kershaw with a 3.07 ERA for 2011, and Halladay at 3.04. So, while you make great points in your article, you didn’t have the right examples! Sabathia and Verlander would have been better examples.

Oops! I'll just leave it the way it is for now, but point taken.

Labels: , , , , ,