## Friday, April 04, 2008

### Carl Bialik on DiMaggio streak probabilities

There have been lots of studies on the chances of beating Joe DiMaggio's 56-game hitting streak – SABR's "Baseball Research Journal" had one in 1994, and at least three more over the past five years or so. And there was another one, recently, in the New York Times, where two Cornell mathematicians ran simulations of baseball history.

Yesterday, Carl Bialik discussed the math involved in calculating the odds of such a streak, and summarized a reader's objection to the simulation's methodology. The Cornell authors had assumed every player got number of plate appearances in every game. That would inflate the odds. Bialik summarizes the logic perfectly:

"Suppose one batter has the identical 81% chance of getting at least one hit in each of 10 games — which is what the authors assumed for DiMaggio for every game of the 1941 season. Now suppose that during those 10 games, his teammate has an 89% chance of getting a hit in each of the first nine games, and a 9% chance in the 10th game. For both players, their mean probability across those 10 games is 81%. But DiMaggio has about a 12.2% chance of getting at least one hit in each game (0.81 to the 10th power). His teammate’s probability of starting off with a nine-game hitting streak is 35%. But that tough 10th game makes his chance of a 10-game hitting streak just 3.2% — or barely one-quarter of DiMaggio’s."

The moral is that when you multiply 56 very small numbers together, little things make a big difference, and you have to make sure you get every assumption right. And there are many issues that you have to just estimate. For instance, suppose Joe is on his 46th game, is hitless so far, and is up for his last time in the game. Is Joe less likely to take pitches, reducing his chances for a walk? Will the opposing pitcher pitch outside, hoping to take advantage of Joe's desperation, or is there some unwritten rule that would actually have him throw down the middle and challenge DiMaggio?

Who knows? But it makes a big difference. If you assume that Joe has a (say) 32% probability of getting a hit in each of his first three PA, but that on his fourth, his chances increase to 35% (because he won't walk), his chance of getting a hit in that game is 79.56%. But if you treat his fourth appearance the same as his first, his chance is only 78.61%. That doesn't seem like much, but over 20 games – say, from game 37 to 56 – it reduces the probability of the 20-game substreak from 1 in 97, to 1 in 123.

And that's only one of many questions you could ask. Would official scorers be more likely to credit Joe with a hit on a bobbled ball? Would opposing managers be more, or less likely to bring in a reliever in the 9th to get the platoon advantage on Joe? Would DiMaggio consciously try to hit singles instead of homers to maximize his chance of getting a hit?

But the most important issue, I think, is one that most analyses I've seen get wrong. It's that when you're projecting DiMaggio's chances of getting a hit, you can't use his raw 1941 batting stats as the Cornell simulation did (
Joe hit .357 that year). Rather, you have to estimate his true talent level, which is much closer to the mean than .357. This makes a huge, huge difference. I haven't done the math, but I'd bet that if you use something more realistic – maybe around .320? – it'll change the results by around a factor of 10.

Labels: , ,

At Friday, April 04, 2008 11:10:00 AM,  Don Coffin said...

Hi, Phil. You wrote: "But the most important issue, I think, is one that most analyses I've seen get wrong. It's that when you're projecting DiMaggio's chances of getting a hit, you can't use his raw 1941 batting stats as the Cornell simulation did (Joe hit .357 that year). Rather, you have to estimate his true talent level, which is much closer to the mean than .357."

I fully agree with the logic, but the issue is whatw e take to be DiMaggio's "true" level of talent. His career BA was .325, but here's what he did at the beginning:
1936---.324
1937---.346
1938....324
1939---.381
1940---.352
Total, to 1940, .343, in more than 2800 AB.

So, .357 is probably a stretch on the high ide. But .320 is probably a stretch on the low side. (One thing that's clear is that DiMaggio was not the same hitter beginning in 1942--and clearly not atfter WWII--that he was before: .305, .290 (1946), .315, .320, .345, .301, .263 (1951). But by 1942, why wouldn't you think he was an established .340 hitter?

Still, it makes the streak look less likely if you evaluate him at .340...

At Friday, April 04, 2008 11:18:00 AM,  Phil Birnbaum said...

I agree that you might think that Joe was an established .340 hitter in 1941, but perhaps I discounted the later years more than you did. You say he wasn't the same hitter, but maybe he WAS the same hitter, just lucky earlier?

But I think you're right that .320 is too low. On second thought, if you showed me all his seasons from 1936-1942 (except 1941), and asked me to guess 1941, I'd probably put the over/under somewhere in the .330s. I should have used that.

(Tango, how many average ABs do you regress a batting average?)