Carl Bialik on DiMaggio streak probabilities
There have been lots of studies on the chances of beating Joe DiMaggio's 56-game hitting streak – SABR's "Baseball Research Journal" had one in 1994, and at least three more over the past five years or so. And there was another one, recently, in the New York Times, where two Cornell mathematicians ran simulations of baseball history.
Yesterday, Carl Bialik discussed the math involved in calculating the odds of such a streak, and summarized a reader's objection to the simulation's methodology. The Cornell authors had assumed every player got number of plate appearances in every game. That would inflate the odds. Bialik summarizes the logic perfectly:
"Suppose one batter has the identical 81% chance of getting at least one hit in each of 10 games — which is what the authors assumed for DiMaggio for every game of the 1941 season. Now suppose that during those 10 games, his teammate has an 89% chance of getting a hit in each of the first nine games, and a 9% chance in the 10th game. For both players, their mean probability across those 10 games is 81%. But DiMaggio has about a 12.2% chance of getting at least one hit in each game (0.81 to the 10th power). His teammate’s probability of starting off with a nine-game hitting streak is 35%. But that tough 10th game makes his chance of a 10-game hitting streak just 3.2% — or barely one-quarter of DiMaggio’s."
The moral is that when you multiply 56 very small numbers together, little things make a big difference, and you have to make sure you get every assumption right. And there are many issues that you have to just estimate. For instance, suppose Joe is on his 46th game, is hitless so far, and is up for his last time in the game. Is Joe less likely to take pitches, reducing his chances for a walk? Will the opposing pitcher pitch outside, hoping to take advantage of Joe's desperation, or is there some unwritten rule that would actually have him throw down the middle and challenge DiMaggio?
Who knows? But it makes a big difference. If you assume that Joe has a (say) 32% probability of getting a hit in each of his first three PA, but that on his fourth, his chances increase to 35% (because he won't walk), his chance of getting a hit in that game is 79.56%. But if you treat his fourth appearance the same as his first, his chance is only 78.61%. That doesn't seem like much, but over 20 games – say, from game 37 to 56 – it reduces the probability of the 20-game substreak from 1 in 97, to 1 in 123.
And that's only one of many questions you could ask. Would official scorers be more likely to credit Joe with a hit on a bobbled ball? Would opposing managers be more, or less likely to bring in a reliever in the 9th to get the platoon advantage on Joe? Would DiMaggio consciously try to hit singles instead of homers to maximize his chance of getting a hit?
Whatever your answers, you're going to affect the estimate. A lot.
But the most important issue, I think, is one that most analyses I've seen get wrong. It's that when you're projecting DiMaggio's chances of getting a hit, you can't use his raw 1941 batting stats as the Cornell simulation did (Joe hit .357 that year). Rather, you have to estimate his true talent level, which is much closer to the mean than .357. This makes a huge, huge difference. I haven't done the math, but I'd bet that if you use something more realistic – maybe around .320? – it'll change the results by around a factor of 10.