Pythagoras, log5, and the independence of game scores
"My only note is that Phil's assuming implicitly in these examples that the two teams' "scoreboard scores" are statistically independent. Under that assumption, what he's written here is right (at least to first order).
However, the "latent score" that underlies the Bradley-Terry-Luce-log5 model is not necessarily the "scoreboard score" that you see on the scoreboard (hence why I'm calling it the "scoreboard score"). It is possible for log5 still to work (or, work better than these calculations suggest) even if scoreboard scores are not extreme-value distributed, if the scoreboard scores aren't independent.
But that is a fine point, because, in fact, the number of types of sport or game in which log5 works precisely is exactly zero. The number of them in which log5 works reasonably well seems to be quite large (depending on how much accuracy you demand)."
Ted makes three important points here.
I agree that log5 works reasonably well for most sports. I may have given the impression that I thought it was too wrong to be useful ... that's not the case. I do think, though, that it's too wrong to be useful to, say, the third decimal place.
In most cases, any inaccuracy in log5 gets lost in all the other inevitable biases involved in estimation. For instance, do we ever really know teams' talent against a .500 team, to plug into the formula? We don't. And the error in estimating them is likely to be at least as big as any inaccuracy in log5 itself, especially if we're dealing with middle-of-the-road teams.
Another commenter last post asked if there's anything better, or more accurate, than log5. Not that I know of. And it seems unlikely, because every sport has a differently shaped distribution of scores.
I think log5 is the best candidate because (1) it seems to work reasonably well for most sports, and (2) we have two cases where we know it works perfectly: retrospectively, and for sudden-death binary games. Those two cases suggest to me that there's some theoretical underpinning that makes it a good anchor.
My point in all this log5 stuff is not to deny that it's useful. My point is, why does it seem to work so well, and what's the underlying theory?
Speaking only for my own intuition ... I'm probably more confident in log5 now, than when I only had counterexamples like "height baseball." That's because now, I have a certain gut understanding that on those occasions when log5 fails, it fails only for the most extreme cases.
2. Latent Scores
Ted points out that it's not actually the "scoreboard score" that needs to be logistically distributed for log5 to be theoretically correct. It can be a "latent score" instead.
What's a "latent score?" It's an alternative measure of what happened in the game, but one that always preserves who won and who lost.
In baseball, the home team doesn't play the bottom of the ninth inning if it's ahead. Imagine changing the rules so that it *does* play that inning. In that case, if the score differential was logistically distributed before, it won't be logistically distributed now (since the new distribution will be wider -- teams will win by more runs than they used to, on average).
But: if log5 worked before the change, it'll work after the change. Because, log5 only deals with talent in terms of probabilities of winning, and those aren't affected at all by whether the team in the lead plays the bottom of the ninth.
Something similar happens in basketball and football. In the NFL, if a team is up by 17 points with, say, four minutes to play, it won't try too hard to score more points. In effect, it will forfeit its "bottom of the ninth" by concentrating on running out the clock instead. That strategy might even result in the opposition scoring more points than they would otherwise -- but not to the extent that they win where they would have otherwise lost. So, maybe you can consider the "latent score" in football to be "what would have happened if the team in the lead had kept trying to run up the score."
In basketball, that kind of thing happens too -- running out the clock, playing "garbage time" bench players, and deliberate fouling. Those don't all work in the same direction, but they change the distribution of the score without (usually) affecting the outcome of the game.
Which brings us to:
In these examples, the "latent score" effect is the result of the fact that the two teams' scores aren't independent. A team's strategy, and therefore score, depends on the other team's score.
In baseball, when you know one team got shut out, your expectation for the other team's score should be lower -- because it's more likely to have been the home team and not have played the bottom of the ninth. In football, you'd expect blowouts to be less frequent than raw talent suggests, because a team that's safely (but barely safely) up late in the game is too busy managing the clock to occasionally score three quick touchdowns in the fourth quarter.
For both baseball and football, you'd expect the score differentials to be narrower than a logistic distribution, but for log5's accuracy to be unaffected by that score compression.
Can we measure that? Well, I'll try one possibility. It's not really that strong as evidence, because there are so many other factors involved than non-independence, but, I think, it's definitely suggestive.
Assume that the log5 assumptions are true for MLB -- that score differential is logistic, and team scores are independent. If that's true, what's the expected standard deviation of score differential over an MLB season?
It seems like there's not enough information to answer that question. And there isn't. We have to add one more thing: that the Pythagorean exponent that works best in baseball is 1.83.
Now, we can actually figure out the SD of the difference in team runs per game. Or, we can come close.
Last post, I linked to a paper (pdf) that showed that, where log5 is accurate, you can compute the chance of a team winning by treating one "latent score differential" as a logistic with mean equal to the log odds ratio, and constant SD of 1.81. Then, you just compute the area under the curve on both sides of zero -- the left area is one team winning, and the right area is the other team winning.
For the distribution of "scoreboard score," we want the mean to be equal to the point spread, not the log odds ratio. To do that, we can multiply the mean by (point spread / log odds ratio).
If we do that, then we need to also multiply the SD by the same amount. That keeps the shape of the distribution the same, which preserves the areas on both sides of zero. (Those areas are the two teams' respective win probabilities.)
Suppose one team is 0.1 runs better than the other, in talent, outscoring its opponent by an expected 4.6 to 4.5. With a Pythagorean exponent of 1.83, that's a winning percentage of .510, which is an odds ratio of 1.04, which is a log odds ratio of 0.04.
We want the mean to be 0.1, rather than 0.04. So we multiply the mean by 2.5. And, we also multiply the SD by that same 2.5. Now, the mean is 0.1, which we know is correct ... and the SD is now 4.5.
So, for a given MLB game, the SD of (team A score - team B score) is, in theory, 4.5 runs.
That's the SD against the spread. We need another step if we want the SD for a full MLB season. We have to add in the variance of the spread. Suppose the average talent differential SD is 0.5 runs (at 10 runs per game, that means the favorite is .550). 4.5 squared plus 0.5 squared equals 4.6 squared.
So, in theory, we expect an SD of 4.6. Which is pretty close to real life! In 2009, the actual SD of score differential was 4.4.
Why the difference? My gut says: (1) the difference between 4.6 and 4.4 is the extent of the "non-independence" of scores in baseball; (2) the bottom-of-the-ninth effect is the cause of most of the difference, since team strategy doesn't differ that much based on score; and (3) that the fact that 4.4 and 4.6 are so close together suggests that the "latent score" is close to logistic, and so log5 should work quite well for baseball.
I don't have strong arguments for these three conclusions; I might be wrong.
Now, let's look at the NFL. Suppose, again, that one team is 0.1 points better than the other, 26.1 points to 26. And assume that the best Pythagorean exponent for the NFL is 2.37, as Wikipedia says.
Instead of going through the calculation, I'll just show the formula that follows from the logic I described:
(In this formula, "spread" has to be equal to "expected favorite score" - "expected underdog score". I just used "spread" to make it easier to read.)
Plug in the numbers, and you get
SD(score diff) = 19.94
That's for a given game, so the 19.94 is the theoretical difference against the spread.
But, in real life, NFL game outcomes have an SD of only about 14 points against the spread.
Why the difference? Non-independence of scores, I'd argue again. If teams in the lead didn't deliberately run out the clock, they'd score a few more points. The difference here is 6 points. Could it be that if teams in the lead played more aggressively, they'd gain an extra 6 points on their opponents? Seems a bit high, but maybe not -- I haven't done any calculations, even the back-of-the-envelope variety.
My wild-ass guys says ... non-independence is maybe 4 points of the difference, and football scores being non-logistic (even if there were independence) is the other 2 points. But I don't really know.
One way you could check: treat the first half as if it were the entire game. Find the best Pythagorean exponent for half-game results, use the formula to predict the SD, and compare it to the actual first half SD. My guess: the two SDs would be closer than they are here.
For the NBA, the Pythagorean exponent is 13.91. Again assuming an 0.1 point favorite, the formula says:
SD(score diff) = 0.1 * (1.814) / [ 13.91 * log(100.1/100) ]
SD(score diff) = 13.05
The actual SD against the spread is about 11.5. So, basketball doesn't work out as well as baseball, but it does come closer than football.
Why the discrepancy between 11.5 and 13.05? I'm guessing yet again, but, in this case, I think it's mostly the distribution, not the non-independence.
In MLB and the NFL, there are few possessions, and runs/points come in bunches. Most possessions result in zero, so the SD per possession is high. Those things make for fat tails, so the distribution is closer to logistic.
In the NBA, there are about 100 possessions per team, and points are scored frequently. Scores are the sum of a large number of (nearly) identically distributed (nearly) independent variables with a low SD. That means that by Central Limit Theorem, you'd expect the differential to be closer to normal, with skinny tails, rather than logistic, with fat tails.
Again, I could be wrong. There is some non-independence going on. Late in the game, the team in the lead sacrifices points for clock, which makes the SD smaller. A team with a big lead plays its bench, which again makes the SD smaller. On the other hand, the trailing team will commit deliberate fouls, which usually results in the opposition padding its lead, which makes the SD larger.
So, yes, there is non-independence going on, but I still suspect the size of the discrepancy would be roughly the same even if team scores were completely independent.
You can use the formula to predict the pythagorean exponent for a given sport, when you only know the actual SD of game outcomes against the spread. You just switch the SD and exponent terms:
In the NFL case, if you sub in an SD of 14 under "Observed SD against spread", and then solve for "Pythagorean Exponent," it works out to 3.4. That's much higher than the 2.37 that's accepted as the exponent for the best Pythagorean estimator.
But we knew that would happen, because this is just the mirror image of the discrepancy in SD that we found earlier for the NFL. Here, the actual exponent is about 70 percent of the theoretical one. There, the actual SD was about 70 percent of the theoretical one.
And, again: the discrepancy, the fact that you get 70 percent instead of 100 percent, has at least two causes:
1. The team scores aren't independent, in a way where the non-independence preserves the win probabilities; and
2. Even after correcting the "scoreboard score" for non-independence, and creating a "latent score," the distribution of score differential doesn't actually match the logistic distribution.
Number (1) doesn't affect how well log5 works. Number (2) *does* affect how well log5 works.
And that's why I suspect log5 might work better for football than for basketball, despite the fact that the discrepancy is higher. Because, in football, my gut says that most of the discrepancy is (1), but, for basketball, I suspect most of it is (2).