When and why log5 doesn't work
Six years ago, Tom Tango described a hypothetical sprinting tournament where log5 didn't seem to give accurate results. I think I have an understanding, finally, of why it doesn't work, thanks to
(a) Ted Turocy,
(b) a paper from Kristi Tsukida and Maya R. Gupta (.pdf; see section 3.4, keeping in mind that "Bradley-Terry model" basically means "log5"), and
(c) this excellent post by John Cook.
(You don't have to follow the links now; I'll give them again later, in context.)
It turns out that the log5 formula makes a certain assumption about the sport, an assumption that makes the log5 formula work out perfectly. That assumption is: that the set of score differentials follows a logistic distribution.
What's the logistic distribution? It's a lot like the normal distribution, a bell-shaped curve. They can be so similar in shape that I'd have trouble telling them apart by eye. But, the logistic distribution has fatter tails relative to the "bell". In other words: the logistic distribution predicts rare events, like teams beating expectations by large amounts, will happen more often than the normal distribution would predict. And it predicts that certain common events, like close games, will happen less often.
The log5 formula works perfectly when scores are distributed logistically. That's been proven mathematically. But, where scores aren't actually logistic, the formula will fail, in proportion to how real life varies from the particular logistic distribution the log5 formula assumes.
That's why the formula didn't work in the sprinting example. Tango explicitly made the results normally distributed. Then, he found that log5 started to break down when the competitors became seriously mismatched. That is: log5 started to fail in the tail of the distribution, exactly where normal and logistic differ the most.
Here's a rudimentary basketball game. Two teams each get 100 possessions, and have a certain probability (talent) of scoring on each possession. Defense doesn't matter, and all baskets are two points.
Suppose you have A, a 55 percent team (expected score: 110) and B, a 45 percent team (expected score, 90). We expect each team's score to be normally distributed, with an SD of an SD of almost exactly 10 points for each team's individual score.*
(* This is just the normal approximation to binomial. To get the SD, calculate .45 multiplied by (1-.45) divided by 100. Take the square root. Multiply by 2 since each basket is 2 points. Then, multiply by 100 for the number of possessions. You get 9.95.)
Since the two teams are independent, the SD of the score difference is the square root of 2 times as big as the individual standard deviations. So, the SD of the differential is 14 points.
By talent, A is a 20-point favorite over underdog B. For B to win, it must beat the spread by 20 points. 20 divided by 14 equals 1.42 SD. Under the normal distribution, the probability of getting a result greater than 1.42 SD is 0.0778.
That means B has a 7.78 percent chance of winning, and A a 92.22 chance. The odds ratio for A is 92.22 divided by 7.78, which is 11.85.
So, that's the answer. Now, how well does log5 approximate it? To figure that out, we need to figure the talent of A and B against a .500 team.
Let's say Team C is that .500 team. Against C, team A has an advantage of 0.71 SD. The normal distribution table says that's a winning percentage of .7611, which is an odds ratio of 3.186. Similarly, team B is 0.71 SD worse than C, so it wins with probability (1 - .7611), or odds ratio 1 / 3.186.
Using log5, what's the estimated odds ratio of A over B? It's 3.186 squared, or 10.15. That works out to a win probability of only .910, an underestimate of the .922 true probability.
Log5 estimate: .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85
To recap what we just did: we started with the correct, theoretically-derived probabilities of A beating C, and of B beating C. When we plugged those exact probabilities into log5, we got the wrong answer.
Why the wrong answer? Because of the log5 formula's implicit assumption: that the distribution of the score difference between A and B is logistic, rather than normal.
What specific logistic distribution does log5 expect? The one where the mean is the log of the odds ratio and the SD is about 1.8. (Logistic distributions are normally described by a "shape parameter, which is the SD divided by 1.8 (pi divided by the square root of 3, to be exact). So, actually, the log5 formula assumes a logistic distribution with a shape parameter of exactly 1.)
So, in this case, the log5 formula treats the score distribution as if it's logistic, with a mean of 2.3 (the log of the log5-calculated odds ratio of 10.15) and an SD of 1.8 (shape parameter 1).
We can rescale that to be more intuitive, more basketball-like, and still get the same probability answer. We just multiply the mean, SD, and shape parameter by a constant. That's like taking a normal curve for height denominated in feet, and multiplying the mean and SD by 12 to convert to inches. The proportion of people under 5 feet under the first curve works out to the same as the proportion of people under 60 inches in the second curve.
In this case, we want the mean to be 20 (basketball points) instead of 2.3. So, we can multiply both the mean and the scale by 20/2.3. That gives us a new logistic distribution with mean 20 and SD 15.6.
That's reasonably close to the actual normal curve we know is correct:
Normal: mean 20, SD 14
Logistic: mean 20, SD 15.6
It's reasonably close, but still incorrect. In this case, log5 overestimates the underdog in two different ways. First, it assumes a logistic distribution instead of normal, which means fatter tails. Second, it assumes a higher SD than actually occurs, which again means fatter tails and more extreme performances.
Here, I'll give you some stolen visuals. I'm swiping this diagram from that great post John Cook post I mentioned, which compares and contrasts the two distributions. Here's a comparison of a normal and logistic distribution with the same SD:
The dotted red line is the normal distribution, the solid blue line is the logistic, and both have an SD of about 1.8. It looks like the logistic tails start getting fatter than the normal tails at around 4.3 or something, which is around two-and-a-half SD from the mean.
But, for our purposes, it's the area under the curve that we care about, the CDF. Here's that comparison, shamefully stolen from the Tsukida/Gupta .pdf I linked to earlier:
From here, you can see that from infinity to around, I dunno, maybe 1.7 or so, the area under the logistic curve is larger than the area under the normal curve.
And again, this is when the curves have equal SDs. In this basketball example, the log5 assumption has a higher SD than the actual normal, by 15.6 points to 14. So the overestimate of the underdog is even higher, and probably starts earlier than 1.7 SD.
Just to make sure my logic was correct, I ran a simulation for this exact basketball game. I created seven teams, with expectations running from 110 points down to 90 points. I ran a huge season, where each team played each other team 200,000 times.
Then, I observed the simulated record of champion team A (the 110 point team) against team C (the team of average talent, the 100-point team). And I observed the simulated record of basement team B (the 90 point team) against team C.
Here's what I got:
A versus C: 151878- 48122 (.7594), odds ratio 3.156
B versus C: 47599-152401 (.2380), odds ratio 0.312
To estimate A versus B using log5, we just divide 3.156 by 0.312:
A versus B: 181990-18010 (.9099), odds ratio 10.105
But, the actual results from the simulated season were:
A versus B: 184395-15605 (.9220), odds ratio 11.816
In the simulation, log5 understimated the favorite almost exactly the same way as it did in theory:
Log5 estimate: .909, odds ratio 10.11
Correct answer: .922, odds ratio 11.82
Log5 estimate: .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85
So that's what I think is going on. If the distribution of the difference in team scores doesn't follow the logistic distribution well enough, you'll get log5 working poorly for those talent differentials for which the curves match the worst. In the case where score differentials are normal, like this basketball simulation, the worst match is when you have a heavy favorite.
For other sports, it's an empirical question. Take the actual score distribution for that particular sport, and fit it to the scaled logistic distribution assumed by log5. Where the curves differ, I suspect, will tell you where log5 projections will be least accurate, and in which direction.