Monday, October 24, 2016

Why the 2016 AL was harder to predict than the 2016 NL

In 2016, team forecasts for the National League turned out more accurate than they had any right to be, with FiveThirtyEight's predictions coming in with a standard error (SD) of only 4.5 wins. The forecasts for the American League, however, weren't nearly as accurate ... FiveThirtyEight came in at 8.9, and Bovada at 8.8. 

That isn't all that great. You could have hit 11.1 just by predicting each team to duplicate their 2015 record. And, 11 wins is about what you'd get most years if you just forecasted every team at 81-81.

Which is kind of what the forecasters did! Well, not every team at 81-81 exactly, but every team *close* to 81-81. If you look at FiveThirtyEight's actual predictions, you'll see that they had a standard deviation of only 3.4 wins. No team was predicted to win or lose more than 87 games.

Generally, team talent has an SD of around 9 wins. If you were a perfect evaluator of talent, your forecasts would also have an SD of 9. If, however, you acknowledge that there are things that you don't know (and many that can't be known, like injuries and suspensions), you'll forecast with an SD somewhat less than 9 -- maybe 6 or 7.

But, 3.4? That seems way too narrow. 

Why so narrow? I think it was because, last year, the AL standings were themselves exceptionally narrow. In 2015, no American League team won or lost more than 95 games. Only three teams were at 89 or more. 

The SD of team wins in the 2015 AL was 7.2. That's much lower than the usual figure of around 11. In fact, 7.2 is the lowest for either league since 1961. In fact, I checked, and it's the lowest for any league in baseball history! (Second narrowest: the 1974 American League, at 7.3.)

Why were the standings so compressed? There are three possibilities:

1. The talent was compressed;

2. There was less luck than normal;

3. The bad teams had good luck and the good teams had bad luck, moving both sets closer to .500.

I don't think it was #1. In 2016, the SD of standings wins was back near normal, at 10.2. The year before, 2014, it was 9.6. It doesn't really make sense that team talent regressed so far to the mean between 2014 and 2015, and then suddenly jumped back to normal in 2016. (I could be wrong -- if you can find trades and signings those years that showed good teams got significantly worse in 2015 and then significantly better in 2016, that would change my mind.)

And I don't think it was #2, based on Pythagorean luck. The SD of the discrepancy in "first-order wins" was 4.3, which larger than the usual 4.0. 

So, that leaves #3 -- and I think that's what it was. In the 2015 AL, the correlation between first-order-wins and Pythagorean luck was -0.54 instead of the expected 0.00. So, yes, the good teams had bad luck and the bad teams had good luck. (The NL figure was -0.16.)


When that happens, that luck compresses the standings, it definitely makes forecasting harder. Because, there's not as much information on how teams differ. To see that, consider the extreme case. If, by some weird fluke, every team wound up 81-81, how would you know which teams were talented but unlucky, and which were less skilled but lucky? You wouldn't, and so you wouldn't know what to expect next season.

Of course, that's only a problem if there *is* a wide spread of talent, one that got overcompressed by luck. If the spread of talent actually *is* narrow, then forecasting works OK. 

That's what many forecasting methods assume, that if the standings are narrow, the talent must be narrow. If you do the usual "just take the standings and regress to the mean" operation, you'll wind up implicitly assuming that the spread of talent shrank at the same time as the spread in the standings shrank.

Which is fine, if that's what you think happened ... but, do you really think that's plausible? The AL talent distribution was pretty close to average in 2014. It makes more sense to me to guess that the difference between 2014 and 2015 was luck, not wholesale changes in personnel that made the bad teams better and the good teams worse.

Of course, I have the benefit of hindsight, knowing that the AL standings returned to near-normal in 2016 (with an SD of 10.2). But it's happened before -- the record-low 7.3 figure for the 1974 AL jumped back to an above-average 11.9 in 1975.

I'd think when I was forecasting the 2016 standings, I might want to make an effort to figure out which teams were lucky and which ones weren't, in order to be able to forecast a more realistic talent SD than 3.5 wins.

Besides, you have more than the raw standings. If you adjust for Pythagoras, the SD jumps from 7.2 to 8.6. And, according to Baseball Prospectus, when you additionally adjust for cluster luck, the SD rises to 9.4. (As I wrote in the P.S. to the last post, I'm not confident in that number, but never mind for now.)

An SD of 9.4 is still smaller than 11, but it should be workable.

Anyway, my gut says that you should be able to differentiate the good teams from the bad with a spread higher than 3.4 games ... but I could be wrong. Especially since Bovada's spread was even smaller, at 3.3.


It's a bad idea to second-guess the bookies, but let's proceed anyway.

Suppose you thought that the standings compression of 2015 was a luck anomaly, and the distribution of talent for 2016 should still be as wide as ever. So, you took FiveThirtyEight's projections, and you expanded them, by regressing them away from the mean, by a factor of 1.5. Since FiveThirtyEight predicted the Red Sox at four games above .500 (85-77), you bump that up to six games (87-75).

If you did that, the SD of your actual predictions is now a more reasonable 5.1. And those predictions, it turns out, would have been better. The accuracy of your new predictions would have been an SD of 8.4. You would have beat FiveThirtyEight and Bovada.

If that's too complicated, try this. If you had tried to take advantage of Bovada's compressed projections by betting the "over" on their top seven teams, and the "under" on their bottom seven teams, you would have gone 9-5 on those bets.

Now, I'm not going to so far as to say this is a workable strategy ... bookmakers are very, very good at what they do. Maybe that strategy just turned out to be lucky. But it's something I noticed, and something to think about.


If compressed standings make predicting more difficult, then a larger spread in the standings should make it easier.

Remember how the 2016 NL predictions were much more accurate than expected, with an SD of 4.5 (FiveThirtyEight) and 5.5 (Bovada)? As it turns out, last year, the SD of the 2015 NL standings was higher than normal, at 12.65 wins. That's the highest of the past three years:

2014  AL= 9.59, NL= 9.20
2015  AL= 6.98, NL=12.65
2016  AL=10.15, NL=10.71

It's not historically high, though. I looked at 1961 to 2011 ... if the 2015 NL were included, it would be well above average, but only 70th percentile.*

(* If you care: of the 10 most extreme of the 102 league-seasons in that timespan, most were expansion years, or years following expansion. But the 2001, 2002, and 2003 AL made the list, with SDs of 15.9, 17.1, and 15.8, respectively. The 1962 National League was the most extreme, at 20.1, and the 2002 AL was second.)

A high SD won't necessarily make your predictions beat the speed of light, and a low SD won't necessarily make them awful. But both contribute. As an analogy: just because you're at home doesn't mean you're going to pitch a no-hitter. But if you *do* pitch a no-hitter, odds are, you had the help of home-field advantage.

So, given how accurate the 2016 NL forecasts were, I'm not surprised that the SD of the 2015 NL standings was higher than normal.


Can we quantify how much compressed standings hurt next year's forecasts? I was curious, so I ran a little simulation. 

First, I gave every team a random 2015 talent, so that the SD of team talent came out between 8.9 and 9.1 games. Then, I ran a simulated 2015 season. (I ran each team with 162 independent games, instead of having them play each other, so the results aren't perfect.)

Then, I regressed each team's 2015 record to the mean, to get an estimate of their talent. I assumed that I "knew" that the SD of talent was around 9, so I "unregressed" each regressed estimate away from the mean by the exact amount that gets the SD of talent to exactly 9.00. That became the official forecast for 2016. 

Finally, I ran a simulation of 2016 (with team talent being the same as 2015). I compared the actual to the forecast, and calculated the SD of the forecast errors.

The results came out, I think, very reasonable.

Over 4,000 simulated seasons, the average accuracy was an SD of 7.9. But, the higher the SD of last year's standings, the better the accuracy:

SD Standings    SD next year's forecast
7.0             8.48 (2015 AL)
8.0             8.31
9.0             8.14
10.0            7.98
11.0            7.81
12.0            7.64
12.6            7.54 (2015 NL)
13.0            7.47
14.0            7.31
20.1            6.29 (1962 NL)

So, by this reckoning, you'd expect the 2016 NL predictions to have been one win more accurate than than the AL predictions. 

They were "much more accurater" than that, of course, by 3.4 or 4.5. The main reason, of course, is that there's a lot of luck involved. Less importantly, this simulation is very rough. The model is oversimplified, and there's no assurance that the relationship is actually linear. (In fact, the relationship *can't* be linear, since the "speed of light" limit is 6.4, and the model says the 1974 AL would beat that, at 6.3). 

It's just a very rough regression to get a very rough estimate. 

But the results seem reasonable to me. In 2016, we had (a) the narrowest standings in baseball history in the 2015 AL, and (b) a wider-than-average, 70th percentile spread in the 2016 NL. In that light, an expected difference of 1 win, in terms of forecasting accuracy, seems very plausible. 


So that's my explanation of why this year's NL forecasts were so accurate, while this year's AL forecasts were mediocre. A large dose of luck -- assisted by a small (but significant) dose of extra information content in the standings.

Labels: , , , , ,

Friday, October 21, 2016

National League forecasts were too accurate in 2016

FiveThirtyEight predicted the National League surprisingly accurately this year.

The standard error of their predictions -- that is, the SD of the difference between their team forecasts, and what actually happened -- was only 4.5 games.* (Here's the link to their forecast -- go to the bottom and choose "April 2".)

(* The SD is the square root of the average squared error. If you prefer just the average error, in this case, it was three-and-a-third games. But I'll be using just the SD in the rest of this post. In most cases, to estimate average error when you only have the SD, you can multiply by 2/pi (approximately 0.64).)

4.5 games is very, very good. In fact, it's so good it can't possibly be all skill. The "speed of light" limit on forecasting MLB is about 6.4 games. That is, even if you knew absolutely everything about the talent of a team and its opposition, every game, an SD of 6.4 is the very best you could expect to do.

Of course, you can get lucky, and beat 6.4 games. You could even get to zero, if fortune smiles on you and every team hits your projection exactly. But, 6.4 is the best you can do by skill.**  

(** Actually, it might be a bit less, 6.3 or something, because 6.4 is what you get when teams are evenly matched ... mismatches are somewhat easier to predict. But never mind.)

How unusual is an SD of 4.5? Well, not *that* unusual. By my estimate, the SD of the observed SD -- sorry if that's a little confusing -- is somewhere around 1.7, for a league of 15 teams. So, FiveThirtyEight was a little over one standard deviation lucky, which isn't really a big deal. Even taking into account that FiveThirtyEight couldn't have been perfectly accurate in their talent assessments, it's still not that big a deal. If they were off, on talent, by around 3 games per team, that would bring them to only about 1.5 SDs of luck.

Still not a huge deal, but interesting nonetheless.


It wasn't just FiveThirtyEight whose projections did well ... the Vegas bookmakers did OK too. Well, at least the one I looked at, Bovada. (I assume the others would be pretty close.)  They had an SD of 5.5 games, which is also better than the "speed of light."  (I can't find the page I got them from, but this one, from a month earlier, is close.)

That suggests that it probably wasn't any particular flash of brilliance from either FiveThirtyEight or Bovada ... it must have been something about the way the season unfolded. 

Maybe, in 2016, there was less random deviation than usual? One type of random variation is whether a team exceeds their Pythagorean Projection -- that is, whether they win more (or fewer) games than you'd expect from their runs scored and allowed. To check that, I used Baseball Prospectus's numbers -- specifically, the difference between actual and "first-order wins."***

(*** Why didn't I use second-order wins? See the P.S. at the bottom of the post.)

In the National League in 2016, the SD of Pythagorean error was 3.55. That is indeed a little smaller than the average of around 4.0. But that small difference isn't nearly enough to explain why the projections were so good.

Here's what I think is the bigger factor -- actually, a combination of two factors.

First, by random chance, the better teams happened to undershoot their Pythagorean expectation, and the worse teams happened to exceed it. 

The Cubs were the best team in the league, and also the team with the most bad luck, -4.8 games. The Phillies were the worst team in the league with luck removed; you'd expect them to have won only 61.4 games, but they but played +9.6 games above their Pythagorean projection to go 71-91.

Those two were the most obvious examples, but the pattern continued through the league. Overall, the correlation between first-order wins (which is an approximation of talent) and Pythagorean error was huge: +0.61. Normally, you'd expect it to be close to zero. (In the American league, it was, indeed, close to zero, at -0.06.)

Second, there was a similar, offsetting relationship in the predictions themselves. 

It turns out that the forecast errors had a strong pattern this year.  Instead of being random, they came out too "conservative" -- they underestimated the talent of the better teams, and overestimated the talent of the worse teams. Here's the distribution of FiveThirtyEight's forecast errors, with the teams sorted by their forecast:

Top 5 teams: average error -4 wins (underestimate)
Mid 5 teams: average error +4  win (overestimate)
Btm 5 teams: average error +1  win (overestimate)

So, in summary:

-- FiveThirtyEight predicted teams too close to the mean
-- Teams' Pythagorean luck moved them closer to the mean

Those two things cancelled each other out to a significant extent. And that's why FiveThirtyEight was so accurate.


Next post: The American League, which is interesting for completely different reasons.


P.S. Baseball Prospectus also produces "second-order wins," which attempts to remove a second kind of luck, what I call "Runs Created luck" (and others call "cluster luck"), which is teams scoring more or fewer runs than would be expected by their batting line. I started to do that, but ... I stopped, because I found something weird.

When you remove luck from the standings, you expect to make them tighter, to bring teams closer together. (To see that better, imagine removing luck from coin tosses. Every team reverts to .500.)

Removing first-order (Pythagorean) luck does seem to reduce the SD of the standings. But, removing second-order (Cluster) luck seems to do the *opposite*.

I checked four seasons of BP data, and, in every case, the SD of second-order wins (for the set of all 30 teams) was higher than the SD of first-order wins:  

         Actual  First-order  Second-order
2016      10.7        10.8        13.1
2015      10.4        10.1        11.8
2014       9.6         8.9         9.6
2013      12.2        12.2        12.8

So, either the good teams got lucky all four years, or there's something weird about how BP is computing second-order luck. 

Labels: , , , , ,

Sunday, October 02, 2016

Pythagoras, log5, and the independence of game scores

Ted Turocy's comment from the previous post is worth repeating in full:

"My only note is that Phil's assuming implicitly in these examples that the two teams' "scoreboard scores" are statistically independent. Under that assumption, what he's written here is right (at least to first order). 

However, the "latent score" that underlies the Bradley-Terry-Luce-log5 model is not necessarily the "scoreboard score" that you see on the scoreboard (hence why I'm calling it the "scoreboard score"). It is possible for log5 still to work (or, work better than these calculations suggest) even if scoreboard scores are not extreme-value distributed, if the scoreboard scores aren't independent. 

But that is a fine point, because, in fact, the number of types of sport or game in which log5 works precisely is exactly zero. The number of them in which log5 works reasonably well seems to be quite large (depending on how much accuracy you demand)."

Ted makes three important points here.

1. Accuracy

I agree that log5 works reasonably well for most sports. I may have given the impression that I thought it was too wrong to be useful ... that's not the case. I do think, though, that it's too wrong to be useful to, say, the third decimal place.

In most cases, any inaccuracy in log5 gets lost in all the other inevitable biases involved in estimation. For instance, do we ever really know teams' talent against a .500 team, to plug into the formula? We don't. And the error in estimating them is likely to be at least as big as any inaccuracy in log5 itself, especially if we're dealing with middle-of-the-road teams.

Another commenter last post asked if there's anything better, or more accurate, than log5. Not that I know of. And it seems unlikely, because every sport has a differently shaped distribution of scores.

I think log5 is the best candidate because (1) it seems to work reasonably well for most sports, and (2) we have two cases where we know it works perfectly: retrospectively, and for sudden-death binary games. Those two cases suggest to me that there's some theoretical underpinning that makes it a good anchor. 

My point in all this log5 stuff is not to deny that it's useful. My point is, why does it seem to work so well, and what's the underlying theory? 

Speaking only for my own intuition ... I'm probably more confident in log5 now, than when I only had counterexamples like "height baseball."  That's because now, I have a certain gut understanding that on those occasions when log5 fails, it fails only for the most extreme cases.

2. Latent Scores

Ted points out that it's not actually the "scoreboard score" that needs to be logistically distributed for log5 to be theoretically correct. It can be a "latent score" instead.

What's a "latent score?"  It's an alternative measure of what happened in the game, but one that always preserves who won and who lost.

In baseball, the home team doesn't play the bottom of the ninth inning if it's ahead. Imagine changing the rules so that it *does* play that inning. In that case, if the score differential was logistically distributed before, it won't be logistically distributed now (since the new distribution will be wider -- teams will win by more runs than they used to, on average). 

But: if log5 worked before the change, it'll work after the change. Because, log5 only deals with talent in terms of probabilities of winning, and those aren't affected at all by whether the team in the lead plays the bottom of the ninth.

Something similar happens in basketball and football. In the NFL, if a team is up by 17 points with, say, four minutes to play, it won't try too hard to score more points. In effect, it will forfeit its "bottom of the ninth" by concentrating on running out the clock instead. That strategy might even result in the opposition scoring more points than they would otherwise -- but not to the extent that they win where they would have otherwise lost. So, maybe you can consider the "latent score" in football to be "what would have happened if the team in the lead had kept trying to run up the score."

In basketball, that kind of thing happens too -- running out the clock, playing "garbage time" bench players, and deliberate fouling. Those don't all work in the same direction, but they change the distribution of the score without (usually) affecting the outcome of the game.

Which brings us to:

3. Non-independence

In these examples, the "latent score" effect is the result of the fact that the two teams' scores aren't independent. A team's strategy, and therefore score, depends on the other team's score.

In baseball, when you know one team got shut out, your expectation for the other team's score should be lower -- because it's more likely to have been the home team and not have played the bottom of the ninth. In football, you'd expect blowouts to be less frequent than raw talent suggests, because a team that's safely (but barely safely) up late in the game is too busy managing the clock to occasionally score three quick touchdowns in the fourth quarter.

For both baseball and football, you'd expect the score differentials to be narrower than a logistic distribution, but for log5's accuracy to be unaffected by that score compression.

Can we measure that? Well, I'll try one possibility. It's not really that strong as evidence, because there are so many other factors involved than non-independence, but, I think, it's definitely suggestive. 


Assume that the log5 assumptions are true for MLB -- that score differential is logistic, and team scores are independent. If that's true, what's the expected standard deviation of score differential over an MLB season?

It seems like there's not enough information to answer that question. And there isn't. We have to add one more thing: that the Pythagorean exponent that works best in baseball is 1.83

Now, we can actually figure out the SD of the difference in team runs per game. Or, we can come close.

Last post, I linked to a paper (pdf) that showed that, where log5 is accurate, you can compute the chance of a team winning by treating one "latent score differential" as a logistic with mean equal to the log odds ratio, and constant SD of 1.81. Then, you just compute the area under the curve on both sides of zero -- the left area is one team winning, and the right area is the other team winning.

For the distribution of "scoreboard score," we want the mean to be equal to the point spread, not the log odds ratio. To do that, we can multiply the mean by (point spread / log odds ratio). 

If we do that, then we need to also multiply the SD by the same amount. That keeps the shape of the distribution the same, which preserves the areas on both sides of zero. (Those areas are the two teams' respective win probabilities.)

Suppose one team is 0.1 runs better than the other, in talent, outscoring its opponent by an expected 4.6 to 4.5. With a Pythagorean exponent of 1.83, that's a winning percentage of .510, which is an odds ratio of 1.04, which is a log odds ratio of 0.04.

We want the mean to be 0.1, rather than 0.04. So we multiply the mean by 2.5. And, we also multiply the SD by that same 2.5. Now, the mean is 0.1, which we know is correct ... and the SD is now 4.5. 

So, for a given MLB game, the SD of (team A score - team B score) is, in theory, 4.5 runs.

That's the SD against the spread. We need another step if we want the SD for a full MLB season. We have to add in the variance of the spread. Suppose the average talent differential SD is 0.5 runs (at 10 runs per game, that means the favorite is .550). 4.5 squared plus 0.5 squared equals 4.6 squared.

So, in theory, we expect an SD of 4.6. Which is pretty close to real life! In 2009, the actual SD of score differential was 4.4. 

Why the difference? My gut says: (1) the difference between 4.6 and 4.4 is the extent of the "non-independence" of scores in baseball; (2) the bottom-of-the-ninth effect is the cause of most of the difference, since team strategy doesn't differ that much based on score; and (3) that the fact that 4.4 and 4.6 are so close together suggests that the "latent score" is close to logistic, and so log5 should work quite well for baseball. 

I don't have strong arguments for these three conclusions; I might be wrong.


Now, let's look at the NFL. Suppose, again, that one team is 0.1 points better than the other, 26.1 points to 26. And assume that the best Pythagorean exponent for the NFL is 2.37, as Wikipedia says.

Instead of going through the calculation, I'll just show the formula that follows from the logic I described:

(In this formula, "spread" has to be equal to "expected favorite score" - "expected underdog score". I just used "spread" to make it easier to read.)

Plug in the numbers, and you get

SD(score diff) = 19.94

That's for a given game, so the 19.94 is the theoretical difference against the spread.

But, in real life, NFL game outcomes have an SD of only about 14 points against the spread.

Why the difference? Non-independence of scores, I'd argue again. If teams in the lead didn't deliberately run out the clock, they'd score a few more points. The difference here is 6 points. Could it be that if teams in the lead played more aggressively, they'd gain an extra 6 points on their opponents? Seems a bit high, but maybe not -- I haven't done any calculations, even the back-of-the-envelope variety.

My wild-ass guys says ... non-independence is maybe 4 points of the difference, and football scores being non-logistic (even if there were independence) is the other 2 points. But I don't really know.

One way you could check: treat the first half as if it were the entire game. Find the best Pythagorean exponent for half-game results, use the formula to predict the SD, and compare it to the actual first half SD. My guess: the two SDs would be closer than they are here.


For the NBA, the Pythagorean exponent is 13.91. Again assuming an 0.1 point favorite, the formula says:

SD(score diff) = 0.1 * (1.814) / [ 13.91 * log(100.1/100) ]
SD(score diff) = 13.05

The actual SD against the spread is about 11.5. So, basketball doesn't work out as well as baseball, but it does come closer than football. 

Why the discrepancy between 11.5 and 13.05? I'm guessing yet again, but, in this case, I think it's mostly the distribution, not the non-independence. 

In MLB and the NFL, there are few possessions, and runs/points come in bunches. Most possessions result in zero, so the SD per possession is high. Those things make for fat tails, so the distribution is closer to logistic.

In the NBA, there are about 100 possessions per team, and points are scored frequently. Scores are the sum of a large number of (nearly) identically distributed (nearly) independent variables with a low SD. That means that by Central Limit Theorem, you'd expect the differential to be closer to normal, with skinny tails, rather than logistic, with fat tails.

Again, I could be wrong. There is some non-independence going on. Late in the game, the team in the lead sacrifices points for clock, which makes the SD smaller. A team with a big lead plays its bench, which again makes the SD smaller. On the other hand, the trailing team will commit deliberate fouls, which usually results in the opposition padding its lead, which makes the SD larger. 

So, yes, there is non-independence going on, but I still suspect the size of the discrepancy would be roughly the same even if team scores were completely independent.


You can use the formula to predict the pythagorean exponent for a given sport, when you only know the actual SD of game outcomes against the spread. You just switch the SD and exponent terms:

In the NFL case, if you sub in an SD of 14 under "Observed SD against spread", and then solve for "Pythagorean Exponent," it works out to 3.4. That's much higher than the 2.37 that's accepted as the exponent for the best Pythagorean estimator. 

But we knew that would happen, because this is just the mirror image of the discrepancy in SD that we found earlier for the NFL. Here, the actual exponent is about 70 percent of the theoretical one. There, the actual SD was about 70 percent of the theoretical one. 


And, again: the discrepancy, the fact that you get 70 percent instead of 100 percent, has at least two causes:

1. The team scores aren't independent, in a way where the non-independence preserves the win probabilities; and

2. Even after correcting the "scoreboard score" for non-independence, and creating a "latent score," the distribution of score differential doesn't actually match the logistic distribution.

Number (1) doesn't affect how well log5 works. Number (2) *does* affect how well log5 works.

And that's why I suspect log5 might work better for football than for basketball, despite the fact that the discrepancy is higher. Because, in football, my gut says that most of the discrepancy is (1), but, for basketball, I suspect most of it is (2). 

Labels: , , , , , ,

Thursday, September 22, 2016

When and why log5 doesn't work

Six years ago, Tom Tango described a hypothetical sprinting tournament where log5 didn't seem to give accurate results. I think I have an understanding, finally, of why it doesn't work, thanks to 

(a) Ted Turocy, 

(b) a paper from Kristi Tsukida and Maya R. Gupta (.pdf; see section 3.4, keeping in mind that "Bradley-Terry model" basically means "log5"), and 

(c) this excellent post by John Cook. 

(You don't have to follow the links now; I'll give them again later, in context.)


It turns out that the log5 formula makes a certain assumption about the sport, an assumption that makes the log5 formula work out perfectly. That assumption is: that the set of score differentials follows a logistic distribution

What's the logistic distribution? It's a lot like the normal distribution, a bell-shaped curve. They can be so similar in shape that I'd have trouble telling them apart by eye. But, the logistic distribution has fatter tails relative to the "bell". In other words: the logistic distribution predicts rare events, like teams beating expectations by large amounts, will happen more often than the normal distribution would predict. And it predicts that certain common events, like close games, will happen less often.

The log5 formula works perfectly when scores are distributed logistically. That's been proven mathematically. But, where scores aren't actually logistic, the formula will fail, in proportion to how real life varies from the particular logistic distribution the log5 formula assumes.

That's why the formula didn't work in the sprinting example. Tango explicitly made the results normally distributed. Then, he found that log5 started to break down when the competitors became seriously mismatched. That is: log5 started to fail in the tail of the distribution, exactly where normal and logistic differ the most. 


Here's a rudimentary basketball game. Two teams each get 100 possessions, and have a certain probability (talent) of scoring on each possession. Defense doesn't matter, and all baskets are two points.

Suppose you have A, a 55 percent team (expected score: 110) and B, a 45 percent team (expected score, 90). We expect each team's score to be normally distributed, with an SD of an SD of almost exactly 10 points for each team's individual score.*

(* This is just the normal approximation to binomial. To get the SD, calculate .45 multiplied by (1-.45) divided by 100. Take the square root. Multiply by 2 since each basket is 2 points. Then, multiply by 100 for the number of possessions. You get 9.95.)

Since the two teams are independent, the SD of the score difference is the square root of 2 times as big as the individual standard deviations. So, the SD of the differential is 14 points.

By talent, A is a 20-point favorite over underdog B. For B to win, it must beat the spread by 20 points. 20 divided by 14 equals 1.42 SD. Under the normal distribution, the probability of getting a result greater than 1.42 SD is 0.0778. 

That means B has a 7.78 percent chance of winning, and A a 92.22 chance. The odds ratio for A is 92.22 divided by 7.78, which is 11.85.

So, that's the answer. Now, how well does log5 approximate it? To figure that out, we need to figure the talent of A and B against a .500 team.

Let's say Team C is that .500 team. Against C, team A has an advantage of 0.71 SD. The normal distribution table says that's a winning percentage of .7611, which is an odds ratio of 3.186. Similarly, team B is 0.71 SD worse than C, so it wins with probability (1 - .7611), or odds ratio 1 / 3.186.

Using log5, what's the estimated odds ratio of A over B? It's 3.186 squared, or 10.15. That works out to a win probability of only .910, an underestimate of the .922 true probability.

Log5 estimate:  .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85

To recap what we just did: we started with the correct, theoretically-derived probabilities of A beating C, and of B beating C. When we plugged those exact probabilities into log5, we got the wrong answer.

Why the wrong answer? Because of the log5 formula's implicit assumption: that the distribution of the score difference between A and B is logistic, rather than normal.

What specific logistic distribution does log5 expect? The one where the mean is the log of the odds ratio and the SD is about 1.8. (Logistic distributions are normally described by a "shape parameter, which is the SD divided by 1.8 (pi divided by the square root of 3, to be exact). So, actually, the log5 formula assumes a logistic distribution with a shape parameter of exactly 1.)

So, in this case, the log5 formula treats the score distribution as if it's logistic, with a mean of 2.3 (the log of the log5-calculated odds ratio of 10.15) and an SD of 1.8 (shape parameter 1). 

We can rescale that to be more intuitive, more basketball-like, and still get the same probability answer. We just multiply the mean, SD, and shape parameter by a constant. That's like taking a normal curve for height denominated in feet, and multiplying the mean and SD by 12 to convert to inches. The proportion of people under 5 feet under the first curve works out to the same as the proportion of people under 60 inches in the second curve.

In this case, we want the mean to be 20 (basketball points) instead of 2.3. So, we can multiply both the mean and the scale by 20/2.3. That gives us a new logistic distribution with mean 20 and SD 15.6.

That's reasonably close to the actual normal curve we know is correct:

Normal:   mean 20, SD 14
Logistic: mean 20, SD 15.6

It's reasonably close, but still incorrect. In this case, log5 overestimates the underdog in two different ways. First, it assumes a logistic distribution instead of normal, which means fatter tails. Second, it assumes a higher SD than actually occurs, which again means fatter tails and more extreme performances.


Here, I'll give you some stolen visuals.  I'm swiping this diagram from that great post John Cook post I mentioned, which compares and contrasts the two distributions. Here's a comparison of a normal and logistic distribution with the same SD:

The dotted red line is the normal distribution, the solid blue line is the logistic, and both have an SD of about 1.8.  It looks like the logistic tails start getting fatter than the normal tails at around 4.3 or something, which is around two-and-a-half SD from the mean.

But, for our purposes, it's the area under the curve that we care about, the CDF. Here's that comparison, shamefully stolen from the Tsukida/Gupta .pdf I linked to earlier:

From here, you can see that from infinity to around, I dunno, maybe 1.7 or so, the area under the logistic curve is larger than the area under the normal curve.

And again, this is when the curves have equal SDs.  In this basketball example, the log5 assumption has a higher SD than the actual normal, by 15.6 points to 14. So the overestimate of the underdog is even higher, and probably starts earlier than 1.7 SD.


Just to make sure my logic was correct, I ran a simulation for this exact basketball game. I created seven teams, with expectations running from 110 points down to 90 points. I ran a huge season, where each team played each other team 200,000 times. 

Then, I observed the simulated record of champion team A (the 110 point team) against team C (the team of average talent, the 100-point team). And I observed the simulated record of basement team B (the 90 point team) against team C. 

Here's what I got:

A versus C: 151878- 48122 (.7594), odds ratio 3.156
B versus C:  47599-152401 (.2380), odds ratio 0.312

To estimate A versus B using log5, we just divide 3.156 by 0.312:

A versus B: 181990-18010 (.9099), odds ratio 10.105

But, the actual results from the simulated season were:

A versus B: 184395-15605 (.9220), odds ratio 11.816

In the simulation, log5 understimated the favorite almost exactly the same way as it did in theory:

Log5 estimate:  .909, odds ratio 10.11
Correct answer: .922, odds ratio 11.82

Log5 estimate:  .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85


So that's what I think is going on. If the distribution of the difference in team scores doesn't follow the logistic distribution well enough, you'll get log5 working poorly for those talent differentials for which the curves match the worst. In the case where score differentials are normal, like this basketball simulation, the worst match is when you have a heavy favorite.

For other sports, it's an empirical question. Take the actual score distribution for that particular sport, and fit it to the scaled logistic distribution assumed by log5. Where the curves differ, I suspect, will tell you where log5 projections will be least accurate, and in which direction.

Labels: , , ,

Wednesday, September 14, 2016

Another case where log5 works perfectly

Here's another case where log5 works perfectly: sudden death foul shooting.

Two players each take a foul shot. The player who makes his shot wins, if the other misses. If both players sink the shot, or both players miss, they each take another one. This continues until there's a winner.

Suppose player A is an 80% shooter, and player B is a 40% shooter. The chance that A makes and B misses is .800 multiplied by (1 - .400), which works out to .480. The chance that B makes and A misses is (1 - .800) multiplied by .400, which is .080. 

So, A beats B 480 times for each 80 times that A loses to B. So, A's odds of winning are 6:1.

And that's exactly what log5 predicts. A's make ratio is 4:1, and B's make ratio is 2:3. Divide 4/1 by 2/3 and you get 6/1, which is the right answer.


Actually, that's a bit of a cheat. The log5 formula isn't based on chances of making a foul shot; it's based on the chance of beating a .500 player.

We can fix that problem. Consider player C, who happens to hit exactly 50% of foul shots. I won't do the calculaton again, but you can easily figure out that player A beats player C 80% of the time, but player B beats player C only 40% of the time.

So, player A is an .800 player against a .500 player, and player B is a .400 player against a .500 player. 

We're not quite done yet. Player C was defined as one who hits 50% of foul shots, not one that wins 50% of games. They're not the same thing. We can fix that by just assuming the league average player is both a .500 player and a 50% shooter. That seems arbitrary, but I'm just trying to come up with an example of when log5 works, so arbitrary is fine.

But, actually, we don't need that assumption. 

I've been saying all along that for talent, you need to use the expected odds ratio against a .500 team. But, actually, you don't need to be that specific. You can actually use the expected odds ratio against *any* other team (as long as it's the same other team for both sides).

So, it doesn't matter if player C is a .500 player, or a .600 player, or a .979 player. If you know A beats him X% of the time, and B beats him Y% of the time, you can just use X and Y in the log5 formula and it'll still work.

(Why does that work? Because the odds ratio against any given player is always a fixed multiple of the odds ratio against any other given player. So it doesn't matter whether you calculate A's odds ratio over B as a/b, or xa/xb -- it comes out the same regardless of x.)

That means that the log5 calculation using .800 (A's record against C) and .400 (B's record against C) is valid. The log5 formula works perfectly for sudden-death foul shooting.


What if we change the game a little, by extending it to two tries instead of one? This time, each player takes two shots, and whoever makes more wins the game. (Again, if it's a tie, repeat the game with two more shots each.)

If I've done my calculations right ... log5 does NOT work for this new game.

Assuming shots are independent, these are the probabilities of the three players hitting 2, 1, and 0 shots:

            0-for-2  1-for-2  2-for-2
A (.800)      .04      .32      .64
B (.400)      .36      .48      .16
C (.500)      .25      .50      .25

From that, we can calculate the exact probability that A beats B on the first two shots. It works out to 65.28%:

A wins 2-0   .64 * .36 = .2304
A wins 2-1   .64 * .48 = .3072
A wins 1-0   .32 * .36 = .1152
Total                    .6528

The chance that B beats A works out to 7.68 percent:

B wins 2-0   .16 * .04 = .0064
B wins 2-1   .16 * .32 = .0512
B wins 1-0   .48 * .04 = .0192
Total                    .0768

For every 6,528 games that A wins, B wins only 768 games. That's an odds ratio of exactly 8.5:1.

Now, bring C into the picture. I won't repeat all the calculations, but instead of eight-and-a-half, log5 gives an estimate of eight-and-three-elevenths:

Odds ratio of A over C: 56:11
Odds ratio of B over C: 24:39
Odds ratio of A over B: 2184:264 = 8.2727:1

So, in this case, log5 fails:

log5 estimate of A over B: 8.2727:1 
Correct odds  of A over B: 8.5000:1 

The log5 estimate works out to a winning percentage of .892. The correct win probability is .895. It's close, but it's still wrong.


So, why doesn't log5 work here, and in Tango's case? 

Because: it's a known result (hat tip: Ted Turocy) that for log5 to work, scores have to follow a certain, specific distribution. In most sports, they don't. How well log5 works depends on how well the real-life distribution of scores follows the assumed, theoretical distribution.

I'll get to that (finally) next post.

(Previous log5 posts: One, and Two)

(Updated 9/15 to remove incorrect reference to "height baseball.")

Labels: ,

Saturday, September 03, 2016

A case where log5 works perfectly

Suppose there's a coin-flipping league where every team has the same talent. After the season is over, you notice one team in the standings at .800, and another is at .400. Those records include the two teams facing each other at least once.

What is the probability, in retrospect, that the .800 team beat the .400 team in a particular game where they met? The log5 formula says you figure it out like this:

.800 is a ratio of 4 wins per loss
.400 is a ratio of 2/3 wins per loss

4 divided by 2/3 equals 6 

6 wins per loss is 6 wins per 7 games, which is .857.

(You can use the traditional form of the log5 formula if you want, to get the same .857.)

And it turns out that, in this case, the log5 formula DOES work. It works perfectly. The probability is indeed .857, and you can prove that.

I'll work out this particular example. Call the two teams A (.800) and B (.400). Suppose there were only 5 games in the season, so that A went 4-1 and B went 2-3. 

Suppose the two teams only met one time. What's the chance A won that game?

Start with the the case where A beat B. If that happened, A would have to have gone 3-1 in its other four games, and B would have to have gone 2-2. 

There are four permutations where A goes 3-1 (WWWL, WWLW, WLWW, LWWW), and six ways for B to go 2-2 (WWLL, WLWL, WLLW, LWWL, LWLW, LLWW).

That means there are 24 (6 x 4) ways to draw up the season when A beats B.

Now, suppose that B beat A. That means A went 4-0 otherwise, and B went 1-3. 

There is only one way for A to go unbeaten (WWWW), and only four ways to arrange B's 1-3 (WLLL, LWLL, LLWL, LLLW). 

That means that there are 4 (1 x 4) ways to draw up the season when B beats A.

Since this is coin flipping, all the cases have equal probability of happening. So, A beats B 24 times for every 4 times that B beats A. 

That's a ratio of 6:1, which is 6/7, which is .857 -- exactly as log5 predicts.


It's not that hard to go from this example to a proof. Just replace the raw numbers by variables for number of games total (n), number of games A wins (a), and number of games B wins (b). When you count permutations, you'll wind up with factorial terms, and when you divide the A permutations by the B permutations, the factorials will cancel out, and you'll be left with

p = (a/(n-a)) / (b/(n-b))

Which is exactly the log5 formula.

I don't know much about the history of log5, but some of you do. Was this part of the genesis of log5, that it could be proven to work retrospectively, so when it seemed to work pretty decently as a forecast, it became the standard?


But wait a minute -- last month, I argued that log5 couldn't possibly work when you used season records. If you recall, I posted this chart:

 matchup           log5
.800 vs .800       .500
.800 vs .700       .631
.800 vs .600       .727
.800 vs .500       .800
.800 vs .400       .857
.800 vs .300       .903
.800 vs .200       .941
 Average           .766

This says that an .800 team, playing against the league as log5 would predict, would actually play at a .766 pace. That's a contradiction -- it should be .800 -- so log5 must be wrong!

One difference between then and now is that, before, we had the .800 team playing against a clone of itself. That's not true here, so let's redo the chart without the first line: 

 matchup           log5
.800 vs .700       .631
.800 vs .600       .727
.800 vs .500       .800
.800 vs .400       .857
.800 vs .300       .903
.800 vs .200       .941
 Average           .810

Well, the average still isn't .800, so we still have a problem.

So, what's going on? Is my logic wrong here, or is my logic wrong there? Does log5 work, or doesn't it?

This bugged me for a while, until I sorted it out in my head. I think both conclusions are correct. The log5 formula actually *does* work in this case, and it actually *does not* work in the other case, for exactly the reasons described. 

But what about these charts that show the contradiction? They apply there, but they don't apply here. 

The difference is: when .800 is the *talent* of the team, it's constant, and you can use it on every line of the chart. But, when you use the *retrospectively observed performance*, it changes with every game. So you can't use .800 in every line of the chart.

Suppose the (eventual) 4-1 team wins the first game. In that case, it's only an (eventual) 3-1 team after that. That means its retrospectively observed performance next game isn't .800, it's .750. That means you have to draw up the chart like this:

  matchup           log5
 .800 vs .700       .631
 .750 vs .600       .667

If it loses the first game, it's 1.000 after, and you draw up the chart like this:

  matchup           log5
 .800 vs .700       .631
1.000 vs .600      1.000

So the chart has to be different every time, based on what actually happens in the games. 

I believe that if you were to do every possible permutation of the season, weighted by log5 probability, and average the averages, you would indeed wind up with .800. 


Now, the proof that validates the retrospective use of log5 only works because we assumed that games are decided by coin flips. If that weren't the case, then all the permutations wouldn't prospectively have an equal chance of happening, and the logic would fall apart. 

But would the *result* still hold? If you don't know A's talent or B's talent, but they still go 4-1 and 2-3, respectively, does the 6 out of 7 still hold?

I don't think it does. Again, imagine "height baseball," where the taller team always wins. It could be that A is the second-tallest team out of 6, and B is the fourth-tallest. That would be consistent with the 4-1 and 2-3 records (imagine a round-robin season).  But A would have a 100% chance of beating B, not 85.7%. 

So this is a special case. Whether log5 works here because there's something special about the 50%, or whether it's because all teams are the same, or whether it's just that the average record against all teams happens to equal the record against the average team ... I don't know.

But still. To me, there are no coincidences in math, just relationships that look coincidental until we see the connection. Maybe when I understand log5 better, it'll be self-evident why it works here. 

As I said, some of you guys reading this are much more familiar with the intricacies of log5 than I am. Is this a known result? Am I reinventing a wheel?

Labels: ,