Thursday, September 22, 2016

When and why log5 doesn't work

Six years ago, Tom Tango described a hypothetical sprinting tournament where log5 didn't seem to give accurate results. I think I have an understanding, finally, of why it doesn't work, thanks to 

(a) Ted Turocy, 

(b) a paper from Kristi Tsukida and Maya R. Gupta (.pdf; see section 3.4, keeping in mind that "Bradley-Terry model" basically means "log5"), and 

(c) this excellent post by John Cook. 

(You don't have to follow the links now; I'll give them again later, in context.)

-----

It turns out that the log5 formula makes a certain assumption about the sport, an assumption that makes the log5 formula work out perfectly. That assumption is: that the set of score differentials follows a logistic distribution

What's the logistic distribution? It's a lot like the normal distribution, a bell-shaped curve. They can be so similar in shape that I'd have trouble telling them apart by eye. But, the logistic distribution has fatter tails relative to the "bell". In other words: the logistic distribution predicts rare events, like teams beating expectations by large amounts, will happen more often than the normal distribution would predict. And it predicts that certain common events, like close games, will happen less often.

The log5 formula works perfectly when scores are distributed logistically. That's been proven mathematically. But, where scores aren't actually logistic, the formula will fail, in proportion to how real life varies from the particular logistic distribution the log5 formula assumes.

That's why the formula didn't work in the sprinting example. Tango explicitly made the results normally distributed. Then, he found that log5 started to break down when the competitors became seriously mismatched. That is: log5 started to fail in the tail of the distribution, exactly where normal and logistic differ the most. 

-------

Here's a rudimentary basketball game. Two teams each get 100 possessions, and have a certain probability (talent) of scoring on each possession. Defense doesn't matter, and all baskets are two points.

Suppose you have A, a 55 percent team (expected score: 110) and B, a 45 percent team (expected score, 90). We expect each team's score to be normally distributed, with an SD of an SD of almost exactly 10 points for each team's individual score.*

(* This is just the normal approximation to binomial. To get the SD, calculate .45 multiplied by (1-.45) divided by 100. Take the square root. Multiply by 2 since each basket is 2 points. Then, multiply by 100 for the number of possessions. You get 9.95.)

Since the two teams are independent, the SD of the score difference is the square root of 2 times as big as the individual standard deviations. So, the SD of the differential is 14 points.

By talent, A is a 20-point favorite over underdog B. For B to win, it must beat the spread by 20 points. 20 divided by 14 equals 1.42 SD. Under the normal distribution, the probability of getting a result greater than 1.42 SD is 0.0778. 

That means B has a 7.78 percent chance of winning, and A a 92.22 chance. The odds ratio for A is 92.22 divided by 7.78, which is 11.85.

So, that's the answer. Now, how well does log5 approximate it? To figure that out, we need to figure the talent of A and B against a .500 team.

Let's say Team C is that .500 team. Against C, team A has an advantage of 0.71 SD. The normal distribution table says that's a winning percentage of .7611, which is an odds ratio of 3.186. Similarly, team B is 0.71 SD worse than C, so it wins with probability (1 - .7611), or odds ratio 1 / 3.186.

Using log5, what's the estimated odds ratio of A over B? It's 3.186 squared, or 10.15. That works out to a win probability of only .910, an underestimate of the .922 true probability.

Log5 estimate:  .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85

To recap what we just did: we started with the correct, theoretically-derived probabilities of A beating C, and of B beating C. When we plugged those exact probabilities into log5, we got the wrong answer.

Why the wrong answer? Because of the log5 formula's implicit assumption: that the distribution of the score difference between A and B is logistic, rather than normal.

What specific logistic distribution does log5 expect? The one where the mean is the log of the odds ratio and the SD is about 1.8. (Logistic distributions are normally described by a "shape parameter, which is the SD divided by 1.8 (pi divided by the square root of 3, to be exact). So, actually, the log5 formula assumes a logistic distribution with a shape parameter of exactly 1.)

So, in this case, the log5 formula treats the score distribution as if it's logistic, with a mean of 2.3 (the log of the log5-calculated odds ratio of 10.15) and an SD of 1.8 (shape parameter 1). 

We can rescale that to be more intuitive, more basketball-like, and still get the same probability answer. We just multiply the mean, SD, and shape parameter by a constant. That's like taking a normal curve for height denominated in feet, and multiplying the mean and SD by 12 to convert to inches. The proportion of people under 5 feet under the first curve works out to the same as the proportion of people under 60 inches in the second curve.

In this case, we want the mean to be 20 (basketball points) instead of 2.3. So, we can multiply both the mean and the scale by 20/2.3. That gives us a new logistic distribution with mean 20 and SD 15.6.

That's reasonably close to the actual normal curve we know is correct:

Normal:   mean 20, SD 14
Logistic: mean 20, SD 15.6

It's reasonably close, but still incorrect. In this case, log5 overestimates the underdog in two different ways. First, it assumes a logistic distribution instead of normal, which means fatter tails. Second, it assumes a higher SD than actually occurs, which again means fatter tails and more extreme performances.

------

Here, I'll give you some stolen visuals.  I'm swiping this diagram from that great post John Cook post I mentioned, which compares and contrasts the two distributions. Here's a comparison of a normal and logistic distribution with the same SD:





The dotted red line is the normal distribution, the solid blue line is the logistic, and both have an SD of about 1.8.  It looks like the logistic tails start getting fatter than the normal tails at around 4.3 or something, which is around two-and-a-half SD from the mean.

But, for our purposes, it's the area under the curve that we care about, the CDF. Here's that comparison, shamefully stolen from the Tsukida/Gupta .pdf I linked to earlier:




From here, you can see that from infinity to around, I dunno, maybe 1.7 or so, the area under the logistic curve is larger than the area under the normal curve.

And again, this is when the curves have equal SDs.  In this basketball example, the log5 assumption has a higher SD than the actual normal, by 15.6 points to 14. So the overestimate of the underdog is even higher, and probably starts earlier than 1.7 SD.

------

Just to make sure my logic was correct, I ran a simulation for this exact basketball game. I created seven teams, with expectations running from 110 points down to 90 points. I ran a huge season, where each team played each other team 200,000 times. 

Then, I observed the simulated record of champion team A (the 110 point team) against team C (the team of average talent, the 100-point team). And I observed the simulated record of basement team B (the 90 point team) against team C. 

Here's what I got:

A versus C: 151878- 48122 (.7594), odds ratio 3.156
B versus C:  47599-152401 (.2380), odds ratio 0.312

To estimate A versus B using log5, we just divide 3.156 by 0.312:

A versus B: 181990-18010 (.9099), odds ratio 10.105

But, the actual results from the simulated season were:

A versus B: 184395-15605 (.9220), odds ratio 11.816

In the simulation, log5 understimated the favorite almost exactly the same way as it did in theory:

Simulation
--------------------------------------
Log5 estimate:  .909, odds ratio 10.11
Correct answer: .922, odds ratio 11.82

Theory
--------------------------------------
Log5 estimate:  .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85

-------

So that's what I think is going on. If the distribution of the difference in team scores doesn't follow the logistic distribution well enough, you'll get log5 working poorly for those talent differentials for which the curves match the worst. In the case where score differentials are normal, like this basketball simulation, the worst match is when you have a heavy favorite.

For other sports, it's an empirical question. Take the actual score distribution for that particular sport, and fit it to the scaled logistic distribution assumed by log5. Where the curves differ, I suspect, will tell you where log5 projections will be least accurate, and in which direction.



Labels: , , ,

Wednesday, September 14, 2016

Another case where log5 works perfectly

Here's another case where log5 works perfectly: sudden death foul shooting.

Two players each take a foul shot. The player who makes his shot wins, if the other misses. If both players sink the shot, or both players miss, they each take another one. This continues until there's a winner.

Suppose player A is an 80% shooter, and player B is a 40% shooter. The chance that A makes and B misses is .800 multiplied by (1 - .400), which works out to .480. The chance that B makes and A misses is (1 - .800) multiplied by .400, which is .080. 

So, A beats B 480 times for each 80 times that A loses to B. So, A's odds of winning are 6:1.

And that's exactly what log5 predicts. A's make ratio is 4:1, and B's make ratio is 2:3. Divide 4/1 by 2/3 and you get 6/1, which is the right answer.

------

Actually, that's a bit of a cheat. The log5 formula isn't based on chances of making a foul shot; it's based on the chance of beating a .500 player.

We can fix that problem. Consider player C, who happens to hit exactly 50% of foul shots. I won't do the calculaton again, but you can easily figure out that player A beats player C 80% of the time, but player B beats player C only 40% of the time.

So, player A is an .800 player against a .500 player, and player B is a .400 player against a .500 player. 

We're not quite done yet. Player C was defined as one who hits 50% of foul shots, not one that wins 50% of games. They're not the same thing. We can fix that by just assuming the league average player is both a .500 player and a 50% shooter. That seems arbitrary, but I'm just trying to come up with an example of when log5 works, so arbitrary is fine.

But, actually, we don't need that assumption. 

I've been saying all along that for talent, you need to use the expected odds ratio against a .500 team. But, actually, you don't need to be that specific. You can actually use the expected odds ratio against *any* other team (as long as it's the same other team for both sides).

So, it doesn't matter if player C is a .500 player, or a .600 player, or a .979 player. If you know A beats him X% of the time, and B beats him Y% of the time, you can just use X and Y in the log5 formula and it'll still work.

(Why does that work? Because the odds ratio against any given player is always a fixed multiple of the odds ratio against any other given player. So it doesn't matter whether you calculate A's odds ratio over B as a/b, or xa/xb -- it comes out the same regardless of x.)

That means that the log5 calculation using .800 (A's record against C) and .400 (B's record against C) is valid. The log5 formula works perfectly for sudden-death foul shooting.

------

What if we change the game a little, by extending it to two tries instead of one? This time, each player takes two shots, and whoever makes more wins the game. (Again, if it's a tie, repeat the game with two more shots each.)

If I've done my calculations right ... log5 does NOT work for this new game.

Assuming shots are independent, these are the probabilities of the three players hitting 2, 1, and 0 shots:

            0-for-2  1-for-2  2-for-2
---------------------------------------
A (.800)      .04      .32      .64
B (.400)      .36      .48      .16
C (.500)      .25      .50      .25

From that, we can calculate the exact probability that A beats B on the first two shots. It works out to 65.28%:

A wins 2-0   .64 * .36 = .2304
A wins 2-1   .64 * .48 = .3072
A wins 1-0   .32 * .36 = .1152
------------------------------
Total                    .6528

The chance that B beats A works out to 7.68 percent:

B wins 2-0   .16 * .04 = .0064
B wins 2-1   .16 * .32 = .0512
B wins 1-0   .48 * .04 = .0192
------------------------------
Total                    .0768

For every 6,528 games that A wins, B wins only 768 games. That's an odds ratio of exactly 8.5:1.

Now, bring C into the picture. I won't repeat all the calculations, but instead of eight-and-a-half, log5 gives an estimate of eight-and-three-elevenths:

Odds ratio of A over C: 56:11
Odds ratio of B over C: 24:39
-------------------------------------------
Odds ratio of A over B: 2184:264 = 8.2727:1

So, in this case, log5 fails:

log5 estimate of A over B: 8.2727:1 
Correct odds  of A over B: 8.5000:1 

The log5 estimate works out to a winning percentage of .892. The correct win probability is .895. It's close, but it's still wrong.

------

So, why doesn't log5 work here, and in Tango's case? 

Because: it's a known result (hat tip: Ted Turocy) that for log5 to work, scores have to follow a certain, specific distribution. In most sports, they don't. How well log5 works depends on how well the real-life distribution of scores follows the assumed, theoretical distribution.

I'll get to that (finally) next post.



(Previous log5 posts: One, and Two)

(Updated 9/15 to remove incorrect reference to "height baseball.")

Labels: ,

Saturday, September 03, 2016

A case where log5 works perfectly

Suppose there's a coin-flipping league where every team has the same talent. After the season is over, you notice one team in the standings at .800, and another is at .400. Those records include the two teams facing each other at least once.

What is the probability, in retrospect, that the .800 team beat the .400 team in a particular game where they met? The log5 formula says you figure it out like this:

.800 is a ratio of 4 wins per loss
.400 is a ratio of 2/3 wins per loss

4 divided by 2/3 equals 6 

6 wins per loss is 6 wins per 7 games, which is .857.

(You can use the traditional form of the log5 formula if you want, to get the same .857.)

And it turns out that, in this case, the log5 formula DOES work. It works perfectly. The probability is indeed .857, and you can prove that.

I'll work out this particular example. Call the two teams A (.800) and B (.400). Suppose there were only 5 games in the season, so that A went 4-1 and B went 2-3. 

Suppose the two teams only met one time. What's the chance A won that game?

Start with the the case where A beat B. If that happened, A would have to have gone 3-1 in its other four games, and B would have to have gone 2-2. 

There are four permutations where A goes 3-1 (WWWL, WWLW, WLWW, LWWW), and six ways for B to go 2-2 (WWLL, WLWL, WLLW, LWWL, LWLW, LLWW).

That means there are 24 (6 x 4) ways to draw up the season when A beats B.

Now, suppose that B beat A. That means A went 4-0 otherwise, and B went 1-3. 

There is only one way for A to go unbeaten (WWWW), and only four ways to arrange B's 1-3 (WLLL, LWLL, LLWL, LLLW). 

That means that there are 4 (1 x 4) ways to draw up the season when B beats A.

Since this is coin flipping, all the cases have equal probability of happening. So, A beats B 24 times for every 4 times that B beats A. 

That's a ratio of 6:1, which is 6/7, which is .857 -- exactly as log5 predicts.

-------

It's not that hard to go from this example to a proof. Just replace the raw numbers by variables for number of games total (n), number of games A wins (a), and number of games B wins (b). When you count permutations, you'll wind up with factorial terms, and when you divide the A permutations by the B permutations, the factorials will cancel out, and you'll be left with

p = (a/(n-a)) / (b/(n-b))

Which is exactly the log5 formula.

I don't know much about the history of log5, but some of you do. Was this part of the genesis of log5, that it could be proven to work retrospectively, so when it seemed to work pretty decently as a forecast, it became the standard?

-------

But wait a minute -- last month, I argued that log5 couldn't possibly work when you used season records. If you recall, I posted this chart:

 matchup           log5
---------------------------
.800 vs .800       .500
.800 vs .700       .631
.800 vs .600       .727
.800 vs .500       .800
.800 vs .400       .857
.800 vs .300       .903
.800 vs .200       .941
---------------------------
 Average           .766

This says that an .800 team, playing against the league as log5 would predict, would actually play at a .766 pace. That's a contradiction -- it should be .800 -- so log5 must be wrong!

One difference between then and now is that, before, we had the .800 team playing against a clone of itself. That's not true here, so let's redo the chart without the first line: 

 matchup           log5
---------------------------
.800 vs .700       .631
.800 vs .600       .727
.800 vs .500       .800
.800 vs .400       .857
.800 vs .300       .903
.800 vs .200       .941
---------------------------
 Average           .810

Well, the average still isn't .800, so we still have a problem.

So, what's going on? Is my logic wrong here, or is my logic wrong there? Does log5 work, or doesn't it?

This bugged me for a while, until I sorted it out in my head. I think both conclusions are correct. The log5 formula actually *does* work in this case, and it actually *does not* work in the other case, for exactly the reasons described. 

But what about these charts that show the contradiction? They apply there, but they don't apply here. 

The difference is: when .800 is the *talent* of the team, it's constant, and you can use it on every line of the chart. But, when you use the *retrospectively observed performance*, it changes with every game. So you can't use .800 in every line of the chart.

Suppose the (eventual) 4-1 team wins the first game. In that case, it's only an (eventual) 3-1 team after that. That means its retrospectively observed performance next game isn't .800, it's .750. That means you have to draw up the chart like this:

  matchup           log5
---------------------------
 .800 vs .700       .631
 .750 vs .600       .667
 ...

If it loses the first game, it's 1.000 after, and you draw up the chart like this:

  matchup           log5
---------------------------
 .800 vs .700       .631
1.000 vs .600      1.000
...

So the chart has to be different every time, based on what actually happens in the games. 

I believe that if you were to do every possible permutation of the season, weighted by log5 probability, and average the averages, you would indeed wind up with .800. 

-------

Now, the proof that validates the retrospective use of log5 only works because we assumed that games are decided by coin flips. If that weren't the case, then all the permutations wouldn't prospectively have an equal chance of happening, and the logic would fall apart. 

But would the *result* still hold? If you don't know A's talent or B's talent, but they still go 4-1 and 2-3, respectively, does the 6 out of 7 still hold?

I don't think it does. Again, imagine "height baseball," where the taller team always wins. It could be that A is the second-tallest team out of 6, and B is the fourth-tallest. That would be consistent with the 4-1 and 2-3 records (imagine a round-robin season).  But A would have a 100% chance of beating B, not 85.7%. 

So this is a special case. Whether log5 works here because there's something special about the 50%, or whether it's because all teams are the same, or whether it's just that the average record against all teams happens to equal the record against the average team ... I don't know.

But still. To me, there are no coincidences in math, just relationships that look coincidental until we see the connection. Maybe when I understand log5 better, it'll be self-evident why it works here. 

As I said, some of you guys reading this are much more familiar with the intricacies of log5 than I am. Is this a known result? Am I reinventing a wheel?



Labels: ,

Friday, August 26, 2016

"Bias" in log5 estimates -- a clarification

Last post, I argued that the log5 method has a bias. When you estimate a team's talent, in the sense of how well it would do playing a normal season against a league's worth of teams, you wind up being too conservative, giving the underdog too much of a chance to win.

Why does that happen? Because for the log5 formula to work, you need to use a team's expectation against a .500 team, not a team's expectation averaged out among all teams. The two aren't the same. You could come up with a method to figure out how big the difference is; it varies by the empirical spread of talent in the league.  

It's easy to see why this is the case with a simple example. Suppose I know more statistics than 90 percent of the population with a degree. If I played a season's worth of stats exams against all of them, I'd finish with an .900 record. But, consider someone with an average amount of stats knowledge, a .500 graduate. That person probably took one or two stats courses, at most. So, I'd beat him or her almost 100 percent of the time, not just 90 percent.

The differences aren't that big in pro sports. By my estimate, an NFL or NBA team that has a .686 talent over an average season actually has a .700 talent against an average team. In the normal range of MLB team talent, the difference is negligible. A team with .565 talent -- that's 91.5 wins out of 162 -- would probably play only .566 against an average team, a discrepancy of only one point.

-------

Anyway, after I posted that, Ted Turocy wrote me, disagreeing with how I described the problem. Ted agreed with my argument itself, but felt strongly that it doesn't show that log5 is "biased."

Here's how I understand his objections:

1.  The log5 (or odds ratio) method has been used successfully for years.  In the academic literature, it's called the "Bradley-Terry" method, named after two academic researchers who introduced it in 1952. In one of the fields Ted studies, Contest Theory, it's been the standard for decades. In the academic world, researchers don't make the mistake I described -- it's understood completely that talent estimates relate to performance against a .500 team.  In fact, the algorithms used to estimate talent don't usually even mention season-against-league performance.

2.  The log5 formula (or the odds ratio formula, which is algebraically identical) has been formally proven to provide unbiased estimates under certain assumptions (which I'll talk about in a future post).  

3.  My objection, that log5 is biased if you use "against league" estimates of talent instead of "against .500" estimates of talent, applies to ANY estimator, not just log5. That's because for "average talent against all teams" to always equal "talent against an average team", the formula would have to be linear. But linearity won't work for a correct formula, since all estimates have to be between .000 and 1.000, and linear formulas would routinely exceed those limits.

I agree with all three of these objections, with one minor nitpick: the academic literature rarely uses the term "log5." It mostly uses "Bradley-Terry," or "odds ratio."  While the formula is the same, the application is different.  

In "normal" sabermetrics, "log5" just uses a season's record as an estimate of talent -- I have *never* seen a mainstream sabermetric study acknowledge that the "against .500" talent should be used instead. In my experience, it's just been commonly assumed that "talent against league" and "talent against .500" were exactly the same number -- and, sometimes that's been stated explicitly.  (In fairness, while the two aren't the same, it turns out that in baseball, they're close enough for most purposes.)

So, I was prepared to say, OK, maybe we can say that "log5" is biased the way it's used in the sabermetric literature, but Bradley-Terry isn't biased in the way it's used in the academic literature. But, Ted let me know that, no, that won't work -- the term "log5" actually *is* used in the academic literature to mean "odds ratio formula," and it's used properly.  So my description of it as "biased" is still wrong.

OK, fair enough. Ted has convinced me that my title is misleading, that it implies that the log5 formula *itself* is biased, even when used properly and talent is assumed to mean "against a .500 team." I had considered "log5" to implicitly mean "using talent against league," which I shouldn't have.

I should have said something like: "A log5 estimate is biased against the favorite when "record against league" is used as the measure of team talent instead of "record against .500.""

Having said that, I say again that both Ted and I agree that the bias is there, the explanation is correct, and it's just the characterization of "log5 is biased" that's in dispute.

I'll soon update the previous post to make that clear.

------

BTW, during our e-mail exchange, Ted educated me about other aspects of the issue, for which I thank him. My current understanding of log5, as I will describe it in future posts, is much clearer because of his help.  However, I think Ted still disagrees with me on a few of the things I will be posting. I may wind up being wrong, but probably less wrong than if Ted hadn't helped me out.



Labels: , , , , ,

Friday, August 05, 2016

log5 estimates are biased when we use the wrong measure of "talent"

The "log5" method tries to predict a team's chance of winning a game based on its talent and that of its opponent. The basic formula, for teams A and B, is  

P = (A - AB)/(A+B-2AB)

A few months ago, I wrote that there's no theoretical reason for the formula to always work. In fact, there's an obvious counterexample where it doesn't work. Consider "height baseball," where the taller team always wins. Suppose team A is .700, because it's taller than 70 percent of its opponents, while team B is .400, being taller than only 40 percent of its opponents. The formula predicts team A will win 77.8 percent of games against B, but, of course, it will win 100 percent.

So why doesn't log5 work? I think I've found one reason, which I'll explain in this post. 

(There's a second reason -- which is actually a first reason, since it came back in 2011. In a blog post, Tango showed another example, using sprinter times, of how the odds ratio method (on which log5 is based) doesn't work, and Kincaid explained why in the comments. When I started writing this post, I originally thought mine was the same argument, just explained differently. But it's not. My argument actually doesn't apply to Tango's example ... I'll try to explain the Tango/Kincaid logic in a future post.)

------

Suppose you have team A, an .800 talent, playing team D, a .500 talent. What is the probability team A wins?

It seems that the answer should be ... well, .800. If team A is .800 against the league, which averages .500, then you'd think it should be .800 against a bona fide average team. And, the log5 method confirms the inutition -- plug in the numbers, and you do, indeed, get .800.

But that can't be right. I think it has to be the case that the .800 team plays *better* than .800 against the league-average team, and that it's easy to see why without any fancy math.

It actually doesn't depend on any technicality about what it actually means to be an .800 team. 

For instance, it's not because, if a team is .800 against the rest of the league, it must be *worse* than .800 in general, since it doesn't have to play itself. Even if you fix that problem, team A will have to be better than .800 against team D.

It's not because of home/road issues either, or the difference between observed .800 and talent .800 ... adjust for those, and the result still holds.

Let me restate the question in more detail, to try to eliminate some of those technicalities: 

----

In a league with no home field advantage, there are seven teams, A through G. 

If team A played a balanced schedule against all of them -- including itself (or a clone of itself) -- you would expect it to finish with an .800 record. So, in that respect, team A "has .800 talent".

By the same definition, teams B through G, respectively, have .700 talent, .600, .500 ... all the way down to .200 talent.

When team A (.800) plays team D (.500), what's the probability A wins?

----

The answer: not .800.

----

Let's create a little spreadsheet of team A's performance against all seven teams. It looks like this:

 matchup         probability
---------------------------
.800 vs .800
.800 vs .700
.800 vs .600
.800 vs .500
.800 vs .400
.800 vs .300
.800 vs .200

Now, let's fill in the log5 estimate for each one of those matchups:

 matchup           log5
---------------------------
.800 vs .800       .500
.800 vs .700       .727
.800 vs .600       .631
.800 vs .500       .800
.800 vs .400       .857
.800 vs .300       .903
.800 vs .200       .941

Those look quite reasonable, except that ... they don't average out to .800! They average out only to .766.

 matchup           log5
---------------------------
.800 vs .800       .500
.800 vs .700       .631
.800 vs .600       .727
.800 vs .500       .800
.800 vs .400       .857
.800 vs .300       .903
.800 vs .200       .941
---------------------------
 Average           .766

There's no trick here. This is a real, valid counterexample, one that shows that log5 doesn't actually work. And there's nothing special about our choice of .800. The average would always wind up too low, except for a team that's exactly .500.

Suppose we abandon the log5 estimates, then, and just try to fill in probabilities that seem reasonable. Can we do that, while insisting that the middle number stay .800? 

We have to hold the first number at .500, since, when a team plays a clone of itself, it must win 50 percent of its games, by definition. So we start with a chart that looks like this:

 matchup       probability
---------------------------
.800 vs .800        .500
.800 vs .700
.800 vs .600
.800 vs .500        .800
.800 vs .400
.800 vs .300
.800 vs .200
---------------------------
 overall avg        .800

From here, how do we fill in the second and third lines? One obvious way, that seems not too unreasonable, is just to stick in ".600" and ".700". 

 matchup       probability
---------------------------
.800 vs .800        .500
.800 vs .700        .600
.800 vs .600        .700
.800 vs .500        .800
.800 vs .400
.800 vs .300
.800 vs .200
---------------------------
 overall avg        .800

Having done that, it seems reasonable to just continue the pattern:

 matchup       probability
---------------------------
.800 vs .800        .500
.800 vs .700        .600
.800 vs .600        .700
.800 vs .500        .800
.800 vs .400        .900
.800 vs .300       1.000 
.800 vs .200       1.100
---------------------------
 overall avg        .800

That does, indeed, keep the average at .800. But it's obviously wrong -- it makes no sense to estimate that team A beats the .200 team 110% of the time.

So, is there another way we can fill this in, while keeping the .500 and .800 estimates, so that it all makes sense? No, I don't think that's possible. 

Right now, the first three lines of the chart average .600, which is .200 points below the .800 average we're shooting for. Therefore, the bottom three lines must average .200 points *above* .800. In other words, the bottom three lines have to average 1.000! Clearly, that can't be done.

So, we have to decrease the second and third lines. Maybe we change them to, say, .650 and .750. If we do that, then the first three lines average only .167 points below .800. Now, the bottom has to average "only" .967. Which, again, doesn't pass the sniff test.

Try if you want, but I'm pretty sure that you're not going to find anything that seems like a plausible breakdown. The only way to get something that looks reasonable, I think, requires the middle line to be something higher than .800.

--------

How much higher? From the original log5 chart, we see that

(a) team A was .766 overall, but
(b) team A played .800 ball against the .500 team.

If a .766 team goes .800 against an average team, maybe we can extrapolate that an .800 team would go, say, .840 against an average team.

Plugging .840 into the middle slot, and filling in the rest in some plausible fashion to average .800, maybe the chart would look something like this:

 matchup       probability
---------------------------
.800 vs .800        .500
.800 vs .700        .690
.800 vs .600        .770
.800 vs .500        .840
.800 vs .400        .900
.800 vs .300        .940
.800 vs .200        .960
---------------------------
 overall avg        .800

That's just a guess, of course. But, no matter what the true values are, the point remains: the middle entry must be significantly higher than .800.

And, another consequence: all the outcome probabilities, other than between equal teams, are *more extreme* than log5 suggests. The log5 formula is too conservative, always underestimating the favorite's chances of winning, when there is a favorite.

--------

So, log5 doesn't actually work. But, I think, there's an easy way to tweak it so that it DOES work. 

And that is: instead of using the log5 formula with the respective teams' expected talent against the league, we use their expected record against a .500 team. 

In our league, a team that finished .800 overall beats an average team 84% of the time, not 80%. Which means, for this new definition of log5, it's not an .800 talent, it's an .840 talent.

Let's reserve the word "talent" for its usual meaning, the expected record against the league, and use the made-up word "5talent" to mean talent against a .500 team. In our seven-team league, a team with a talent of .800 has a "5talent" of .840.

What's the 5talent of the rest of the teams? We can guess. If an .800 talent is an .840 5talent, maybe a .700 team is .720, a .600 team is .610, and so on. 

Repeating the log5 calculation using 5talent instead of talent, we get:

5talent matchup     log5
------------------------------
.840 vs .840        .500
.840 vs .720        .671
.840 vs .610        .771
.840 vs .500        .840
.840 vs .390        .891
.840 vs .280        .931
.840 vs .160        .965
------------------------------
 Overall avg        .796

Not bad! Under our estimates, a team that's an .840 5talent works out to a .796 talent. You could easily tweak the assumptions to get the average to .800 exactly, if you wanted to.

--------

Why does this happen, that log5 doesn't work if you use league performance? 

Because the win probabilities are based on the odds ratio. 

The log5 method works like this: suppose you have an .800 team against a .400 team. The .800 team has average 4:1 odds of winning. The .400 team has average 2:3 odds of winning. Divide 4/1 by 2/3, and you get 6/1. So the .800 team has 6:1 odds of beating the .400 team. That works out to an .857 winning percentage.

(I used odds ratios instead of the "usual" log5 formula, but it's exactly the same thing. If you do some algebra on the odds ratio calculation, you can actually derive the log5 formula at the top of this post.

When I calculate log5 probabilities, I actually use the odds ratio method, because the method is easy to remember, and I don't have to memorize the formula.)

The log5 formula is based on *multiplication* of *odds ratios*. But a team's overall average record, the one we normally talk about, is based on *addition* of *probabilities*. Those are two different sets of two different things. 

We calculated the bottom line of the chart, the overall winning percentage, as the arithmetic mean of the win probabilities. But, the odds ratio method doesn't know about arithmetic means and win probabilities. It knows only about geometric means of odds ratios. 

And, as it turns out, if we calculate the average as the geometric mean of the odds ratios ... well, then everything works! The geometric mean of the odds ratios against all teams is the same as the odds ratio against the average team.

Going back to the chart, and going back to the .800 talent, I'll convert the probabilities to odds ratios. (The odds ratio is the probability of winning divided by the probability of losing.)

talent matchup     log5     odds ratio
------------------------------------------
.800 vs .800       .500        1.00
.800 vs .700       .631        1.71
.800 vs .600       .727        2.67
.800 vs .500       .800        4.00
.800 vs .400       .857        6.00
.800 vs .300       .903        9.33
.800 vs .200       .941       16.00
------------------------------------------
 arithmetic mean   .766        
  geometric mean               4.00 (.800)

     
The .766 arithmetic mean doesn't equal the .800 talent, but the 4.00 geometric mean of the odds ratios *does* equal the 4.00 odds ratio talent. 

In other words, a team that's a 4.00 odds ratio talent against the league overall is also a 4.00 talent against the league average team. 

-------

OK, I cheated a bit. The reason the geometric mean works out perfectly is that, in our league, the team talents are symmetrically distributed around .500. 

If the talents aren't symmetrical, it doesn't work out perfectly -- I found the geometric mean comes out a little too high. However: it's a lot closer than the arithmetic mean works out to be. And, most leagues are symmetrical enough that it wouldn't be an issue in real life. There aren't many leagues with, say, 20 teams with .480 talent, but one team at .900. 

(Also... I wonder if you use the 5talents in the chart instead of the talents, if the geometric mean might not work out perfectly in that case, even for non-symmetrical leagues. That's just a gut feeling, and I need to think about it more.)

-------

I think that what makes the arithmetic mean give such a serious underestimate is that our league has large extremes of team talent, spanning a range from .200 to .800. The farther the numbers are from .500, the more the geometric mean of the odds ratio differs from the arithmetic mean of the probabilities. 

One measure of "extremes" is the standard deviation. In our hypothetical seven-team league, the SD of talent is .200. In actual Major League Baseball, the SD of talent is only around .055.

So, let's simulate the MLB spread with an example where, instead of teams varying from .800 to .200, they only vary from .565 to .435. Here's the log5 chart:

 talent matchup    log5     odds ratio
------------------------------------------
.565 vs .565       .500      1
.565 vs .500       .565      1.2989
.565 vs .435       .628      1.6870
------------------------------------------
arithmetic mean    .564
 geometric mean              1.2989 (.565)

This league has a talent SD of .053, similar to MLB. And, with that smaller SD, log5 is a very good fit. Now, a team with a 5talent of .565 has a talent of .564 -- so close that it makes no material difference. 

It looks like the log5 method is very, very sensitive to the dispersion of talent in the league -- bumpng the SD from .053 to .200 -- only a factor of 4 -- made the log5 discrepancy jump from 1 point all the way to 40 points.

So that explains why, even though log5 produces conservatively biased estimates when we use the "usual" definition of talent, it nonetheless works so well for baseball.  It's because in MLB, the spread in talent is reasonably small.

------

We can repeat this for other sports. Ten years ago, Tom Tango found the SD of talent to be .134 in the NBA, and .143 in the NFL.

 SD of talent
-------------------------------
.200 hypothetical 7-team league
.143 NFL
.134 NBA
.055 MLB
.053 hypothetical 3-team league
-------------------------------

We know that log5 is pretty far off for the 7-team league, and pretty good for MLB. The two leagues in between -- the NFL and NBA -- are right in the middle. 

If we do a five-team league, with talents ranging from .700 to .300, the SD is .141, which is pretty close to the NBA and NFL. Here's the calculation for the .700 team:

talent matchup     log5     odds ratio
------------------------------------------
.700 vs .700       .500        1.00
.700 vs .600       .609        1.56
.700 vs .500       .700        2.33
.700 vs .400       .778        3.50
.700 vs .300       .845        5.44
------------------------------------------
 arithmetic mean   .686        
  geometric mean               2.33(.700)


We can estimate that in the NFL or NBA, a .686 talent has a "5talent" of .700, meaning that it'll beat a .500 team 70% of the time. 

That means that if you're estimating a .686 talent against a .500 talent, your estimate is going to be .014 too conservative. That's approximately what it'll be for any pair of teams about that far apart. It'll be more accurate for opponents closer in talent, and less accurate for talent mismatches, at least to the point of diminishing returns (if log5 says your team is .990, it's not possible that it really should be 1.005).

Does that bother you, that when you use log5 on teams with significantly different talent, your estimate is off by as much as .014? If it doesn't, just go ahead and keep using log5. 

But for any study that's looking for small effects ... well, to me, it seems to me that .014 points is probably as big as the effect you're looking for. If you don't correct for it, you're underestimating the favorite by, what, maybe a quarter as much as home field advantage?

If you do some kind of "record after a time zone change on a hot streak" study, and you find an effect of .014 points ... since hot streaks mean good teams, could it just be that all you did was rediscover that log5 is biased too conservatively when you use the wrong definition of talent?  



UPDATE, 8/25/16: Changed title of post and reworded a few sentences to emphasize that the "bias" in log5 is not intrinsic to the formula itself, but occurs when we improperly use "talent" instead of "5talent." Hat tip: Ted Turocy. See longer explanation here.  

Labels: , , , ,