Friday, November 17, 2017

How Elo ratings overweight recent results

"Elo" is a rating system widely used to rate players in various games, most notably chess. Recently, FiveThirtyEight started using it to maintain real-time ratings for professional sports teams. Here's the Wikipedia page for the full explanation and formulas.

In Elo, everyone gets a number that represents their skill. The exact number only matters in the difference between you and the other players you're being compared to. If you're 1900 and your opponent is 1800, it's exactly the same as if you're 99,900 and your opponent is 99,800.

In chess, they start you off with a rating of 1200, because that happens to be the number they picked. You can interpret the 1200 either as "beginner," or "guy who hasn't played competitively yet so they don't know how good he is."  In the NBA system, FiveThirtyEight decided to start teams off with 1500, which represents a .500 team. 

A player's rating changes with every game he or she plays. The winner gets points added to his or her rating, and the loser gets the same number of points subtracted. It's like the two players have a bet, and the loser pays points to the winner.

How many points? That depends on two things: the "K factor," and the odds of winning.

The "K factor" is chosen by the organization that does the ratings. I think of it as double the number of points the loser pays for an evenly matched game. If K=20 (which FiveThirtyEight chose for the NBA), and the odds are even, the loser pays 10 points to the winner.

If it's not an even match, the loser pays the number of points by which he underperformed expectations. If the Warriors had a 90% chance of beating the Knicks, and they lost, they'd lose 18 points, which is 90% of K=20. If the Warriors win, they only gain 2 points, since they only exceeded expectations by 10 percentage points.

How does Elo calculate the odds? By the difference between the two players' ratings. The Elo formula is set so that a 400-point difference represents a 10:1 favorite. An 800 point favorite is 100:1 (10 for the first 400, multiplied by 10 for the second 400). A 200 point favorite is 3.16 to 1 (3.16 is the square root of 10). A 100 point favorite is 1.78 to 1 (the fourth root of 10), and so on.

In chess, the K factor varies depending on which chess federation is doing the ratings, and the level of skill of the players. For experienced non-masters, K seems to vary between 15 and 32 (says Wikipedia). 

------

Suppose A and B have equal ratings of 1600, and A beats B. With K=20, A's rating jumps to 1610, and B's rating falls to 1590. 

A and B are now separated by 20 points in the ratings, so A is deemed to have odds of 1.12:1 of beating B. (That's because the "400/20th" root of 10 is 1.12.)  That corresponds to an expected winning percentage of .529.

After lunch, they play again, and this time B beats A. Because B was the underdog, he gets more than 10 points -- 10.6 points (.529 times K=20), to be exact. And A loses the identical 10.6 points.

That means after the two games, A has a rating of 1599.4, and B has a rating of 1600.6.

------

That example shows one of the properties of Elo -- it weights recent performance higher than past performance. In their two games, A and B effectively tied, each going 1-1. But B winds up with a higher rating than A because his win was more recent.

Is that reasonable? In a way, it is. People's skill at chess changes over their lifetimes, and it would be weird to give the same weight to a game Garry Kasparov played when he was 8, as you would to a game Garry Kasparov played as World Champion.

But in the A and B case, it seems weird. A and B played both games the same day, and their respective skills couldn't have changed that much during the hour they took their lunch break. In this case, it would make more sense to weight the games equally.

Well, according to Wikipedia, that's might would actually happen. Instead of updating the ratings every game, the Federation would wait until the end of the tournament, and then compare each player to his or her overall expectation based on ratings going in. In this case, A and B would be expected to go 1-1 in two games, which they did, so their ratings wouldn't change at all.

But, if A and B's games were days or weeks apart, as part of different tournaments, the two games would be treated separately, and B might indeed wind up 1.2 points ahead of A.

------

Is that a good thing, giving a higher weight to recency? It depends how much higher a weight. 

People's skill does indeed change daily, based on mood, health, fatigue -- and, of course, longer-term trends in skill. In the four big North American team sports, it's generally true that players tend to improve in talent until a certain age (27 in baseball), then decline. And, of course, there are non-age-related talent changes, like injuries, or cases where players just got a lot better or a lot worse partway through their careers.

That's part of the reason we tend to evaluate players based on their most recent season. If a player hit 15 home runs last year, but 20 this year, we expect the 20 to be more indicative of what we can expect next season.

Still, I think Elo gives far too much weight to recent results, when applied to professional sports teams. 

Suppose you're near the end of the season, and you're looking at a team with a 40-40 record -- the Bulls, say. From that, you'd estimate their talent as average -- they're a .500 team.

Now, they win an even-money game, and they're 41-40, which is .506. How do you evaluate them now? You take the .506, and regress to the mean a tiny bit, and maybe estimate they're a .505 talent. (I'll call that the "traditional method," where you estimate talent by taking the W-L record and regressing to the mean.)

What would Elo say? Before the 81st game, the Bulls were probably rated at 1500. After the game, they've gained 10 points for the win, bringing them to 1510.

But, 1510 represents a .514 record, not .505. So, Elo gives that one game almost three times the weight that the traditional method does.

Could that be right? Well, you'd have to argue that maybe because of personnel changes, team talent changes so much from the beginning of the year to the end that the April games are worth three times as much as the average game. But ... well, that still doesn't seem right. 

-------

Techincal note: I should mention that FiveThirtyEight adds a factor to their Elo calculation -- they give more or fewer points to the winner of the game based on the score differential. If a favorite wins by 1 point, they'll get a lot fewer points than if they won by 15 points. Same for the underdog, except that the underdog always gets more points than the favorite for the same point differential -- which makes sense.

FiveThirtyEight doesn't say so explicitly, but I think they set the weighting factor so that the average margin of victory corresponds to the number points the regular Elo would award the winner.

Here's the explanation of their system.

-------

Elo starts with a player's rating, then updates it based on results. But, when it updates it, it has no idea how much evidence was behind the player's rating in the first place. If a team is at 1500, and then it wins an even game, it goes to 1510 regardless of whether it was at 1500 because it's an expansion team's first game, or because it was 40-40, or (in the case of chess) it's 1000-1000.

The traditional method, on the other hand, does know. If a team goes from 1-1 to 2-1, that's a move of .167 points (less after regressing to estimate talent, of course). If a team goes from 40-40 to 41-40, that's a move of only .005 points. 

Which makes sense; the more evidence you already have, the less your new evidence should move your prior. But if your prior moves the same way regardless of the previous evidence, you're seriously underweighting that previous evidence (which means you're overweighting the new evidence).

The chess Federations implicitly understand this, that you should give less weight to new results when you have more older results. That's why they vary the K-values based on who's playing. 

FIDE, for instance, weights new players at K=40, experienced players at K=20, and masters (who presumably have the most experience) at K=10.

------

As I said last post, I did a simulation. I created a team that was exactly average in talent, and assumed that FiveThirtyEight had accurately given them an average rating at the beginning of the year. I played out 1000 random seasons, and, on average, the team wound up at right where it started, just as you would expect.

Then, I modified the simulation as if FiveThirtyEight had underrated the team by 50 points, which would peg them as a .429 team. (They use their "CARM-Elo" player projection system for those pre-season ratings. I'm not saying that system is wrong, just checking what happens when a projection happens to be off.)

It turned out that, at the end of the 82-game season, Elo had indeed figured out the team was better than their initial rating, and had restored 45 of the 50 points. They were still underrated, but only by 5 points (.493) instead of 50 (.429). 

Effectively, the current season wiped out 90% of the original rating. Since the original rating was based on the previous seasons, that means that, to get the final rating, Elo effectively weighted this year at 90%, and the previous years at 10%. 

10% is close to 12.5%. I'll use that because it makes the calculation a bit easier. At 12.5%, which is one-eighth, it means the NBA season contains three "half lives" of about 27 games each. 

That is: after 27 games, the gap of 50 points is reduced by half, to 25. After another 27 games, it's down to 12. After a third 27 games, it's down to 6, which is 12.5% of where the gap started.

That means that to calculate a final rating, the thirds of seasons are effectively weighted in a ratio of 1:2:4. A game in April has four times the weight of a game in November. Last post, I argued why I think that's too high.

-------

Here's another way of illustrating how recency matters. 

I tweaked the simulation to do something a little different. Instead of creating 1,000 different seasons, I created only one season, but randomly reordered the games 1,000 times. The opponents and final scores were the same; only the sequence was different. 

Under the traditional method, the talent estimates would be the same, since all 1,000 teams had the same W-L record. But the Elo ratings varied, because of recency effects. They varied with an SD of about 26 points. That's about .037 in winning percentage, or 3 wins per 82 games.

If you consider the SD to be, in a sense, the "average" discrepancy, that means that, on average, Elo will misestimate a team's talent by 3 wins. That's for teams with the same actual record -- based only on the randomness of *when* they won or lost. 

And you can't say, "well, those three wins might be because talent changed over time."  Because, that's just the random part. Any actual change in talent is additional to that. 

--------

If all NBA games were pick-em, the SD of team luck in an 82-game season would be around 4.5 wins. Because there are lots of mismatches, which are more predictable, the actual SD is lower, maybe, say, 4.1 games. 

Elo ratings are fully affected by that 4.1 games of binomial luck, but also by another 3 games worth of luck for the random order in which games are won or lost. 

Why would you want to dilute the accuracy of your talent estimate by adding 3 wins worth of randomness to your SD? Only if you're gaining 3 wins worth of accuracy some other way. Like, for instance, if you're able to capture team changes in talent from the beginning of the season to the end. If teams vary in talent over time, like chess players, maybe weighting recent games more highly could give you a better estimate of a team's new level of skill.

Do teams vary in talent, from the beginning to the end of the year, by as much as 3 games (.037)?

Actually, 3 games is a bit of a red herring. You need more than 3 games of talent variance to make up for the 3 games of sequencing luck.

Because, suppose a team goes smoothly from a 40-win talent at the beginning of the year to a 43-win talent at the end of the year. That team will have won 41.5 games, not 40, so the discrepancy between estimate and talent won't be 3 games, but just one-and-a-half games.

As expected, Elo does improve on the 1.5 game discrepancy you get from the traditional method. I ran the simulation again, and found that Elo picked up about 90% of the talent difference rather than 50%. That means that Elo would peg the (eventual) 43-game talent at 42.7.

For a team that transitions from a 40- to a 43-game talent, the traditional method was off by 1.5 games. The Elo method was off by only 0.3 games. 

It looks like Elo is only a 1.2 game improvement over the traditional method, in its ability to spot changes in talent. But it "costs" a 3-game SD for extra randomness. So it doesn't seem like it's a good deal.

To compensate for the 3-game recency SD, you'd need the average within-season talent change to be much more than 3 games. You'd need it to be 7.5 games.

Do teams really change in talent, on average, by 7.5/82 games, over the course of a single season? Sure, some teams must, like they have injury problems to their star players. But on average? That doesn't seem plausible.

------

Besides, what's stopping you from adjusting teams on a case-by-case basis? If Stephen Curry gets hurt ... well, just adjust the Warriors down. If you think Curry is worth 15 games a season, just drop the Warriors' estimate by that much until he comes back.

It's when you try to do things by formula that you run into problems. If you expect Elo to automatically figure out that Curry is hurt, and adjust the Warriors accordingly ... well, sure, that'll happen. Eventually. As we saw, it will take 27 games, on average, until Elo adjusts just for half of Curry's value. And, when he comes back, it'll take 27 games until you get back only half of what Elo managed to adjust by. 

In our example, we assumed that talent changed constantly and slowly over the course of the season. That makes it very easy for Elo to track. But if you lose Curry suddenly, and get him back suddenly 27 games later ... then Elo isn't so good. If losing Curry is worth -.100 in winning percentage, Elo will start at .000 Curry's first game out, and only reach -.050 by Curry's 27th game out. Then, when he's back, Elo will take another 27 games just to bounce back from -.050 to -.025.

In other words, Elo will be significantly off for at least 54 games. Because Elo does weight recent games more heavily, it'll still be better than the traditional method. But neither method really distinguishes itself. When you have a large, visible shock to team talent, I don't see why you wouldn't just adjust for it based on fundamentals, instead of waiting a whole season for your formula to figure it out.

-------

Anyway, if you disagree with me, and believe that team talent does change significantly, in a smooth and gradual way, here's how you can prove it.

Run a regression to predict a team's last 12 games of the season, from their previous seven ten-game records (adjusted for home/road and opponent talent, if you can). 

You'll get seven coefficients. If the seventh group has a significantly higher coefficient than the first group, then you have evidence it needs to be weighted higher, and by how much.

If the weight for the last group turns out to be three or four times as high as the weight for the first group, then you have evidence that Elo might be doing it right after all.

I doubt that would happen. I could be wrong. 



Labels: , , , , ,

4 Comments:

At Friday, November 17, 2017 3:29:00 PM, Blogger Zach said...

If ELO predicts a team will have a .500 record and at the end of the season they do, should it matter that the reason they do is they lost their star player for half the season? What's the talent estimation of injury luck?

 
At Friday, November 17, 2017 4:49:00 PM, Blogger Phil Birnbaum said...

I'm not sure I understand the question. Elo might predict, at the beginning of the year, that the team has a .500 talent. But then that estimate changes every game.

 
At Friday, November 17, 2017 5:49:00 PM, Blogger Unknown said...

To make the last game worth twice as much as the first instead of 4 times as much, the K factor would need to be reduced. However this will cause the initial rating to have a higher value. I believe the best way to improve Elo ratings (though this will still have problems) is that to have a flexible K based off of how many games have been played and the number of time since the last game.

Honestly my biggest problems with ELO ratings are the SOS adjustments. If you beat a 1900 team with true talent of 1500 than your rating will shoot up a ton even though the other teams rating will most likely drop. ELO's SOS is entirely based on the games that happen before the date and ignores anything that happens in the future even though it gives valuable information on their true talent during the game.

 
At Sunday, November 19, 2017 5:32:00 AM, Blogger James Willoughby said...

Really good post, as usual. One comment:

"When you have a large, visible shock to team talent, I don't see why you wouldn't just adjust for it based on fundamentals, instead of waiting a whole season for your formula to figure it out."

Because you are not always right about its true impact, and learning it incrementally is the preferred strategy averaged over all cases. An injured star may be replaced by a teammate who is actually surprisingly better, for instance.
I agree with the post though. Bayesian methods dominate Elo, but they are more difficult to understand for all consumers of the ratings output. Longitudinal changes to Elo are easier to follow than modifications to probability distributions.

 

Post a Comment

Links to this post:

Create a Link

<< Home