Friday, November 17, 2017

How Elo ratings overweight recent results

"Elo" is a rating system widely used to rate players in various games, most notably chess. Recently, FiveThirtyEight started using it to maintain real-time ratings for professional sports teams. Here's the Wikipedia page for the full explanation and formulas.

In Elo, everyone gets a number that represents their skill. The exact number only matters in the difference between you and the other players you're being compared to. If you're 1900 and your opponent is 1800, it's exactly the same as if you're 99,900 and your opponent is 99,800.

In chess, they start you off with a rating of 1200, because that happens to be the number they picked. You can interpret the 1200 either as "beginner," or "guy who hasn't played competitively yet so they don't know how good he is."  In the NBA system, FiveThirtyEight decided to start teams off with 1500, which represents a .500 team. 

A player's rating changes with every game he or she plays. The winner gets points added to his or her rating, and the loser gets the same number of points subtracted. It's like the two players have a bet, and the loser pays points to the winner.

How many points? That depends on two things: the "K factor," and the odds of winning.

The "K factor" is chosen by the organization that does the ratings. I think of it as double the number of points the loser pays for an evenly matched game. If K=20 (which FiveThirtyEight chose for the NBA), and the odds are even, the loser pays 10 points to the winner.

If it's not an even match, the loser pays the number of points by which he underperformed expectations. If the Warriors had a 90% chance of beating the Knicks, and they lost, they'd lose 18 points, which is 90% of K=20. If the Warriors win, they only gain 2 points, since they only exceeded expectations by 10 percentage points.

How does Elo calculate the odds? By the difference between the two players' ratings. The Elo formula is set so that a 400-point difference represents a 10:1 favorite. An 800 point favorite is 100:1 (10 for the first 400, multiplied by 10 for the second 400). A 200 point favorite is 3.16 to 1 (3.16 is the square root of 10). A 100 point favorite is 1.78 to 1 (the fourth root of 10), and so on.

In chess, the K factor varies depending on which chess federation is doing the ratings, and the level of skill of the players. For experienced non-masters, K seems to vary between 15 and 32 (says Wikipedia). 

------

Suppose A and B have equal ratings of 1600, and A beats B. With K=20, A's rating jumps to 1610, and B's rating falls to 1590. 

A and B are now separated by 20 points in the ratings, so A is deemed to have odds of 1.12:1 of beating B. (That's because the "400/20th" root of 10 is 1.12.)  That corresponds to an expected winning percentage of .529.

After lunch, they play again, and this time B beats A. Because B was the underdog, he gets more than 10 points -- 10.6 points (.529 times K=20), to be exact. And A loses the identical 10.6 points.

That means after the two games, A has a rating of 1599.4, and B has a rating of 1600.6.

------

That example shows one of the properties of Elo -- it weights recent performance higher than past performance. In their two games, A and B effectively tied, each going 1-1. But B winds up with a higher rating than A because his win was more recent.

Is that reasonable? In a way, it is. People's skill at chess changes over their lifetimes, and it would be weird to give the same weight to a game Garry Kasparov played when he was 8, as you would to a game Garry Kasparov played as World Champion.

But in the A and B case, it seems weird. A and B played both games the same day, and their respective skills couldn't have changed that much during the hour they took their lunch break. In this case, it would make more sense to weight the games equally.

Well, according to Wikipedia, that's might would actually happen. Instead of updating the ratings every game, the Federation would wait until the end of the tournament, and then compare each player to his or her overall expectation based on ratings going in. In this case, A and B would be expected to go 1-1 in two games, which they did, so their ratings wouldn't change at all.

But, if A and B's games were days or weeks apart, as part of different tournaments, the two games would be treated separately, and B might indeed wind up 1.2 points ahead of A.

------

Is that a good thing, giving a higher weight to recency? It depends how much higher a weight. 

People's skill does indeed change daily, based on mood, health, fatigue -- and, of course, longer-term trends in skill. In the four big North American team sports, it's generally true that players tend to improve in talent until a certain age (27 in baseball), then decline. And, of course, there are non-age-related talent changes, like injuries, or cases where players just got a lot better or a lot worse partway through their careers.

That's part of the reason we tend to evaluate players based on their most recent season. If a player hit 15 home runs last year, but 20 this year, we expect the 20 to be more indicative of what we can expect next season.

Still, I think Elo gives far too much weight to recent results, when applied to professional sports teams. 

Suppose you're near the end of the season, and you're looking at a team with a 40-40 record -- the Bulls, say. From that, you'd estimate their talent as average -- they're a .500 team.

Now, they win an even-money game, and they're 41-40, which is .506. How do you evaluate them now? You take the .506, and regress to the mean a tiny bit, and maybe estimate they're a .505 talent. (I'll call that the "traditional method," where you estimate talent by taking the W-L record and regressing to the mean.)

What would Elo say? Before the 81st game, the Bulls were probably rated at 1500. After the game, they've gained 10 points for the win, bringing them to 1510.

But, 1510 represents a .514 record, not .505. So, Elo gives that one game almost three times the weight that the traditional method does.

Could that be right? Well, you'd have to argue that maybe because of personnel changes, team talent changes so much from the beginning of the year to the end that the April games are worth three times as much as the average game. But ... well, that still doesn't seem right. 

-------

Techincal note: I should mention that FiveThirtyEight adds a factor to their Elo calculation -- they give more or fewer points to the winner of the game based on the score differential. If a favorite wins by 1 point, they'll get a lot fewer points than if they won by 15 points. Same for the underdog, except that the underdog always gets more points than the favorite for the same point differential -- which makes sense.

FiveThirtyEight doesn't say so explicitly, but I think they set the weighting factor so that the average margin of victory corresponds to the number points the regular Elo would award the winner.

Here's the explanation of their system.

-------

Elo starts with a player's rating, then updates it based on results. But, when it updates it, it has no idea how much evidence was behind the player's rating in the first place. If a team is at 1500, and then it wins an even game, it goes to 1510 regardless of whether it was at 1500 because it's an expansion team's first game, or because it was 40-40, or (in the case of chess) it's 1000-1000.

The traditional method, on the other hand, does know. If a team goes from 1-1 to 2-1, that's a move of .167 points (less after regressing to estimate talent, of course). If a team goes from 40-40 to 41-40, that's a move of only .005 points. 

Which makes sense; the more evidence you already have, the less your new evidence should move your prior. But if your prior moves the same way regardless of the previous evidence, you're seriously underweighting that previous evidence (which means you're overweighting the new evidence).

The chess Federations implicitly understand this, that you should give less weight to new results when you have more older results. That's why they vary the K-values based on who's playing. 

FIDE, for instance, weights new players at K=40, experienced players at K=20, and masters (who presumably have the most experience) at K=10.

------

As I said last post, I did a simulation. I created a team that was exactly average in talent, and assumed that FiveThirtyEight had accurately given them an average rating at the beginning of the year. I played out 1000 random seasons, and, on average, the team wound up at right where it started, just as you would expect.

Then, I modified the simulation as if FiveThirtyEight had underrated the team by 50 points, which would peg them as a .429 team. (They use their "CARM-Elo" player projection system for those pre-season ratings. I'm not saying that system is wrong, just checking what happens when a projection happens to be off.)

It turned out that, at the end of the 82-game season, Elo had indeed figured out the team was better than their initial rating, and had restored 45 of the 50 points. They were still underrated, but only by 5 points (.493) instead of 50 (.429). 

Effectively, the current season wiped out 90% of the original rating. Since the original rating was based on the previous seasons, that means that, to get the final rating, Elo effectively weighted this year at 90%, and the previous years at 10%. 

10% is close to 12.5%. I'll use that because it makes the calculation a bit easier. At 12.5%, which is one-eighth, it means the NBA season contains three "half lives" of about 27 games each. 

That is: after 27 games, the gap of 50 points is reduced by half, to 25. After another 27 games, it's down to 12. After a third 27 games, it's down to 6, which is 12.5% of where the gap started.

That means that to calculate a final rating, the thirds of seasons are effectively weighted in a ratio of 1:2:4. A game in April has four times the weight of a game in November. Last post, I argued why I think that's too high.

-------

Here's another way of illustrating how recency matters. 

I tweaked the simulation to do something a little different. Instead of creating 1,000 different seasons, I created only one season, but randomly reordered the games 1,000 times. The opponents and final scores were the same; only the sequence was different. 

Under the traditional method, the talent estimates would be the same, since all 1,000 teams had the same W-L record. But the Elo ratings varied, because of recency effects. They varied with an SD of about 26 points. That's about .037 in winning percentage, or 3 wins per 82 games.

If you consider the SD to be, in a sense, the "average" discrepancy, that means that, on average, Elo will misestimate a team's talent by 3 wins. That's for teams with the same actual record -- based only on the randomness of *when* they won or lost. 

And you can't say, "well, those three wins might be because talent changed over time."  Because, that's just the random part. Any actual change in talent is additional to that. 

--------

If all NBA games were pick-em, the SD of team luck in an 82-game season would be around 4.5 wins. Because there are lots of mismatches, which are more predictable, the actual SD is lower, maybe, say, 4.1 games. 

Elo ratings are fully affected by that 4.1 games of binomial luck, but also by another 3 games worth of luck for the random order in which games are won or lost. 

Why would you want to dilute the accuracy of your talent estimate by adding 3 wins worth of randomness to your SD? Only if you're gaining 3 wins worth of accuracy some other way. Like, for instance, if you're able to capture team changes in talent from the beginning of the season to the end. If teams vary in talent over time, like chess players, maybe weighting recent games more highly could give you a better estimate of a team's new level of skill.

Do teams vary in talent, from the beginning to the end of the year, by as much as 3 games (.037)?

Actually, 3 games is a bit of a red herring. You need more than 3 games of talent variance to make up for the 3 games of sequencing luck.

Because, suppose a team goes smoothly from a 40-win talent at the beginning of the year to a 43-win talent at the end of the year. That team will have won 41.5 games, not 40, so the discrepancy between estimate and talent won't be 3 games, but just one-and-a-half games.

As expected, Elo does improve on the 1.5 game discrepancy you get from the traditional method. I ran the simulation again, and found that Elo picked up about 90% of the talent difference rather than 50%. That means that Elo would peg the (eventual) 43-game talent at 42.7.

For a team that transitions from a 40- to a 43-game talent, the traditional method was off by 1.5 games. The Elo method was off by only 0.3 games. 

It looks like Elo is only a 1.2 game improvement over the traditional method, in its ability to spot changes in talent. But it "costs" a 3-game SD for extra randomness. So it doesn't seem like it's a good deal.

To compensate for the 3-game recency SD, you'd need the average within-season talent change to be much more than 3 games. You'd need it to be 7.5 games.

Do teams really change in talent, on average, by 7.5/82 games, over the course of a single season? Sure, some teams must, like they have injury problems to their star players. But on average? That doesn't seem plausible.

------

Besides, what's stopping you from adjusting teams on a case-by-case basis? If Stephen Curry gets hurt ... well, just adjust the Warriors down. If you think Curry is worth 15 games a season, just drop the Warriors' estimate by that much until he comes back.

It's when you try to do things by formula that you run into problems. If you expect Elo to automatically figure out that Curry is hurt, and adjust the Warriors accordingly ... well, sure, that'll happen. Eventually. As we saw, it will take 27 games, on average, until Elo adjusts just for half of Curry's value. And, when he comes back, it'll take 27 games until you get back only half of what Elo managed to adjust by. 

In our example, we assumed that talent changed constantly and slowly over the course of the season. That makes it very easy for Elo to track. But if you lose Curry suddenly, and get him back suddenly 27 games later ... then Elo isn't so good. If losing Curry is worth -.100 in winning percentage, Elo will start at .000 Curry's first game out, and only reach -.050 by Curry's 27th game out. Then, when he's back, Elo will take another 27 games just to bounce back from -.050 to -.025.

In other words, Elo will be significantly off for at least 54 games. Because Elo does weight recent games more heavily, it'll still be better than the traditional method. But neither method really distinguishes itself. When you have a large, visible shock to team talent, I don't see why you wouldn't just adjust for it based on fundamentals, instead of waiting a whole season for your formula to figure it out.

-------

Anyway, if you disagree with me, and believe that team talent does change significantly, in a smooth and gradual way, here's how you can prove it.

Run a regression to predict a team's last 12 games of the season, from their previous seven ten-game records (adjusted for home/road and opponent talent, if you can). 

You'll get seven coefficients. If the seventh group has a significantly higher coefficient than the first group, then you have evidence it needs to be weighted higher, and by how much.

If the weight for the last group turns out to be three or four times as high as the weight for the first group, then you have evidence that Elo might be doing it right after all.

I doubt that would happen. I could be wrong. 



Labels: , , , , ,

Wednesday, November 01, 2017

Does previous playoff experience matter in the NBA?

Conventional wisdom says that playoff experience matters. All else being equal, players who have been in the post-season before are more capable of adapting to the playoff environment -- the pressure, the intensity, the refereeing, and so on.

FiveThirtyEight now has a study they say confirms this in the NBA:

"In the NBA postseason since 1980, the team with the higher initial Elo rating has won 74 percent of playoff series. But if a team has both a higher Elo rating and much more playoff experience, that win percentage shoots up to 86 percent. Conversely, teams with the higher Elo rating but much less playoff experience have won just 52 percent of playoff series. These differences are highly statistically and practically significant."

I don't dispute their results, but I disagree that they have truly found evidence that playoff experience matters.

------

There's a certain amount of random luck in the outcomes of games. NBA results have less luck than most other sports, but there's significant randomness nonetheless.

So, if you have two teams, each of which finish with identical records and identical Elo ratings, they're not necessarily equal in talent. One team is probably better than the other, but just had worse luck in the regular season. That's the team from which you would expect better playoff performance. 

But how do you tell them apart? If you have two teams, each of which finishes 55-27, with an Elo of (say) 1600, how can you tell which one is better?

One way is to look at their previous season records. If team A was .500 last year, while team B was .650, it's more likely that B is better. Sure, maybe not: it could be that team B lost a hall-of-famer in the off-season, while team A got a superstar back from injury. But, most of the time, that didn't happen, and you're going to find that team B is still the better team.

If you're looking at last season, the team with the better record is probably the team that got farther in the playoffs. And the team that got farther in the playoffs is probably the one whose players have more playoff experience.

So, when FiveThirtyEight notices that teams with playoff experience tend to outperform Elo expectations, it's not necessarily the actual playoff experience that's the cause. It could be that it's actually that the teams are better than their ratings -- a situation that correlates with playoff experience.

And, of course, "team being better" is a much more plausible explanation for good performance than "team has more playoff experience."

------

Here's a possible counterargument. 

The study doesn't just look at players' *last year's* playoff experience -- it looks at their *career* playoff experience. You'd think that would dilute the effect, somewhat. But, still. Teams tend to stay good or bad for a few years before you could say their talent has changed significantly. Also, players with a lot of playoff experience, even with other teams, are more likely to be good players, and good players tend to play for more talented teams (even if all that makes the team "more talented" comes from them).

------

Another counterargument might be: well, the previous season's performance is already baked into Elo. If a team did well last season, it starts out with a higher rating than a team that didn't. So, checking again what a team did last season shouldn't make any difference. It would be like checking which team played better on Mondays. That shouldn't matter, because the Elo's conclusion that the teams are equal has already used the results of Monday games. 

That's a strong argument, and it would hold up if Elo did, in fact, give last season the appropriate consideration. But I don't think it does. When it calculates the rating, Elo gives previous seasons a very low weighting.

I did a little simulation (which I'll describe next post), and found that, when two NBA teams start with different Elo ratings, but perform identically, half the difference is wiped out after about 27 games.

So, team A starts out with a rating of 1500 (projection of 41-41). Team B starts out with 1600 (52-30). After 27 games playing identically against identical opponents, the Elo difference drops from 100 points to 50. 

After a second 27 games, the difference gets cut in half again, and the teams are now only 25 points apart. After a third 27 games, the difference cuts in half again, to around 12 points. That takes us to 81 games, roughly an NBA season. 

So, at the beginning of the season, Elo thought team B was 100 points better than team A. Now, because both teams had equal seasons, Elo thinks B is only 12 points better than A.

And that's even after considering personnel changes between seasons. If the two ratings started out 100 points apart, their performance last season was actually 133 points apart, because FiveThirtyEight regresses a quarter the way to the mean during the off-season, to account for player aging and roster changes. 

133 points is about 15 games out of 82. So, last year B was 56-26, while A was 41-41. They now have identical 49-33 seasons, and Elo thinks B is only 1.4 games better than A.

In other words, after a combined 162 games, the system thinks A was less than two games luckier than B.

That seems like too much convergence. 

------

Under the FiveThirtyEight system, previous years' performance contributes only 12 percent of the rating; this year's performance is the remaining 88 percent.  And that's *in addition* to adjusting for personnel changes off-season. 

That's far lower than traditional sabermetric standards. For baseball player performance, Bill James (and Tango, and others) have traditionally put previous seasons at 50 percent. They use a "1/2/3" weighting. By contrast, the NBA is using "1/5/44".

Of course, the "1/2/3" is for players; for teams, it should be lower, because of personnel changes. Especially in the NBA, where personnel changes make a much bigger difference because a superstar has such a big impact. 

But, still, 12 percent is far too little weight to give to previous NBA seasons. That's why, when you want to know whose Elo ratings are unlucky, you actually do add valuable information about talent, by checking which teams played much better the last few years than they did this year. 

And that's why playoff experience seems to matter. It correlates highly with teams that did well in the past. 

------

I could be wrong; it shouldn't be too hard to test. 

1. If this hypothesis is right, then playoff experience won't just predict playoff success; it will also predict late regular season success, because the same logic holds. That might be tricky to test because some teams might not give their stars as many minutes in those April games. But you could still check using teams that are fighting for a playoff spot.

2. Instead of breaking Elo ties by looking at previous playoff experience, look instead at the teams' start-of-season rating. I bet you find an even larger effect. And I bet that after you adjust for that, the apparent effect of playoff experience will be much smaller.

3. For teams whose Elo ratings at the end of the season are close to their ratings at the beginning of the season, I predict that the apparent effect of playoff experience will be much smaller. That's because those teams will tend to have been less lucky or unlucky, so you won't need to look as hard at previous performance (playoff experience) to counterbalance the luck.

4. Instead of using Elo as an estimate of team talent entering the playoffs, estimate talent from Vegas odds in April regular-season games (or late-March games, if you're worried about teams who bench their stars in April). I predict you'll find much less of a "playoff experience" effect.



Hat Tip and thanks: GuyM, for link and e-mail discussion

Labels: , , ,