Friday, November 17, 2017

How Elo ratings overweight recent results

"Elo" is a rating system widely used to rate players in various games, most notably chess. Recently, FiveThirtyEight started using it to maintain real-time ratings for professional sports teams. Here's the Wikipedia page for the full explanation and formulas.

In Elo, everyone gets a number that represents their skill. The exact number only matters in the difference between you and the other players you're being compared to. If you're 1900 and your opponent is 1800, it's exactly the same as if you're 99,900 and your opponent is 99,800.

In chess, they start you off with a rating of 1200, because that happens to be the number they picked. You can interpret the 1200 either as "beginner," or "guy who hasn't played competitively yet so they don't know how good he is."  In the NBA system, FiveThirtyEight decided to start teams off with 1500, which represents a .500 team. 

A player's rating changes with every game he or she plays. The winner gets points added to his or her rating, and the loser gets the same number of points subtracted. It's like the two players have a bet, and the loser pays points to the winner.

How many points? That depends on two things: the "K factor," and the odds of winning.

The "K factor" is chosen by the organization that does the ratings. I think of it as double the number of points the loser pays for an evenly matched game. If K=20 (which FiveThirtyEight chose for the NBA), and the odds are even, the loser pays 10 points to the winner.

If it's not an even match, the loser pays the number of points by which he underperformed expectations. If the Warriors had a 90% chance of beating the Knicks, and they lost, they'd lose 18 points, which is 90% of K=20. If the Warriors win, they only gain 2 points, since they only exceeded expectations by 10 percentage points.

How does Elo calculate the odds? By the difference between the two players' ratings. The Elo formula is set so that a 400-point difference represents a 10:1 favorite. An 800 point favorite is 100:1 (10 for the first 400, multiplied by 10 for the second 400). A 200 point favorite is 3.16 to 1 (3.16 is the square root of 10). A 100 point favorite is 1.78 to 1 (the fourth root of 10), and so on.

In chess, the K factor varies depending on which chess federation is doing the ratings, and the level of skill of the players. For experienced non-masters, K seems to vary between 15 and 32 (says Wikipedia). 


Suppose A and B have equal ratings of 1600, and A beats B. With K=20, A's rating jumps to 1610, and B's rating falls to 1590. 

A and B are now separated by 20 points in the ratings, so A is deemed to have odds of 1.12:1 of beating B. (That's because the "400/20th" root of 10 is 1.12.)  That corresponds to an expected winning percentage of .529.

After lunch, they play again, and this time B beats A. Because B was the underdog, he gets more than 10 points -- 10.6 points (.529 times K=20), to be exact. And A loses the identical 10.6 points.

That means after the two games, A has a rating of 1599.4, and B has a rating of 1600.6.


That example shows one of the properties of Elo -- it weights recent performance higher than past performance. In their two games, A and B effectively tied, each going 1-1. But B winds up with a higher rating than A because his win was more recent.

Is that reasonable? In a way, it is. People's skill at chess changes over their lifetimes, and it would be weird to give the same weight to a game Garry Kasparov played when he was 8, as you would to a game Garry Kasparov played as World Champion.

But in the A and B case, it seems weird. A and B played both games the same day, and their respective skills couldn't have changed that much during the hour they took their lunch break. In this case, it would make more sense to weight the games equally.

Well, according to Wikipedia, that's might would actually happen. Instead of updating the ratings every game, the Federation would wait until the end of the tournament, and then compare each player to his or her overall expectation based on ratings going in. In this case, A and B would be expected to go 1-1 in two games, which they did, so their ratings wouldn't change at all.

But, if A and B's games were days or weeks apart, as part of different tournaments, the two games would be treated separately, and B might indeed wind up 1.2 points ahead of A.


Is that a good thing, giving a higher weight to recency? It depends how much higher a weight. 

People's skill does indeed change daily, based on mood, health, fatigue -- and, of course, longer-term trends in skill. In the four big North American team sports, it's generally true that players tend to improve in talent until a certain age (27 in baseball), then decline. And, of course, there are non-age-related talent changes, like injuries, or cases where players just got a lot better or a lot worse partway through their careers.

That's part of the reason we tend to evaluate players based on their most recent season. If a player hit 15 home runs last year, but 20 this year, we expect the 20 to be more indicative of what we can expect next season.

Still, I think Elo gives far too much weight to recent results, when applied to professional sports teams. 

Suppose you're near the end of the season, and you're looking at a team with a 40-40 record -- the Bulls, say. From that, you'd estimate their talent as average -- they're a .500 team.

Now, they win an even-money game, and they're 41-40, which is .506. How do you evaluate them now? You take the .506, and regress to the mean a tiny bit, and maybe estimate they're a .505 talent. (I'll call that the "traditional method," where you estimate talent by taking the W-L record and regressing to the mean.)

What would Elo say? Before the 81st game, the Bulls were probably rated at 1500. After the game, they've gained 10 points for the win, bringing them to 1510.

But, 1510 represents a .514 record, not .505. So, Elo gives that one game almost three times the weight that the traditional method does.

Could that be right? Well, you'd have to argue that maybe because of personnel changes, team talent changes so much from the beginning of the year to the end that the April games are worth three times as much as the average game. But ... well, that still doesn't seem right. 


Techincal note: I should mention that FiveThirtyEight adds a factor to their Elo calculation -- they give more or fewer points to the winner of the game based on the score differential. If a favorite wins by 1 point, they'll get a lot fewer points than if they won by 15 points. Same for the underdog, except that the underdog always gets more points than the favorite for the same point differential -- which makes sense.

FiveThirtyEight doesn't say so explicitly, but I think they set the weighting factor so that the average margin of victory corresponds to the number points the regular Elo would award the winner.

Here's the explanation of their system.


Elo starts with a player's rating, then updates it based on results. But, when it updates it, it has no idea how much evidence was behind the player's rating in the first place. If a team is at 1500, and then it wins an even game, it goes to 1510 regardless of whether it was at 1500 because it's an expansion team's first game, or because it was 40-40, or (in the case of chess) it's 1000-1000.

The traditional method, on the other hand, does know. If a team goes from 1-1 to 2-1, that's a move of .167 points (less after regressing to estimate talent, of course). If a team goes from 40-40 to 41-40, that's a move of only .005 points. 

Which makes sense; the more evidence you already have, the less your new evidence should move your prior. But if your prior moves the same way regardless of the previous evidence, you're seriously underweighting that previous evidence (which means you're overweighting the new evidence).

The chess Federations implicitly understand this, that you should give less weight to new results when you have more older results. That's why they vary the K-values based on who's playing. 

FIDE, for instance, weights new players at K=40, experienced players at K=20, and masters (who presumably have the most experience) at K=10.


As I said last post, I did a simulation. I created a team that was exactly average in talent, and assumed that FiveThirtyEight had accurately given them an average rating at the beginning of the year. I played out 1000 random seasons, and, on average, the team wound up at right where it started, just as you would expect.

Then, I modified the simulation as if FiveThirtyEight had underrated the team by 50 points, which would peg them as a .429 team. (They use their "CARM-Elo" player projection system for those pre-season ratings. I'm not saying that system is wrong, just checking what happens when a projection happens to be off.)

It turned out that, at the end of the 82-game season, Elo had indeed figured out the team was better than their initial rating, and had restored 45 of the 50 points. They were still underrated, but only by 5 points (.493) instead of 50 (.429). 

Effectively, the current season wiped out 90% of the original rating. Since the original rating was based on the previous seasons, that means that, to get the final rating, Elo effectively weighted this year at 90%, and the previous years at 10%. 

10% is close to 12.5%. I'll use that because it makes the calculation a bit easier. At 12.5%, which is one-eighth, it means the NBA season contains three "half lives" of about 27 games each. 

That is: after 27 games, the gap of 50 points is reduced by half, to 25. After another 27 games, it's down to 12. After a third 27 games, it's down to 6, which is 12.5% of where the gap started.

That means that to calculate a final rating, the thirds of seasons are effectively weighted in a ratio of 1:2:4. A game in April has four times the weight of a game in November. Last post, I argued why I think that's too high.


Here's another way of illustrating how recency matters. 

I tweaked the simulation to do something a little different. Instead of creating 1,000 different seasons, I created only one season, but randomly reordered the games 1,000 times. The opponents and final scores were the same; only the sequence was different. 

Under the traditional method, the talent estimates would be the same, since all 1,000 teams had the same W-L record. But the Elo ratings varied, because of recency effects. They varied with an SD of about 26 points. That's about .037 in winning percentage, or 3 wins per 82 games.

If you consider the SD to be, in a sense, the "average" discrepancy, that means that, on average, Elo will misestimate a team's talent by 3 wins. That's for teams with the same actual record -- based only on the randomness of *when* they won or lost. 

And you can't say, "well, those three wins might be because talent changed over time."  Because, that's just the random part. Any actual change in talent is additional to that. 


If all NBA games were pick-em, the SD of team luck in an 82-game season would be around 4.5 wins. Because there are lots of mismatches, which are more predictable, the actual SD is lower, maybe, say, 4.1 games. 

Elo ratings are fully affected by that 4.1 games of binomial luck, but also by another 3 games worth of luck for the random order in which games are won or lost. 

Why would you want to dilute the accuracy of your talent estimate by adding 3 wins worth of randomness to your SD? Only if you're gaining 3 wins worth of accuracy some other way. Like, for instance, if you're able to capture team changes in talent from the beginning of the season to the end. If teams vary in talent over time, like chess players, maybe weighting recent games more highly could give you a better estimate of a team's new level of skill.

Do teams vary in talent, from the beginning to the end of the year, by as much as 3 games (.037)?

Actually, 3 games is a bit of a red herring. You need more than 3 games of talent variance to make up for the 3 games of sequencing luck.

Because, suppose a team goes smoothly from a 40-win talent at the beginning of the year to a 43-win talent at the end of the year. That team will have won 41.5 games, not 40, so the discrepancy between estimate and talent won't be 3 games, but just one-and-a-half games.

As expected, Elo does improve on the 1.5 game discrepancy you get from the traditional method. I ran the simulation again, and found that Elo picked up about 90% of the talent difference rather than 50%. That means that Elo would peg the (eventual) 43-game talent at 42.7.

For a team that transitions from a 40- to a 43-game talent, the traditional method was off by 1.5 games. The Elo method was off by only 0.3 games. 

It looks like Elo is only a 1.2 game improvement over the traditional method, in its ability to spot changes in talent. But it "costs" a 3-game SD for extra randomness. So it doesn't seem like it's a good deal.

To compensate for the 3-game recency SD, you'd need the average within-season talent change to be much more than 3 games. You'd need it to be 7.5 games.

Do teams really change in talent, on average, by 7.5/82 games, over the course of a single season? Sure, some teams must, like they have injury problems to their star players. But on average? That doesn't seem plausible.


Besides, what's stopping you from adjusting teams on a case-by-case basis? If Stephen Curry gets hurt ... well, just adjust the Warriors down. If you think Curry is worth 15 games a season, just drop the Warriors' estimate by that much until he comes back.

It's when you try to do things by formula that you run into problems. If you expect Elo to automatically figure out that Curry is hurt, and adjust the Warriors accordingly ... well, sure, that'll happen. Eventually. As we saw, it will take 27 games, on average, until Elo adjusts just for half of Curry's value. And, when he comes back, it'll take 27 games until you get back only half of what Elo managed to adjust by. 

In our example, we assumed that talent changed constantly and slowly over the course of the season. That makes it very easy for Elo to track. But if you lose Curry suddenly, and get him back suddenly 27 games later ... then Elo isn't so good. If losing Curry is worth -.100 in winning percentage, Elo will start at .000 Curry's first game out, and only reach -.050 by Curry's 27th game out. Then, when he's back, Elo will take another 27 games just to bounce back from -.050 to -.025.

In other words, Elo will be significantly off for at least 54 games. Because Elo does weight recent games more heavily, it'll still be better than the traditional method. But neither method really distinguishes itself. When you have a large, visible shock to team talent, I don't see why you wouldn't just adjust for it based on fundamentals, instead of waiting a whole season for your formula to figure it out.


Anyway, if you disagree with me, and believe that team talent does change significantly, in a smooth and gradual way, here's how you can prove it.

Run a regression to predict a team's last 12 games of the season, from their previous seven ten-game records (adjusted for home/road and opponent talent, if you can). 

You'll get seven coefficients. If the seventh group has a significantly higher coefficient than the first group, then you have evidence it needs to be weighted higher, and by how much.

If the weight for the last group turns out to be three or four times as high as the weight for the first group, then you have evidence that Elo might be doing it right after all.

I doubt that would happen. I could be wrong. 

Labels: , , , , ,

Wednesday, November 01, 2017

Does previous playoff experience matter in the NBA?

Conventional wisdom says that playoff experience matters. All else being equal, players who have been in the post-season before are more capable of adapting to the playoff environment -- the pressure, the intensity, the refereeing, and so on.

FiveThirtyEight now has a study they say confirms this in the NBA:

"In the NBA postseason since 1980, the team with the higher initial Elo rating has won 74 percent of playoff series. But if a team has both a higher Elo rating and much more playoff experience, that win percentage shoots up to 86 percent. Conversely, teams with the higher Elo rating but much less playoff experience have won just 52 percent of playoff series. These differences are highly statistically and practically significant."

I don't dispute their results, but I disagree that they have truly found evidence that playoff experience matters.


There's a certain amount of random luck in the outcomes of games. NBA results have less luck than most other sports, but there's significant randomness nonetheless.

So, if you have two teams, each of which finish with identical records and identical Elo ratings, they're not necessarily equal in talent. One team is probably better than the other, but just had worse luck in the regular season. That's the team from which you would expect better playoff performance. 

But how do you tell them apart? If you have two teams, each of which finishes 55-27, with an Elo of (say) 1600, how can you tell which one is better?

One way is to look at their previous season records. If team A was .500 last year, while team B was .650, it's more likely that B is better. Sure, maybe not: it could be that team B lost a hall-of-famer in the off-season, while team A got a superstar back from injury. But, most of the time, that didn't happen, and you're going to find that team B is still the better team.

If you're looking at last season, the team with the better record is probably the team that got farther in the playoffs. And the team that got farther in the playoffs is probably the one whose players have more playoff experience.

So, when FiveThirtyEight notices that teams with playoff experience tend to outperform Elo expectations, it's not necessarily the actual playoff experience that's the cause. It could be that it's actually that the teams are better than their ratings -- a situation that correlates with playoff experience.

And, of course, "team being better" is a much more plausible explanation for good performance than "team has more playoff experience."


Here's a possible counterargument. 

The study doesn't just look at players' *last year's* playoff experience -- it looks at their *career* playoff experience. You'd think that would dilute the effect, somewhat. But, still. Teams tend to stay good or bad for a few years before you could say their talent has changed significantly. Also, players with a lot of playoff experience, even with other teams, are more likely to be good players, and good players tend to play for more talented teams (even if all that makes the team "more talented" comes from them).


Another counterargument might be: well, the previous season's performance is already baked into Elo. If a team did well last season, it starts out with a higher rating than a team that didn't. So, checking again what a team did last season shouldn't make any difference. It would be like checking which team played better on Mondays. That shouldn't matter, because the Elo's conclusion that the teams are equal has already used the results of Monday games. 

That's a strong argument, and it would hold up if Elo did, in fact, give last season the appropriate consideration. But I don't think it does. When it calculates the rating, Elo gives previous seasons a very low weighting.

I did a little simulation (which I'll describe next post), and found that, when two NBA teams start with different Elo ratings, but perform identically, half the difference is wiped out after about 27 games.

So, team A starts out with a rating of 1500 (projection of 41-41). Team B starts out with 1600 (52-30). After 27 games playing identically against identical opponents, the Elo difference drops from 100 points to 50. 

After a second 27 games, the difference gets cut in half again, and the teams are now only 25 points apart. After a third 27 games, the difference cuts in half again, to around 12 points. That takes us to 81 games, roughly an NBA season. 

So, at the beginning of the season, Elo thought team B was 100 points better than team A. Now, because both teams had equal seasons, Elo thinks B is only 12 points better than A.

And that's even after considering personnel changes between seasons. If the two ratings started out 100 points apart, their performance last season was actually 133 points apart, because FiveThirtyEight regresses a quarter the way to the mean during the off-season, to account for player aging and roster changes. 

133 points is about 15 games out of 82. So, last year B was 56-26, while A was 41-41. They now have identical 49-33 seasons, and Elo thinks B is only 1.4 games better than A.

In other words, after a combined 162 games, the system thinks A was less than two games luckier than B.

That seems like too much convergence. 


Under the FiveThirtyEight system, previous years' performance contributes only 12 percent of the rating; this year's performance is the remaining 88 percent.  And that's *in addition* to adjusting for personnel changes off-season. 

That's far lower than traditional sabermetric standards. For baseball player performance, Bill James (and Tango, and others) have traditionally put previous seasons at 50 percent. They use a "1/2/3" weighting. By contrast, the NBA is using "1/5/44".

Of course, the "1/2/3" is for players; for teams, it should be lower, because of personnel changes. Especially in the NBA, where personnel changes make a much bigger difference because a superstar has such a big impact. 

But, still, 12 percent is far too little weight to give to previous NBA seasons. That's why, when you want to know whose Elo ratings are unlucky, you actually do add valuable information about talent, by checking which teams played much better the last few years than they did this year. 

And that's why playoff experience seems to matter. It correlates highly with teams that did well in the past. 


I could be wrong; it shouldn't be too hard to test. 

1. If this hypothesis is right, then playoff experience won't just predict playoff success; it will also predict late regular season success, because the same logic holds. That might be tricky to test because some teams might not give their stars as many minutes in those April games. But you could still check using teams that are fighting for a playoff spot.

2. Instead of breaking Elo ties by looking at previous playoff experience, look instead at the teams' start-of-season rating. I bet you find an even larger effect. And I bet that after you adjust for that, the apparent effect of playoff experience will be much smaller.

3. For teams whose Elo ratings at the end of the season are close to their ratings at the beginning of the season, I predict that the apparent effect of playoff experience will be much smaller. That's because those teams will tend to have been less lucky or unlucky, so you won't need to look as hard at previous performance (playoff experience) to counterbalance the luck.

4. Instead of using Elo as an estimate of team talent entering the playoffs, estimate talent from Vegas odds in April regular-season games (or late-March games, if you're worried about teams who bench their stars in April). I predict you'll find much less of a "playoff experience" effect.

Hat Tip and thanks: GuyM, for link and e-mail discussion

Labels: , , ,

Friday, August 04, 2017

Deconstructing an NBA time-zone regression

Warning: for regression geeks only.


Recently, I came across an NBA study that found an implausibly huge effect of teams playing in other time zones. The study uses a fairly simple regression, so I started thinking about what could be happening. 

My point here isn't to call attention to the study, just to figure out the puzzle of how such a simple regression could come up with such a weird result. 


The authors looked at every NBA regular-season game from 1991-92 to 2001-02. They tried to predict which team won, using these variables:

-- indicator for home team / season
-- indicator for road team / season
-- time zones east for road team
-- time zones west for road team

The "time zones" variable was set to zero if the game was played in the road team's normal time zone, or if it was played in the opposite direction. So, if an east-coast team played on the west coast, the west variable would be 3, and the east variable would be 0.

The team indicators are meant to represent team quality. 


When the authors ran the regression, they found the "number of time zones" variable large and statistically significant. For each time zone moving east, teams played .084 better than expected (after controlling for teams). A team moving west played .077 worse than expected. 

That means a .500 road team on the West Coast would actually play .756 ball on the East Coast. And that's regardless of how long the visiting team has been in the home team's time zone. It could be a week or more into a road trip, and the regression says it's still .756.

The authors attribute the effect to "large, biological effects of playing in different time zones discovered in medicine and physiology research." 


So, what's going on? I'm going to try to get to the answer, but I'll start with a couple of dead ends that nonetheless helped me figure out what the regression is actually doing. I should say in advance that I can't prove any of this, because I don't have their data and I didn't repeat their regression. This is just from my armchair.

Let's start with this. Suppose it were true, that for physiological reasons, teams always play worse going west, and teams always play better going east. If that were the case, how could you ever know? No matter what you see in the data, it would look EXACTLY like the West teams were just better quality than the East teams. (Which they have been, lately.)  

To see that argument more easily: suppose the teams on the West Coast are all NBA teams. The MST teams are minor-league AAA. The CST teams are AA. And the East Coast teams are minor league A ball. But all the leagues play against each other.

In that case, you'd see exactly the pattern the authors got: teams are .500 against each other in the same time zone, but worse when they travel west to play against better leagues, and better when they travel east to play against worse leagues.

No matter what results you get, there's no way to tell whether it's time zone difference, or team quality.

So is that the issue, that the regression is just measuring a quality difference between teams in different time zones? No, I don't think so. I believe the "time zone" coefficient of the regression is measuring something completely irrelevant (and, in fact, random). I'll get to that in a bit. 


Let's start by considering a slightly simpler version of this regression. Suppose we include all the team indicator variables, but, for now, we don't include the time-zone number. What happens?

Everything works, I think. We get decent estimates of team quality, both home and road, for every team/year in the study. So far, so good. 

Now, let's add a bit more complexity. Let's create a regression with two time zones, "West" and "East," and add a variable for the effect of that time zone change.

What happens now?

The regression will fail. There's an infinite number of possible solutions. (In technical terms, the regression matrix is "singular."  We have "collinearity" among the variables.)

How do we know? Because there's more than one set of coefficients that fits the data perfectly. 

(Technical note: a regression will always fail if you have an indicator variable for every team. To get around this, you'll usually omit one team (and the others will come out relative to the one you omitted). The collinearity I'm talking about is even *after* doing that.)

Suppose the regression spit out that the time-zone effect is actually  .080, and it also spit out quality estimates for all the teams.

From that solution, we can find another solution that works just as well. Change the time-zone effect to zero. Then, add .080 to the quality estimate of every West team. 

Every team/team estimate will wind up working out exactly the same. Suppose, in the first result, the Raptors were .400 on the road, the Nuggets were .500 at home, and the time-zone effect is .080. In that case, the regression will estimate the Raptors at .320 against the Nuggets. (That's .400 - (.500 - .500) - .080.)

In the second result, the regression leaves the Raptors at .400, but moves the Nuggets to .580, and the time-zone effect to zero. The Raptors are still estimated at .320 against the Nuggets. (This time, it's .400 - (.580 - .500) - .000.)

You can create as many other solutions as you like that fit the data identically: just add any X to the time-zone estimate, and add the same X to every Western team.

The regression is able to figure out that the data doesn't give a unique solution, so it craps out, with a message that the regression matrix is singular.


All that was for a regression with only two time zones. If we now expand to include all four zones, that gives six different effects each direction (E moving to C, C to M, M to P, E to M, C to P, and M to P). What if we include six time-zone variables, one for each effect?

Again, we get an infinity of solutions. We can produce new solutions almost the same way as before. Just take any solution, subtract X from each E team quality, and add X to the E-C, E-M and E-P coefficients. You wind up with the same estimates.


But the authors' regression actually did have one unique best fit solution. That's because they did one more thing that we haven't done.

We can get to their regression in two steps.

First, we collapse the six variables into three -- one for "one time zone" (regardless of which zone it is), one for "two time zones," and one for "three time zones". 

Second, we collapse those three variables into one, "number of time zones," which implicitly forces the two-zone effect and three-zone effect to be double and triple, respectively, the value of the one-zone effect. I'll call that the "x/2x/3x rule" and we'll assume that it actually does hold.

So, with the new variable, we run the regression again. What happens?

In the ideal case, the regression fails again. 

By "ideal case," I mean one where all the error terms are zero, where every pair of teams plays exactly as expected. That is, if the estimates predict the Raptors will play .350 against the Nuggets, they actually *do* play .350 against the Nuggets. It will never happen that every pair will go perfectly in real life, but maybe assume that the dataset is trillions of games and the errors even out.

In that special "no errors" case, you still have an infinity of solutions. To get a second solution from a first, you can, for instance, double the time zone effects from x/2x/3x to 2x/4x/6x. Then, subtract x from each CST team, subtract 2x from each MST team, and subtract 3x from each PST team. You'll wind up with exactly the same estimates as before.


For this particular regression to not crap out, there have to be errors. Which is not a problem for any real dataset. The Raptors certainly won't go the exact predicted .350 against the Nuggets, either because of luck, or because it's not mathematically possible (you'd need to go 7-for-20, and the Raptors aren't playing 20 games a season in Denver).

The errors make the regression work.

Why? Before, x/2x/3x fit all the observations perfectly. So you could create duplicate solutions by adding and subtracting X and 2X from the teams, and adding X and 2X to the one-zone effects and two-zone effects. Now, because of errors, not all the observed two-zone effects are exactly double the one-zone effects. So not everything cancels out, and you get different residuals. 

That means that this time there's a unique solution, and the regression spits it out.


In this new, valid, regression, what's the expected value of the estimate for the time-zone effect?

I think it must be zero.

The estimate of the coefficient is a function of the observed error terms in the data. But the errors are, by definition, just as likely to be negative as positive. I believe (but won't prove) that if you reverse the signs of all the error terms, you also reverse the sign of the time zone coefficient estimate.

So, the coefficient is as likely to be negative as positive, which means by symmetry, its expected value must be zero.

In other words: the coefficient in the study, the one that looks like it's actually showing the physiological effects of changing time zone ... is actually completely random, with expected value zero.

It literally has nothing at all to do with anything basketball-related!


So, that's one factor that's giving the weird result, that the regression is fitting the data to randomness. Another factor, and (I think) the bigger one, is that the model is wrong. 

There's an adage, "All models are wrong; some models are useful." My argument is that this model is much too wrong to be useful. 

Specifically, the "too wrong" part is the requirement that the time-zone effect must be proportional to the number of zones -- the "x/2x/3x" assumption.

It seems like a reasonable assumption, that the effect should be proportional to the time lag. But, if it's not, that can distort the results quite a bit. Here's a simplified example showing how that distortion can happen.

Suppose you were to run the regression without the time-zone coefficient, and you get talent estimates for the teams, and you look at the errors in predicted vs. actual. For East teams, you find the errors are

+.040 against Central
+.000 against Mountain
-.040 against Pacific

That means that East teams played .040 better than expected against Central teams (after adjusting for team quality). They played exactly as expected against Mountain Time teams, and .040 worse than expected against West Coast teams.

The average of those numbers is zero. Intuitively, you'd look at those numbers and think: "Hey, there's no appreciable time-zone effect. Sure, the East teams lost a little more than normal against the Pacific teams, but they won a little more than normal against the Central teams, so it's mostly a wash."

Also, you'd notice that it really doesn't look like the observed errors follow x/2x/3x. The closest fit seems to be when you make x equal to zero, to get 0/0/0.

So, does the regression see that and spit out 0/0/0, accepting the errors it found? No. It actually finds a way to make everything fit perfectly!

To do that, it increases its estimates of every Eastern team by .080. Now, every East team appears to underperform by .080 against each of the three other time zones. Which means the observed errors are now 

-.040 against Central
-.080 against Mountain
-.120 against Pacific

And that DOES follow the x/2x/3x model -- which means you can now fit the data perfectly. Using 0/0/0, the .500 Raptors were expected to be .500 against an average Central team (.500 minus 0), but they actually went .540. Using -.040/-.080/.120, the .580 Raptors are expected to be .540 against an average Central team (.580 minus .040), and that's exactly what they did.

So the regression says, "Ha! That must be the effect of time zone! It follows the x/2x/3x requirement, and it fits the data perfectly, because all the errors now come out to zero!"

So you conclude that 

(a) over a 20-year period, the East teams were .580 teams but played down to .500 because they suffered from a huge time-zone effect.

Well, do you really want to believe that? 

You have at least two other options you can justify: 

(b) over a 20-year period, the East teams were .500 teams and there was a time-zone effect of +40 points playing in CST, and -40 points playing in PST, but those effects weren't statistically significant.

(c) over a 20-year period, the East teams were .500 teams and due to lack of statistical significance and no obvious pattern, we conclude there's no real time-zone effect.

The only reason to choose (a) is if you are almost entirely convinced of two things: first, that x/2x/3x is the only reasonable model to consider, and, second, that 40/80/120 points is plausible enough to not assume that it's just random crap, despite the statistical significance.

You have to abandon your model at this point, don't you? I mean, I can see how, before running the regression, the x/2x/3x assumption seemed as reasonable as any. But, now, to maintain that it's plausible, you have to also believe it's plausible that an Eastern team loses .120 points of winning percentage when it plays on the West Coast. Actually, it's worse than that! The .120 was from this contrived example. The real data shows a drop of more than .200 when playing on the West Coast!

The results of the regression should change your mind about the model, and alert you that the x/2x/3x is not the right hypothesis for how time-zone effects work.


Does this seem like cheating? We try a regression, we get statistically-significant estimates, but we don't like the result so we retroactively reject the model. Is that reasonable?

Yes, it is. Because, you have to either reject the model, or accept its implications. IF we accept the model, then we're forced to accept that there's 240-point West-to-East time zone effect, and we're forced to accept that West Coast teams that play at a 41-41 level against other West Coast teams somehow raise their game to the 61-21 level against East Coast teams that are equal to them on paper.

Choosing the x/2x/3x model led you to an absurd conclusion. Better to acknowledge that your model, therefore, must be wrong.

Still think it's cheating? Here's an analogy:

Suppose I don't know how old my friend's son is. I guess he's around 4, because, hey, that's a reasonable guess, from my understanding of how old my friend is and how long he's been married. 

Then, I find out the son is six feet tall.

It would be wrong for me to keep my assumption, wouldn't it? I can't say, "Hey, on the reasonable model that my friend's son is four years old, the regression spit out a statistically significant estimate of 72 inches. So, I'm entitled to conclude my friend's son is the tallest four-year-old in human history."

That's exactly what this paper is doing.  

When your model spews out improbable estimates for your coefficients, the model is probably wrong. To check, try a different, still-plausible model. If the result doesn't hold up, you know the conclusions are the result of the specific model you chose. 


By the way, if the statistical significance is concerning you, consider this. When the authors repeated the analysis for a later group of years, the time-zone effect was much smaller. It was .012 going east and -.008 going west, which wasn't even close to statistical significance. 

If the study had combined both samples into one, it wouldn't have found significance at all.

Oh, and, by the way: it's a known result that when you have strong correlation in your regression variables (like here), you get wide confidence intervals and weird estimates (like here). I posted about that a few years ago.  


The original question was: what's going on with the regression, that it winds up implying that a .500 team on the West Coast is a .752 team on the East Coast?

The summary is: there are three separate things going on, all of which contribute:

1.  there's no way to disentangle time zone effects from team quality effects.

2.  the regression only works because of random errors, and the estimate of the time-zone coefficient is only a function of random luck.

3.  the x/2x/3x model leads to conclusions that are too implausible to accept, given what we know about how the NBA works. 


UPDATE, August 6/17: I got out of my armchair and built a simulation. The results were as I expected. The time-zone effect I built in wound up absorbed by the team constants, and the time-zone coefficient varied around zero in multiple runs.

Labels: ,

Friday, June 23, 2017

Juiced baseballs, part II

Last post, I showed how MGL found the variation (SD) of MLB baseballs to be in the range of about 7 feet difference for a typical fly ball. I wondered if that were truly the case, or if some of it wasn't real, just imprecision due to measurement error.

After some Twitter conversations that led me to other sources, I'm leaning to the conclusion that the variance is real.


Two of the three measurements in MGL's study (co-authored with Ben Lindbergh) were the circumference of the baseball and its average seam height. For both of those factors, the higher the measure, the more air resistance, and therefore shorter distance travelled.

It occurred to me -- why not measure distance directly, if that's what you're interested in? MGL told me, on Twitter, that that's been done. I found one study via a Google search (a study that Kevin later linked to in a comment).

That study took a box of one dozen MLB balls, fired them from a cannon one by one, and observed how far each travelled. Crucially, the authors adjusted that distance for the original speed and angle, because the cannon itself produces variations in intial conditions. So, what remains is mostly about the ball.

For one of the two boxes, the balls varied (SD) by 8 feet. For the second box, the SD was only 3 feet.

It's still possible that some of that variation is due to initial conditions that weren't controlled for, like small fluctuations in temperature, or air movement within the flight path, or whatever. Fortunately, the authors repeated the procedure, but for a single ball fired multiple times. 

The SD for the single ball was 3 feet.

Using the usual method, we know

SD for different balls ^ 2 = SD for a single ball ^ 2 + SD caused by ball differences ^ 2

That means for the first box, we estimate that the balls vary by 7 feet. For the second box, it's 0 feet. That's a big difference. Fortunately again, the authors repeated the procedure for different types of balls.

NCAA balls have higher seams and therefore less carry. The study found an overall SD of 11 feet, and single ball variation of 2 feet. That means different balls vary by an expected 10.8 feet, which I'll round to 11. 

For minor league balls, the study found an SD of 8 feet overall, but didn't test single balls. Taking 3 feet as a representative estimate for single-ball variation, we get that MiLB balls vary by 7 feet. (8 squared minus 3 squared equals 7 squared, roughly.)

So we have:

-- MLB  balls vary  0 feet in air resistance
-- MLB  balls vary  7 feet in air resistance
-- MiLB balls vary  7 feet in air resistance
-- NCAA balls vary 11 feet in air resistance

In that light, the 7 feet found in MGL's study doesn't seem out of line. Actually, that 7 feet is a bit of an overestimate. It includes variation in COR (bounciness), which doesn't factor into air resistance, as far as I can tell. Limiting only to air resistance, MGL's study found an SD of only 6 feet.


One thing I noticed in the MGL data is that even for balls within the same era, the COR "bouciness" measure correlates highly to both circumference (-.46 overall) and seam height (-.35 overall). (For the 10 balls after the 2016 All-Star break, it's -.36 and -.56, respectively.)

I don't know if those measures are related on some kind of physics basis, or if it's just coincidence that they varied together that way. 


One thing I wonder: are balls within the same batch (whether the definition of "batch" is a box, a case, or a day's production) more uniform than balls from different batches? I haven't found a study that tells us that. From MGL's data, and treating day of use as a "batch," my eyeballs say batches are slightly more uniform than expected, but not much. My eyeballs could be wrong.

If batches *are* more uniform, teams could get valuable information by grabbing a few balls from today's batch, and getting them tested in advance. They'd be more likely to know, then, if they were dealing with livelier or deader balls that night.

Even if there's no difference within batches compared to between batches, it's still worth the testing. I don't know if any teams actually did this, but if any of them were testing balls in 2016, they'd have had advance knowledge that the balls were getting livelier. 

I have no idea what a team would do with that information, that home runs were about to jump significantly over last year ... but you'd think it would be valuable some way.


MGL tweeted, and I agreed, that it doesn't take much variation in a ball to make a huge difference to home run rates. He also thinks that any change in liveliness is likely to have been inadvertent on the part of the manufacturer, since it takes so little to make balls fly farther. I agree with that too.

But, why are MLB standards so lenient? As Lindbergh quotes from an earlier report,

" ... two baseballs could meet MLB specifications for construction but one ball could be theoretically hit 49.1 feet further."

Why doesn't MLB just put tighter control on the baseballs it uses? If the manufacturers can't make baseballs that precise, just put out a net at a standard distance, fire all the balls, and discard (or save for batting practice) all the balls that land outside the net. (That can't be so hard, can it? It can't be that the cannon would damage the balls too much, since MLB reuses balls that have been hit for line drives, which is a much more violent impact.)

You could even assign the balls to different liveliness groups, and require that different batches be stored at different humidor settings to equalize their bounciness.

Even if that's not practical, couldn't MLB, at least, test the balls regularly, so as to notice the variation before it shows up so obviously in the HR totals?


Finally, one last thought I had. If a ball is hit for a deep fly ball, doesn't that suggest that, at least as a matter or probability, it's juicier than average? If I were the pitching team, I might not want to pitch that ball again. It might be an expected difference of only a foot or two, but every little bit helps.

Labels: , , ,

Monday, June 19, 2017

Are some of today's baseballs twice as lively as others?

Over at The Ringer, Ben Lindbergh and Mitchel Lichtman (MGL) claim to have evidence of a juiced ball in MLB.

They got the evidence in the most direct way possible -- by obtaining actual balls, and having them tested. MLB sells some of their game-used balls directly to the public, with certificates of authenticity that include the date and play in which the ball was used. MGL bought 36 of those balls, and sent them to a lab for testing.

It never once occurred to me that you could do that ... so simple an idea, and so ingenious! Kudos to MGL. I wonder why mainstream sports journalists didn't think of it? It would be trivial for Sports Illustrated or ESPN to arrange for that.

Anyway ... it turned out that the 13 more recent balls -- the ones used in 2016 -- were indeed "juicier" than the 10 older balls used before the 2015 All-Star break. Differences in COR (Coefficient of Restitution, a measure of "bounciness"), seam height, and circumference were all in the expected "juicy" direction in favor of the newer baseballs. (The difference was statistically significant at 2.6 SD.)

The article says,

"While none of these attributes in isolation could explain the increase in home runs that we saw in the summer of 2015, in combination, they can."

If I read that right, it means the magnitude of the difference in the balls matches the magnitude of the increase in home runs. The sum of the three differences translated to the equivalent of 7.1 feet in fly ball distance.

The authors posted the results of the lab tests, for each of the 36 balls in the study; you can find their spreadsheet here.


One thing I noticed: there sure is a lot of variation between balls, even within the same era, even used on the same day. Consider, for instance, the balls marked "MSCC0041" and "MSCC0043," both used on June 15, 2016.

The "43" ball had a COR of .497, compared to .486 for the "41" ball. That's a difference of 8 feet (I extrapolated from the chart in the article).

The "43" ball had a seam height of .032 inches, versus .046 for the other ball. That's a difference of *17 feet*.

The "43" ball had a circumference of 9.06 inches, compared to 9.08. That's another 0.5 feet.

Add those up, and you get that one ball, used the same day as another, was twenty-five feet livelier

If 7.1 feet (what MGL observed between seasons) is worth, say, 30 percent more home runs, then the 25 foot difference means the "43" ball is worth DOUBLE the home runs of the "41" ball. And that's for two balls that look identical, feel identical, and were used in MLB game play on exactly the same day.


That 25-foot difference is bigger than typical, because I chose a relative outlier for the example. But the average difference is still pretty significant. Even within eras, the SD of difference between balls (adding up the three factors) is 7 or 8 feet.

Which means, if you take two random balls used on the same day in MLB, on average, one of them is *40 percent likelier* to be hit for a home run.

Of course, you don't know which one. If it were possible to somehow figure it out in real time during a game, what would that mean for strategy?


UPDATE: thinking further ... could it just be that the lab tests aren't that precise, and the observed differences between same-era balls are mostly random error? 

That would explain the unintuitive result that balls vary so hugely, and it would still preserve the observation that the eras are different.

Labels: , , ,

Thursday, May 25, 2017

Pete Palmer on luck vs. skill

Pete Palmer has a new article on skill and luck in baseball, in which he crams a whole lot of results into five pages. 

It's called "Calculating Skill and Luck in Major League Baseball," and appears in the new issue of SABR's "Baseball Research Journal."  It's downloadable only by SABR members at the moment, but will be made publicly available when the next issue comes out this fall.

For most of the results, Pete uses what I used to call the "Tango method," which I should call the "Palmer method," because I think Pete was actually the first to use it in the context of sabermetrics, in the 2005 book "Baseball Hacks."  (The mathematical method is very old; Wikipedia says it's the "Bienaym√© formula," discovered in 1853. But its use in sabermetrics is recent, as far as I can tell.)

Anyway, to go through the method yet one more time ... 

Pete found that the standard deviation (SD) of MLB season team wins, from 1981 to 1990, was 9.98. Mathematically, you can calculate that the expected SD of luck is 6.25 wins. Since a team's wins is the total of (a) its expected wins due to talent, and (b) deviation due to luck, the 1853 formula says

SD(actual)^2 = SD(talent)^2 + SD(luck)^2

Subbing in the numbers, we get

9.98 ^ 2 = SD(talent)^2 + 6.25^2 

Which means SD(talent) = 7.78.

In terms of the variation in team wins for single seasons from 1981 to 1990, we can estimate that differences in skill were only slightly more important than differences in luck -- 7.8 games to 6.3 games.


That 7.8 is actually the narrowest range of team talent for any decade. Team skill has been narrowing since the beginning of baseball, but seems to have widened a bit since 1990. Here's part of Pete's table:

ending   SD(talent)
 1880     9.93
 1890    14.44
 1900    14.72
 1910    15.33
 1920    13.06
 1930    12.51
 1940    13.66
 1950    12.99
 1960    11.95
 1970    11.17
 1980     9.75
 1990     7.78
 2000     8.46
 2010     9.87
 2016     8.91

Anyway, we've seen that many times, in various forms (although perhaps not by decade). But that's just the beginning of what Pete provides. I don't want to give away his entire article, but here some of the findings I hadn't seen before, at least not in this form:

1. For players who had at least 300 PA in a season, the spread in their batting average is roughly evenly caused by luck and skill.

2. Switching from BA to NOPS (normalized on-base plus slugging), skill now surpasses luck, by an SD of 20 points to 15.

3. For pitchers with 150 IP or more, luck and skill are again roughly even.

In the article, these are broken down by decade. There's other stuff too, including comparisons with the NBA and NFL (OK, that's not new, but still). Check it out if you can.


OK, one thing that surprised me. Pete used simulations to estimate the true talent of teams, based on their W-L record. For instance, teams who win 95-97 games are, on average, 5.6 games lucky -- they're probably 90 or 91-win talents rather than 96.

That makes sense, and is consistent with other studies that tried to figure out the same thing. But Pete went one step further: he found actual teams that won 95-97 games, and checked how they did next year.

For the year in question, you'd expect them to have been 91 win teams. For the following year, you'd expect them to be *worse* than 91 wins, though. Because, team talent tends to revert to .500 over the medium term, unless you're a Yankee dynasty or something.

But ... for those teams, the difference was only six-tenths of a win. Instead of being 91 wins (90.8), they finished with an average of 90.2.

I would have thought the difference would have been more than 0.6 wins. And it's not just this group. For teams who finished between 58 and 103 wins, no group regressed more than 1.8 wins beyond their luck estimate. 

I guess that makes sense, when you think about it. A 90-win team is really an 87-win talent. If they regress to 81-81 over the next five seasons, that's only about one win per year. It's my intuition that was off, and it took Pete's chart to make me see that.

Labels: , ,