Friday, August 04, 2017

Deconstructing an NBA time-zone regression

Warning: for regression geeks only.

----

Recently, I came across an NBA study that found an implausibly huge effect of teams playing in other time zones. The study uses a fairly simple regression, so I started thinking about what could be happening. 

My point here isn't to call attention to the study, just to figure out the puzzle of how such a simple regression could come up with such a weird result. 

------

The authors looked at every NBA regular-season game from 1991-92 to 2001-02. They tried to predict which team won, using these variables:

-- indicator for home team / season
-- indicator for road team / season
-- time zones east for road team
-- time zones west for road team

The "time zones" variable was set to zero if the game was played in the road team's normal time zone, or if it was played in the opposite direction. So, if an east-coast team played on the west coast, the west variable would be 3, and the east variable would be 0.

The team indicators are meant to represent team quality. 

------

When the authors ran the regression, they found the "number of time zones" variable large and statistically significant. For each time zone moving east, teams played .084 better than expected (after controlling for teams). A team moving west played .077 worse than expected. 

That means a .500 road team on the West Coast would actually play .756 ball on the East Coast. And that's regardless of how long the visiting team has been in the home team's time zone. It could be a week or more into a road trip, and the regression says it's still .756.

The authors attribute the effect to "large, biological effects of playing in different time zones discovered in medicine and physiology research." 

------

So, what's going on? I'm going to try to get to the answer, but I'll start with a couple of dead ends that nonetheless helped me figure out what the regression is actually doing. I should say in advance that I can't prove any of this, because I don't have their data and I didn't repeat their regression. This is just from my armchair.

Let's start with this. Suppose it were true, that for physiological reasons, teams always play worse going west, and teams always play better going east. If that were the case, how could you ever know? No matter what you see in the data, it would look EXACTLY like the West teams were just better quality than the East teams. (Which they have been, lately.)  

To see that argument more easily: suppose the teams on the West Coast are all NBA teams. The MST teams are minor-league AAA. The CST teams are AA. And the East Coast teams are minor league A ball. But all the leagues play against each other.

In that case, you'd see exactly the pattern the authors got: teams are .500 against each other in the same time zone, but worse when they travel west to play against better leagues, and better when they travel east to play against worse leagues.

No matter what results you get, there's no way to tell whether it's time zone difference, or team quality.

So is that the issue, that the regression is just measuring a quality difference between teams in different time zones? No, I don't think so. I believe the "time zone" coefficient of the regression is measuring something completely irrelevant (and, in fact, random). I'll get to that in a bit. 

------

Let's start by considering a slightly simpler version of this regression. Suppose we include all the team indicator variables, but, for now, we don't include the time-zone number. What happens?

Everything works, I think. We get decent estimates of team quality, both home and road, for every team/year in the study. So far, so good. 

Now, let's add a bit more complexity. Let's create a regression with two time zones, "West" and "East," and add a variable for the effect of that time zone change.

What happens now?

The regression will fail. There's an infinite number of possible solutions. (In technical terms, the regression matrix is "singular."  We have "collinearity" among the variables.)

How do we know? Because there's more than one set of coefficients that fits the data perfectly. 

(Technical note: a regression will always fail if you have an indicator variable for every team. To get around this, you'll usually omit one team (and the others will come out relative to the one you omitted). The collinearity I'm talking about is even *after* doing that.)

Suppose the regression spit out that the time-zone effect is actually  .080, and it also spit out quality estimates for all the teams.

From that solution, we can find another solution that works just as well. Change the time-zone effect to zero. Then, add .080 to the quality estimate of every West team. 

Every team/team estimate will wind up working out exactly the same. Suppose, in the first result, the Raptors were .400 on the road, the Nuggets were .500 at home, and the time-zone effect is .080. In that case, the regression will estimate the Raptors at .320 against the Nuggets. (That's .400 - (.500 - .500) - .080.)

In the second result, the regression leaves the Raptors at .400, but moves the Nuggets to .580, and the time-zone effect to zero. The Raptors are still estimated at .320 against the Nuggets. (This time, it's .400 - (.580 - .500) - .000.)

You can create as many other solutions as you like that fit the data identically: just add any X to the time-zone estimate, and add the same X to every Western team.

The regression is able to figure out that the data doesn't give a unique solution, so it craps out, with a message that the regression matrix is singular.

------

All that was for a regression with only two time zones. If we now expand to include all four zones, that gives six different effects each direction (E moving to C, C to M, M to P, E to M, C to P, and M to P). What if we include six time-zone variables, one for each effect?

Again, we get an infinity of solutions. We can produce new solutions almost the same way as before. Just take any solution, subtract X from each E team quality, and add X to the E-C, E-M and E-P coefficients. You wind up with the same estimates.

------

But the authors' regression actually did have one unique best fit solution. That's because they did one more thing that we haven't done.

We can get to their regression in two steps.

First, we collapse the six variables into three -- one for "one time zone" (regardless of which zone it is), one for "two time zones," and one for "three time zones". 

Second, we collapse those three variables into one, "number of time zones," which implicitly forces the two-zone effect and three-zone effect to be double and triple, respectively, the value of the one-zone effect. I'll call that the "x/2x/3x rule" and we'll assume that it actually does hold.

So, with the new variable, we run the regression again. What happens?

In the ideal case, the regression fails again. 

By "ideal case," I mean one where all the error terms are zero, where every pair of teams plays exactly as expected. That is, if the estimates predict the Raptors will play .350 against the Nuggets, they actually *do* play .350 against the Nuggets. It will never happen that every pair will go perfectly in real life, but maybe assume that the dataset is trillions of games and the errors even out.

In that special "no errors" case, you still have an infinity of solutions. To get a second solution from a first, you can, for instance, double the time zone effects from x/2x/3x to 2x/4x/6x. Then, subtract x from each CST team, subtract 2x from each MST team, and subtract 3x from each PST team. You'll wind up with exactly the same estimates as before.

-------

For this particular regression to not crap out, there have to be errors. Which is not a problem for any real dataset. The Raptors certainly won't go the exact predicted .350 against the Nuggets, either because of luck, or because it's not mathematically possible (you'd need to go 7-for-20, and the Raptors aren't playing 20 games a season in Denver).

The errors make the regression work.

Why? Before, x/2x/3x fit all the observations perfectly. So you could create duplicate solutions by adding and subtracting X and 2X from the teams, and adding X and 2X to the one-zone effects and two-zone effects. Now, because of errors, not all the observed two-zone effects are exactly double the one-zone effects. So not everything cancels out, and you get different residuals. 

That means that this time there's a unique solution, and the regression spits it out.

-------

In this new, valid, regression, what's the expected value of the estimate for the time-zone effect?

I think it must be zero.

The estimate of the coefficient is a function of the observed error terms in the data. But the errors are, by definition, just as likely to be negative as positive. I believe (but won't prove) that if you reverse the signs of all the error terms, you also reverse the sign of the time zone coefficient estimate.

So, the coefficient is as likely to be negative as positive, which means by symmetry, its expected value must be zero.

In other words: the coefficient in the study, the one that looks like it's actually showing the physiological effects of changing time zone ... is actually completely random, with expected value zero.

It literally has nothing at all to do with anything basketball-related!

-------

So, that's one factor that's giving the weird result, that the regression is fitting the data to randomness. Another factor, and (I think) the bigger one, is that the model is wrong. 

There's an adage, "All models are wrong; some models are useful." My argument is that this model is much too wrong to be useful. 

Specifically, the "too wrong" part is the requirement that the time-zone effect must be proportional to the number of zones -- the "x/2x/3x" assumption.

It seems like a reasonable assumption, that the effect should be proportional to the time lag. But, if it's not, that can distort the results quite a bit. Here's a simplified example showing how that distortion can happen.

Suppose you were to run the regression without the time-zone coefficient, and you get talent estimates for the teams, and you look at the errors in predicted vs. actual. For East teams, you find the errors are

+.040 against Central
+.000 against Mountain
-.040 against Pacific

That means that East teams played .040 better than expected against Central teams (after adjusting for team quality). They played exactly as expected against Mountain Time teams, and .040 worse than expected against West Coast teams.

The average of those numbers is zero. Intuitively, you'd look at those numbers and think: "Hey, there's no appreciable time-zone effect. Sure, the East teams lost a little more than normal against the Pacific teams, but they won a little more than normal against the Central teams, so it's mostly a wash."

Also, you'd notice that it really doesn't look like the observed errors follow x/2x/3x. The closest fit seems to be when you make x equal to zero, to get 0/0/0.

So, does the regression see that and spit out 0/0/0, accepting the errors it found? No. It actually finds a way to make everything fit perfectly!

To do that, it increases its estimates of every Eastern team by .080. Now, every East team appears to underperform by .080 against each of the three other time zones. Which means the observed errors are now 

-.040 against Central
-.080 against Mountain
-.120 against Pacific

And that DOES follow the x/2x/3x model -- which means you can now fit the data perfectly. Using 0/0/0, the .500 Raptors were expected to be .500 against an average Central team (.500 minus 0), but they actually went .540. Using -.040/-.080/.120, the .580 Raptors are expected to be .540 against an average Central team (.580 minus .040), and that's exactly what they did.

So the regression says, "Ha! That must be the effect of time zone! It follows the x/2x/3x requirement, and it fits the data perfectly, because all the errors now come out to zero!"

So you conclude that 

(a) over a 20-year period, the East teams were .580 teams but played down to .500 because they suffered from a huge time-zone effect.

Well, do you really want to believe that? 

You have at least two other options you can justify: 

(b) over a 20-year period, the East teams were .500 teams and there was a time-zone effect of +40 points playing in CST, and -40 points playing in PST, but those effects weren't statistically significant.

(c) over a 20-year period, the East teams were .500 teams and due to lack of statistical significance and no obvious pattern, we conclude there's no real time-zone effect.

The only reason to choose (a) is if you are almost entirely convinced of two things: first, that x/2x/3x is the only reasonable model to consider, and, second, that 40/80/120 points is plausible enough to not assume that it's just random crap, despite the statistical significance.

You have to abandon your model at this point, don't you? I mean, I can see how, before running the regression, the x/2x/3x assumption seemed as reasonable as any. But, now, to maintain that it's plausible, you have to also believe it's plausible that an Eastern team loses .120 points of winning percentage when it plays on the West Coast. Actually, it's worse than that! The .120 was from this contrived example. The real data shows a drop of more than .200 when playing on the West Coast!

The results of the regression should change your mind about the model, and alert you that the x/2x/3x is not the right hypothesis for how time-zone effects work.

-------

Does this seem like cheating? We try a regression, we get statistically-significant estimates, but we don't like the result so we retroactively reject the model. Is that reasonable?

Yes, it is. Because, you have to either reject the model, or accept its implications. IF we accept the model, then we're forced to accept that there's 240-point West-to-East time zone effect, and we're forced to accept that West Coast teams that play at a 41-41 level against other West Coast teams somehow raise their game to the 61-21 level against East Coast teams that are equal to them on paper.

Choosing the x/2x/3x model led you to an absurd conclusion. Better to acknowledge that your model, therefore, must be wrong.

Still think it's cheating? Here's an analogy:

Suppose I don't know how old my friend's son is. I guess he's around 4, because, hey, that's a reasonable guess, from my understanding of how old my friend is and how long he's been married. 

Then, I find out the son is six feet tall.

It would be wrong for me to keep my assumption, wouldn't it? I can't say, "Hey, on the reasonable model that my friend's son is four years old, the regression spit out a statistically significant estimate of 72 inches. So, I'm entitled to conclude my friend's son is the tallest four-year-old in human history."

That's exactly what this paper is doing.  

When your model spews out improbable estimates for your coefficients, the model is probably wrong. To check, try a different, still-plausible model. If the result doesn't hold up, you know the conclusions are the result of the specific model you chose. 

------

By the way, if the statistical significance is concerning you, consider this. When the authors repeated the analysis for a later group of years, the time-zone effect was much smaller. It was .012 going east and -.008 going west, which wasn't even close to statistical significance. 

If the study had combined both samples into one, it wouldn't have found significance at all.

Oh, and, by the way: it's a known result that when you have strong correlation in your regression variables (like here), you get wide confidence intervals and weird estimates (like here). I posted about that a few years ago.  

-------

The original question was: what's going on with the regression, that it winds up implying that a .500 team on the West Coast is a .752 team on the East Coast?

The summary is: there are three separate things going on, all of which contribute:

1.  there's no way to disentangle time zone effects from team quality effects.

2.  the regression only works because of random errors, and the estimate of the time-zone coefficient is only a function of random luck.

3.  the x/2x/3x model leads to conclusions that are too implausible to accept, given what we know about how the NBA works. 





-----

UPDATE, August 6/17: I got out of my armchair and built a simulation. The results were as I expected. The time-zone effect I built in wound up absorbed by the team constants, and the time-zone coefficient varied around zero in multiple runs.



Labels: ,

Monday, October 24, 2016

Why the 2016 AL was harder to predict than the 2016 NL

In 2016, team forecasts for the National League turned out more accurate than they had any right to be, with FiveThirtyEight's predictions coming in with a standard error (SD) of only 4.5 wins. The forecasts for the American League, however, weren't nearly as accurate ... FiveThirtyEight came in at 8.9, and Bovada at 8.8. 

That isn't all that great. You could have hit 11.1 just by predicting each team to duplicate their 2015 record. And, 11 wins is about what you'd get most years if you just forecasted every team at 81-81.

Which is kind of what the forecasters did! Well, not every team at 81-81 exactly, but every team *close* to 81-81. If you look at FiveThirtyEight's actual predictions, you'll see that they had a standard deviation of only 3.4 wins. No team was predicted to win or lose more than 87 games.

Generally, team talent has an SD of around 9 wins. If you were a perfect evaluator of talent, your forecasts would also have an SD of 9. If, however, you acknowledge that there are things that you don't know (and many that can't be known, like injuries and suspensions), you'll forecast with an SD somewhat less than 9 -- maybe 6 or 7.

But, 3.4? That seems way too narrow. 

Why so narrow? I think it was because, last year, the AL standings were themselves exceptionally narrow. In 2015, no American League team won or lost more than 95 games. Only three teams were at 89 or more. 

The SD of team wins in the 2015 AL was 7.2. That's much lower than the usual figure of around 11. In fact, 7.2 is the lowest for either league since 1961. In fact, I checked, and it's the lowest for any league in baseball history! (Second narrowest: the 1974 American League, at 7.3.)

Why were the standings so compressed? There are three possibilities:

1. The talent was compressed;

2. There was less luck than normal;

3. The bad teams had good luck and the good teams had bad luck, moving both sets closer to .500.

I don't think it was #1. In 2016, the SD of standings wins was back near normal, at 10.2. The year before, 2014, it was 9.6. It doesn't really make sense that team talent regressed so far to the mean between 2014 and 2015, and then suddenly jumped back to normal in 2016. (I could be wrong -- if you can find trades and signings those years that showed good teams got significantly worse in 2015 and then significantly better in 2016, that would change my mind.)

And I don't think it was #2, based on Pythagorean luck. The SD of the discrepancy in "first-order wins" was 4.3, which larger than the usual 4.0. 

So, that leaves #3 -- and I think that's what it was. In the 2015 AL, the correlation between first-order-wins and Pythagorean luck was -0.54 instead of the expected 0.00. So, yes, the good teams had bad luck and the bad teams had good luck. (The NL figure was -0.16.)

-------

When that happens, that luck compresses the standings, it definitely makes forecasting harder. Because, there's not as much information on how teams differ. To see that, consider the extreme case. If, by some weird fluke, every team wound up 81-81, how would you know which teams were talented but unlucky, and which were less skilled but lucky? You wouldn't, and so you wouldn't know what to expect next season.

Of course, that's only a problem if there *is* a wide spread of talent, one that got overcompressed by luck. If the spread of talent actually *is* narrow, then forecasting works OK. 

That's what many forecasting methods assume, that if the standings are narrow, the talent must be narrow. If you do the usual "just take the standings and regress to the mean" operation, you'll wind up implicitly assuming that the spread of talent shrank at the same time as the spread in the standings shrank.

Which is fine, if that's what you think happened ... but, do you really think that's plausible? The AL talent distribution was pretty close to average in 2014. It makes more sense to me to guess that the difference between 2014 and 2015 was luck, not wholesale changes in personnel that made the bad teams better and the good teams worse.

Of course, I have the benefit of hindsight, knowing that the AL standings returned to near-normal in 2016 (with an SD of 10.2). But it's happened before -- the record-low 7.3 figure for the 1974 AL jumped back to an above-average 11.9 in 1975.

I'd think when I was forecasting the 2016 standings, I might want to make an effort to figure out which teams were lucky and which ones weren't, in order to be able to forecast a more realistic talent SD than 3.5 wins.

Besides, you have more than the raw standings. If you adjust for Pythagoras, the SD jumps from 7.2 to 8.6. And, according to Baseball Prospectus, when you additionally adjust for cluster luck, the SD rises to 9.4. (As I wrote in the P.S. to the last post, I'm not confident in that number, but never mind for now.)

An SD of 9.4 is still smaller than 11, but it should be workable.

Anyway, my gut says that you should be able to differentiate the good teams from the bad with a spread higher than 3.4 games ... but I could be wrong. Especially since Bovada's spread was even smaller, at 3.3.

-------

It's a bad idea to second-guess the bookies, but let's proceed anyway.

Suppose you thought that the standings compression of 2015 was a luck anomaly, and the distribution of talent for 2016 should still be as wide as ever. So, you took FiveThirtyEight's projections, and you expanded them, by regressing them away from the mean, by a factor of 1.5. Since FiveThirtyEight predicted the Red Sox at four games above .500 (85-77), you bump that up to six games (87-75).

If you did that, the SD of your actual predictions is now a more reasonable 5.1. And those predictions, it turns out, would have been better. The accuracy of your new predictions would have been an SD of 8.4. You would have beat FiveThirtyEight and Bovada.

If that's too complicated, try this. If you had tried to take advantage of Bovada's compressed projections by betting the "over" on their top seven teams, and the "under" on their bottom seven teams, you would have gone 9-5 on those bets.

Now, I'm not going to so far as to say this is a workable strategy ... bookmakers are very, very good at what they do. Maybe that strategy just turned out to be lucky. But it's something I noticed, and something to think about.

-------

If compressed standings make predicting more difficult, then a larger spread in the standings should make it easier.

Remember how the 2016 NL predictions were much more accurate than expected, with an SD of 4.5 (FiveThirtyEight) and 5.5 (Bovada)? As it turns out, last year, the SD of the 2015 NL standings was higher than normal, at 12.65 wins. That's the highest of the past three years:

2014  AL= 9.59, NL= 9.20
2015  AL= 6.98, NL=12.65
2016  AL=10.15, NL=10.71

It's not historically high, though. I looked at 1961 to 2011 ... if the 2015 NL were included, it would be well above average, but only 70th percentile.*

(* If you care: of the 10 most extreme of the 102 league-seasons in that timespan, most were expansion years, or years following expansion. But the 2001, 2002, and 2003 AL made the list, with SDs of 15.9, 17.1, and 15.8, respectively. The 1962 National League was the most extreme, at 20.1, and the 2002 AL was second.)

A high SD won't necessarily make your predictions beat the speed of light, and a low SD won't necessarily make them awful. But both contribute. As an analogy: just because you're at home doesn't mean you're going to pitch a no-hitter. But if you *do* pitch a no-hitter, odds are, you had the help of home-field advantage.

So, given how accurate the 2016 NL forecasts were, I'm not surprised that the SD of the 2015 NL standings was higher than normal.

-------

Can we quantify how much compressed standings hurt next year's forecasts? I was curious, so I ran a little simulation. 

First, I gave every team a random 2015 talent, so that the SD of team talent came out between 8.9 and 9.1 games. Then, I ran a simulated 2015 season. (I ran each team with 162 independent games, instead of having them play each other, so the results aren't perfect.)

Then, I regressed each team's 2015 record to the mean, to get an estimate of their talent. I assumed that I "knew" that the SD of talent was around 9, so I "unregressed" each regressed estimate away from the mean by the exact amount that gets the SD of talent to exactly 9.00. That became the official forecast for 2016. 

Finally, I ran a simulation of 2016 (with team talent being the same as 2015). I compared the actual to the forecast, and calculated the SD of the forecast errors.

The results came out, I think, very reasonable.

Over 4,000 simulated seasons, the average accuracy was an SD of 7.9. But, the higher the SD of last year's standings, the better the accuracy:

SD Standings    SD next year's forecast
------------------------------------
7.0             8.48 (2015 AL)
8.0             8.31
9.0             8.14
10.0            7.98
11.0            7.81
12.0            7.64
12.6            7.54 (2015 NL)
13.0            7.47
14.0            7.31
20.1            6.29 (1962 NL)

So, by this reckoning, you'd expect the 2016 NL predictions to have been one win more accurate than than the AL predictions. 

They were "much more accurater" than that, of course, by 3.4 or 4.5. The main reason, of course, is that there's a lot of luck involved. Less importantly, this simulation is very rough. The model is oversimplified, and there's no assurance that the relationship is actually linear. (In fact, the relationship *can't* be linear, since the "speed of light" limit is 6.4, and the model says the 1974 AL would beat that, at 6.3). 

It's just a very rough regression to get a very rough estimate. 

But the results seem reasonable to me. In 2016, we had (a) the narrowest standings in baseball history in the 2015 AL, and (b) a wider-than-average, 70th percentile spread in the 2016 NL. In that light, an expected difference of 1 win, in terms of forecasting accuracy, seems very plausible. 

--------

So that's my explanation of why this year's NL forecasts were so accurate, while this year's AL forecasts were mediocre. A large dose of luck -- assisted by a small (but significant) dose of extra information content in the standings.












Labels: , , , , ,

Friday, October 21, 2016

National League forecasts were too accurate in 2016

FiveThirtyEight predicted the National League surprisingly accurately this year.

The standard error of their predictions -- that is, the SD of the difference between their team forecasts, and what actually happened -- was only 4.5 games.* (Here's the link to their forecast -- go to the bottom and choose "April 2".)

(* The SD is the square root of the average squared error. If you prefer just the average error, in this case, it was three-and-a-third games. But I'll be using just the SD in the rest of this post. In most cases, to estimate average error when you only have the SD, you can multiply by 2/pi (approximately 0.64).)

4.5 games is very, very good. In fact, it's so good it can't possibly be all skill. The "speed of light" limit on forecasting MLB is about 6.4 games. That is, even if you knew absolutely everything about the talent of a team and its opposition, every game, an SD of 6.4 is the very best you could expect to do.

Of course, you can get lucky, and beat 6.4 games. You could even get to zero, if fortune smiles on you and every team hits your projection exactly. But, 6.4 is the best you can do by skill.**  

(** Actually, it might be a bit less, 6.3 or something, because 6.4 is what you get when teams are evenly matched ... mismatches are somewhat easier to predict. But never mind.)

How unusual is an SD of 4.5? Well, not *that* unusual. By my estimate, the SD of the observed SD -- sorry if that's a little confusing -- is somewhere around 1.7, for a league of 15 teams. So, FiveThirtyEight was a little over one standard deviation lucky, which isn't really a big deal. Even taking into account that FiveThirtyEight couldn't have been perfectly accurate in their talent assessments, it's still not that big a deal. If they were off, on talent, by around 3 games per team, that would bring them to only about 1.5 SDs of luck.

Still not a huge deal, but interesting nonetheless.

------

It wasn't just FiveThirtyEight whose projections did well ... the Vegas bookmakers did OK too. Well, at least the one I looked at, Bovada. (I assume the others would be pretty close.)  They had an SD of 5.5 games, which is also better than the "speed of light."  (I can't find the page I got them from, but this one, from a month earlier, is close.)

That suggests that it probably wasn't any particular flash of brilliance from either FiveThirtyEight or Bovada ... it must have been something about the way the season unfolded. 

Maybe, in 2016, there was less random deviation than usual? One type of random variation is whether a team exceeds their Pythagorean Projection -- that is, whether they win more (or fewer) games than you'd expect from their runs scored and allowed. To check that, I used Baseball Prospectus's numbers -- specifically, the difference between actual and "first-order wins."***

(*** Why didn't I use second-order wins? See the P.S. at the bottom of the post.)

In the National League in 2016, the SD of Pythagorean error was 3.55. That is indeed a little smaller than the average of around 4.0. But that small difference isn't nearly enough to explain why the projections were so good.

Here's what I think is the bigger factor -- actually, a combination of two factors.

First, by random chance, the better teams happened to undershoot their Pythagorean expectation, and the worse teams happened to exceed it. 

The Cubs were the best team in the league, and also the team with the most bad luck, -4.8 games. The Phillies were the worst team in the league with luck removed; you'd expect them to have won only 61.4 games, but they but played +9.6 games above their Pythagorean projection to go 71-91.

Those two were the most obvious examples, but the pattern continued through the league. Overall, the correlation between first-order wins (which is an approximation of talent) and Pythagorean error was huge: +0.61. Normally, you'd expect it to be close to zero. (In the American league, it was, indeed, close to zero, at -0.06.)

Second, there was a similar, offsetting relationship in the predictions themselves. 

It turns out that the forecast errors had a strong pattern this year.  Instead of being random, they came out too "conservative" -- they underestimated the talent of the better teams, and overestimated the talent of the worse teams. Here's the distribution of FiveThirtyEight's forecast errors, with the teams sorted by their forecast:

Top 5 teams: average error -4 wins (underestimate)
Mid 5 teams: average error +4  win (overestimate)
Btm 5 teams: average error +1  win (overestimate)

So, in summary:

-- FiveThirtyEight predicted teams too close to the mean
-- Teams' Pythagorean luck moved them closer to the mean

Those two things cancelled each other out to a significant extent. And that's why FiveThirtyEight was so accurate.

-------

Next post: The American League, which is interesting for completely different reasons.

-------

P.S. Baseball Prospectus also produces "second-order wins," which attempts to remove a second kind of luck, what I call "Runs Created luck" (and others call "cluster luck"), which is teams scoring more or fewer runs than would be expected by their batting line. I started to do that, but ... I stopped, because I found something weird.

When you remove luck from the standings, you expect to make them tighter, to bring teams closer together. (To see that better, imagine removing luck from coin tosses. Every team reverts to .500.)

Removing first-order (Pythagorean) luck does seem to reduce the SD of the standings. But, removing second-order (Cluster) luck seems to do the *opposite*.

I checked four seasons of BP data, and, in every case, the SD of second-order wins (for the set of all 30 teams) was higher than the SD of first-order wins:  

         Actual  First-order  Second-order
------------------------------------------
2016      10.7        10.8        13.1
2015      10.4        10.1        11.8
2014       9.6         8.9         9.6
2013      12.2        12.2        12.8

So, either the good teams got lucky all four years, or there's something weird about how BP is computing second-order luck. 










Labels: , , , , ,

Thursday, September 22, 2016

When and why log5 doesn't work

Six years ago, Tom Tango described a hypothetical sprinting tournament where log5 didn't seem to give accurate results. I think I have an understanding, finally, of why it doesn't work, thanks to 

(a) Ted Turocy, 

(b) a paper from Kristi Tsukida and Maya R. Gupta (.pdf; see section 3.4, keeping in mind that "Bradley-Terry model" basically means "log5"), and 

(c) this excellent post by John Cook. 

(You don't have to follow the links now; I'll give them again later, in context.)

-----

It turns out that the log5 formula makes a certain assumption about the sport, an assumption that makes the log5 formula work out perfectly. That assumption is: that the set of score differentials follows a logistic distribution

What's the logistic distribution? It's a lot like the normal distribution, a bell-shaped curve. They can be so similar in shape that I'd have trouble telling them apart by eye. But, the logistic distribution has fatter tails relative to the "bell". In other words: the logistic distribution predicts rare events, like teams beating expectations by large amounts, will happen more often than the normal distribution would predict. And it predicts that certain common events, like close games, will happen less often.

The log5 formula works perfectly when scores are distributed logistically. That's been proven mathematically. But, where scores aren't actually logistic, the formula will fail, in proportion to how real life varies from the particular logistic distribution the log5 formula assumes.

That's why the formula didn't work in the sprinting example. Tango explicitly made the results normally distributed. Then, he found that log5 started to break down when the competitors became seriously mismatched. That is: log5 started to fail in the tail of the distribution, exactly where normal and logistic differ the most. 

-------

Here's a rudimentary basketball game. Two teams each get 100 possessions, and have a certain probability (talent) of scoring on each possession. Defense doesn't matter, and all baskets are two points.

Suppose you have A, a 55 percent team (expected score: 110) and B, a 45 percent team (expected score, 90). We expect each team's score to be normally distributed, with an SD of an SD of almost exactly 10 points for each team's individual score.*

(* This is just the normal approximation to binomial. To get the SD, calculate .45 multiplied by (1-.45) divided by 100. Take the square root. Multiply by 2 since each basket is 2 points. Then, multiply by 100 for the number of possessions. You get 9.95.)

Since the two teams are independent, the SD of the score difference is the square root of 2 times as big as the individual standard deviations. So, the SD of the differential is 14 points.

By talent, A is a 20-point favorite over underdog B. For B to win, it must beat the spread by 20 points. 20 divided by 14 equals 1.42 SD. Under the normal distribution, the probability of getting a result greater than 1.42 SD is 0.0778. 

That means B has a 7.78 percent chance of winning, and A a 92.22 chance. The odds ratio for A is 92.22 divided by 7.78, which is 11.85.

So, that's the answer. Now, how well does log5 approximate it? To figure that out, we need to figure the talent of A and B against a .500 team.

Let's say Team C is that .500 team. Against C, team A has an advantage of 0.71 SD. The normal distribution table says that's a winning percentage of .7611, which is an odds ratio of 3.186. Similarly, team B is 0.71 SD worse than C, so it wins with probability (1 - .7611), or odds ratio 1 / 3.186.

Using log5, what's the estimated odds ratio of A over B? It's 3.186 squared, or 10.15. That works out to a win probability of only .910, an underestimate of the .922 true probability.

Log5 estimate:  .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85

To recap what we just did: we started with the correct, theoretically-derived probabilities of A beating C, and of B beating C. When we plugged those exact probabilities into log5, we got the wrong answer.

Why the wrong answer? Because of the log5 formula's implicit assumption: that the distribution of the score difference between A and B is logistic, rather than normal.

What specific logistic distribution does log5 expect? The one where the mean is the log of the odds ratio and the SD is about 1.8. (Logistic distributions are normally described by a "shape parameter, which is the SD divided by 1.8 (pi divided by the square root of 3, to be exact). So, actually, the log5 formula assumes a logistic distribution with a shape parameter of exactly 1.)

So, in this case, the log5 formula treats the score distribution as if it's logistic, with a mean of 2.3 (the log of the log5-calculated odds ratio of 10.15) and an SD of 1.8 (shape parameter 1). 

We can rescale that to be more intuitive, more basketball-like, and still get the same probability answer. We just multiply the mean, SD, and shape parameter by a constant. That's like taking a normal curve for height denominated in feet, and multiplying the mean and SD by 12 to convert to inches. The proportion of people under 5 feet under the first curve works out to the same as the proportion of people under 60 inches in the second curve.

In this case, we want the mean to be 20 (basketball points) instead of 2.3. So, we can multiply both the mean and the scale by 20/2.3. That gives us a new logistic distribution with mean 20 and SD 15.6.

That's reasonably close to the actual normal curve we know is correct:

Normal:   mean 20, SD 14
Logistic: mean 20, SD 15.6

It's reasonably close, but still incorrect. In this case, log5 overestimates the underdog in two different ways. First, it assumes a logistic distribution instead of normal, which means fatter tails. Second, it assumes a higher SD than actually occurs, which again means fatter tails and more extreme performances.

------

Here, I'll give you some stolen visuals.  I'm swiping this diagram from that great post John Cook post I mentioned, which compares and contrasts the two distributions. Here's a comparison of a normal and logistic distribution with the same SD:





The dotted red line is the normal distribution, the solid blue line is the logistic, and both have an SD of about 1.8.  It looks like the logistic tails start getting fatter than the normal tails at around 4.3 or something, which is around two-and-a-half SD from the mean.

But, for our purposes, it's the area under the curve that we care about, the CDF. Here's that comparison, shamefully stolen from the Tsukida/Gupta .pdf I linked to earlier:




From here, you can see that from infinity to around, I dunno, maybe 1.7 or so, the area under the logistic curve is larger than the area under the normal curve.

And again, this is when the curves have equal SDs.  In this basketball example, the log5 assumption has a higher SD than the actual normal, by 15.6 points to 14. So the overestimate of the underdog is even higher, and probably starts earlier than 1.7 SD.

------

Just to make sure my logic was correct, I ran a simulation for this exact basketball game. I created seven teams, with expectations running from 110 points down to 90 points. I ran a huge season, where each team played each other team 200,000 times. 

Then, I observed the simulated record of champion team A (the 110 point team) against team C (the team of average talent, the 100-point team). And I observed the simulated record of basement team B (the 90 point team) against team C. 

Here's what I got:

A versus C: 151878- 48122 (.7594), odds ratio 3.156
B versus C:  47599-152401 (.2380), odds ratio 0.312

To estimate A versus B using log5, we just divide 3.156 by 0.312:

A versus B: 181990-18010 (.9099), odds ratio 10.105

But, the actual results from the simulated season were:

A versus B: 184395-15605 (.9220), odds ratio 11.816

In the simulation, log5 understimated the favorite almost exactly the same way as it did in theory:

Simulation
--------------------------------------
Log5 estimate:  .909, odds ratio 10.11
Correct answer: .922, odds ratio 11.82

Theory
--------------------------------------
Log5 estimate:  .910, odds ratio 10.15
Correct answer: .922, odds ratio 11.85

-------

So that's what I think is going on. If the distribution of the difference in team scores doesn't follow the logistic distribution well enough, you'll get log5 working poorly for those talent differentials for which the curves match the worst. In the case where score differentials are normal, like this basketball simulation, the worst match is when you have a heavy favorite.

For other sports, it's an empirical question. Take the actual score distribution for that particular sport, and fit it to the scaled logistic distribution assumed by log5. Where the curves differ, I suspect, will tell you where log5 projections will be least accurate, and in which direction.



Labels: , , ,

Thursday, January 07, 2016

When log5 does and doesn't work

Team A, with an overall winning percentage talent of .600, plays against a weaker Team B with an overall winning percentage of .450. What's the probability that team A wins? 

In the 1980s, Bill James created the "log5" method to answer that question. The formula is

P = (A - AB)/(A+B-2AB)

... where A is the talent level of team A winning (in this case, .600), and B is the talent level of team B (.450).

Plug in the numbers, and you get that team A has a .647 chance of winning against team B. 

That makes sense: A is .600 against average teams. Since opponent B is worse than average, A should be better than .600. 

Team B is .050 worse than average, so you'd kind of expect A to "inherit" those 50 points, to bring it to .650. And it does, almost. The final number is .647 instead of .650. The difference is because of diminishing returns -- those ".050 lost wins" are what B loses to *average* teams because it's bad. Because A is better than average, it would have got some of those .050 wins anyway because it's good, so B can't "lose them again" no matter how bad it is.

In baseball, the log5 formula has been proven to work very well.

------

There was some discussion of log5 lately on Tango's site (unrelated to this post, but very worthwhile), and that got me thinking. Specifically, it got me thinking: log5 CANNOT be right. It can be *almost* right, but it can never be *exactly* right.

In the baseball context, it can be very, very close, indistinguishable from perfect. But in other sports, or other contexts, it could be way wrong. 

Here's one example where it doesn't work at all.

Suppose that, instead of actually playing baseball games, teams just measured their players' average height, and the taller team wins. And, suppose there are 11 teams in the league, and there's a balanced 100-game season.

What happens? Well, the tallest team beats everyone, and goes 100-0. The second-tallest team beats everyone except the tallest, and winds up 90-10. The third-tallest goes 80-20. And so on, all the way down to 0-100.

Now: when a .600 team plays a .400 team, what happens? The log5 formula says it should win 69.2 percent of those games. But, of course, that's not right -- it will win 100 percent of those games, because it's always taller.

For height, the log5 method fails utterly.

------

What's the difference between real baseball and "height baseball" that makes log5 work in one case but not the other?

I'm not 100% sure of this, but I think it's due to a hidden, unspoken assumption in the log5 method. 

When we say, "Team A is a .600 talent," what does that mean? It could mean either of two things:

-- A1. Team A is expected to beat 60 percent of the opponents it plays.

-- A2. If Team A plays an average team, it is expected to win 60 percent of the time.

Those are not the same! And, for the log5 method to work, assumption A1 is irrelevant. It's assumption A2 that, crucially, must be true. 

In both real baseball and "height baseball," A1 is true. But that doesn't matter. What matters is A2. 

In real baseball, A2 is close enough. So log5 works.

In "height baseball," A2 is absolutely false. If Team A (.600) plays an average team (.500), it will win 100 percent of the time, not 60 percent! And that's why log5 doesn't work there.

-------

What it's really coming down to is our old friend, the question of talent vs. luck. In real baseball, for a single game, luck dwarfs talent. In "height baseball," there's no luck at all -- the winner is just the team with the most talent (height). 

Here are two possible reasons a sports team might have a .600 record:

-- B1: Team C is more talented than exactly 60 percent of its opponents

-- B2: Team C is more talented than average, by some unknown amount (which varies by sport) that leads to it winning exactly 60 percent of its games.

Again, these are not the same. And, in real life, all sports (except "height baseball") are some combination of the two. 

B1 refers completely to talent, but B2 refers mostly to luck. The more luck there is, in relation to talent, the better log5 works.

Baseball has a pretty high ratio of luck to talent -- on any given day, the worst team in baseball can beat the best team in baseball, and nobody bats an eye. But in the NBA, there's much less randomness -- if Philadelphia beats Golden State, it's a shocking upset. 

So, my prediction is: the less that luck is a factor in an outcome, the more log5 will underestimate the better team's chance of winning.

Specifically, I would predict: log5 should work better for MLB games than for NBA games.

--------

Maybe someone wants to do some heavy thinking and figure how to move this forward mathematically.  For now, here's how I started thinking about it.

In MLB, the SD of team talent seems to be about 9 games per season. That's 90 runs. Actually, it's less, because you have to regress to the mean. Let's call it 81 runs, or half a run per game. (I'm too lazy to actually calculate it.) Combining the team and opponent, multiply by the square root of two, to give an SD of around 0.7 runs.

The SD of luck, in a single game, is much higher. I think that if you computed the SD of a single team's 162 runs-scored-that-game, you'd get around 3. The SD of runs allowed is also around 3, so the SD of the difference would be around 4.2.

SD(MLB talent) = 0.7 runs
SD(MLB luck)   = 4.2 runs

Now, let's do the NBA. From basketball-reference.com, the SD of the SRS rating seems to be just under 5 points. That's based on outcomes, so it's too high to be an estimate of talent, and we need to regress to the mean. Let's arbitrarily reduce it to 4 points. Combining the two teams, we're up to 5.2 points.

What about the SD of luck? This site shows that, against the spread, the SD of score differential is around 11 points. So we have

SD(NBA talent) =  5.2 points
SD(NBA luck)   = 11.0 points

In an MLB game, luck is 6 times as important as talent. In an NBA game, luck is only 2 times as important as talent. 

But, how you apply that to fix log5, I haven't figured out yet. 

What I *do* think I know is that the MLB ratio of 6:1 is large enough that you don't notice that log5 is off. (I know that from studies that have tested it and found it works almost perfectly.) But I don't actually know whether the NBA ratio of 2:1 is also large enough. My gut says it's not -- I suspect that, for the NBA, in extreme cases, log5 will overestimate the underdog enough so that you'll notice. 

-------

Anyway, let me summarize what I speculate is true:

1. The log5 formula never works perfectly. Only as the luck/talent ratio goes to infinity, will log5 be theoretically perfect. (But, then, the predictions will always be .500 anyway.)  In all other cases, log5 will underestimate, to some extent, how much the better team will dominate.

2. For practical purposes, log5 works well when luck is large compared to talent. The 6:1 ratio for a given MLB game seems to be large enough for log5 to give good results.

3. When comparing sports, the more likely it is that the more-talented team beats the less-talented team, the worse log5 will perform. In other words: the bigger the Vegas odds on underdogs, the worse log5 will perform for that sport.

4. You can also estimate how well log5 will perform with a simple test. Take a team near the extremes of the performance scale (say, a .600/.400 team in MLB, or a .750/.250 team in the NBA), and see how it performed specifically against only those teams with talent close to .500.

If a .750 team has a .750 record against teams known to be average, log5 will work great. But if it plays .770 or .800 or .900 ball against teams known to be average, log5 will not work well. 

-------

All this has been mostly just thinking out loud. I could easily be wrong.




Labels: , , , , ,