Sabermetric Research: Why the 2016 AL was harder to predict than the 2016 NL

In 2016, team forecasts for the National League turned out more accurate than they had any right to be, with FiveThirtyEight's predictions coming in with a standard error (SD) of only 4.5 wins. The forecasts for the American League, however, weren't nearly as accurate ... FiveThirtyEight came in at 8.9, and Bovada at 8.8.

That isn't all that great. You could have hit 11.1 just by predicting each team to duplicate their 2015 record. And, 11 wins is about what you'd get most years if you just forecasted every team at 81-81.

Which is kind of what the forecasters did! Well, not every team at 81-81 exactly, but every team *close* to 81-81. If you look at FiveThirtyEight's actual predictions, you'll see that they had a standard deviation of only 3.4 wins. No team was predicted to win or lose more than 87 games.

Generally, team talent has an SD of around 9 wins. If you were a perfect evaluator of talent, your forecasts would also have an SD of 9. If, however, you acknowledge that there are things that you don't know (and many that can't be known, like injuries and suspensions), you'll forecast with an SD somewhat less than 9 -- maybe 6 or 7.

But, 3.4? That seems way too narrow.

Why so narrow? I think it was because, last year, the AL standings were themselves exceptionally narrow. In 2015, no American League team won or lost more than 95 games. Only three teams were at 89 or more.

The SD of team wins in the 2015 AL was 7.2. That's much lower than the usual figure of around 11. In fact, 7.2 is the lowest for either league since 1961. In fact, I checked, and it's the lowest for any league in baseball history! (Second narrowest: the 1974 American League, at 7.3.)

Why were the standings so compressed? There are three possibilities:

1. The talent was compressed;

2. There was less luck than normal;

3. The bad teams had good luck and the good teams had bad luck, moving both sets closer to .500.

I don't think it was #1. In 2016, the SD of standings wins was back near normal, at 10.2. The year before, 2014, it was 9.6. It doesn't really make sense that team talent regressed so far to the mean between 2014 and 2015, and then suddenly jumped back to normal in 2016. (I could be wrong -- if you can find trades and signings those years that showed good teams got significantly worse in 2015 and then significantly better in 2016, that would change my mind.)

And I don't think it was #2, based on Pythagorean luck. The SD of the discrepancy in "first-order wins" was 4.3, which larger than the usual 4.0.

So, that leaves #3 -- and I think that's what it was. In the 2015 AL, the correlation between first-order-wins and Pythagorean luck was -0.54 instead of the expected 0.00. So, yes, the good teams had bad luck and the bad teams had good luck. (The NL figure was -0.16.)

-------

When that happens, that luck compresses the standings, it definitely makes forecasting harder. Because, there's not as much information on how teams differ. To see that, consider the extreme case. If, by some weird fluke, every team wound up 81-81, how would you know which teams were talented but unlucky, and which were less skilled but lucky? You wouldn't, and so you wouldn't know what to expect next season.

Of course, that's only a problem if there *is* a wide spread of talent, one that got overcompressed by luck. If the spread of talent actually *is* narrow, then forecasting works OK.

That's what many forecasting methods assume, that if the standings are narrow, the talent must be narrow. If you do the usual "just take the standings and regress to the mean" operation, you'll wind up implicitly assuming that the spread of talent shrank at the same time as the spread in the standings shrank.

Which is fine, if that's what you think happened ... but, do you really think that's plausible? The AL talent distribution was pretty close to average in 2014. It makes more sense to me to guess that the difference between 2014 and 2015 was luck, not wholesale changes in personnel that made the bad teams better and the good teams worse.

Of course, I have the benefit of hindsight, knowing that the AL standings returned to near-normal in 2016 (with an SD of 10.2). But it's happened before -- the record-low 7.3 figure for the 1974 AL jumped back to an above-average 11.9 in 1975.

I'd think when I was forecasting the 2016 standings, I might want to make an effort to figure out which teams were lucky and which ones weren't, in order to be able to forecast a more realistic talent SD than 3.5 wins.

Besides, you have more than the raw standings. If you adjust for Pythagoras, the SD jumps from 7.2 to 8.6. And, according to Baseball Prospectus, when you additionally adjust for cluster luck, the SD rises to 9.4. (As I wrote in the P.S. to the last post, I'm not confident in that number, but never mind for now.)

An SD of 9.4 is still smaller than 11, but it should be workable.

Anyway, my gut says that you should be able to differentiate the good teams from the bad with a spread higher than 3.4 games ... but I could be wrong. Especially since Bovada's spread was even smaller, at 3.3.

-------

It's a bad idea to second-guess the bookies, but let's proceed anyway.

Suppose you thought that the standings compression of 2015 was a luck anomaly, and the distribution of talent for 2016 should still be as wide as ever. So, you took FiveThirtyEight's projections, and you expanded them, by regressing them away from the mean, by a factor of 1.5. Since FiveThirtyEight predicted the Red Sox at four games above .500 (85-77), you bump that up to six games (87-75).

If you did that, the SD of your actual predictions is now a more reasonable 5.1. And those predictions, it turns out, would have been better. The accuracy of your new predictions would have been an SD of 8.4. You would have beat FiveThirtyEight and Bovada.

If that's too complicated, try this. If you had tried to take advantage of Bovada's compressed projections by betting the "over" on their top seven teams, and the "under" on their bottom seven teams, you would have gone 9-5 on those bets.

Now, I'm not going to so far as to say this is a workable strategy ... bookmakers are very, very good at what they do. Maybe that strategy just turned out to be lucky. But it's something I noticed, and something to think about.

-------

If compressed standings make predicting more difficult, then a larger spread in the standings should make it easier.

Remember how the 2016 NL predictions were much more accurate than expected, with an SD of 4.5 (FiveThirtyEight) and 5.5 (Bovada)? As it turns out, last year, the SD of the 2015 NL standings was higher than normal, at 12.65 wins. That's the highest of the past three years:

2014 AL= 9.59, NL= 9.20
2015 AL= 6.98, NL=12.65
2016 AL=10.15, NL=10.71

It's not historically high, though. I looked at 1961 to 2011 ... if the 2015 NL were included, it would be well above average, but only 70th percentile.*

(* If you care: of the 10 most extreme of the 102 league-seasons in that timespan, most were expansion years, or years following expansion. But the 2001, 2002, and 2003 AL made the list, with SDs of 15.9, 17.1, and 15.8, respectively. The 1962 National League was the most extreme, at 20.1, and the 2002 AL was second.)

A high SD won't necessarily make your predictions beat the speed of light, and a low SD won't necessarily make them awful. But both contribute. As an analogy: just because you're at home doesn't mean you're going to pitch a no-hitter. But if you *do* pitch a no-hitter, odds are, you had the help of home-field advantage.

So, given how accurate the 2016 NL forecasts were, I'm not surprised that the SD of the 2015 NL standings was higher than normal.

-------

Can we quantify how much compressed standings hurt next year's forecasts? I was curious, so I ran a little simulation.

First, I gave every team a random 2015 talent, so that the SD of team talent came out between 8.9 and 9.1 games. Then, I ran a simulated 2015 season. (I ran each team with 162 independent games, instead of having them play each other, so the results aren't perfect.)

Then, I regressed each team's 2015 record to the mean, to get an estimate of their talent. I assumed that I "knew" that the SD of talent was around 9, so I "unregressed" each regressed estimate away from the mean by the exact amount that gets the SD of talent to exactly 9.00. That became the official forecast for 2016.

Finally, I ran a simulation of 2016 (with team talent being the same as 2015). I compared the actual to the forecast, and calculated the SD of the forecast errors.

The results came out, I think, very reasonable.

Over 4,000 simulated seasons, the average accuracy was an SD of 7.9. But, the higher the SD of last year's standings, the better the accuracy:

SD Standings SD next year's forecast
------------------------------------
7.0 8.48 (2015 AL)
8.0 8.31
9.0 8.14
10.0 7.98
11.0 7.81
12.0 7.64
12.6 7.54 (2015 NL)
13.0 7.47
14.0 7.31
20.1 6.29 (1962 NL)

So, by this reckoning, you'd expect the 2016 NL predictions to have been one win more accurate than than the AL predictions.

They were "much more accurater" than that, of course, by 3.4 or 4.5. The main reason, of course, is that there's a lot of luck involved. Less importantly, this simulation is very rough. The model is oversimplified, and there's no assurance that the relationship is actually linear. (In fact, the relationship *can't* be linear, since the "speed of light" limit is 6.4, and the model says the 1974 AL would beat that, at 6.3).

It's just a very rough regression to get a very rough estimate.

But the results seem reasonable to me. In 2016, we had (a) the narrowest standings in baseball history in the 2015 AL, and (b) a wider-than-average, 70th percentile spread in the 2016 NL. In that light, an expected difference of 1 win, in terms of forecasting accuracy, seems very plausible.

--------

So that's my explanation of why this year's NL forecasts were so accurate, while this year's AL forecasts were mediocre. A large dose of luck -- assisted by a small (but significant) dose of extra information content in the standings.

Labels: baseball, forecasting, luck, mlb, pythagoras, statistics

4 Comments:

At Monday, October 24, 2016 10:52:00 PM, Anonymous said...: Couple of comments Phil. Good article by the way. Any projection system that uses data such as last year's w/l records, is garbage. Any credible system is going to use player projections along with playing time projections. Of course last year's w/l records won't matter except to the extent that they reflect real parity, which they should regardless of the w/l records of the previous or subsequent years (to be fair, those provide somewhat of a pre and post Bayesian probability of there being parity in the middle year.) unless the third order wins have a normal SD. That's because 3rd order wins always trump regular wins (more or less).

Also, I am pretty sure that "reverse regressing" w/l records is always wrong and will always give you a worse projection. You ALWAYS want to take last year's w/l and regress them toward the mean regardless of how small the SD is. If you get a better projection by regressing away from the mean, it is just a fluke. It seems like you suggest in the article that it was "correct" to regress away from the mean when the SD of wins is small.

MGL
At Monday, October 24, 2016 11:13:00 PM, Phil Birnbaum said...: Hi, MGL,

I think that "reverse regressing" CAN work, if you know (or have a very good idea) that the true talent is wider than the observed standings.

Imagine that you have two teams, one you know has 91 game talent and another you know has 71 game talent, but you don't know which is which. Team A goes 88-73, and team B goes 73-88. Your best estimate is to regress both AWAY from the mean towards 91 and 71 wins. (Or, 90.9 and 71.1, or whatever the Bayesian calculation gives you.)
At Monday, October 24, 2016 11:18:00 PM, Phil Birnbaum said...: Agreed that projection systems will use player projections. But if last year's standings are compressed, even after adjusting for Pythag and Runs Created luck, probably the player performances were "too close" to the mean. So, widening your SD over what the normal system will give you might be a good idea.

It all hinges on how confident you are that the standings won't be that compressed.

Of course, if you're using more than one year of player data for your current projections, last year being compressed won't affect your forecasts that much, and you may not have to regress away from the mean much (if at all).

BTW, doesn't FiveThirtyEight use player projections initially, but then just use game results (ELO) to adjust their forecasts as the season progresses? Or maybe not, I haven't looked too deeply.
At Sunday, November 06, 2016 8:10:00 PM, Don Coffin said...: Phil--I've got something that may fit in BTN, but I'm not set up to use outlook, so I can't access your email address...could you email me at dcoffin@iun.edu? Thanks.

<< Home

Sabermetric Research

Monday, October 24, 2016

Why the 2016 AL was harder to predict than the 2016 NL

4 Comments:

About Me

Previous Posts