Predictions should be narrower than real life
Every year since 1983 (strike seasons excepted), at least one MLB team finished with 97 wins or more. More than half the time, the top team had 100 wins or more.
In contrast, if you look at ESPN's 2014 team projections, their highest team forecast is 93-69.
What's going on? Does ESPN really expect no team to win more than 93 games?
Nope. I bet ESPN would give you pretty good odds that some team *will* win 94 games or more, once you add in random luck.
The standard deviation (SD) of team wins due to binomial randomness is around 6.4. That means on average, nine teams per season will be lucky by six wins or more. If you have a bunch of teams forecasted in the low 90s -- and ESPN has five of those -- chances are, one of them will get lucky and finish around a hundred wins.
But you can't predict which teams will get that luck. So, if you care only about getting the best accuracy of the individual team forecasts, you're always going to project a narrower range than the actual outcomes.
A more obvious way to see that is to imagine simulating a season by flipping coins -- heads the home team wins, tails the visiting team wins. Obviously, any one team is as "good" as any other. Under those circumstances, the best prediction you can make is that every team will go 81-81. Of course, that probably won't happen, and some teams, by just random chance, will go 92-70, or something. But you don't know which teams those will be, and, since it's just luck, there's no way of being able to guess.
It's the same logic in for real baseball. No *specific* team can be expected to win more than 93 games. Some teams will probably win more than 96 by getting lucky, but there's no way of predicting which ones.
That's why your range of projections has to be narrower than the expected standings. How much narrower?
Over the past few seasons, the SD of team wins has been around 11. Actually, it fluctuates a fair bit (which is expected, due to luck and changing competitive balance). In 2002, it was over 14.5 wins; in 2007, it was as low as 9.3. But 11 is pretty typical.
Since a team's observed performance is the sum of talent and luck, and because talent and luck are independent,
SD(observed)^2 = SD(talent)^2 + SD(luck)^2.
Since SD(observed) equals 11, and SD(luck) = 6.4, we can figure that, after rounding,
SD(talent) = 9
So: if a season prediction has an SD that's significantly larger than 9, that's a sign that someone is trying to predict which teams will be lucky. And that's impossible.
As I wrote before, it's *really* impossible, as in, "exceeding the speed of light" impossible. It's not a question of just having better information about the teams and the players. The "9" already assumes that your information is perfect -- "perfect" in the sense of knowing the exact probability of every team winning every game. If your information isn't perfect, you have to settle for even less than 9.
Let's break down the observed SD of 11 even further. Before, we had
Observed = talent + luck
We can change that to:
Observed = talent we can estimate + talent we can't estimate + luck
Clearly, we'd be dumb in trying to estimate talent we don't know about -- by definition, we'd just be choosing random numbers. What kind of things are there that affect team talent that we can't estimate? Lots:
-- which players will get injured, and for how long?
-- which players will blossom in talent, and which will get worse?
-- how will mid-season trades change teams' abilities?
-- which players will the manager choose to play more or less than expected?
How big are those issues? I'll try guessing.
For injuries: looking at this FanGraphs post, the SD of team player-days lost to injury seems to be around 400. If the average injured player has a WAR of 2.0, that's an SD of about 5 wins (400 games is around 2.5 seasons).
But that's too high. A WAR of 2.0 is the standard estimate for full-time players, but there are many part-time players whose lost WAR would be negligible. The Fangraphs data might also include long term DL days, where the team would had time to find a better-than-replacement substitute.
I don't know what the right number is ... my gut says, let's change the SD to 2 wins instead of 5.
What about players blossoming in talent? I have no idea. Another 2 wins? Trades ... call it 1 win? And, playing time ... that could be significant. Call that another 2 wins.
Use your own estimates if you think I don't know what I'm talking about (which I don't). But for now, we have:
9 squared equals
-- SD of known talent squared +
-- 2 squared [injuries] +
-- 2 squared [blossoming] +
-- 1 squared [trades] +
-- 2 squared [playing time].
Solving, and rounding, we get
SD of known talent = 8
What that means is: the SD of your predictions shouldn't be more than around 8. If it is, you're trying to predict something that's random and unpredictable.
And, again: that calculation is predicated on *perfect knowledge* of everything that we haven't listed here. If you don't have perfect knowledge -- which you don't -- you should be even lower than 8.
In a sense, the SD of your projections is a measure of your confidence. The higher the SD, the more you think you know. A high SD is a brag. A low SD is humility. And, a too-high SD -- one that violates the "speed of light" limit -- is a sign that there's something wrong with your methdology.
What about an easy, naive prediction, where we just project based on last year's record?
This blog post found a correlation of .58 between team wins in 2012 and 2013. That would suggest that, to predict next year, you take last year, and regress it to the mean by around 42 percent.
If you do that, your projections would have an SD of 6.38. (It's coincidence that it works out nearly identical to the SD of luck.)
I'd want to check the correlation for other pairs of years, to get a bigger sample size for the .58 estimate. But, 6.38 does seem reasonable. It's less than 8, which assumes excellent information, and it's closer to 8 than 0, which makes sense, since last year's record is still a pretty good indicator of how good a team will be this year.
A good practical number has to be somewhere betwen 6.38 (where we only use last year's record), and 8 (where we have perfect information, everything that can truly be known).
Where in between? For that, we can look to the professional bookmakers.
I think it's safe to assume that bookies pretty much need to have the most accurate predictions of anyone. If they didn't, smart bettors would bankrupt them.
The Bovada pre-season Over/Under lines had a standard deviation of 7.16 wins. The Las Vegas Hotel and Casino also came in at 7.16. (That's probably coincidence -- their lines weren't identical, just the SD.)
7.16 seems about right. It's almost exactly halfway between 6.38 and 8.00.
If we accept that number, it turns out that more of the season is unpredictable than predictable. The SD of 11 wins per team comes from 7.2 wins that the sports book can figure out, and 8.3 that the sports book *can't* figure out (and is probably unknowable).
So, going back to ESPN: how did they do? When I saw they didn't predict any teams higher than 93 wins, I suspected their SD would come out reasonable. And, yes, it's OK -- 7.79 wins. A little bit immodest, in my judgment, but not too bad.
I decided to check some others. I did a Google search to find all the pre-season projections I could, and then added the one I found in Sports Illustrated, and the one from "ESPN The Magazine" (without their "chemistry" adjustments).
Here they are, in order of [over]confidence:
11.50 Sports Illustrated
9.23 Mark Townsend (Yahoo!)
9.00 Speed of Light (theoretical est.)
8.76 Jeff Passan (Yahoo!)
8.72 ESPN The Magazine
8.53 Jonah Keri (Grantland)
8.51 Sports Illustrated (runs/10)
8.00 Speed of Light (practical est.)
7.83 Average ESPN voter (FiveThirtyEight)
7.79 ESPN Website
7.78 Mike Oz (Yahoo!)
7.58 David Brown (Yahoo!)
7.16 Vegas Betting Line
6.90 Tim Brown (Yahoo!)
6.38 Naive previous year method (est.)
5.55 Fangraphs (4/7/14)
0.00 Predict 81-81 for all teams
Anything that's not from an actual prediction is an estimate, as discussed in the post. Even the theoretical "speed of light" is an estimate, since we arbitrarily chose 11.00 as the SD of observed wins. None of the estimates are accurate to two decimals (or even one decimal), but I left them in to make the chart look nicer.
Sports Illustrated ran two sets of predictions: wins, and run differential. You'd think they would have predicted wins by using Pythagoras or "10 runs equals one win", but they didn't. It turns out that their runs estimates are much more reasonable than their win predictions.
Fangraphs seems way too humble. I'm not sure why. Fangraphs updates their estimates every day, for the season's remaining games. At time of writing, most teams had 156 games to go, so I pro-rated everything from 156 to 162. Still, I think I did it right; their estimates are very narrow, with no team projected better than 90-72.
FiveThirtyEight got access to the raw projections of the ESPN voters, and ran a story on them. It would be fun to look at the individual voters, and see how they ranged, but FiveThirtyEight gave us only the averages (which, after rounding, are identical to those published on the ESPN website).
If you have any others, let me know and I'll add them in.
If you're a forecaster, don't think you need to figure out what your SD should be, and then adjust your predictions to it. If you have a logical, reasonable algorithm, it should just work out. Like, when we realized we had to predict 81-81 for every coin. We didn't need to say, "hmmm, how do I get SD to come out to zero?" We just realized that we knew nothing, so 81-81 was the right call.
The SD should be a check on your logic, not necessarily part of your algorithm.
-- Last year, Tango and commenters discussed this in some depth and tested the SDs of some 2013 predictions.
-- Here's my post on why you shouldn't judge predictions by how well they match outliers.
-- Yesterday, FiveThirtyEight used this argument in the context of not wanting to predict the inevitable big changes in the standings.
-- IIRC, Tango had a nice post on how narrow predictions and talent estimates, but I can't find it right now.