How well can anyone predict the NFL standings?
In a post at The Sports Economist, Brian Goff reports that pre-season predictions of NFL teams’ rankings turned out to be not all that accurate. Comparing predicted to actual ranking within the conference, Peter King (of Sports Illustrated) came up with an r-squared of only .11. His colleague “Dr. Z” did a bit better, at .21, while the ranking implied by Las Vegas oddsmakers did the best, at .26.
But you’d expect that correlations would appear fairly weak if you use rankings. If the talent distribution of teams is shaped like a bell curve, many of the teams will all be bunched together in the middle. Those teams would be so close together in talent that their actual results are effectively random. Combining the “random” teams in the middle with the obvious choices at the top and bottom, those numbers don’t seem all that bad.
To check, I ran a simulation. First, I chose a random “talent” for each of the 16 teams in a conference, from a normal distribution with mean .500 and SD of .143 (per Tangotiger’s technique). Then, I ran an independent 16-game season for each team, where their chances of winning any given game were equal to their talent. Finally, I computed the r-squared between their talent rank and the rank of their actual performance (both ranks from 1 to 16). I ran this simulation 10,000 times and got the average r-squared.
r-squared = 0.53
This was a lot higher than I expected; but, remember, this is an upper bound on how well any prediction can do. It assumes that the predictor is capable of knowing a team’s actual talent level. In reality, that’s not possible. Even if you’ve watched every NFL game ever, the players’ historical performances have a fair bit of luck embedded in them too. If a QB had a passer rating of, say, 80 in his rookie season, you don’t really know if he was a 90 who had an unlucky year, or a 70 who had a lucky year, so your estimates are going to be somewhat off.
Suppose we build that uncertainty into the model, so that instead of correlating wins and actual talent, we correlate wins with a *guess* at the actual talent. We get the guess by starting with the talent, but adding an error term, with mean 0 and SD of 0.050.
If we do that, we now get:
r-squared = 0.48
Now, what about talent lost to injuries and criminal convictions and so forth? Those are unpredictable too. Suppose we lump those together with the other SD, and raise it from .050 to .070. Repeating the simulation, I get
r-squared = 0.44
This is still quite a bit higher than the actual performance of even the Las Vegas oddsmakers. What could explain the difference? It could be that our assumptions were too conservative: maybe the limit on human ability to predict talent isn’t actually .070, but even higher. But to get the r-squared down to .26, I had to raise the uncertainty from .070 all the way to .160, which seems way too high.
Another possibility is that the pundits just had a bad year. The SDs of the r-squareds over the 10,000 trials were all in the 0.19 range. So an r-squared in the .20s would be only a bit more than 1 SD away, and so should happen reasonably often.
One more factor: in my simulation, all the teams’ records were independent (as if every game was against a team in the other conference). But in real life, teams play each other, and upsets thus impact two teams instead of one. I’m too lazy to simulate that, but I’d bet the theoretical r-squared would go down, because the variance of the difference between two teams who wound up playing each other would increase.
But in any case, and even with knowledge as perfect as anyone can have, the r-squared couldn’t get any higher than somewhere in the .40s. In that light, the Vegas figure of .26 is looking pretty reasonable.