Thursday, August 30, 2007

How well can anyone predict the NFL standings?

In a post at The Sports Economist, Brian Goff reports that pre-season predictions of NFL teams’ rankings turned out to be not all that accurate. Comparing predicted to actual ranking within the conference, Peter King (of Sports Illustrated) came up with an r-squared of only .11. His colleague “Dr. Z” did a bit better, at .21, while the ranking implied by Las Vegas oddsmakers did the best, at .26.

But you’d expect that correlations would appear fairly weak if you use rankings. If the talent distribution of teams is shaped like a bell curve, many of the teams will all be bunched together in the middle. Those teams would be so close together in talent that their actual results are effectively random. Combining the “random” teams in the middle with the obvious choices at the top and bottom, those numbers don’t seem all that bad.

To check, I ran a simulation. First, I chose a random “talent” for each of the 16 teams in a conference, from a normal distribution with mean .500 and SD of .143 (
per Tangotiger’s technique). Then, I ran an independent 16-game season for each team, where their chances of winning any given game were equal to their talent. Finally, I computed the r-squared between their talent rank and the rank of their actual performance (both ranks from 1 to 16). I ran this simulation 10,000 times and got the average r-squared.

The results:

r-squared = 0.53

This was a lot higher than I expected; but, remember, this is an upper bound on how well any prediction can do. It assumes that the predictor is capable of knowing a team’s actual talent level. In reality, that’s not possible. Even if you’ve watched every NFL game ever, the players’ historical performances have a fair bit of luck embedded in them too. If a QB had a passer rating of, say, 80 in his rookie season, you don’t really know if he was a 90 who had an unlucky year, or a 70 who had a lucky year, so your estimates are going to be somewhat off.

Suppose we build that uncertainty into the model, so that instead of correlating wins and actual talent, we correlate wins with a *guess* at the actual talent. We get the guess by starting with the talent, but adding an error term, with mean 0 and SD of 0.050.

If we do that, we now get:

r-squared = 0.48

Now, what about talent lost to injuries and criminal convictions and so forth? Those are unpredictable too. Suppose we lump those together with the other SD, and raise it from .050 to .070. Repeating the simulation, I get

r-squared = 0.44

This is still quite a bit higher than the actual performance of even the Las Vegas oddsmakers. What could explain the difference? It could be that our assumptions were too conservative: maybe the limit on human ability to predict talent isn’t actually .070, but even higher. But to get the r-squared down to .26, I had to raise the uncertainty from .070 all the way to .160, which seems way too high.

Another possibility is that the pundits just had a bad year. The SDs of the r-squareds over the 10,000 trials were all in the 0.19 range. So an r-squared in the .20s would be only a bit more than 1 SD away, and so should happen reasonably often.

One more factor: in my simulation, all the teams’ records were independent (as if every game was against a team in the other conference). But in real life, teams play each other, and upsets thus impact two teams instead of one. I’m too lazy to simulate that, but I’d bet the theoretical r-squared would go down, because the variance of the difference between two teams who wound up playing each other would increase.

But in any case, and even with knowledge as perfect as anyone can have, the r-squared couldn’t get any higher than somewhere in the .40s. In that light, the Vegas figure of .26 is looking pretty reasonable.

Labels: ,


At Friday, August 31, 2007 10:57:00 AM, Blogger Brian Burke said...

Your 0.48 r-squared result is almost exactly what I found doing a very similar exercise a few weeks ago. (

And just last night I starting putting together an article for an analysis of Vegas over-under win predictions and Football Outsiders predictions. It turns out that even the best pre-season predictions are worthless.

The test of a prediction should not be a comparison against zero knowledge. It should be against obvious knowledge. For example, the test of NFL season predictions should be a comparison of how much better it does than just using last year's results.

It turns out that just using a regression of last year's win totals to predict the following year's win totals has just as good predictive capability (judged by r-squared and mean absolute error) as Football Outsiders and Vegas. (Data from '05 and '06).

In fact, if you rank the teams in their divisions based on Vegas' current over-under totals, the rankings are virtually identical to last year's final standings! (The only exceptions are 2 ties.) I'll have more in a forthcoming article.

At Monday, September 03, 2007 12:39:00 AM, Blogger Tangotiger said...

In the blog post of mine you cite, I showed that the var(true) is .143^2, and var(observed) is .19^2. This implies an r of .143^2/.19^2 = .57, or r-squared of .32.

That is, given one 16 game sample, and presuming no change at all in talent level, you'll get an r-squared of .32 with another 16 game sample.

Of course, we have more information than just 16 games. So, we'd expect an r-squared of better than .32. Perfect knowledge would be the square root of .32, or .57. So, somewhere between the two is what the forecasters are fighting for.

At Friday, April 24, 2009 2:18:00 AM, Anonymous Anonymous said...



Post a Comment

<< Home