Last December, David Lewin and Dan Rosenbaum released a working version of a fascinating APBRmetrics paper. It's called "The Pot Calling the Kettle Black – Are NBA Statistical Models More Irrational than "Irrational" Decision Makers?"
I can't find the paper online any more (anyone know where it is?), but a long message-board discussion is here, and some presentation slides are here.
Basically, the paper compares the state-of-the-art basketball sabermetric statistics to more naive factors that, presumably, uneducated GMs use when deciding who to sign and how much to pay. You'd expect that the sabermetric stats should correlate much better with winning. But, as it turns out, they don't, at least not by much.
Here's what the authors did. For each team, they computed various statistics, sabermetric and otherwise, for each player. Let me list them:
-- Minutes per game
-- Points per game
-- NBA Efficiency
-- Player Efficiency Rating
-- Wins Produced
-- Alternate Win Score
Except for the first two, the study adjusted each stat by normalizing it to the position average. More importantly, they normalized all six stats so that the sum of a team's individual stats would sum to the team's actual efficiency (points per possession minus points against per possession).
That team adjustment is important. I can phrase the result of that adjustment a different way: the authors took the team's performance (as measured by efficiency) and *allocated it among the players* six different ways, corresponding to the six different statistics listed above. (Kind of like six different versions of Win Shares, I guess).
For the most "naive" stat, minutes per game: suppose the team, on average, scored five more points per 100 possessions than its opponent. And suppose one player played half-time (24 minutes per game). That means he's responsible for one-tenth of his team's minutes, so he'd be credited with 0.5 points per game.
So the authors did this for all six statistics. Now, for the current season, all the teams will sum to their actual efficiency, because that's the way the stats were engineered. So you can't learn anything about the relative worth of the stats by using the current season.
But what if you use *next* season? Now, you can evaluate how well stats predict wins. That's because some players will have moved around. Suppose a team loses players A, B, and C off-season, but sign players X, Y, and Z.
Using minutes per game, maybe A, B, and C were worth +1 win, and players X, Y and Z were worth +2 wins. In that case, the team "should" – if minutes per game is a good stat – gain one win over last year.
But, using Wins Produced, maybe A, B and C were worth 1.5 wins, and X, Y and Z are also worth 1.5 wins. Then, if Wins Produced is accurate, the team should finish the same as last year.
By running this analysis on all six stats, and all teams, you should be able to figure out which stat is best. And you'd expect that the naive methods should be the worst – if sabermetric analysis is worth anything, wouldn't you think it should be able to improve on "minutes per game" in telling us which players are best?
But, surprisingly, the naive methods weren't that much worse than the sabermetric ones. Lewin and Rosenbaum regressed last year's player stats on this year's wins, and here are the correlation coefficients (r) he got:
0.823 -- Minutes per game
0.817 -- Points per game
0.820 -- NBA Efficiency
0.805 -- Player Efficiency Rating
0.803 -- Wins Produced
0.829 -- Alternate Win Score
It turns out that the method you'd think was least effective – minutes per game – outperformed almost all the other stats. The worst predictor was "Wins Produced," the carefully derived stat featured in "The Wages of Wins." (BTW, not all the differences in correlations were statistically significant, but the more extreme ones were.)
And repeating the analysis on teams two years forward, and three years forward, the authors find the results to be very similar.
So what's going on? Are general managers actually better at evaluating NBA players than existing sabermetric analysis? The authors think so:
"Our findings ... suggest that models that assume simplistic NBA decision-making often outperform more sophisticated statistical models."
I agree. But I don't think it's because GMs are omniscient – I think it's because even the best statistics measure only part of the game.
All of the above measures are based on "box score statistics" – things that are actually counted during the game. And there are more things counted on offense than on defense. For instance, shooting percentage factors into most of the above stats, but what about *opponent's* shooting percentage? That isn't considered at all, but we could all agree that forcing your opponent to take low-percentage shots is a major part of the defense's job. That's factored into the player ratings as part of the team adjustment, but all players get equal credit for it.
So: if coaches and general managers know how good a player is on defense (which presumably they do), and Wins Produced doesn't, then it's no surprise that GMs outperform stats.
-----
Take a baseball analogy. In the National League, what correlates better to team wins summed at the player level – wOBA, or GM's evaluations? It would definitely be GM's evaluations. Why? Because of pitching. The GM would certainly take pitching into account, but wOBA doesn't. That doesn't mean that wOBA is a bad stat, just that it doesn’t measure *total* goodness, just hitting goodness.
Another one, and more subtle: what would correlate better with wins – wOBA or At Bats? It could go either way, again because of pitching. Better pitchers have playing time, and therefore more AB, so good pitching correlates with AB (albeit weakly). But good pitchers don't necessarily have a better wOBA. So AB would be better for measuring pitching prowess (although, of course, it would still be a very poor measure).
That means that if you run a regression using AB, you get a worse measure for hitters, and a better measure for batters. If you use wOBA, you get a better measure for hitters, but a worse measure for pitchers. Which would give you a better correlation with wins? We can't tell without trying.
-----
What Lewin and Rosenbaum are saying is that, in basketball right now, sabermetric measures aren't good enough to compete with the judgments of GMs, and that APBRmetricians' confidence in their methods is unjustified. I agree. However, I'd argue that it's not that the new statistical methods are completely wrong, or bad, just that they don't measure enough of what needs to be measured.
If I wanted to reliably evaluate baskeball players, I'd start with the most reliable of the above six sabermetric measures – Alternate Win Score. Then, I'd list all the areas of the game that AWS doesn't measure, like various aspects of defensive play. I'd ask the GMs, or knowlegeable fans, to rate each player in each of those areas. Then, after combining those evaluations with the results of AWS, I'd bet I'd wind up with a rating that kicks the other methods' respective butts.
But, until then, I have to agree with the authors of this paper – the pot is indeed calling the kettle black. It looks like humans *are* better at evaluating talent than any of the widely available stats.
Labels: basketball, NBA