## Friday, December 12, 2008

### Do NBA general managers outperform sabermetric stats?

Last December, David Lewin and Dan Rosenbaum released a working version of a fascinating APBRmetrics paper. It's called "The Pot Calling the Kettle Black – Are NBA Statistical Models More Irrational than "Irrational" Decision Makers?"

I can't find the paper online any more (anyone know where it is?), but a long message-board discussion is here, and some presentation slides are here.

Basically, the paper compares the state-of-the-art basketball sabermetric statistics to more naive factors that, presumably, uneducated GMs use when deciding who to sign and how much to pay. You'd expect that the sabermetric stats should correlate much better with winning. But, as it turns out, they don't, at least not by much.

Here's what the authors did. For each team, they computed various statistics, sabermetric and otherwise, for each player. Let me list them:

-- Minutes per game

-- Points per game

Except for the first two, the study adjusted each stat by normalizing it to the position average. More importantly, they normalized all six stats so that the sum of a team's individual stats would sum to the team's actual efficiency (points per possession minus points against per possession).

That team adjustment is important. I can phrase the result of that adjustment a different way: the authors took the team's performance (as measured by efficiency) and *allocated it among the players* six different ways, corresponding to the six different statistics listed above. (Kind of like six different versions of Win Shares, I guess).

For the most "naive" stat, minutes per game: suppose the team, on average, scored five more points per 100 possessions than its opponent. And suppose one player played half-time (24 minutes per game). That means he's responsible for one-tenth of his team's minutes, so he'd be credited with 0.5 points per game.

So the authors did this for all six statistics. Now, for the current season, all the teams will sum to their actual efficiency, because that's the way the stats were engineered. So you can't learn anything about the relative worth of the stats by using the current season.

But what if you use *next* season? Now, you can evaluate how well stats predict wins. That's because some players will have moved around. Suppose a team loses players A, B, and C off-season, but sign players X, Y, and Z.

Using minutes per game, maybe A, B, and C were worth +1 win, and players X, Y and Z were worth +2 wins. In that case, the team "should" – if minutes per game is a good stat – gain one win over last year.

But, using Wins Produced, maybe A, B and C were worth 1.5 wins, and X, Y and Z are also worth 1.5 wins. Then, if Wins Produced is accurate, the team should finish the same as last year.

By running this analysis on all six stats, and all teams, you should be able to figure out which stat is best. And you'd expect that the naive methods should be the worst – if sabermetric analysis is worth anything, wouldn't you think it should be able to improve on "minutes per game" in telling us which players are best?

But, surprisingly, the naive methods weren't that much worse than the sabermetric ones. Lewin and Rosenbaum regressed last year's player stats on this year's wins, and here are the correlation coefficients (r) he got:

0.823 -- Minutes per game

0.817 -- Points per game

0.820 -- NBA Efficiency

0.805 -- Player Efficiency Rating

0.803 -- Wins Produced

0.829 -- Alternate Win Score

It turns out that the method you'd think was least effective – minutes per game – outperformed almost all the other stats. The worst predictor was "Wins Produced," the carefully derived stat featured in "The Wages of Wins." (BTW, not all the differences in correlations were statistically significant, but the more extreme ones were.)

And repeating the analysis on teams two years forward, and three years forward, the authors find the results to be very similar.

So what's going on? Are general managers actually better at evaluating NBA players than existing sabermetric analysis? The authors think so:

"Our findings ... suggest that models that assume simplistic NBA decision-making often outperform more sophisticated statistical models."

I agree. But I don't think it's because GMs are omniscient – I think it's because even the best statistics measure only part of the game.

All of the above measures are based on "box score statistics" – things that are actually counted during the game. And there are more things counted on offense than on defense. For instance, shooting percentage factors into most of the above stats, but what about *opponent's* shooting percentage? That isn't considered at all, but we could all agree that forcing your opponent to take low-percentage shots is a major part of the defense's job. That's factored into the player ratings as part of the team adjustment, but all players get equal credit for it.

So: if coaches and general managers know how good a player is on defense (which presumably they do), and Wins Produced doesn't, then it's no surprise that GMs outperform stats.

-----

Take a baseball analogy. In the National League, what correlates better to team wins summed at the player level – wOBA, or GM's evaluations? It would definitely be GM's evaluations. Why? Because of pitching. The GM would certainly take pitching into account, but wOBA doesn't. That doesn't mean that wOBA is a bad stat, just that it doesn’t measure *total* goodness, just hitting goodness.

Another one, and more subtle: what would correlate better with wins – wOBA or At Bats? It could go either way, again because of pitching. Better pitchers have playing time, and therefore more AB, so good pitching correlates with AB (albeit weakly). But good pitchers don't necessarily have a better wOBA. So AB would be better for measuring pitching prowess (although, of course, it would still be a very poor measure).

That means that if you run a regression using AB, you get a worse measure for hitters, and a better measure for batters. If you use wOBA, you get a better measure for hitters, but a worse measure for pitchers. Which would give you a better correlation with wins? We can't tell without trying.

-----

What Lewin and Rosenbaum are saying is that, in basketball right now, sabermetric measures aren't good enough to compete with the judgments of GMs, and that APBRmetricians' confidence in their methods is unjustified. I agree. However, I'd argue that it's not that the new statistical methods are completely wrong, or bad, just that they don't measure enough of what needs to be measured.

If I wanted to reliably evaluate baskeball players, I'd start with the most reliable of the above six sabermetric measures – Alternate Win Score. Then, I'd list all the areas of the game that AWS doesn't measure, like various aspects of defensive play. I'd ask the GMs, or knowlegeable fans, to rate each player in each of those areas. Then, after combining those evaluations with the results of AWS, I'd bet I'd wind up with a rating that kicks the other methods' respective butts.

But, until then, I have to agree with the authors of this paper – the pot is indeed calling the kettle black. It looks like humans *are* better at evaluating talent than any of the widely available stats.

Labels: ,

At Friday, December 12, 2008 4:08:00 PM,  Anonymous said...

GMs essentially have been using boxscore stats to some degree and intuitive adjusted +/- to make decisions. That set of tools gets used better by some than others. Now pure adjusted +/- is available widely. How widely used by insiders I can't say.

In private, second generation pure/statistical adjusted +/- or adjusted lineup data with splits done to offense and defense and impact at counterpart level vs team impact are beginning to be developed or refined. How much are they being used and whether they will do better than boxscore stats and intuitive adjusted +/- will be an interesting story to listen for clues on, though the insiders make it hard to know how much they have and how much it is used and how much it contributed to the success. Not within the control of the consultants to reveal, yet. When / if an actual GM committed to analytics (like Daryl Morey) wins a title we will probably hear about the role of the new analytics more. But will still have to evaluate the claimed share vs the shares for the coaches and players themselves.

At Friday, December 12, 2008 4:20:00 PM,  Don Coffin said...

I'm actually not surprised that playing time correlates closely with (future) wins. Assuming that the coach is trying to win, and assuming that the coach can actually determine which players contribute to winning (I know...but the best coaches probably do a better job of this than the worst coaches...), then playing time ought to correlate well. And better than in baseball, actually, because the coach has more control over playing time (substitution is easier, and a player can be reinserted after coming out). I wonder whether NBA teams have private metrics that they use?

At Friday, December 12, 2008 4:26:00 PM,  Anonymous said...

I don't think the conclusion reached here ("in basketball right now, sabermetric measures aren't good enough to compete with the judgments of GMs, and that APBRmetricians' confidence in their methods is unjustified") is very fair, since the study isn't based on metrics that front offices would actually use. Is any team seriously using Wins Produced to make decisions? Or PER? No, of course not, they're novelty stats designed to sell books. In essence, Lewin & Rosenbaum are knocking down a straw man here: Wins Produced doesn't predict future wins any better than MPG? Really? I had no idea!

Just because these junk stats don't predict wins well doesn't mean all APBRmetric stats aren't good enough to compete with GMs (Isiah Thomas & Kevin McHale beg to differ). In this study, they simply picked bad metrics which obviously aren't going to correlate well with future performance.

At Friday, December 12, 2008 4:30:00 PM,  Phil Birnbaum said...

Anonymous: So if these are "junk stats," then what stat DOES correlate with future performance better than MPG?

At Friday, December 12, 2008 4:44:00 PM,  Anonymous said...

Anything nonlinear. Adjusted Plus/Minus, Kevin Pelton's WARP, Justin Kubatko's Win Shares, plus proprietary stats the public has never even heard of. To suggest that these simplistic metrics are representative of what APBRmetrics has to offer is akin to saying that batting average is the sabermetric stat of choice.

At Friday, December 12, 2008 5:10:00 PM,  Anonymous said...

Actually, Adjusted +/- correlated the worst of all measures with future records.

At Friday, December 12, 2008 5:14:00 PM,  Anonymous said...

Which form, though? The single-season version is notoriously unreliable, but Steve Ilardi's multiyear version is much more steady from year to year.

At Friday, December 12, 2008 9:24:00 PM,  Anonymous said...

The other reason why GMs' choices may be better than stats which rely on boxscores (and even play-by-play data) is that GMs can use qualitative information to predict whose performances will improve or degrade in the near future. There are some rough models in basketball as well as baseball that attempt to do this: taking a player's age or experience into account, or doing a version of Nate Silver's PECOTA analysis of finding similar players and seeing what their career trajectory looked like, but the public versions of the estimates are still IMO relatively crude, and it wouldn't be suprising if an experienced human being could outperform a formula by taking into account not just age and statistics but also body type and more importantly observations about a player's rate of improvement, willingness to practice, perseverance, injury history, lifestyle (drugs vs working out), coachability, etc. etc.

At Friday, December 12, 2008 10:24:00 PM,  Anonymous said...

Right MKT.

Using quantitative information should never completely crowd out thoughtful, experienced based, self-refining use of qualitative information. But the more you account for quantitatively the less you have to try to span with judgment.

Last season my stats + judgment based projections for team wins at apbr board beat the contenders which were strictly numbers based. Did I get lucky or did I do well at blending the 2?

Coaches and GMs try their hand at the blend too. Being rigorous and talented at both will get the best results. Boston did a very good job at combining the numbers and the qualitative.

Doc I assume some of the best teams with analytic shops have private metrics or ways to blend the signal reading from multiple sources. Dan Rosenbaum presented pure adjusted +/- and then immediately build overall +/- that combined it with statistical +/-. It would seem odd to me if they didn't use the blend in-house but that is behind closed doors and closed lips.

At Saturday, December 13, 2008 12:01:00 PM,  Anonymous said...

"Which form, though? The single-season version is notoriously unreliable, but Steve Ilardi's multiyear version is much more steady from year to year."

The Rosenbaum paper article showed how unreliable it was and it was the worst, I think. Did someone show the multiyear version is both more accurate and more steady? Until someone shows that this method is not the worst, and no one has, how do we trust it? Steady is not the same as predicting.

At Saturday, December 13, 2008 4:41:00 PM,  Anonymous said...

Anonymous said:
Is any team seriously using Wins Produced to make decisions? Or PER? No, of course not, they're novelty stats designed to sell books.

Oh really? What if I told you I had proof to the contrary?

At Monday, December 15, 2008 10:44:00 PM,  Brian Burke said...

It was pointed out to me how strange the methodology was in this paper. Basically, the team residuals for each of the advanced stats were distributed among the players on the team.

So for the team win projections in the next year, the players carried a share of the residuals with them.

Why do it that way? Why not just leave the residuals out? Is this an accepted statistical methodology? I honestly don't know.

It seems that if you carry all the residuals forward with each player, then add up all the projected wins, you're guaranteeing that you'll end up with a statistically insignificant comparison among the competing stat systems. You're only going to get the correlation from one year's wins to the next year's wins, which in the NBA I'd bet is about r=0.8--because that's what the stat systems all came out to be.

Phil, can you explain?

At Tuesday, December 16, 2008 12:52:00 AM,  Phil Birnbaum said...

Brian,

I think you want to keep the residuals in because, without them, you get crappy results. The point is that "lots of minutes on a good team" is a good indicator that the player is good, but "lots of minutes on a bad team" is less so. Using the residuals is how you tell the first case from the second.

That is: how the team does is evidence of how good the player is. Not great evidence, but evidence nonetheless. So you add it to the player's evaluation.

At Tuesday, December 16, 2008 10:18:00 AM,  Unknown said...

Why do it that way? Why not just leave the residuals out?

Actually, I'm pretty sure there is a very specific reason for this. WoW has claimed it is better at predicting wins than NBA efficiency. To do this it uses a team adjustment that is supposed to take into account things such as D and fit. That team adjustment boiled down to a residual. This meant WoW got it's residual into it's prediction, but didn't use one for the NBA efficiency prediction. Without that advantage WoW performed no better than other metrics.

At Tuesday, December 16, 2008 10:47:00 PM,  Brian Burke said...

Ok. So the residual is basically the team adjustment--mostly defense I suppose. I think I get it.

But (as I suggested above) won't that guarantee all the projection systems end up with essentially identical correlations?