Wednesday, December 20, 2006

Do teams "choose to lose" to improve their draft position?

Like other sports leagues, the NBA awards the best draft picks to teams that perform the worst, in order to even out team quality over time.

Up until 1984, the worst teams in each of the two conferences flipped a coin for the number one pick. After that, draft choices went to teams in reverse order of their finish the previous season.

But because this system awarded better picks to worse teams, the NBA worried that this drafting method gave teams an incentive to lose. And so, for 1985, the league changed the rule so that the draft order became a lottery among all non-playoff teams. Once a team knew it was going to miss the playoffs, it would have no further incentive to lose – its draft position would wind up the same either way.

The new system, of course, didn't promote competitive balance as well as the previous one. Therefore, in 1990, the NBA changed the system once more. The draft order would still be determined by lottery, but the worst teams would get a higher probability of winning than the less-bad teams. There would still be some incentive to lose, the theory went, but much less than under the pre-1985 system.

(It's important to understand that the question isn't about whether players deliberately throw games. Teams can decide to increase their chance of losing in other ways -- sitting out stars, playing their bench more, giving good players more time to come back from injury, trying out a different style of defense, playing a little "safer" to avoid getting hurt, and so on.)

The repeated changes to the system provided a perfect natural experiment, and in a paper called "
Losing to Win: Tournament Incentives in the National Basketball Association," economists Beck A. Taylor and Justin G. Trogdon check to see if bad teams actually did respond to these incentives – losing more when the losses benefited their draft position, and losing less when it didn't matter. (A subscription is required for the full paper – I was able to download it at my public library.)

The study ran a regression on all games in three different seasons, each representing a different set of incentives: 1983-84, 1984-85, and 1989-90. They (logistically) regressed the probability of winning on several variables: team winning percentage, opposition winning percentage, and dummy variables for home/away/neutral court, whether the team and opposition had clinched a playoff spot, and whether the team and opposition had been mathematically eliminated from the playoffs (as of that game). For the "eliminated" variables, they used different dummies for each of the three seasons. Comparing the different-year dummy coefficients would provide evidence of whether the teams did indeed respond to the incentives facing them.

One of the study's findings was that once teams were eliminated in 1983-84, when the incentive to lose was the strongest, they played worse than you would expect. That year, eliminated teams appeared to be .220 worse than expected from their W-L record.


That number is huge. Teams mathematically eliminated from the playoffs already have pretty bad records. Suppose they're .400 teams. By the results of this study, after they're eliminated, the authors have them becoming .180 teams! It seems to me, unscientifically, that if these teams – and that's the average team in this situation, not just one or two -- were actually playing .180 ball in a race to the bottom, everyone would have noticed.

The authors don't notice the .180 number explicitly, which is too bad – because if they had, they might also have noticed a flaw in their interpretation of the results.

The flaw is this: in choosing their "winning percentage" measure for their regression, Taylor and Trodgon didn't use the season winning percentage. Instead, they used the team's winning percentage up to that game of the season. For a team that started 1-0, the second entry in the regression data would have pegged them as a 1.000 team.

What that means is that the winning percentages used in the early games of the season are an unreliable measure of the quality of the team. For the late games, the winning percentages will be much more reliable.

For games late in the season, there will be a much higher correlation of winning percentage with victory. And games where a team has been eliminated are all late in the season. Therefore, the "eliminated" variable isn't actually measuring elimination – it's measuring a combination of elimination and late-season games. The way the authors set up the study, there's actually no way to isolate the actual effects of being eliminated.

For instance: the regression treates a 1-2 team the same as a 25-50 team – both are .333. But the 1-2 team is much more likely to win its next game than the 25-50 team. The study sees this as the "not yet eliminated team" playing better than the "already eliminated" team, and assumes it's because the 25-50 team is shirking.

The same pattern holds for the "clinch" variable. Teams that have clinched are .550 teams who are really .550 teams. Those are better than .550 teams of the 11-9 variety, and that's why teams that have clinched appear to be .023 points better than expected.

The same is true for the "opposition clinched" dummy variable (which comes in at .046 points), and the "opposition eliminated" variable (at .093 points).

All four of the indicator variables for "clinched" and "eliminated" are markers for "winning percentages are more reliable because of sample size." And it's clear from the text that the authors are unaware of this bias.

I'm not sure we can disentangle the two causes, but perhaps we can take a shot.

Suppose a .333 team is facing a .667 team. The first week of the season, the chance of the 1-2 team beating the 2-1 team is maybe .490. The last game of the season, the chance of the (likely to be) 27-54 team beating the (likely to be) 54-27 team is maybe .230. The middle of the season, maybe it's .350, which is what the regression would have found for a "base" value. (I'm guessing at all these numbers, of course.)

So even if eliminated and clinched teams played no differently than ever, the study would still find a difference of .120 just based on the late-season situation. The actual difference the study found was an "eliminated facing clinched" difference of .266 (.046 for "opposition clinched" plus .220 for "team eliminated"). Therefore, by our assumptions, the real effect is .266 minus .120. That's about .150 points – still a lot.

But that's a back-of-the-envelope calculation, and I may have done something wrong. I'd be much more comfortable just rerunning the study, but using full-season winning percentage instead of only-up-to-the-moment winning percentage.

Here are how the predicted marginal winning percentage changes for "eliminated" compare to the other seasons:

1983-84: -.220 (as discussed above)
1984-85: -.069
1989-90: -.192

and the changes for "opposition eliminated":

1983-84: +.237
1984-85: +.093
1989-90: +.252

The middle year, 1984-85, is the year the authors expect the "eliminated" effect to be zero – because, once eliminated, there's no further way to improve your draft choice by losing. The results partially conform to expectations – the middle year shows a significantly lower effect than the other two.

The results for that middle year are not statistically significant, in the sense of being different from zero. The authors therefore treat it as zero – "nonplayoff teams were no more likely to lose than playoff-bound teams." I don't agree with that conclusion, as I complained
here. However, the effect as seen is not that much different from our (probably unreliable) estimated size of the late-season effect. Subtract the two, and it might turn out that eliminated teams actually played no worse after being eliminated – just as the authors hypothesize. The standard errors of these middle-year estimates, though, are pretty high. As commenter Guy points out (in a comment to the post I linked to a couple of sentences ago), it would be better if the authors used more than one year's worth of data in studies like these. Although transcribing every NBA game for three seasons must have been a hell of a lot of work – is there a Retrohoop? – I agree that one season just isn't enough.

And, also, it's possible that 1984-85 unfolded in such a way to make the coefficients look different. If, that year, teams played the early part of the season entirely consistently with their eventual record, that would cause the "late-season" factor in the coefficients to be small. That is, if the standings after one week were exactly the same as the standings at season's end, the difference in reliability between late games and early games would be zero. That could account for all the apparent difference in the "eliminated" effect. It's unlikely, but possible – and I don't think there's any easy way to figure out a confidence interval for the effect without running a simulation.

As it stands, my personal feeling is that the authors have found a real effect, but I can't justify that feeling enough that anyone should take my word for it.

My bottom line is that the authors had a great idea, but they failed in their execution. Rerunning the study for more than one season per dummy, and using full-season winning percentages instead of just season-to-date, would probably give a solid answer to this important question.

(Hat tip to
David Berri at The Sports Economist.)

Labels: , , ,


At Wednesday, December 20, 2006 10:56:00 PM, Anonymous Anonymous said...

Try for win results from many different sports.

WRT in-season win%, adding 5 wins and 5 losses to each teams record regresses them nicely. (That's not arbitrary -- those are actually the results of a best fit I did once. The details are at APBRMetrics somewhere.) EG your 1-0 team is best regrarded as a 6-5 team, for the purposes of retrodicting the results of their next game.

At Wednesday, December 20, 2006 11:04:00 PM, Blogger Phil Birnbaum said...

Ed K., is very nice! Thanks! I've added a link here for lazy people.

And thanks for the 5-5 info. That means a 1-0 team is actually .545, which is higher than I thought. It must be somewhat proportional to the "number of atoms" and the SD of team quality, so for baseball you must need more games, like 13-13 or something. I think Tango wrote about this once.

At Thursday, December 21, 2006 12:46:00 AM, Blogger Fifth Outfielder said...

I suppose we could also say there may be a sample bias in this study. I'm not set up here to access the full text and I don't know the draft lottery history, but intuitively one would assume that there must have been some extreme perception of teams tanking to get higher picks based on 83-84. In other words, knowing that the draft was changed for the following season, we would expect that the previous season would have shown an exaggerated effect.

(Of course, conspiracy theorists might disagree and argue that the lottery was solely put in place to land Ewing in New York.)

Moreover, the 1990 season may have involved an artificial race to the bottom fueled by the rule change, an effect which may have cooled in later seasons. You would also think that those changes would be compared to 1995, when the lottery became more heavily weighted because the Magic had the best record and got the first pick, IIRC.

At Thursday, December 21, 2006 2:54:00 AM, Blogger Phil Birnbaum said...

Tom: wouldn't there be a one-year lag? That is, if the extreme perception would have been in 82-83, they would have changed the rule in the off-season for 83-84.

Or was the rule changed announced in mid-season to take effect immediately?

At Thursday, December 21, 2006 11:07:00 AM, Anonymous Anonymous said...

Great analysis. Another approach that could be used would simply be to compare teams' post-elimination record with their pre-elimination record, controlling for strength of opposition and court, and see if teams play worse after elimination.

But this raises a larger issue that we've discussed before, which is the failure of peer review in sports economics. This paper was published in The Journal of Labor Economics, and Berri says it is "One of the best recent articles written in the field of sports economics." Yet the error you describe is so large and so fundamental that we can have no confidence at all in the paper's main finding. More importantly, this is not a failure to acknowledge and learn from prior non-academic research (as you often find), but a basic methodological failure. How does this paper get published and cited favorably by economists?

At Thursday, December 21, 2006 11:36:00 AM, Blogger JavaGeek said...

Wouldn't there be a reasonably easy way to test this by looking at the distribution and testing for a "gap".

For example in the last three seasons in the NHL there's been 1 team with 86, 87 or 88 points (chosen for effect not as a random sample - 6.9% of teams should be in this region) This works out to a Z of -2.06, which says there's probably something strange going on.

At Thursday, December 21, 2006 11:40:00 AM, Blogger Tangotiger said...

In my post here, I said:

"To get an r of .5, you need only 14 NBA games! Sheesh. This is a huge problem here. 14 NBA games tells you as much as 36 NHL games. "

This means that you want to add 7 wins and 7 losses for NBA. If Ed says it should be 5 & 5, he may be right. It all depends on the sample chosen.

At Thursday, December 21, 2006 11:42:00 AM, Blogger Phil Birnbaum said...

Guy (11:07): I agree, and I've been meaning to get around to writing something about peer review, in the line of a response to what David Berri wrote on his blog in early December.

It does seem like a lot of papers have flaws that you'd want to see caught in peer review. I agree with the idea that peer review is a good thing, but a disinterested observer would have to admit that it doesn't seem to work well in practice.

At Thursday, December 21, 2006 1:07:00 PM, Blogger Tangotiger said...

If your "peer" is someone just like you, then of course it won't work. Bill James said that Win Shares was "peer-reviewed", but not so formally.

What you need are SME (subject matter experts), people who will point out obvious flaws that you yourself would not know.

Coming up with a crappy model just means you are leaving alot of uncertainty on the table, uncertainty that an SME would be able to reduce.

I on the other hand prefer the power of the people, thousands of SME who volunteer their time and expertise. In 30 years, peer reviews will be done blog-style. That time can't come soon enough.

At Thursday, December 21, 2006 1:14:00 PM, Blogger Phil Birnbaum said...

In a discussion somewhere, I recently suggested that people active in sabermetrics should be consulted in connection with this kind of academic peer review.

My argument went something like this: if an academic economist is writing about the cost/benefit of cancer treatements, he'd run it by an oncologist to make sure he got the medical details right. If an academic economist is writing about baseball, why not run it by a sabermetrician to make sure the sabermetrics is OK?

The suggestion was met with complete silence.

So far, no academic that I know of -- including some who have strong sabermetric skills and whose work I respect very much -- has been receptive to the idea that the peer review process for sabermetric articles is not completely adequate.

At Thursday, December 21, 2006 2:31:00 PM, Anonymous Anonymous said...

I agree with your suggestion Phil. But one doesn't need to know anything about basketball to see the flaw in this model. Anyone who understands the relationship between sample size and predictive power -- which should include everyone who reviewed this paper --should get this. My guess is that peer reviewers mainly ask 1) did the author use the statistical method currently 'accepted' in the field to address the issue, and 2) did the author cite every previous work on the topic he should? I suspect that relatively few actually dig deep into the methodology and raise challenging questions. It's a friendly little fraternity, with a lot of shared assumptions (and shared blindspots). I'm sure everyone is honest, but the result is still poor self-policing.

One thing that Tango's SMEs could do for these academic studies is simply looking at the results and conclusions and asking "Does this make sense? Is it consistent with what we know about the world?" The .220 exponent is absurd on its face, as you clearly saw. That's a red flag.

Same thing with Sauer's Moneyball paper. It's simply not plausible that MLB teams corrected a huge undervaluation of OBP in a single season. Given multi-year contracts, the high correlation of OBP and SLG, and other factors, it can't have happened. RE-examine your methods, try other approaches.

JCB did a study that found that Tony LaRussa improved his team K/BB ratio by 1.0 just by arguing with umpires about balls and strikes. That would be worth about .5 R/G, or 9 wins a year. It's impossible that we (and every other manager) wouldn't notice this if it were true. Doesn't happen; try again.

It seems to me that good peer reviewers would ask these questions and identify red flags. But it's obvious that they often don't.


Post a Comment

<< Home