Friday, August 10, 2007

NCAA point-shaving study convincingly debunked

There have been a couple of Justin Wolfers papers in the news recently. There was the “racial bias in the NBA” study a few months ago (co-authored with Joseph Price), which I reviewed here. And, last May, there was an article on NCAA point shaving, which I’ll talk about now.

Wolfers analyzed the outcomes and betting lines for over 70,000 NCAA Division I games. He found that, overall, the favorite covered the spread almost exactly half the time (50.01%), exactly as you would expect. But heavy favorites (defined as a spread of more than 12 points) covered only 48.37% of the time.

He then found that, for heavily favored teams, the results weren’t symmetrical. That is, for teams favored by, say, 4.5 points, they would cover by 0-4 almost exactly as often as by 5-8. But for large spreads, like 15.5 points, teams covered by 0-15 a lot more than by 16-31.

“Among teams favored to win by 12 points or more, 46.2% of teams won but did not cover, while 40.7% were in the comparison range of outcomes.”

The difference is about 6% of those games, which we can call the “Wolfers Discrepancy” for his dataset.

Wolfers concludes that 6% figure is evidence of cheating – that point shaving caused 3% of potential covers to become non-covers. Since only half of games can be shaved (games where the favorite is leading), that means that the proportion of corrupt games was double that. Conclusion: the fix was in in about 6% of 12-point-plus-spread games.

This study, and the 6% figure, got a lot of publicity.

Here’s a post from The Wages of Wins. Here’s an article from the New York Times. And you can find a lot more by Googling.

Not receiving anywhere near as much publicity is a rebuttal paper by Dan Bernhardt and Steven Heston. And that’s too bad, because it’s a great study, and it thoroughly debunks Wolfers’ conclusions. These guys nail it.

Bernhardt and Heston argue that there’s no reason to expect the symmetry that Wolfers assumes. For instance,

"Consider a 14 point favorite that is up by only seven points with five minutes to play. To maximize its chances of winning, the favorite will “manage the clock,” holding the ball to reduce the number of opportunities that the other team has to score ... In contrast, the same favorite up by 21 is sure to win, and has no need to manage the clock, raising the expected increment in winning margin, generating an asymmetric distribution in winning margins."

That’s very plausible, and the authors back it up with some clever analysis. They argue that if the Wolfers Discrepancy is caused by cheating, then it should be much higher than 6% in games that are more likely to be corrupt, and less then 6% in games that are less likely to be corrupt. And so they check – four different ways.

First, they note that a fixed game will attract heavy betting on the underdog, and all that money will move the betting line to reduce the spread. So they split games into two groups: games where margin moved the “wrong” way, towards the favorite, and all other games. It turns out that the Wolfers’ Discrepancy is almost exactly the same between the two groups, suggesting that it’s a natural part of the scoring distribution rather than an artifact of corruption.

Second, they note that if game fixing happens, it happens in games where the final score is very close to the spread. So, for a 14-game spread, instead of comparing finals of 1-14 points to finals of 15-28 points, they compare 8-14 to 15-21, to narrow the range. Now, the number of corrupt games in these smaller samples should be equal to the number in the bigger sample. But the denominator, the number of total games, is smaller. Therefore, the percentage of apparently corrupt games – the Wolfers Discrepancy -- should rise.

It didn’t. In fact, it *dropped*, by more than two-thirds.

As a third test, they check games where the line did move towards the underdog, games where corruption is plausible. You’d expect, in those games, that the more the line moved, the more likely the game is fixed, and so the larger the Wolfers Discrepancy for those games.

The result -- nothing statistically significant:

5.61% -- line moved towards the favorite
6.64% -- line didn’t move
7.17% -- line moved towards the underdog by 0 - 0.5 points
8.43% -- line moved towards the underdog by 0 - 1 point
6.83% -- line moved towards the underdog by 0 - 1.5 points.

Finally, they analyzed games where there was no betting line, which usually happens when there isn’t enough interest in the game for the sports books to bother. (For those games, Bernhardt and Heston had to estimate a spread using team strength ratings.)

If there is no betting, there’s no incentive to shave points. And so, if the Wolfers Discrepancy is really caused by corruption, it should be zero for those games.

But it wasn’t zero. In fact, it was 6% -- almost exactly the same as Wolfers found in his own study!

Truly outstanding work by Bernhardt and Heston. They took a statistical effect that Wolfers claimed showed corruption, and proved, four different ways, that the effect exists even when corruption is unlikely. In addition, they provided a plausible explanation of what the effect might be.

The NCAA should buy these guys a beer.

Hat tip:
Zubin Jelveh

Labels: , ,


At Friday, August 10, 2007 6:28:00 PM, Blogger Justin said...

I like the Bernhardt and Heston study - a lot - so I'm reluctant to say anything critical. But I think that they oversell their results a bit, and your posting oversells their results a lot.

Their key point is the first one you make: That if point shaving is occurring, one might expect it to be more prevalent in games in which there is heavy betting on the underdog. So they take the sample, and split it into two. In particular, they create a "shaving" sample, in which betting line movements suggests strong betting on the underdog. And indeed, they find stronger evidence of point shaving in the "shaving" sample (see their Table 1). That is, they were able to locate a sample in which (logically enough) the effects I pointed to earlier were even stronger.

But they actually say something a bit stronger: "when the closing line does not exceed the opening line, point shaving is implausible." That is, they treat the alternative sample as a "control" sample as if they know that there is no point shaving going on. Thus, rather than look only at the absolute level of the "Wolfers discrepancy", they look at how it differs across the two samples. They find that it is indeed greater in the "shaving" sample, but not statistically significantly different. And this is the evidence against shaving.

But note the problem here: They are assuming zero point-shaving in those games in which the betting line doesn't move as predicted. If this is false, then the test is problematic. To see why it is false, just think about Tim Donaghy. There is no evidence that the outcomes in games in which he ref'd are at all correlated with line movements. That is, Donaghy likely cheated as much when the betting line moved as when it didn't. In fact, if you look at the history of known point-shaving episodes over the past century, you will often find that the betting line didn't move. There are two reasons for this. Some point shaving episodes are just half-assed - a college kid betting a few hundred dollars on the opposition, and this doesn't move the betting line. And second, only a very very small proportion of gambling on NCAA basketball occurs through Las Vegas, the source of the betting line movements relied on by the authors.

If instead you thought - as seems plausible from historical experience - two-thirds of point-shaving episodes appear in the "shaving" sample, and one-third in the "control" sample, then this would fully explain why they find a noticeable, but not large difference in point shaving across the two samples. All told, I regard these results (truly!) as supportive of my analysis.

The second test is also quite interesting: they narrow the game-margins in which to look for point shaving. In fact, they find greater evidence of point shaving when they modify my results to looking at a 12-point window; it is only a 6-point window that yields less evidence. And I think that makes sense: If I had bet $1000 on a game, I might be unwilling to shave enough to just win my bet by, say, 3 points, because then a fluke at the end of the game might mean I lose my bet. So perhaps the 6-point window is just too narrow to find all of the point-shavers. (Certainly a two-point window would be too narrow, so I'm not sure where to stop.)

The third point you note is actually supportive of my results: The data you cite show that the more betting there was on the underdog, the stronger the evidence of point-shaving. Look again at your numbers, but don't focus as much on what is or isn't statistically significant different from zero, but instead on whether the pattern is statistically significantly different from what you would expect if point shaving were occurring. The more the line moves, the greater the evidence of point-shaving.

I find their fourth point most interesting: that in games in which there is no point spread, the authors find similar evidence of point-shaving. That worries me a lot. But the problem is that in games in which there is no point spread, there is no point spread, and my test can't actually be performed! The authors sidestep that neatly by "predicting" a point spread, but then one is left to wonder whether we are learning something about their predictions of what the point spread would have been, rather than about point shaving. Betting markets are a lot smarter than econometric models produced by even very clever economists, and so I suspect that some of this result reflects them mis-predicting what the spread would have been.

But lest this sound defensive, let me just note that I describe my own research as providing a “prima facie” case that point-shaving may be occurring. Forensic economics can’t do more than point out suspicious patterns. Bernhardt and Heston do a nice job in extending the analysis, but arguably strengthen, rather than weaken the case that there is something fishy in the relationship between outcomes in basketball games and betting lines.

At Saturday, August 11, 2007 1:41:00 AM, Blogger Phil Birnbaum said...

Hi, Justin,

Thanks for dropping by and responding; much appreciated. I will try to respond to all the points you raise.

Let me start with where you’ve convinced me, where we can agree to agree. That would be the fourth test, on the games where there’s no point spread, and Bernhardt and Heston (B&H) calculate their own based on Jeff Sagarin’s ratings. The authors give a reference for the algorithm used to calculate the spread, and, on first reading their paper, I assumed that whatever algorithm they used, if reasonable, would be good enough. But, on reading your answer, and thinking about it some more, I think you’re right. The effect you observe is very sensitive to the point spread, and even a small inaccuracy in the algorithm would ruin the results. Given that B&H do not prove, or even assert, that their methods are accurate enough for this analysis, I have to agree with you that we can’t rely on that fourth test.

Also, I think I agree with you on some of your comments about the first test. That’s the one where B&H split the sample into two – the ones where the spread moved towards the favorite, and the ones where it didn’t. I wrote that the Wolfers discrepancy was almost the same between the two samples. You argue that it’s actually higher in the expected direction, the direction that supports corruption. You’re right that they’re higher – 6.64% to 5.61% -- and I am willing to accept that my “almost exactly the same” is perhaps an overstatement.

You also argue against B&H’s assumption that *100%* of fixed games involve the point spread moving, and suggest 2/3 of games as a better estimate. That would mean that the estimate of shaved games would be triple the observed numbers (since 1 – 0 equals three times 2/3 – 1/3). So instead of 1.03%, that would make the effect 3.09%.

On this, I think I’m still agreeing with you; your point makes sense, and I wish I’d caught it. However, I note that the test does come up with a figure only about half of the 6% that you argue for in your own paper – and that’s going to be the largest figure any of the B&H tests come up with. More on that in a bit.

Now, here’s where I disagree with you.

For the second test, you say that B&H find greater evidence of point-shaving (i.e., a greater “Wolfers discrepancy”) when they look at a +/- 12-point window instead of the larger +/- point-spread-sized window (these are for games where the spread is at least 12). I don’t see that: for both halves of the sample, the percentages for the +/- 12 are smaller than yours. (Am I misunderstanding your point?) In any case, you acknowledge that the 6-point-spread sample yields smaller corruption estimates, but argue that supports your conclusion. That’s because, you argue, in a game that’s only decided by 6-points, the cheating player is too scared to cheat, lest his cheating cost his team not just the spread, but the game itself.

But I think you’ve just mistaken what the 6 points means here. It’s not the raw margin of victory, but the games that ended *within 6 points of the spread*. Since the spreads are at least 12, that means that you’re talking about games where the favorite won by between 6 and 18 points. So fear of losing the game is not actually an issue, and I think that B&H argument is still valid.

Moving on to the third test, which looks at the Wolfers Discrepancies based on the amount of change in the point spread from the opening line to the closing line: you argue that I shouldn’t look at whether those numbers are significantly different from zero, but, rather, whether they’re what we’d expect if point shaving were occurring. And, indeed, for point shaving, we’d expect the numbers to be increasing, and they roughly are:

5.61, 6.64, 7.17, 8.42, 6.83

Actually, that’s why I reported them in the original post: they do in fact show evidence of increased shaving (although the last one, where the point spread moves the most, is the exception). I probably should have mentioned that explicitly, and that the user could judge for himself. And I’m certainly willing to agree that the pattern is roughly consistent with some point shaving happening.

But what’s crucially important -- and I think this is the source of most of our differences in interpretation of these results – is the reason that I am implicitly comparing these numbers to zero. It’s because *your original paper* implies I should! More specifically, your analysis found a discrepancy of 6%, and repeatedly states that the full 6% should be the estimate of the incidence of point shaving. Let me quote your paper:

“... an indicative, albeit rough, estimate suggesting that around 6 percent of strong favorites have been willing to manipulate their performance.”

“ ... suggesting that the proportion of strong favorites agreeing to point shave is 6 percent.”

And I won’t go searching the media reports, but I’m pretty sure that they quoted the “6% of strong favorites corrupt” figure too.

While I’m willing to accept that B&H found *some* evidence of corruption – such as the fact that the numbers do go up *slightly* in the right direction – they also repeatedly evidence that strongly contradicts your 6% estimate.

The bottom line, as I see it, is this: your analysis found a “Wolfers Discrepancy” of about 6%. Where does the 6% asymmetry in scoring come from? It’s a combination of (a) corruption, and (b) the natural characteristics of basketball games in which one team wins by a lot of points (such as playing cautiously and “managing the clock,” as suggested by B&H).

How much of the 6% is (a) and how much is (b)? It could be any combination of the two. But your paper assumes that it’s all (a), 6% corruption and 0% game strategy. That’s what B&H are refuting. They do so by responding, “well, if the Wolfers Discrepancy of 6% is all corruption, how come we still get a discrepancy of 5% in games where corruption is very unlikely? How come when we study only games that ended within 6 points of the spread (but were still moderate blowouts), and should therefore contain an even larger proportion of corrupt games, we now only get 2%? How come, when the line changes in a way that strongly counter-indicates corruption (by your suggested estimate, reducing the chance by half), we *still* get a discrepancy of 5.6%?”

Their reasonable conclusion is that almost all of the 6% you found must be due to (b), the natural characteristics of the game of basketball. And, respectfully, I still think they’re right.

At Saturday, August 11, 2007 3:45:00 AM, Anonymous Anonymous said...

I just want to commend Justin for posting. Phil's reviews of sabermetric-type studies are great and would be a benefit for all academics to read. Justin is one of very few who actually has responded to comments about his research, and that immediately raises his profile in my eyes.


Post a Comment

<< Home