Wednesday, June 18, 2014

Absence of evidence: the Oregon Medicaid study

In 2008, the State of Oregon created a randomized, controlled experiment to study the effect of Medicaid on health. For the randomization, they held a lottery to choose which 10,000 of the 90,000 applicants would receive coverage. Over the following years, researchers were able to compare the covered and uncovered groups, to check for differences in subsequent health outcomes, and to determine whether the covered individuals had more conditions diagnosed and treated.

Last month brought the publication of another study. In 2006, the state of Massachusetts had instituted health care reform, and this new study compared covered individuals pre- and post-reform, within MA and compared to other states.

The two states' results appeared to differ. In Oregon, the studies found improvement in some health measures, but no statistically significant change in most others. On the other hand, the Massachusetts study found large benefits across the board.

Why the differences? It turns out the Oregon study had a much smaller dataset, so it was unable to find statistical significance in most of the outcomes it studied. Austin Frakt, of "The Incidental Economist," massaged the results in the Oregon study to make them comparable to the Massachusetts study. Here's his diagram comparing the two confidence intervals:

The OR study's interval is ten times as wide as the MA study's! So, obviously, there's no way it would have been able find statistical significance for the size of the effect the MA study found.  

On the surface, it looks like the two studies had radically different findings: MA found large benefits in cases where OR failed to find any benefit.  But that's not right. What really happened is: MA found benefits in cases where OR really didn't have enough data to decide one way or the other.


The Oregon study is another case of "absence of evidence is not evidence of absence." But, I think, what really causes this kind of misinterpretation is the conventional language used to describe non-significant results.  

In one of the Oregon studies, the authors say this:

"We found no significant effect of Medicaid coverage on the prevalence or diagnosis of hypertension."

Well, that's not true. For one thing, the authors say "significant" instead of "statistically significant." Those are different -- crucially different.  "Not significant means, 'has little effect'." "Not statistically significant" means, "the sample size was too small to provide evidence of what the effect might be."

When the reader sees "no significant improvements," the reasonable inference is that the authors had reasonable evidence, and concluded that any improvements were negligible. That's almost the opposite of the truth -- insufficient evidence either way.  

In fact, it's even more "not true," because the estimate of hypertension WAS "significant" in the real-life sense:

"The 95% confidence intervals for many of the estimates of effects on individual physical health measures were wide enough to include changes that would be considered clinically significant — such as a 7.16-percentage-point reduction in the prevalence of hypertension."

So, at the very least, the authors should have put "statistically" in front of "significant," like this:

"We found no statistically significant effect of Medicaid coverage on the prevalence or diagnosis of hypertension."

Better!  Now, the sentence is longer false.  But now it's meaningless.

It's meaningless because, as I wrote before, statistical significance is never a property of a real-world effect. It's a property of the *data*, a property of the *evidence* of the effect.

Saying "Medicaid had no statistically significant effect on patient mortality" is like saying, "OJ Simpson had no statistically significant effect on Nicole Brown Simpson's mortality." It uses an adjective that should apply only to the evidence, and improperly applies it to the claim itself.

Let's add the word "evidence," so the sentence makes sense:

"We found no statistically significant evidence of Medicaid coverage's effect on the prevalence or diagnosis of hypertension."

We're getting there.  Now, the sentence is meaningful. But, in my opinion, it's misleading. To my ear, It implies that, specifically, you found no evidence that an effect exists. But, you also didn't find any evidence that an effect *doesn't* exist -- especially relevant in this case, wherer the point estimate was "clinically significant."  

So, change it to this:

"We found no statistically significant evidence of whether or not Medicaid coverage affects the prevalence or diagnosis of hypertension."

Better again. But it's still misleading, in a different way. It's phrased in such a way that it implies that it's an important fact that they found no statistically significant evidence.  

Because, why else say it at all? These researchers aren't the only ones with no evidence. I didn't find any evidence either. In fact, of the 7 billion people on earth, NONE of them, as far as I know, found any statistically significant evidence for what happened in Oregon. And none of us are mentioning that in print.

The difference is: these researchers *looked* for evidence. But, does that matter enough?

Mary is murdered.  The police glance around the murder scene. They call Mary's husband John, and ask him if he did it. He says no. The police shrug. They don't do DNA testing, they don't take fingerprints, they don't look for witnesses, and they just go back to the station and write a report. And then they say, "We found no good evidence that John did it."

That's the same kind of "true but misleading." When you say you didn't find evidence, that implies you searched.  Doesn't it? Otherwise, you'd say directly, "We didn't search," or "We haven't found any evidence yet."

Not only does it imply that you searched ... I think it implies that you searched *thoroughly*.  That's because of the formal phrasing: not just, "we couldn't find any evidence," but, rather, "We found NO evidence." In English, saying it that way, with the added dramatic emphasis ... well, it's almost an aggressive, pre-emptive rebuttal.

Ask your kid, "Did you feed Fido like I asked you to?". If he replies, "I couldn't find him," you'll say, "well, you didn't look hard enough ... go look again!" 

But if he replies, "I found no trace of him," you immediately think ... "OMG, did he run away, did he get hit by a car?" If it turns out Fido is safely in his doggie bed in the den, and your kid didn't even leave the bedroom ... well, it's literally true that he found "no trace" of Fido in his room, but that doesn't make him any less a brat.

In real life, "we found no evidence for X" carries the implication, "we looked hard enough that you should interpret the absence of evidence of X as evidence of absence of X." In the Oregon study, the implication is obviously not true. The researchers weren't able to look hard enough.  Not that they weren't willing -- just that they weren't able, with only 10,000 people in the dataset they were given.

In that light, instead of "no evidence of a significant effect," the study should have said something like,

"The Oregon study didn't contain enough statistically significant evidence to tell us whether or not Medicaid coverage affects the prevalence or diagnosis of hypertension."

If the authors did that, there would have been no confusion. Of course, people would wonder why Oregon bothered to do the experiment at all, if they could have realized in advance there wouldn't be enough data to reach a conclusion.


My feeling is that for most studies, the authors DO want to imply "evidence of absence" when they find a result that's not statistically significant.  I suspect the phrasing has evolved in order to evoke that conclusion, without having to say it explicitly. 

And, often, "evidence of absence" is the whole point. Naturopaths will say, "our homeopathic medicines can cure cancer," and scientists will do a study, and say, "look, the treatment group didn't have any statistically significant difference from the control group." What they really mean -- and often say -- is, "that's enough evidence of absence to show that your belief in clutch hitters stupid pseudo-medicine is useless, you greedy quacks."  

And, truth be told, I don't actually object to that. Sometimes absence of evidence IS evidence of absence. Actually, from a Bayesian standpoint, it's ALWAYS evidence of absence, albeit perhaps to a tiny, tiny degree.  Do unicorns exist? Well, there isn't one in this room, so that tips the probability of "no" a tiny bit higher. But, add up all the billions of times that nobody has seen a unicorn, and the probability of no unicorns is pretty close to 1.  

You don't need to do any math ... it's just common sense. Do I have a spare can opener in the house? If I open the kitchen drawer, and it isn't there ... that's good evidence that I don't have one, because that's probably where I would have put it. On the other hand, if I open the fridge and it's not there, that's weak evidence at best.

We do that all the time. We use our brains, and all the common-sense prior knowledge we have. In this case, our critical prior assumption is that spare can openers are much more likely to be found in the drawer than in the fridge.  

If you want to go from "absence of evidence" to "evidence of absence," you have to be Bayesian. You have to use "priors," like your knowledge of where the can opener is more likely to be. And if you want to be intellectually honest, you have to use ALL your priors, even those that work against your favored conclusion. If you only look in the fridge, and the toilet, and the clothes closet, and you tell your wife, "I checked three rooms and couldn't find it," ... well, you're being a dick. You're telling her the literal truth, but hoping to mislead her into reaching an incorrect conclusion.

If you want to claim "evidence of absence," you have to show that if the effect *was* there, you would have found it. In other words, you have to convince us that you really *did* look everywhere for the can opener.  

One way to do that is to formally look at the statistical "power" of your test. But, there's an easier way: just look at your confidence interval.  If it's narrow around zero, that's good "evidence of absence". If it's wide, that's weak "evidence of absence."  

In the Oregon study, the confidence interval for hypertension is obviously quite wide. Since the point estimate is "clinically significant," the edge of the confidence interval -- the point estimate plus 2 SDs -- must be *really* clinically significant.  

The thing is, the convention for academic studies is that even if the estimate isn't statistically significant, you don't treat it differently in high-power studies versus low-power studies. The standard phrasing is the same either way: "There was no statistically significant effect."  

And that's misleading.

Especially when, as in the Oregon case, the study is so underpowered that even your "clinically significant" result is far from statistical significance. Declaring "the effect was not statistically significant," for a study that weak, is as misleading as saying "The missing can opener could not be found in the kitchen," when all you did was look in the fridge.

If you want to argue for evidence of absence, you have to be Bayesian. You have to acknowledge that your conclusions about absence are subjective. You have to make an explicit argument about your prior assumptions, and how they lead to your conclusion of evidence of absence.  

If you don't want to do that, fine. But then, your conclusion should clearly and directly acknowledge your ignorance. "We found no evidence of a significant effect" doesn't do that: it's a "nudge nudge, wink wink" way to imply "evidence of absence" under the table.

If you don't have statistical significance, here's what you -- and the Oregon study -- should say instead:

"We have no clue.  Our study doesn't have enough data to tell."

Labels: ,

Friday, June 13, 2014

Does diversity help soccer teams win?

A new academic paper (.pdf) claims that linguistic diversity can be a huge factor in helping high-level soccer teams win games. Here's a Washington Post blog item where they explain their study and conclusions.

They looked at ten seasons' (2003-2013) worth of teams from the Champions League Group Round and beyond -- 168 team-seasons in all. To measure which teams were the most diverse, they used something called the "Automated Similarity Judgment Program" to rate every pair of languages on a 100-point scale. The higher the number, the more distant the languages are from each other. (Roughly speaking, the European languages are fairly close to each other, as are the East Asian languages, and the Middle-East languages. But each of those three groups are relatively distant from the others.)

Then, they calculate the overall team score by averaging the linguistic distance for every pair of players. 

What they found: the more diverse teams have significantly better goal differential. 

Could it simply be that the more diverse teams also have better players?  The authors corrected for that by including "transfer value" in their regression. They use the log of the total estimated value a team could receive for its players, as estimated by . 

What does that value represent?  Well, as I understand it: in soccer, teams sell players to other teams all the time. The team gets the proceeds from the sale, and the player negotiates a contract with his new team. The contract value is significantly less than the sale price. 

In 2013, Gareth Bale was transferred from Tottenham Hotspur to Real Madrid, for a sum of more than 78 million UK pounds, the highest sum ever. Bale's salary, though, is 15.6 million pounds per year. (Of course, the transfer price is a one-time payment, but Bale's salary is paid every year.) posts their own, real-time, estimates of transfer value for every player in most European leagues. I think they base the estimates on the player's current level of performance, so the values are a good proxy for how talented the team actually is.

The authors' main result says that after correcting for team value, if you increase diversity by 1 standard deviation, you gain 0.31 goals per game.


Now, 0.31 goals per game is a LOT. How big a lot?  I'll translate that to MLS terms (even though the study was for the European Champions League), just because the teams are more comparable to each other in talent.

In MLS, teams play 34-game seasons, so an extra 0.31 goals per game works out to 10.5 overall. In both 2011 and 2012, *more than half* of MLS teams were within 10 goals of even. 

That's how big 0.31 goals per game is. Can you really improve your standings position that much by favoring players who speak a foreign language?

I don't think so. I think there's a big problem with the study, one you may already have noticed.

The authors based their regression not on transfer value itself, but on the logarithm of team value. But, as I've written before, that doesn't make sense. What a log(salary) model is assuming that what matters -- what translates value into performance -- is not the dollar difference, but the *percentage difference*. It assumes that if you go from $10 million to $20 million, you get the same number of goals as if you go from $100 million to $200 million. 

That can't be right, can it?  It doesn't work that way for Mountain Dew. If you go from $1 to $2, you get one extra can. If you go from $10 to $20, do you still get only one extra can?  

It's got to be almost as wrong for players. If you spend 20 million Euros on a striker, how many extra goals is he going to give you?  According to the model, he'll give you more than twice as many goals if you're AC Milan than if you're Real Madrid. That's because 20 million Euros increases Milan's value by 9 percent, but it increases Madrid's value by only 3.5 percent.

That makes no sense. Well, I suppose it *could* make sense if higher-value teams are so good that there are diminishing returns. A team of eight Barry Bondses probably won't win that many more games if you add a ninth. But, real life European football teams aren't even close to that level.

Also: if that were true, we'd see almost every player sold to a weaker team. Suppose the last few players Barcelona signed gave them 0.25 extra goals per game. Barcelona's team value is 11 times Celtic FC's. So, why wouldn't Celtic have bought the players instead, and got 11 times the benefit -- almost *four goals per game* -- for the same money?

Because, of course, they wouldn't get four goals per game. It would give them the same 0.25 goals per game -- or, perhaps, a fraction more, or less.

Empirical evidence has shown that, in MLB, free agent salaries are close to linear in expected runs. There's no reason to think that European soccer is any different.


Why does the incorrect use of log(value) lead to the overvaluing of diversity?  

Well, if a relationship is linear, but you plot it as proportional to the log, you get a curved line, an exponential line. The graph of y=x is linear, of course, but here's a picture of what happens when you put log(x) on the x-axis instead.  You get the equivalent of y=e^x:

It's obviously not linear, but if you don't realize that, and fit a straight line anyway, you get this:

What happens?  The top and bottom teams get badly underestimated. And the middle teams get badly overestimated.

Suppose that, now, you take the regression and throw in your variable for "linguistic diversity". What happens? 

Well, diversity happens to be positively correlated with team value (r=+.23, the paper says). So, the regression "sees" that log(salary) underestimates the top teams. It "sees" that the top teams also have high diversity. Then, it connects the two, and "notices" that high diversity is related to teams that appeared to do better than otherwise expected. So it "concludes" that diversity is a positive and significant factor.

Well, not quite; there's one more thing we have to explain. It's not just the top teams that are overestimated -- it's the bottom teams, too.  So, if you add in diversity, there's no reason yet to assume it will help the fit. If it makes the the right-side dots fit better because of high diversity, the left-side dots will fit worse because of low diversity, and it'll cancel out.

Well, here's a possible explanation (which I suspect is the right one). 

In the study, the top teams are the same, successful ones over and over -- Chelsea, Madrid, Barcelona, Bayern Munich, and so forth. Presumably, those have a consistent (higher than normal) diversity score.

But the bottom teams are probably those who qualified for the Champions League maybe only one or two of the 10 years in the study.  With so many of those bad teams, the low- and high-diversities will cancel out more than the few good teams at the high end. So, there would be high diversity at the top, but average (or slightly-below average) diversity in the middle and bottom.   

If that's true, then the diversity variable will fix the underestimates of the top teams, while not affecting the bottom teams much. 


I can't prove this is happening, without having the full dataset, but I bet it is. In any event, it's not up to me to prove it. It's enough to show that (a) the logarithmic model is wrong, and (b) the logarithmic model is wrong in a way that will plausibly produce the exact kind of (spurious) result the authors found.

In other words: if you say the Yankees won last night, and it turns out your evidence is last week's newspaper ... well, that's enough to refute your claim. I don't actually have to prove that the score in your paper is actually different from last night's score.


But, in the interests of double-checking my hypothesis, I gathered data for the 32 teams in the 2013-14 Champions League (which is outside the sample the authors used). I got team values from, like the authors did, although the values I used may have changed slightly by now.  I used the group round only.

Here's the graph I got for a linear relationship:

Looks quite reasonable. Compare it to the graph I get for a log relationship:

It isn't obvious that the dots on the log graph actually form a curved pattern. But, once you fit the regression line, it shows up -- the top and bottom teams are underestimated, as theory suggests. (I'm sure the curving would be much more obvious if I used the full 10-season dataset instead of just the one year.)

In fairness to the authors, there's a reason it might have been less obvious to them. They didn't use all 32 teams each year, as I did. They included only teams from the five most successful countries (England, France, Italy, Spain, and Germany). All their teams were valued at 123MM Euros or higher (4.8 or higher on the log scale). The remaining 16 team values were generally lower -- nine were under 100MM Euros, and one was as low as 19MM (2.9 on the log scale).

Mathematical theory says that the smaller the range of log(X), the less curved the line will be.  Which is why, if you compress the X-axis by eliminating every point to the left of 4.8, it's less obvious the dots form a curve.

To illustrate, here's the full sample again, but this time on one graph. (The linear is linear, and the log is curved.)  The lines are quite different:

But, for the smaller sample, not quite as much:

It turns out that in this truncated sample, the correlation is actually *higher* for the log scale than for the linear scale, .88 to .83. That doesn't make the model right; it's just that the random errors happen to offset the errors in the model, in this case. (Especially those two dots at the bottom.)

My guess is: if I plotted all 10 years, instead of just the one, the pattern of dots would look more like a curve. The highest-value teams would still fit the straight line much better than the curved line, and that would cause "diversity" -- which would apply disproportionately to the highest-value teams -- to make up the difference.

But, as I said, (a) I don't have the full dataset, and (b) it doesn't really matter. If your regression assumes that players are worth orders of magnitude more to poor teams as rich teams, there's no obvious way to interpret what comes out the back end.


(Hat tip to GuyM for pointing me to the article.)

Labels: , ,

Sunday, June 08, 2014

Team fatigue and the NBA playoffs, part II

Last month, a FiveThirtyEight study by Nate Silver found that, after winning their first round series 4-0, NBA teams significantly outperformed expectations in the second round, by about 3 points per game. Conversely, teams that took seven games in the first round fell well short in the second, underperforming expectations by a hefty 5.7 points per game.

The study argued that it's fatigue. Teams that sweep their series have lots of time to rest and recover, while the 4-3 teams have to jump right back in immediately. 

I was skeptical of the fatigue explanation. Last week, I thought it might be just a mathematical anomaly. I posted about it, and then immediately realized that, while the analysis was correct, the effect wasn't nearly enough to explain what Nate got. In fact, it explained less than one-tenth.

So, I figured, if it's not that, maybe I can try to figure out what it really is. After a couple of days of working on simulations, I have an answer. Well, I think I have an answer for part of it, and an opinion for the other part.


First, there's the effect I talked about last post, where the expected point differential for a favorite should be artificially high because of how games are weighted. My estimate last post was kind of back-of-the-envelope, so I decided to use a simulation to get a better handle on it.

I created a conference of 15 teams, and assigned each a random point differential "talent," with a mean of zero and an SD of 4 points. Then, I played independent 82-game seasons for each, where a game was 100 possessions, two-point field goals only. After the season, I ranked the teams by W-L record, and had the top 8 make the playoffs. I paired them off 1-8, 2-7, 3-6, and 4-5, and played a first-round best-of-seven series. I ranked the four winners, paired them up 1-4 and 2-3, and played a second-round best-of-seven series. 

After all that, I took the four second-round teams -- actually, close to 120,000, because I ran 30,000 repetitions -- and compared their simulated second-round point differential to their talent. I expected the teams that went 4-0 in the first round would appear to exceed their talent in the second. (If they did so, the only possible reason would be the weighting anomaly, because of the way the simulation was set up.)

They did appear to score higher than their talent, but not by much:

4 games: +0.17 points/game
5 games: +0.03 
6 games: -0.03
7 games: -0.11

I found a difference of 0.28 points between the 4-0 teams and the 4-3 teams. The FiveThirtyEight study found 8.7 points. So, the logic is right, but the magnitude is nowhere near enough to explain the real-life differential.


There's another good reason you'd expect the 4-0 teams to outperform in the second round -- their first round gives us more information about the team. Specifically, the first round sweep suggests that the team is probably better than our original talent estimate. 

The FiveThirtyEight study used season SRS as their estimate of talent. That's the regular-season point differential after adjusting for strength of schedule. Like any other observed performance, it's subject to randomness, and will vary from true talent.  

Taking the simplified "100 possessions, 2-point attempts only" model, you can calculate that the SD of single game point differential is 14.1 points (10 times the square root of 2). For a season average, you divide that by the square root of 82, which gives 1.56. That means that even in this oversimplified model, the typical team's SRS is more than 1 point different from its true talent.

Some teams' SRSses are underestimates, and some are overestimates. The teams that go 4-0 are now more likely to be underestimates. So, you'd expect them to outperform in the next round.

To check, I re-ran the simulation, but, this time, instead of checking whether 4-0 teams performed better than their talent, I checked whether they performed better than their 82-game SPS. As expected, they did. But the effect was still pretty small:

4 games: +0.22 points/game
5 games: +0.28
6 games: -0.09
7 games: -0.35

We're up to 0.58 points difference between 4-0 and 4-3, still far short of FiveThirtyEight's finding of 8.7 points.

However: the simulation is still missing some hidden talent variation. For one thing, it assumes team talent is exactly the same every game. But that's not the case. Aside from home court advantage (which I ignored, because it wouldn't change the results much), there's things like injuries, trades, changes in player talent as they learn, and so forth.

For instance: if a team acquires a 2-point star player halfway through the season, a single SRS will blend the "before" and "after."  As a result, the overall SRS for the year will be 1 point short of playoff reality. 

To simulate that, I introduced a "playoff variation" factor. For each of the eight playoff teams, I tweaked their talent by a random number of points, mean 0 and SD 1. The results:

4 games: +0.39 points/game
5 games: +0.17
6 games: -0.08
7 games: -0.38

Larger again. Now, the difference is up to 3/4 of a point. 

When I up the "playoff variation" to have an SD of 2, it gets bigger still:

4 games: +0.75 points/game
5 games: +0.17
6 games: -0.19
7 games: -0.63

Now, we're up to 1.38 points. Still well short of 8.7, but enough that we would be able to say that this is at least *part* of what the FiveThirtyEight study found.

This suggests that, if fatigue isn't the explanation, maybe it has something to do with differences between SRS and actual talent.


Well, I finished writing all of that, and then I thought, hey, we have an easy way to figure out how well SRS estimates talent -- the Vegas betting line!  So I went to check on those, and now I'm writing the rest of this post two days later.

In the FiveThirtyEight study, which covered 2003-2013, there were 17 teams that went 4-0 in the first round.

In 2013, the Heat took on the Bulls in the second round after sweeping the first. SRS ratings had the Heat at 7.04 point-per-game favorites over the Bulls. But the Vegas line had them at 9.5 points better. (The betting line was 13 points, and I subtracted 3.5 for home court advantage.)

So, we can say Vegas estimated the Heat as 2.46 points better (relative to Chicago) than their SRS estimate. I'll chart that like this:

                    Vegas   SRS     Diff
2013 Heat/Bulls      +9.5  +7.04   +2.46

Notes: (a) The Vegas numbers may vary, because I used more than one site, and they might vary by a half point here and there. (b) For all Vegas lines, I looked only at the first game of the series. (c) It's conventional to write a Vegas favorite as "-9.5", but I'm going to use "+9.5" for consistency with SRS; hope that's not too confusing. (d) I used 3.5 points for home court advantage. (e)  I'm going to talk about how SRS rated a single team (the one I'm talking about at the time), even though it's actually how SRS rated that team minus how SRS rated the opposing team. That's just to make things easier to read.

Here's all seventeen of the 4-0 teams:

4-0 teams           Vegas   SRS     Diff
2013 Heat/Bulls      +9.5  +7.04   +2.46
2013 Spurs/Warriors  +6    +5.35   +0.65
2012 Spurs/Clippers  +8    +4.36   +3.64
2012 Thunder/Lakers  +4.5  +5.32   -0.82
2011 Celtics/Heat    -1.5  -1.93   +0.43
2010 Magic/Hawks     +5.5  +2.78   +2.72
2009 Cavs/Pistons    +8    +6.97   +1.03
2008 Lakers/Jazz     +4.5  +0.47   +3.13
2007 Bulls/Pistons   -1.5  +0.84   -2.34
2007 Pistons/Bulls   +1.5  -0.84   +2.34
2007 Cavs/Nets       +2.5  +4.33   -1.83
2006 Mavs/Grizzlies  -1.5  -0.73   -0.77
2005 Suns/Mavs       +3    +1.23   +1.77
2005 Heat/Wizards    +7.5  +6.48   +1.02
2004 Spurs/Lakers    +1.5  +3.16   -1.66
2004 Nets/Pistons    -2    -3.16   +1.16
2004 Pacers/Heat     +8    +5.06   +2.94
Average              +3.74 +2.81   +0.93 

For those 17 series, the bookmakers rated the average favorite 0.93 points better than their SRS differential. In other words, the 4-0 teams were a point better than the FiveThirtyEight study gave them credit for. That explains roughly one point of the three points by which FiveThirtyEight found the favorites outperforming. 

Does that mean the "fatigue" effect can now only be two points?  Not necessarily. You could still argue that the reason for the extra Vegas point is that bookmakers and bettors *knew* about the fatigue factor, and adjusted their expectations accordingly. 

But, in that case, you could also ask, why did Vegas only adjust for one point out of the three?  Actually, I don't think even the one point is fatigue adjustment. There's a better explanation, in my view.

Suppose the the +0.93 was all fatigue adjustment. In that case, if we repeat the chart for the first round, the discrepancy should be zero, right?  Because all teams had roughly equal rest before the first round.

It's not zero. I won't give you the full chart, but the first round average is still positive, at +0.49 points. 

You could still defend fatigue. You could say, sure, maybe SRS was wrong by +0.49 points all along, but the remaining +0.44 points for the second round favorites could still be Vegas acknowledging the fatigue factor.

But, I think there's a better explanation for that +0.49 points. After the first round, we have good reason to believe those teams are better than we thought before the first round. After all , they just swept. 

The 17 teams were six-point favorites, on average, in the first round. According to a simulation I did, when a +6 team wins, it wins by an average of 14 points. (When it loses, it loses by 9.5 points, but that doesn't factor in to these 4-0 series.)  

If you add four +14 games to a +6 regular season, it increases the SRS from 6.00 to 6.37 -- almost exactly the +0.44 the favorites moved.

So, I think it's not fatigue that the lines were correcting for, but just new evidence for what the team talent was all along.


But we still have those remaining 2.07 points to account for. Actually, let's adjust for the mathematical anomaly effect, which is .17 points. (It's not a big adjustment, but I did all that work, dammit, and I don't want to waste it.)

That brings us down to 1.9 points. Where did those come from?

Well, it could just be luck. If the SD of a single game point differential is 14, the SD of the average of 56 independent games would be 1.87 points. In that light, the 1.9 point differential is only one SD.

Actually, it's probably a bit more than that. Two of the 17 series were actually identical, just in reverse -- when the 2007 Bulls faced the Pistons, after they both went 4-0 in the first round. Since those two teams have to cancel to zero, the effect is bigger than it looks, perhaps by a factor of the square root of 17/15. (The observed effect rises by 17/15, and the SD rises by only the square root 17/15.)

Still, it's not statistically significant by normal standards, if that's what you like to look at.

Another way of looking at the same discrepancy: in the second round, the 17 teams went a combined 50-41 against the spread. Eliminating the two duplicates, they went 44-35. That's also about one SD away from .500, which you'd expect, since the W-L and point differential are essentially two ways of looking at the same result. 

It still *could* be fatigue, but I think you need better evidence than 44-35.


Now let's look at the fourteen 4-3 teams, the ones that underperformed in the second round:

4-3 teams           Vegas   SRS     Diff
2013 Bulls/Heat      -9     -7.04  -1.96
2012 Clippers/Spurs  -8     -4.85  -3.15
2012 Lakers/Thunder  -4     -4.48  +0.48
2010 Hawks/Magic     -5.5   -2.68  -2.82
2009 Hawks/Cavs      -8     -6.97  -1.03
2009 Celtics/Magic   -2     +0.95  -2.95
2008 Celtics/Cavs    +6     +9.84  -3.84
2007 Jazz/Warriors   +0.5   +3.07  -2.57
2006 Suns/Clippers   +2     +3.73  -1.73
2005 Pacers/Pistons  -5     -2.82  -2.18
2005 Mavs/Suns       -3     -1.23  -1.77
2004 Heat/Pacers     -8     -4.80  -3.20
2003 Mavs/Kings      -5     +1.22  -6.22
2003 Pistons/76ers   -5.5   +1.22  -6.72
Average              -3.89  -1.06  -2.73 

These are much bigger SRS errors than for the 4-3 teams ... compared to the bookies, SRS overestimated the teams by an average 2.7 points. The FiveThirtyEight study found a 5.7 point difference, which leaves three points unexplained -- 2.8 points after adjusting for the mathematical anomaly.

SRS had also rated those teams higher than the bookies in the first round -- but only by 0.45 points. That's only 1/6 of the full effect. So, this time, if you believe Vegas is adjusting for fatigue, you have a better argument. (But, again: why did bettors only adjust by half the observed effect?)

The 2.8-point shortfall, relative to Vegas, resulted in those teams going 29-45 against the spread. That's a bit less than 2 SDs below .500. (Actually, by "30 points equals one win," 2.8 points works out to 30-44. So there's one game Pythagorean error.)


So we have 44-35 for the sweeping teams, and 29-45 for the seven-game teams. Even if they aren't significant individually, doesn't the *combination* of the two suggest something real is going on?

Not as much as it seems, because the 4-0 and 4-3 results aren't independent. The 4-0 teams played the 4-3 teams some of the time. So, some of the extreme results are counted in both samples. 

In 2010, the Magic went 4-0 in the first round, while the Hawks went 4-3. When they faced each other next round, Orlando absolutely crushed Atlanta, with an *average* score of 107-82. That's 22 points per game more than expected.

That shows up as a +22 for the 4-0 teams, and a -22 for the 4-3 teams. You can see those as the two most extreme dots in the FiveThirtyEight chart: 

In fact, exactly half the series are independent, and half are exact mirror images, except for which column of the chart they appear in. In the chart, for every dot, you'll find a mirror image dot in one of the four columns somewhere.

If I erased the 4-3 column from the chart, you could reproduce it perfectly. You'd just find all the dots in the first three columns that aren't offset by a mirror-reflection dot, and those must be the ones that go in the last column. 

You can still say, "Not only did the 4-0 teams outperform, but, also, the 4-3 teams underperformed!"  But that's like saying, "Not only did the Magic score 20 more points than the Hawks last night, but the Hawks also scored 20 fewer points than the Magic!"  Well, not exactly, because not every 4-0 team played a 4-3 team -- some of them played 4-1 teams and 4-2 teams. So it's only partially like that. But still enough that you have to keep it in mind.


In the FiveThirtyEight study, the traditional evidence for significance comes when they do a regression, and they find a "first round games" effect that's 3 SDs from the mean. But that's an overestimate of the significance, for the reasons discussed:

1. The series aren't independent; half are duplicates of the other half.

2. The regression doesn't adjust for the mathematical weighting anomaly, which is around 0.15 points for the first and last columns. 

3. The regression doesn't adjust for SRS under/overestimating Vegas by the 0.5 points we should all be able to agree on (looking at the first round, where fatigue didn't apply).

4. The regression doesn't adjust for the fact that our estimate of team skill should change for the second round, even independent of fatigue, because of the evidence of how they played in the first round. 

After all that, what's left?

1. The unexplained observed point differentials: +1.85 points for the first group, -2.8 points for the second group. Or, equivalently converted to wins against the spread: the unexplained record of 44-35 for the sweeping teams, and 29-45 for the seven-game teams.

2. The possible argument that the difference between SRS and Vegas in the second round -- after subtracting off the difference in the first round, and the new information about team talent -- might be evidence that Vegas is adjusting for fatigue.

Even without having a formal significance test for what's left, those effects seem small enough to me like they could just be random luck. 


Going from data interpretation to personal opinion, here's my argument for it being just random:

(a) It's probably not significant at the 5% level, or, at best, just barely.

(b) It's rare to find a large effect that bookies and sharp bettors haven't also found;

(c) Nate said the effect didn't repeat for other rounds;

(d) This year failed to follow the pattern (after the study appeared). The 4-3 Pacers performed against the 4-0 Wizards exactly as SRS predicted. The 4-0 Heat performed against the 4-3 Nets a tiny bit worse than predicted. And the 4-3 Spurs handily beat expectations against the 4-2 Trail Blazers. (The remaining two 4-3 teams faced each other, cancelling out.)

(e) The difference is so huge that it's just implausible on its face. Even Nate doesn't really believe it: "the effects are so pronounced I don't trust them." Taking the results at face value would have made Indiana a 35% underdog to win its series against Washington, instead of a 76% favorite. (The Pacers wound up winning, 4 games to 2.)

(f) It seems implausible that a fatigue advantage would persist throughout the entire second round. The first game or two, maybe. But, the numbers are too big for that. A 3-point-per-game advantage over a 5-game series is 15 points overall ... and there's no way one team could have a 7.5 point advantage in the first two games, or a 15-point advantage in the first game.

Feel free to disagree with me.


(P.S. Credit to regular commenter GuyM for suggestions in an e-mail conversation we had. Guy was big on the "teams may be different in the playoffs from the regular season" explanation, which turns out to be the important one.)

Labels: , , ,

Sunday, June 01, 2014

Team fatigue and the NBA playoffs

(UPDATE: after I posted this, I discovered that my results weren't as strong as I originally thought.  I've edited to emphasize that.)

Is fatigue a factor affecting NBA playoff success?  Yes, it's a huge factor, a FiveThirtyEight article concluded last month.

Looking at the past 11 post-seasons, Nate Silver's study found that the more games it took at team to win its first-round series, the worse it did compared to expectations in the second round. That was after carefully accounting for team talent and home court advantage.

Teams that won the first round in four games beat second round expectations by 3 points per game. Teams that took seven games underperformed by 5.7 points. Here's their graph that illustrates it best:

But ... I think part of what the study found is a statistical artifact. I think there is a mathematical reason you will *always* get a certain amount of that kind of effect, in any sport, regardless of scheduling and fatigue.


To keep things simple, I'm going to use wins for my examples, instead of points. Also, I'm going to ignore home court advantage. (The argument would still work otherwise; it would just make the math too complicated.)

So: let's suppose you have a best-of-seven series, and the favorite (Team "F") has an independent .667 probability of winning each game. So, the probability that F wins Game 1 is .667. The probability that F wins Game 2 is .667. And so on, so that the probability that F wins Game 7 (if necessary) is also .667.

What is team F's expected winning percentage for the series?

It's not .667. 

And that's the key, understanding why it's not .667. 


Suppose that you have a large number of Team Fs, and, true to form, they win exactly the expected number of games. That means their overall winning percentage is .667. 

But what's the average winning percentage of their *series*?

It would be the same .667, if every series goes like this:

4-2    .667
4-2    .667
4-2    .667
avg    .667

It would also be .667 if all series were four straight:

4-0   1.000
4-0   1.000
0-4    .000
avg    .667

But, for any other combination, the average will have to be *more* than .667. For instance, this:

4-3    .571
4-1    .800
4-2    .667
avg    .679

Team F still won .667 of its games -- going 12-6 -- but the average of the three series is .679. Here's another one:

4-2    .667
1-4    .200
4-0   1.000
4-0   1.000
3-4    .429
4-0   1.000
4-2    .667
avg    .709

This time, Team F's series average is .709, even though it still wins games at the expected .667 clip (24-12). 

What's going on?  Well, when you look at the overall winning percentage by summing games, you're treating all *games* equally. But when you look at the overall percentage by averaging the individual series, you're treating all *series* equally.

But all series aren't equal. They have different numbers of games. If you're giving a four-game series the same weight as a seven-game series, you're weighting each of the "four-game" games higher than each of the "seven-game" games. (75% higher, in fact.)

The shorter the series, the more you're overweighting individual games. 

Now, here's the key: the shorter the series, the higher the expected winning percentage. Why is that?  Becuase of the heavy favorite. 

If a series goes 7 games, either the favorite went 4-3 (.571) or 3-4 (.429). Obviously, 4-3 is more likely. In our example, if you do the arithmetic, it comes out exactly twice as likely. That means for series that go 7 games, the favorite's expected winning percentage is .524. (That's the average of two .571s and one .429.)

If a series goes 4 games, it could be 4-0 (1.000) or 0-4 (.000). Again, 4-0 is more likely. But this time it's much, much more likely -- *sixteen* times as likely. So, in sweeps, the favorite's expected winning percentage is .941. (That's the average of sixteen 1.000s and one .000.)

The difference makes sense intuitively. The underdog might have a decent chance to beat the favorite in a close series, but to beat them four games straight?  Not likely.


1. shorter series are overweighted
2. shorter series have higher winning percentages


3. higher winning percentages are overweighted

And, that's why you the expected winning percentage for a series to be higher than the expected winning percentage for games.


Here's a second way to look at it. Suppose I give you five wins and five losses, and ask you to split them into series, any size you like, so that you win the highest percentage of series. Here's what you'll do:

Series 1: W
Series 2: W
Series 3: W
Series 4: W
Series 5: W
Series 6: LLLLL

You went 5-5 in games, but 5-1 in series. The average winning percentage in games is .500. The average winning percentage in series is .833. (Five series of 1.000, and one series of .000).

They don't match. And, the reason they don't match is that you stuck the losses together, so all the losses can only "ruin" one series. That is: there's a high variance of losses between series.

It's the same in the NBA. The losses aren't always evenly distributed from one series to another. If all the series go 4-2, everything matches. If there's any deviation from that, even randomly, the losses are distributed with higher variance, which means the losses cluster, which means they "ruin" fewer series, which means you get a higher winning percentage.


Here's a third way to prove it: just do the math. Here are all the possible series outcomes, with the binomial probability of occurrence, and the resulting winning percentage:

result  prob   pct
4-0    .198  1.000
4-1    .263   .800
4-2    .219   .667
4-3    .146   .571
3-4    .073   .429
2-4    .055   .333
1-4    .033   .200
0-4    .012   .000

Out of 1000 series, Team F will go 4-0 in 198 of them, obtaining a 1.000 winning percentage. They'll go 4-1 in 263 of them, obtaining an .800 winning percentage. And so on.

If you average out the 1,000 results, you get .694. That is: when a team goes into a series with a .667 probability of winning each game, its expected winning percentage *for the series* is .694.


The difference between the expected series percentage (.694) and the expected game percentage (.667) is .027. Let's call that the "discrepancy".

When the favorite is only .500, the discrepancy is zero, since it's no longer true that a short series has a higher winning percentage than a long series.

When the favorite is close to 1.000, the discrepancy is also zero, since every series is the exact same 4-0. 

Since the discrepancy goes from zero to .027 back to zero, it must be that there's a peak somewhere. I ran a simulation, and, not surprisingly, it turns out that the peak is exactly halfway between, at .750. Here are some winning percentages (for the favorite) and the resulting discrepancy:

wpct discrepancy
.500  .000
.550  .010
.600  .020
.650  .027
.667  .028
.700  .030
.750  .032
.800  .030
.850  .025

So, roughly speaking, the closer the favorite is to .750, the higher the discrepancy. In practical terms, the stronger the favorites, the more they will appear to outperform.


Can that explain the FiveThirtyEight results?

Recall what the FiveThirtyEight study found: that teams who won the first round in four games did better in the second round relative to expectations. That is: they had a higher discrepancy.

That fits. Teams that won the first round 4-0 are likely to be stronger teams than teams that won 4-3. In fact, teams that won 4-3 might even be underdogs. That would explain why the 4-0 teams came out "too high," and the 4-3 teams came out "too low".

The FiveThirtyEight study used point differential, rather than wins. Let me try to rejig the example to be in points, so we can compare results. 

Assume that when a .667 team wins, it wins by 10 points on average. When it loses, it's by 5 points. That works out to +5 points per game overall (the average of +10, +10, and -5). Those +5 points correspond to a .667 talent. (The rule of thumb is around 30 points per win, so 5 points above average results in .167 wins above average.)

When the favorite goes 4-0, it winds up at +10 points per game. When it goes 4-1, it's +7 points per game (average of four +10s and one -5). And so on. Here's the chart:

result  prob  pts/g
4-0    .198    +10
4-1    .263    +7
4-2    .219    +5
4-3    .146    +3.5
3-4    .073    +1.4
2-4    .055     0
1-4    .033    -2
0-4    .012    -5
average        +5.4 

Under our assumptions, strong favorites would show 0.4 points better than their talent in a best-of-seven series.

(UPDATE: I originally said 5.4 points better, because I forgot to subtract off the 5 points we started with.  Oops!)

How does that compare to the real study? FiveThirtyEight found that teams going 4-0 in the first round overperformed by 3 points ... and teams going 4-3 in the first round underperformed by 5.7 points.

Hmmm ... not very close!  

Maybe it's our back-of-the-envelope assumptions: (a) that 4-0 teams would be favored by +5 in the second round, and (b) that the +10/-5 breakdown for wins and losses is reasonable. 

What if we assume it's +12 when they win, and -9 when they lose?  Then, it shows 0.6 points better -- not much difference.

I thought the effect would be bigger.  Maybe not.  


In any case, it's easy to back out this effect to see if any "fatigue" factor remains.  All you have to do is: for that chart, where there's one circle for every series ... just change it so there's one circle for every *game*. I'm betting that if you do that, the sloped line becomes closer to horizontal.  But, it appears, maybe not that much closer.


Here's something else that's probably part of the effect. 

The study based the second round expectations on regular season point differentials.  But, a team's first round performance gives you additional information about the team.

Suppose at team goes 4-0, with an average +10 point differential per game. If you combine that with a regular season +5 rating, you get a new rating of 5.23 points.  

In footnote #4, the article points out that a playoff game gives you two to three times the information about future performance than a regular season game. If we use "twice," the +5 rating becomes 5.44.  If we use "three times," it becomes 5.63.

So, we have: +0.4 points for the "weighting by series" anomaly, and another 0.5 points or so for the added information the 4-0 gives us.  We're up to +0.9 of the observed +3 points.

Still, that leaves 2.1 points for which fatigue is still the only explanation on the table. 


(More to come.)

Labels: , , ,