Tuesday, July 22, 2014

Did McDonald's get shafted by the Consumer Reports survey?

McDonald's was the biggest loser in Consumer Reports' latest fast food survey, ranking them dead last out of 21 burger chains. CR readers rated McDonald's only 5.8 out of 10 for their burgers, and 71 out of 100 for overall satisfaction. (Ungated results here.)

CR wrote,


"McDonald's own customers ranked its burgers significantly worse than those of [its] competitors."

Yes, that's true. But I think the ratings are a biased measure of what people actually think. I suspect that McDonald's is actually much better loved than the survey says. In fact, the results could even be backwards.  It's theoretically possible, and fully consistent with the results, that people actually like McDonald's *best*. 

I don't mean because of statistical error -- I mean because of selective sampling.

-----

According to CR's report, 32,405 subscribers reported on 96,208 dining experiences. That's 2.97 restaurants per respondent, which leads me to suspect that they asked readers to report on the three chains they visit most frequently. (I haven't actually seen the questionnaire -- they used to send me one in the mail to fill out, but not any more.)

Limiting respondents to their three most frequented restaurants would, obviously, tend to skew the results upward. If you don't like a certain chain, you probably wouldn't have gone lately, so your rating of "meh, 3 out of 10" wouldn't be included. It's going to be mostly people who like the food who answer the questions.

But McDonald's might be an exception. Because even if you don't like their food that much, you probably still wind up going occasionally:

-- You might be travelling, and McDonald's is all that's open (I once had to eat Mickey D's three nights in a row, because everything else nearby closed at 10 pm). 

-- You might be short of time, and there's a McDonald's right in Wal-Mart, so you grab a burger on your way out and eat it in the car.

-- You might be with your kids, and kids tend to love McDonald's.

-- There might be only McDonald's around when you get hungry. 

Those "I'm going for reasons other than the food" respondents would depress McDonald's ratings, relative to other chains.

Suppose there are two types of people in America. Half of them rate McDonald's a 9, and Fuddruckers a 5. The other half rate Fuddruckers a 8, but McDonald's a 6.

So, consumers think McDonald's is a 7.5, and Fuddrucker's is a 6.5.

But the people who prefer McDonald's seldom set foot anywhere else -- where there's a Fuddrucker's, the Golden Arches are always not too far away. On the other hand, fans of Fuddrucker's can't find one when they travel. So, they wind up eating at McDonald's a few times a year.

So what happens when you do the survey McDonald's gets a rating of 7.5 -- the average of 9s from the loyal customers, and 6s from the reluctant ones. Fuddruckers, on the other hand, gets an average of 8 -- since only their fans vote. 

That's how, even if people actually like McDonald's more than Fuddrucker's, selective sampling might make McDonald's look worse.

------

It seems likely this is actually happening. If you look at the burger chain rankings, it sure does seem like the biggest chains are clustered near the bottom. Of the five chains with the most locations (by my Googling and estimates), all of them rank within the bottom eight of the rankings: Wendy's (burger score 6.8), Sonic (6.7), Burger King (6.6), Jack In The Box (6.6), and McDonald's (5.8). 

As far as I can tell, Hardees is next biggest, with about 2,000 US restaurants. It ranks in the middle of the pack, at 7.5. 

Of the ten chains ranked higher than Hardee's, every one of them has less than 1,000 locations. The top two, Habit Burger Grill (8.1) and In-N-Out (8.0), have only 400 restaurants between them. Burgerville, which ranked 7.7, has only 39 stores. (Five Guys (7.9) now has more than 1,000, but the survey covered April, 2012, to June, 2013, when there were fewer.)

The pattern was the same in other categories, where the largest chains were also at or near the bottom. KFC ranked worst for chicken; Subway rated second-worst for sandwiches; and Taco Bell scored worst for Mexican.

And, the clincher, for me at least: the chain with the worst "dining experience," according to the survey was Sbarro, at 65/100. 

What is Sbarro, if not the "I'm stuck at the mall" place to get pizza? Actually, I think there's even a Sbarro at the Ottawa airport -- one of only two fast food places in the departure area. If you get hungry waiting for your flight, it's either them or Tim Hortons.

The Sbarro ratings are probably dominated by customers who didn't have much of a choice. 

(Not that I'm saying Sbarro is actually awesome food -- I don't ever expect to hear someone say, unironically, "hey, I feel like Sbarro tonight."  I'm just saying they're probably not as bad as their rating suggests.)

------

Another factor: CR asked readers to rate the burgers, specifically. In-N-Out sells only burgers. But McDonald's has many other popular products. You can be a happy McDonald's customer who doesn't like the burgers, but you can't be a happy In-N-Out customer who doesn't like the burgers. Again, that's selective sampling that would skew the results in favor of the burger-only joints.

And don't forget: a lot of people *love* McDonald's french fries. So, their customers might be prefer "C+ burger with A+ fries" to a competitor who's a B- in both categories. 

That thinking actually *supports* CR's conclusion that people like McDonald's burgers less ... but, at the same time, it makes the arbitrary ranking-by-burger-only seem a little unfair. It's as if CR rated baseball players by batting average, and ignored power and walks.

For evidence, you can compare CRs two sets of rankings. 

In burgers, the bottom eight are clustered from 6.6 to 6.8 -- except McDonald's, a huge outlier at 5.8, as far from second-worst as second-worst is from average.

In overall experience, though, McDonald's makes up the difference completely, perhaps by hitting McNuggets over the fences. It's still last, but now tied with Burger King at 71. And the rest aren't that far away. The next six range from 74 to 76 -- and, for what it's worth, CR says a difference of five points is "not meaningful".

-----

A little while ago, I read an interesting story about people's preferences for pies. I don't remember where I read it so I may not have the details perfect. (If you recognize it, let me know.)

For years, Apple Pie was the biggest selling pie in supermarkets. But that was when only full-size pies were sold, big enough to feed a family. Eventually, one company decided to market individual-size pies. To their surprise, Apple was no longer the most popular -- instead, Blueberry was. In fact, Apple dropped all the way to *fifth*. 

What was going on? It turns out that Apple wasn't anyone's most liked pie, but neither was it anyone's least liked pie. In other words, it ranked high as a compromise choice, when you had to make five people happy at once.

I suspect that's what happens with McDonald's. A bus full of tourists isn't going to stop at a specialty place which may be a little weird, or have limited variety. They're going to stop at McDonald's, where everyone knows the food and can find something they like.

McDonald's is kind of the default fast food, everybody's second or third choice.

------

But having said all that ... it *does* look to me that the ratings are roughly in line with what I consider "quality" in a burger. So I suspect there is some real signal in the results, despite the selective sampling issue.

Except for McDonald's. 

Because, first, I don't think there's any way their burgers are *that* much "worse" than, say, Burger King's. 

And, second, every argument I've made here applies significantly more to McDonald's than to any of the other chains. They have almost twice as many locations as Burger King, almost three times as many as Wendy's, and almost four times as many as Sonic. Unless you truly can't stand them, you'll probably find yourself at McDonald's at some point, even if you'd much rather be dining somewhere else.

All the big chains probably wind up shortchanged in CR's survey. But McDonald's, I suspect, gets spectacularly screwed.







Labels: , , ,

Saturday, July 12, 2014

Nate Silver and the 7-1 blowout

Brazil entered last Tuesday's World Cup semifinal match missing two of their best players -- Neymar, who was out with an injury, and Silva, who was sitting out a red-card suspension. Would they still be good enough to beat Germany?

After crunching the numbers, Nate Silver, at FiveThirtyEight, forecasted that Brazil still had a 65 percent chance of winning the match -- that the depleted Brazilians were still better than the Germans. In that prediction, he was taking a stand against the betting markets, which actually had Brazil as underdog -- barely -- at 49 percent. 

Then, of course, Germany beat the living sh!t out of Brazil, by a score of 7-1. 


"Time to eat some crow," Nate wrote after Brazil had been humiliated. "That prediction stunk."

I was surprised; I had expected Nate to defend his forecast. Even in retrospect, you can't say there was necessarily anything wrong with it.

What's the argument that the prediction stunk?  Maybe it goes something like this:

-- Defying the oddsmakers, Nate picked Brazil as the favorite.
-- Brazil suffered the worst disaster in World Cup history. 
-- Nate's prediction was WAY off.
-- So that has to be a bad prediction, right?

No, it doesn't. It's impossible to know in advance what's going to happen in a soccer game, and, in fact, anything at all could happen. The best anyone can do is try to assign the best possible estimate of the probabilities. Which is what Nate did: he said that there was a 65% chance that Brazil would win, and a 35% chance they would lose. 

Nate said Brazil had about twice as much chance of winning as Germany did. He did NOT say that Brazil would play twice as well. He didn't say Brazil would score twice as many goals. He didn't say Brazil would control the ball twice as much of the time. He didn't say the game would be close, or that Brazil wouldn't get blown out. 

All he said was, Brazil has twice the probability of winning. 

The "65:35" prediction *did* imply that Nate thought Brazil was a better team than Germany. But that's not the same as implying that Brazil would play better this particular game. It happens all the time, in sports, that the better team plays like crap, and loses. That's all built in to the "35 percent". 

Here's an analogy. 

FIFA is about to pick a random number of dollars between 1 and 1,000,000. I say, there's a 65 percent chance that the number drawn will be higher than the value of a three-bedroom bungalow, which is $350,000. 

That's absolutely a true statement, right?  650,000 "winning" balls out of a million is 65 percent. I've made a perfect forecast.

After I make my prediction, FIFA reaches into the urn, pulls out out one of the million balls, and it's ... number 14. 

Was my prediction wrong?  No, it wasn't. It was exactly, perfectly correct, even in retrospect.

It might SEEM that my prediction was awful, if you don't understand how probability works, or you didn't realize how the balls were numbered, or you didn't understand the question. In that case, you might gleefully assume I'm an idiot. You might say, "Are you kidding me?  Phil predicted you could buy a house for $14! Obviously, there's something wrong with his model!"

But, there isn't. I knew all along that there was a chance of "14" coming up, and that factored into my "35 percent" prediction. "14" is, in fact, a surprisingly low outcome, but one that was fully anticipated by the model.

When Nate said that Brazil had a 35 percent chance of losing, a small portion of that 35 percent was the chance of those rare events, like a 7-1 score -- in the same way my own 35 percent chance included the rare event of a really small number getting drawn. 

As unintuitive as it sounds, you can't judge Nate's forecast by the score of the game. 

-------

Critics might dispute my analogy by arguing something like this:

"The "14" result in Phil's model doesn't show he was wrong, because, obviously, which ball comes out of the urn it just a random outcome. On the other hand, a soccer game has real people and real strategies, and a true expert would have been able to foresee how Germany would come out so dominating against Brazil."

But ... no. An expert probably *couldn't* know that. That's something that was probably unknowable. For one thing, the betting markets didn't know -- they had the two teams about even. I didn't hear any bettor, soccer expert, sportswriter, or sabermetrician say anything otherwise, like that Germany should be expected to win by multiple goals. That suggests, doesn't it, that it was legimately impossible to foresee?

I say, yes, it was definitely unknowable. You can't predict the outcome of a single game to that extent -- it's a violation of the "speed of light" limit. I would defy you to find any single instance where anyone, with money at stake, seriously predicted a single game outcome that violates conventional wisdom to anything near this extent. 

Try it for any sport. On August 22, 2007, the Rangers were 2:3 underdogs on the road against the Orioles. They won 30-3. Did anyone predict that?  Did anyone even say the Rangers should be heavy favorites?  Is there something wrong with Vegas, that they so obviously misjudged the prowess of the Texas batters?

Of course not. It was just a fluke occurrence, literally unpredictable by human minds. Like, say, 7-1 Germany.


Huh? [Nate Silver] says his prediction  “stunk,” but it was probabilistic. No way to know if it was even wrong. 

Exactly correct. 

--------

So I don't think you can fault Nate's prediction, here. Actually, that's too weak a statement. I don't mean you have to forgive him, as in, "yeah, he was wrong, but it was a tough one to predict."  I don't mean, "well, nobody's perfect."  I mean: you have no basis even for *questioning* Nate's prediction, if your only evidence is the outcome of the game. Not as in, "you shouldn't complain unless you can do better," but, as in, "his prediction may well have been right, despite the 7-1 outcome."  

But I did a quick Google search for "Brazil 7-1 Nate Silver," and every article I saw that talked about Nate's prediction treated it as certain that his forecast was wrong.

1. Here's one guy who agrees that it's very difficult to predict game results. From there, he concludes that all predictions must therefore be bullsh!t (his word). "Why did they even bother updating their odds for the last three remaining teams at numbers like 64 percent for Germany, 14 percent for the Netherlands, when we just saw how useless those numbers can be?"

Because, of course, the numbers *aren't* bullsh!t, if you correctly interpret them as probabilities and not certainties. If you truly believe that no estimate of odds is useful unless it can successfully call the winner of every game, then how about you bet me every game, taking the Vegas underdog at even money?  Then we'll see who's bullsh!tting.

2. This British columnist gets it right, but kind of hides his defense of Nate in a discussion of how sports pundits are bad at predicting. Except that he means that sabermetricians are bad at correctly guessing outcomes. Well, yes, and we know that. But we *are* fairly decent at predicting probabilities, which is all that Nate was trying to do, because he knows that's all that can realistically be done.


"I love Nate Silver and 538, but this result might be breaking his model. Haven't been super impressed with the predictions."

What, in particular, wasn't this guy impressed with?  He can't just be talking about game results, can he?  Because, in the knockout round, Nate's predicted favorites won *every game* up to Germany/Brazil. Twelve in a row. What would have "super impressed" this guy, 13 out of 12?

Here's another one: 

"To be fair to Nate Silver + 538, their model on the whole was excellent. It's how they dealt with Brazil where I (and others) had problems."

What kind of problems?  Not picking them to lose 7-1?  

In fairness, sure, there's probably some basis for critiquing Nate's model, since he's been giving Brazil siginficantly higher odds than the bookies. But, in this case, the difference was between 65% and 49%, not between 65% and "OMG, it's a history-making massacre!"  So this is not really a convincing argument against Nate's method.

It's kind of like your doctor says, "you should stop smoking, or you're going to die before you're 50!"  You refuse, and the day before your fiftieth birthday, a piano falls on your head and kills you. And the doctor says, "See? I was right!" 

4. Here's a mathematical one, from a Guardian blogger. He notes that Nate's model assumed goals were independent and Poisson, but, in real life, they're not -- especially when a team collapses and the opponent scores in rapid-fire fashion.

All very true, but that doesn't invalidate Nate's model. Nate didn't try to predict the score -- just the outcome. Whether a team collapses after going down 3-0, or not, doesn't much affect the probability of winning after that, which is why any reasonable model doesn't have to go into that level of detail.

Which is why, actually, losing 7-1 loss isn't necessarily inconsistent with being a favorite. Imagine if God had told the world, "if Brazil falls behind 2-0, they'll collapse and lose 7-1." Nate, would have figured: "Hmmm, OK, so we have to take subtract off the chance of 'Brazil gives up the first two goals, but then dramatically comes back to win the game,' since God says that can't happen."

Nate would have figured that's maybe a 1 percent of all games, and say, "OK, I'm reducing my 65% to 64%."  

So, that particular imperfection in the model isn't really a serious flaw. 

But, now that I think about it ... imagine that when Nate published his 65% estimate, he explicitly mentioned, "hey, there's still a 1-in-3 chance that Brazil could lose ... and that includes a chance that Germany will kick the crap out of them. So don't get too cocky."  That would have helped him, wouldn't it?  It might even have made him look really good!

I mean, he shouldn't need to say it to statisticians, because it's an obvious logical consequence of his 65% estimate. But maybe it needs to be said to the public.


"It's hard to imagine how Silver could have been more wrong."

No, it's not hard to imagine at all. If Nate had predicted, "Germany has a 100% chance of winning 7-1," that would have been MUCH more wrong. 

6. Finally, and worst for last ... here's a UNC sociology professor lecturing Nate on how he screwed up, without apparently really understanding what's going on at all. I could spend an entire post on this one, but I'll just give you a summary. 

First, she argues that Nate should have paid attention to sportswriters, who said Brazil would struggle without those missing players. Researchers need to know when to listen to subject-matter experts, who knew something Nate's mathematical models don't. 

Well, first, she's cherry-picking her sportswriters -- they didn't ALL say Brazil would lose badly, did they?  You can always find *someone*, after the fact, who bet the right way. So what?

As for subject-matter experts ... Nate actually *is* a subject matter expert -- not on soccer strategy, specfically, but on how sports works mathematically. 

On the other hand, a sociology professor is probably an expert in neither. And it shows. At one point, she informs Nate that since the Brazilian team has been subjected to the emotional trauma of losing two important players, Nate shouldn't just sub in the skills of the two new players and run with it as if psychology isn't an issue. He should have *known* that kind of thing makes teams, and statistical models, collapse.

Except that ... it's not true, and subject-matter experts like Nate who study these things know that. There are countless cases of teams who are said to "come together" after a setback and win one for the Gipper -- probably about as many as appear to "collapse". There's no evidence of significant differences at all -- and certainly no evidence that's obvious to a sociologist in an armchair. 

Injuries, deaths, suspensions ... those happen all the time. Do teams play worse than expected afterwards?  I doubt it. I mean, you can study it, there's no shortage of data. After the deaths of Thurman Munson, Lyman Bostock, Ray Chapman, did their teams collapse?  I doubt it. What about other teams that lost stars to red cards?  Did they all lose their next game 7-1?  Or even 6-2, or 5-3?

Anyway, that's only about one-third of the post ... I'm going to stop, here, but you should read the whole thing. I'm probably being too hard on this professor, who didn't realize that Nate is the expert and not her, and wrote like she was giving a stock lecture to a mediocre undergrad student quoting random regressions, instead of to someone who actually wrote a best-selling book on this very topic.

So, moving along. 

------

There is one argument that would legitimately provide evidence that Nate was wrong. If any of the critics had chosen to argue convincing evidence for Brazil actually having much less TALENT than Nate and others estimated, evidence that was freely available before the game ... that would certainly be legitimate.

Something like, "Brazil, as a team, is 2.5 goals above replacement with all their players in action, but I can prove that, without Neymar and Silva, they're 1.2 goals *below* replacement!"

That would work. 

And, indeed, some of the critiques seem to be actually suggesting that. They imply, *of course* Brazil wouldn't be any good without those players, and how could anyone have expected they would be?  

Fine. But, then, why did the bookmakers think they still had a 49% chance? Are you that smart that you saw something? OK, if you have a good argument that shows Brazil should have been 30%, or 20%, then, hey, I'm listening.

If the missing two players dropped Brazil from a 65% talent to a 20% talent, what is each worth individually? Silva is back for today's third-place game against Holland. What's your new estimate for Brazil ... maybe back to 40%?

Well, then, you're bucking the experts again. Brazil is actually the favorite today. The betting markets give them a 62% chance of beating the Netherlands, even though Neymar is still out. Nate has Brazil at 71%. If you think the Brazilians are really that bad, and Nate's model is a failure, I hope you'll be putting a lot of money on the Dutch today. 

Because, you can't really argue that Brazil is back to their normal selves today, right?  An awful team doesn't improve its talent that much, from 1-7 to favorite, just from the return of a single player, who's not even the team's best. No amount of coaching or psychology can do that.

If you thought Brazil's 7-1 humiliation was because of bad players, you should be interpreting today's odds as a huge mistake by the oddsmakers. I think they're confirmation that Tuesday's outcome was just a fluke. 

As I write this, the game has just started. Oh, it's 2-0 Netherlands. Perfect. You're making money, right?  Because, if you want to persuade me that you have a good argument that Nate was obviously and egregiously incorrect, now you can prove it: first, show me where you wrote he was still wrong and why; and, second, tell me how much you bet on the underdog Netherlands.

Otherwise, I'm going to assume you're just blowing hot air. Even if Brazil loses again today. 


-----

Update/clarification: I am not trying to defend Nate's methodology against others, and especially not against the Vegas line (which I trust more than Nate's, until there's evidence I shouldn't).  

I'm just saying: the 7-1 outcome is NOT, in and of itself, sufficient evidence (or even "good" evidence) that Nate's prediction was wrong.  



Labels: , , , ,

Wednesday, July 09, 2014

"The Cult of Statistical Significance"

"The Cult of Statistical Significance" is a critique of social science's overemphasis on confidence levels and its convention that only statistically-significant results are worthy of acceptance. It's by two academic economists, Stephen Ziliak and Deirdre McCloskey, and my impression is that it made a little bit of a splash when it was released in 2008.

I've had the book for a while now, and I've been meaning to write a review. But, I haven't finished reading it, yet; I started a couple of times, and only got about halfway through. It's a difficult read for me ... it's got a flowery style, and it jumps around a bit too much for my brain, which isn't good at multi-tasking. But a couple of weeks ago, someone on Twitter pointed me to this .pdf -- a short paper by the same authors, summarizing their arguments. 

------

Ziliak and McCloskey's thesis is that scientists are too fixated on significance levels, and not enough on the actual size of the effect. To illustrate that, they use an example of two weight-loss pills:


"The first pill, called "Oomph," will shed from Mom an average of 20 pounds. Fantastic! But Oomph is very uncertain in its effects—at [a standard error of] plus or minus 10 pounds. ... Could be ten pounds Mom loses; could be thrice that.

"The other pill you found, pill "Precision," will take 5 pounds off Mom on average but it is very precise—at plus or minus 0.5 pounds. Precision is the same as Oomph in price and side effects but Precision is much more certain in its effects. Great! ...

"Fine. Now which pill do you choose—Oomph or Precision? Which pill is best for Mom, whose goal is to lose weight?"

Ziliak and McCloskey -- I'll call them "ZM" for short -- argue that "Oomph" is the more effectual pill, and therefore the best choice. But, because its effect is not statistically significant from zero*, scientists would recommend "Precision". Therefore, the blind insistence on statistical significance costs Mom, and society, a high price in lost health and happiness.

(*In their example, the effect actually *is* statistically significant, at 2 SDs, but the authors modify the example later so it isn't.)

But: that isn't what happens in real life. In actual research, scientists would *observe* 20 pounds plus or minus 10, and try to infer the true effect as best they can. But here, the authors proceed as if we *already know* the true effect on Mom is 20 +/- 10. But if we did already know that, then, *of course* we wouldn't need significance testing!

Why do the authors wind up having their inference going the wrong way?  I think it's a consequence of failing to notice the elephant in the room, the fact that's the biggest reason significance testing becomes necessary. That elephant is: most pills don't work. 

What I suspect is that when the authors see an estimate of 20, plus or minus 10, they think that must be a reasonable, unbiased estimate of the actual effect. They don't consider that most true values are zero, therefore, most observed effects are just random noise, and that the "20 pounds" estimate is likely spurious.

That's the key to the entire issue of why we have to look at statistical significance -- to set a high-enough bar that we don't wind up inundated with false positives.

At best, the authors are setting up an example in which they already assume the answer, then castigating statistical significance for getting it wrong. And, sure, insisting on p <  .05 will indeed cause false negatives like this one. But ZM fail to set off the false negatives against the inevitable false positives that would result without looking at significance, without realizing we need to find evidence of existence.

-----

In fairness, Ziliak and McCloskey don't say explicitly that they're rejecting the idea that most pills are useless. They might not actually even believe it. They might just be making statistical assumptions that necessarily assume it's true. Specifically:

-- In their example, they assume that, because the "Oomph" study found a mean of 20 pounds and SD of 10 pounds, that's what Mom should expect in real life. But that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- They also seem to assume the implication of that, that when you come up with a 95% confidence interval for the size of the effect, there is actually a 95% probability that the effect lies in that range. Again, that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- And, I think they assume that if a result comes out with a p-value of .75, it implies a 75% chance that the true effect is greater than zero. Same thing: that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

I can't read minds, and I probably shouldn't assume that's what ZM were actually thinking. But that one single assumption would easily justify their entire line of argument -- if only it were true. 

And it certainly *seems* justifiable, to assume that every effect size is equally likely. You can almost hear the argument being made: "Why assume that the drug is most likely useless?  Isn't that an assumption without a basis, an unscientific prejudice?  We should keep a completely open mind, and just let the data speak."  

It sounds right, but it's not. "All effects are equally likely" is just as strong a prejudice as "Zero is most likely."  It just *seems* more open-minded because (a) it doesn't have to be said explicitly, (b) it keeps everything equal, which seems less arbitrary, and (c) "don't be prejudiced" seems like a strong precedent, being such an important ethical rule for human relationships.

If you still think "most pills don't work" is an unacceptable assumption ... imagine that instead of "Oomph" being a pill, it was a magic incantation. Are you equally unwilling to accept the prejudice "most incantations don't work"?

If it is indeed true that most pills (and incantations) are useless, ignoring the fact might make you less prejudiced, but it will also make you more wrong. 

----

And "more wrong" is something that ZM want to avoid, not tolerate. That's why they're so critical of the .05 rule -- it causes "a loss of jobs, justice, profit, and even life."  Reasonably, they say we should evaluate the results not just on significance, but on the expected economic or social gain or loss. When a drug appears to have an effect on cancer that would save 1,000 lives a year ... why throw it away because there's too much noise?  Noise doesn't cost lives, while the pill saves them!

Except that ... if you're looking to properly evaluate economic gain -- costs and benefits -- you have to consider the prior. 

Suppose that 99 out of 100 experimental pills don't work. Then, when you get a p-value of .05, there's only about a 17 percent chance that the pill has a real effect. Do you want to approve cancer pills when you know five-sixths of them don't do anything?

(Why 5/6?  Of the 99 worthless drugs, about 5 of them will show significance just randomly. So you accept 5 spurious effects for each real effect.)

And that 17 percent is when you *do* have p=.05 significance. If you lower your significance threshold, it gets worse. When you have p=.20, say, you get 20 false positives for every real one.

Doing the cost-benefit analysis for Mom's diet pill ... if there's only a 1 in 6 chance that the effect is real, her expectation is a loss of 3.3 pounds, not 20. In that case, she is indeed better off taking "Precision" than "Oomph".

-----

If you don't read the article or book, here's the one sentence summary: Scientists are too concerned with significance, and not enough with real-life effects. Or, as Ziliak and McCloskey put it, 


"Precision is Nice but Oomph is the Bomb."

The "oomph" -- the size of the coefficient -- is the scientific discovery that tells you something about the real world. The "precision" -- the significance level -- tells you only about your evidence and your experiment.

I agree with the authors on this point, except for one thing. Precision is not merely "nice". It's *necessary*. 

If you have a family of eight and shop at Costco and need a new vehicle, "Tires are Nice but Cargo Space is the Bomb." That's true -- but the "Bomb" is useless without the "Nice".

Even if you're only concerned with real-world effects, you still need to consider p-values in a world where most  hypotheses are false. As critical as I have been about the way significance is used in practice, it's still something that's essential to consider, in some way, in order to filter out false positives, where you mistakenly approve treatments that are no better than sugar pills. 

None of that ever figures into the authors' arguments. Failing to note the false positives -- the word "false" doesn't appear anywhere in their essay, never mind "false positive" -- the authors can't figure out why everyone cares about significance so much. The only conclusion they can think of is that scientists must worship precision for its own sake. They write, 


"[The] signal to noise ratio of pill Oomph is 2-to-1, and of pill Precision 10-to-1. Precision, we find, gives a much clearer signal—five times clearer.

"All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way. "Well," say our significance testing colleagues, "the pill with the highest signal to noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision.” 

"But Precision—precision commonly defined as a large t-statistic or small p-value on a coefficient—is obviously the wrong choice. Wrong for Mother's weight-loss plan and wrong for the many other victims of the sizeless scientist. The sizeless scientist decides whether something is important or not—he decides "whether there exists an effect," as he puts it—by looking not at the something's oomph but at how precisely it is estimated. Mom wants to lose weight, not gain precision."

Really?  I have much, much less experience with academic studies than the authors, but ... I don't recall ever having seen papers boast about how precise their estimates are, except as evidence that effects are significant and real. I've never seen anything like, "My estimates are 7 SDs from zero, while yours are only 4.5 SDs, so my study wins!  Even though yours shows cigarettes cause millions of cancer deaths, and mine shows that eating breakfast makes you marginally happier."

Does that really happen?

-------

Having said that, I agree emphatically with the part of ZM's argument that says scientists need to pay more attention to oomph. I've seen many papers that spend many, many words arguing that an effect exists, but then hardly any examining how big it is or what it means. Ziliak and McCloskey refer to these significance-obsessed authors as "sizeless scientists." 

(I love the ZM terminology: "cult," "oomph," "sizeless".) 

Indeed, sometimes studies find an effect size that's so totally out of whack that it's almost impossible -- but they don't even notice, so focused are they on significance levels.

I wish I could recall an example ... well, I can make one up, just to give you the flavor of how I vaguely remember the outrageousness. It's like, someone finds a statistically-significant relationship between baseball career length and lifespan, and trumpets how he has statistical significance at the 3 percent level ... but doesn't realize that his coefficient estimates a Hall-of-Famer's lifespan at 180 years. 

If it were up to me, every paper would have to show the actual "oomph" of its findings in real-world terms. If you find a link between early-childhood education and future salary, how many days of preschool does it take to add, say, a dollar an hour?  If you find a link between exercising and living longer, how many marathons does it take to add a month to your life?  If fast food is linked with childhood obesity, how many pounds does a kid gain from each Happy Meal?  

And we certainly do also need less talk of precision. My view is that you should spend maybe one paragraph confirming that you have statistical significance. Then, shut up about it and talk about the real world. 

If you're publishing in the Journal of Costcological Science, you want to be talking about cargo space, and what the findings mean for those who benefit from Costcology. How many fewer trips to Costco will you make per year?  Is it now more efficient to get your friends to buy you gift cards instead of purchasing a membership? Are there safety advantages to little Joey no longer having to make the trip home with an eleven-pound jar of Nutella between his legs?

You don't want to be going on and on about, how, yes, the new vehicle does indeed have four working tires!  And, look, I used four different chemical tests to make sure they're actually made out of rubber!  And did I mention that when I redo the regression but express the cargo space in metric, the car still tests positive for tires?  It did!  See, tires are robust with respect to the system of mensuration!

For me, one sentence is enough: "The tire treads are significant, more than 2 mm from zero."  

-----

So I agree that you don't need to talk much about the tires. The authors, though, seem to be arguing that the tires themselves don't really matter. They think drivers must just have some kind of weird rubber fetish. Because, if the vehicle has enough cargo space, who cares if the tires are slashed?

You need both. Significance to make sure you're not just looking at randomness, and oomph to tell you what the science actually means.


Labels: , , ,

Wednesday, July 02, 2014

When a null hypothesis makes no sense

In criminal court, you're "innocent until proven guilty."  In statistical studies, it's "null hypothesis until proven significant."

The null hypothesis, generally, is the position that what you're looking for isn't actually there. If you're trying to prove that early-childhood education leads to success in adulthood, the default position is "we're going to assume it doesn't until evidence proves otherwise."

Why do we make "no" the null?  It's because, most times, there really IS nothing there. Pick a random thing and a random life outcome: shirts, marriage. Is there a relationship between shirt color and how happy a marriage you'll have?  Probably not. So "not" becomes the null hypothesis.

Carl Sagan famously said, "extraordinary claims require extraordinary evidence."  And, in a world where most things are unrelated, "my drug shrinks tumors" is indeed an extraordinary claim.

The null hypothesis is the one that's the LEAST extraordinary -- the one that's most likely, in some common-sense way. "Randomness caused it until proven otherwise," not "Fairies caused it until proven otherwise."

In studies, authors usually gloss over that, and just use the convention that the null is always "zero". They'll say, "the difference between the treatment and control groups is not statistically-significantly different from zero, so we do not reject the hypothesis that the drug is of no benefit."

-------

But, "zero" isn't always the least extraordinary claim.

I believe that teams up 1-0 lead in hockey games get overconfident and wind up doing worse than expected. So, I convince Gary Bettman to randomly pick a treatment group of teams, and give them a free goal to start the first game of their season. At the end of the year, I compare goal differential between the treatment and control groups.

Which should my null hypothesis be?

--  The treatment has an effect of 0
--  The treatment has an effect of +1

Obviously, it's the second one. The first one, even though it includes the typical "zero," is, nonetheless, an extraordinary claim: that you give one group a one-goal advantage, but by the end of the year, that advantage has disappeared. Instead of saying "innocent until proven guilty," you'resaying, "one goal guilty unless proven otherwise."  But that's hidden, because you use the word "zero" instead of "one goal guilty."

If you use 0 instead of +1, you're effectively making your hypothesis the default, by stealth. 

(In this case, the null should be +1 ... in real life, the researcher would probably keep the same null, but also transform the model to put the conventional "0" back in. Instead of Y = b(treatment dummy), they'll use the model Y = (b+1)(treatment dummy), so that b=0 now means "no effect other than the obvious extra goal".)

What that shows is: it's not enough that you use "0". You have to make an argument about whether your zero is an appropriate null hypothesis for your model. If you choose the right model, and the peer reviewers don't notice, you can "privilege your hypothesis" by making "zero" represent anything you like.

But that's actually not my main point here.

------

A while ago, I saw a blog post where an economist ran a regression to predict wins from salaries, for a certain sport. The coefficient was not statistically-significantly different from zero, so the author declared that we can't reject the null hypothesis that team payroll relates to team performance.

But, in this case, "salary has an effect of zero" is not a reasonable null hypothesis. Why?  Because we have strong, common-sense knowledge that salary DOES sometimes have an effect. 

That knowledge is: we all know that better free-agent players get paid higher salaries. If you don't believe that's the case -- if you don't believe that LeBron James will earn more than a bench player next season -- you are excused from this argument. But, the economist who did that regression certainly believes it.

In light of that, "zero" is no longer the likeliest, least extraordinary, possibility, so it doesn't qualify as a null. 

That doesn't mean it can't still be the right answer. It could indeed turn out that the relationship between salary and wins is truly 0.00000. For that to happen, it would have to be that other factors exactly cancel out the LeBron factor.

Suppose every million dollars more you spend on Lebron James gives you 0.33446 extra wins, on average (from a God's-eye view). In that case, if you use "zero" in your null hypothesis, it's exactly equivalent to this alternative:

"For every million dollars less you spend on Lebron, you just happen to get exactly 0.33446 extra wins from other players."

Well, that's completely arbitrary!  Why would 0.33446 be more likely than 0.33445, or any other number?  There's no reason to believe that 0.33446 is "least extraordinary."  And so there's no reason to believe that the original "zero" is least extraordinary.

Moreover, if you use a null hypothesis of zero, you're contradicting yourself, because you're insisting on two contradictory things:

(1)  Players who sign for a lot more money, like LeBron, are generally much better players.

(2) We do not reject the assumption that the amount of money a team pays is irrelevant to how good it is.

You can believe either one of these, but not both. 

-----

It used to be conventional wisdom that women over 40 should get a mammogram every year. The annual scan, it was assumed, would help discover cancer earlier, and lead to better outcomes.

Recent studies, though, dispute that conclusion. They say that there is no evidence that there's any difference in cancer survival or diagnosis rates for women who get the procedure and women who don't.  

Well, same problem: the null of "no difference" is an arbitrary one. It's the same argument as in the salary case:

Imagine two identical women, with the same cancer. One of them gets a mammogram, the cancer is discovered, and she starts treatment. Another one doesn't get the mammogram, and the cancer isn't discovered until later.

Obviously, the diagnosis MUST make a difference in the expected outcomes for these two patients. Nobody believes that whether you get treatment early or late makes NO difference, right?  Otherwise, doctors would just shrug and ignore the mammogram.

But, the null hypothesis of "zero difference" suggests that, when you add in all the other women, the expected overall survival rates should be *exactly the same*. 

That's an extraordinary claim. Sure it's *possible* that the years of life lost by the undiagnosed cancer are exactly offset by the years lost from the unnecessary treatment from false positives after a mammogram. Like, for instance, the 34 cancer patients who didn't get the mammogram each lose 8.443 years off their lives, and the 45 false-positives each lose 6.379 years, and if you work it out, it comes to exactly zero. 

"We can't reject that there is no difference" is exactly as arbitrary as "We can't reject that the difference is the cosine of 1.2345".

Unless, of course, you have an argument about how zero is a special case. If you DID want to argue that cancer treatment is completely useless, then, certainly, your zero null would be appropriate. 

------

"Zero" works well as a null hypothesis when it's most plausible that there's nothing there at all, when it's quite possible that there isn't any trace of a relationship. It's inappropriate otherwise: when there's SOME  evidence of SOME real relationship, SOME of the time. 

In other words, zero works when it's a synonym for "there's no relationship at all."   It doesn't work when it's a synonym for, "the relationship is so small that it might as well be zero."

The null hypothesis works as a defense against the placebo effect. It does not work as a defense against actual effects that happen to be close to zero.

But, isn't it almost the same thing?  It's it just splitting hairs?

No, not at all. It's an important distinction.

There are two different questions you may want a study to answer. First: is there actually a relationship there?  And, second, if there is a relationship there, how big is it?

The traditional approach is: if you don't get statistical significance, you're considered to have not proved it's really there -- and, therefore, you're not allowed to talk about how big in might be. You have to stop dead.

But, in the case of the mammogram studies, you shouldn't have to prove it's really there. Under any reasonable assumptions a researcher might have about mammograms and cancer, there MUST be an effect. Whether the observed size is bigger or smaller than twice the SD -- which is the criterion for "existence" -- is completely irrelevant. You already know that an effect must exist.

If you demand statistical proof of existence when you already know it's there, you're choosing to ignore perfectly good information, and you're misleading yourself.

That's what happened in the Oregon Medicaid study. It found that Medicaid coverage was associated with "clinically significant" improvements in reducing hypertension. But they ignored those improvements, because there wasn't enough data to constitute sufficient evidence -- evidence that there actually is a relationship between having free doctor visits and having your hypertension treated. 

But that's silly. We KNOW that people behave differently when they have Medicaid than when they don't. That's why they want it, so they can see doctors more and pay less. There MUST be actual differences in the two groups. We just don't know how large.  

But, because the authors of the study chose to pretend that existence was in doubt, they threw away perfectly good evidence. Imprecise evidence, certainly -- the confidence interval was very wide. But imprecision was not the problem. If the point estimate had been just an SD higher than it was, they would have accepted it at face value, imprecision be damned. 

-------

One last analogy:

The FDA has 100 untested pills that drugmakers say treat cancer. The FDA doesn't know anything about them. However, God knows, and He tells you. 

It turns out 96 of the 100 compounds don't work at all -- they have no effect on cancer whatsoever, no more than sugar pills. The other four do something. They may help cancer, or they may hurt it, and all to different degrees. (Some of the four may even have an effect size of zero -- but if that's the case, they actually do something to the cancer, but the good things they do are balanced out by the bad things.)

You test one of the pills. The result is clinically significant, but only 0.6 SD from zero, not nearly strong enough to be statistically significant. It's reasonable for you to say, "well, I'm not even going to look at the magnitude of the effect, because, it's likely that it's just random noise from one of the 96 sugar pills."

You test another one of the pills, and get the same result. But this time, God pops His head into the lab and says, "By the way, that drug is one of the four that actually do something!"  

This time, the size of the effect matters, doesn't it?  You'd look like a fool to refuse to consider the evidence, given that you now know the pill is doing *something* to the cancer.

Well, that's what happens in real life. God has been telling us -- through common sense and observation -- that expensive players cost more money, that patients on Medicaid get more attention from doctors, and that patients with a positive mammogram get treated earlier. 

Clinging to a null hypothesis of "no real effect," when we know that null hypothesis is false, makes no sense at all.






Labels: , , ,

Wednesday, June 18, 2014

Absence of evidence: the Oregon Medicaid study

In 2008, the State of Oregon created a randomized, controlled experiment to study the effect of Medicaid on health. For the randomization, they held a lottery to choose which 10,000 of the 90,000 applicants would receive coverage. Over the following years, researchers were able to compare the covered and uncovered groups, to check for differences in subsequent health outcomes, and to determine whether the covered individuals had more conditions diagnosed and treated.

Last month brought the publication of another study. In 2006, the state of Massachusetts had instituted health care reform, and this new study compared covered individuals pre- and post-reform, within MA and compared to other states.

The two states' results appeared to differ. In Oregon, the studies found improvement in some health measures, but no statistically-significant change in most others. On the other hand, the Massachusetts study found large benefits across the board.

Why the differences? It turns out the Oregon study had a much smaller dataset, so it was unable to find statistical significance in most of the outcomes it studied. Austin Frakt, of "The Incidental Economist," massaged the results in the Oregon study to make them comparable to the Massachusetts study. Here's his diagram comparing the two confidence intervals:




The OR study's interval is ten times as wide as the MA study's! So, obviously, there's no way it would have been able find statistical significance for the size of the effect the MA study found.  

On the surface, it looks like the two studies had radically different findings: MA found large benefits in cases where OR failed to find any benefit.  But that's not right. What really happened is: MA found benefits in cases where OR really didn't have enough data to decide one way or the other.

-----

The Oregon study is another case of "absence of evidence is not evidence of absence." But, I think, what really causes this kind of misinterpretation is the conventional language used to describe non-significant results.  

In one of the Oregon studies, the authors say this:


"We found no significant effect of Medicaid coverage on the prevalence or diagnosis of hypertension."

Well, that's not true. For one thing, the authors say "significant" instead of "statistically significant." Those are different -- crucially different.  "Not significant means, 'has little effect'." "Not statistically significant" means, "the sample size was too small to provide evidence of what the effect might be."

When the reader sees "no significant improvements," the reasonable inference is that the authors had reasonable evidence, and concluded that any improvements were negligible. That's almost the opposite of the truth -- insufficient evidence either way.  

In fact, it's even more "not true," because the estimate of hypertension WAS "significant" in the real-life sense:


"The 95% confidence intervals for many of the estimates of effects on individual physical health measures were wide enough to include changes that would be considered clinically significant — such as a 7.16-percentage-point reduction in the prevalence of hypertension."

So, at the very least, the authors should have put "statistically" in front of "significant," like this:


"We found no statistically-significant effect of Medicaid coverage on the prevalence or diagnosis of hypertension."

Better!  Now, the sentence is longer false.  But now it's meaningless.

It's meaningless because, as I wrote before, statistical significance is never a property of a real-world effect. It's a property of the *data*, a property of the *evidence* of the effect.

Saying "Medicaid had no statistically-significant effect on patient mortality" is like saying, "OJ Simpson had no statistically-significant effect on Nicole Brown Simpson's mortality." It uses an adjective that should apply only to the evidence, and improperly applies it to the claim itself.

Let's add the word "evidence," so the sentence makes sense:


"We found no statistically-significant evidence of Medicaid coverage's effect on the prevalence or diagnosis of hypertension."

We're getting there.  Now, the sentence is meaningful. But, in my opinion, it's misleading. To my ear, It implies that, specifically, you found no evidence that an effect exists. But, you also didn't find any evidence that an effect *doesn't* exist -- especially relevant in this case, wherer the point estimate was "clinically significant."  

So, change it to this:


"We found no statistically-significant evidence of whether or not Medicaid coverage affects the prevalence or diagnosis of hypertension."

Better again. But it's still misleading, in a different way. It's phrased in such a way that it implies that it's an important fact that they found no statistically-significant evidence.  

Because, why else say it at all? These researchers aren't the only ones with no evidence. I didn't find any evidence either. In fact, of the 7 billion people on earth, NONE of them, as far as I know, found any statistically-significant evidence for what happened in Oregon. And none of us are mentioning that in print.

The difference is: these researchers *looked* for evidence. But, does that matter enough?

Mary is murdered.  The police glance around the murder scene. They call Mary's husband John, and ask him if he did it. He says no. The police shrug. They don't do DNA testing, they don't take fingerprints, they don't look for witnesses, and they just go back to the station and write a report. And then they say, "We found no good evidence that John did it."

That's the same kind of "true but misleading." When you say you didn't find evidence, that implies you searched.  Doesn't it? Otherwise, you'd say directly, "We didn't search," or "We haven't found any evidence yet."

Not only does it imply that you searched ... I think it implies that you searched *thoroughly*.  That's because of the formal phrasing: not just, "we couldn't find any evidence," but, rather, "We found NO evidence." In English, saying it that way, with the added dramatic emphasis ... well, it's almost an aggressive, pre-emptive rebuttal.

Ask your kid, "Did you feed Fido like I asked you to?". If he replies, "I couldn't find him," you'll say, "well, you didn't look hard enough ... go look again!" 

But if he replies, "I found no trace of him," you immediately think ... "OMG, did he run away, did he get hit by a car?" If it turns out Fido is safely in his doggie bed in the den, and your kid didn't even leave the bedroom ... well, it's literally true that he found "no trace" of Fido in his room, but that doesn't make him any less a brat.

In real life, "we found no evidence for X" carries the implication, "we looked hard enough that you should interpret the absence of evidence of X as evidence of absence of X." In the Oregon study, the implication is obviously not true. The researchers weren't able to look hard enough.  Not that they weren't willing -- just that they weren't able, with only 10,000 people in the dataset they were given.

In that light, instead of "no evidence of a significant effect," the study should have said something like,


"The Oregon study didn't contain enough statistically-significant evidence to tell us whether or not Medicaid coverage affects the prevalence or diagnosis of hypertension."

If the authors did that, there would have been no confusion. Of course, people would wonder why Oregon bothered to do the experiment at all, if they could have realized in advance there wouldn't be enough data to reach a conclusion.

------

My feeling is that for most studies, the authors DO want to imply "evidence of absence" when they find a result that's not statistically significant.  I suspect the phrasing has evolved in order to evoke that conclusion, without having to say it explicitly. 

And, often, "evidence of absence" is the whole point. Naturopaths will say, "our homeopathic medicines can cure cancer," and scientists will do a study, and say, "look, the treatment group didn't have any statistically-significant difference from the control group." What they really mean -- and often say -- is, "that's enough evidence of absence to show that your belief in clutch hitters stupid pseudo-medicine is useless, you greedy quacks."  

And, truth be told, I don't actually object to that. Sometimes absence of evidence IS evidence of absence. Actually, from a Bayesian standpoint, it's ALWAYS evidence of absence, albeit perhaps to a tiny, tiny degree.  Do unicorns exist? Well, there isn't one in this room, so that tips the probability of "no" a tiny bit higher. But, add up all the billions of times that nobody has seen a unicorn, and the probability of no unicorns is pretty close to 1.  

You don't need to do any math ... it's just common sense. Do I have a spare can opener in the house? If I open the kitchen drawer, and it isn't there ... that's good evidence that I don't have one, because that's probably where I would have put it. On the other hand, if I open the fridge and it's not there, that's weak evidence at best.

We do that all the time. We use our brains, and all the common-sense prior knowledge we have. In this case, our critical prior assumption is that spare can openers are much more likely to be found in the drawer than in the fridge.  
If you want to go from "absence of evidence" to "evidence of absence," you have to be Bayesian. You have to use "priors," like your knowledge of where the can opener is more likely to be. And if you want to be intellectually honest, you have to use ALL your priors, even those that work against your favored conclusion. If you only look in the fridge, and the toilet, and the clothes closet, and you tell your wife, "I checked three rooms and couldn't find it," ... well, you're being a dick. You're telling her the literal truth, but hoping to mislead her into reaching an incorrect conclusion.

If you want to claim "evidence of absence," you have to show that if the effect *was* there, you would have found it. In other words, you have to convince us that you really *did* look everywhere for the can opener.  

One way to do that is to formally look at the statistical "power" of your test. But, there's an easier way: just look at your confidence interval.  If it's narrow around zero, that's good "evidence of absence". If it's wide, that's weak "evidence of absence."  

In the Oregon study, the confidence interval for hypertension is obviously quite wide. Since the point estimate is "clinically significant," the edge of the confidence interval -- the point estimate plus 2 SDs -- must be *really* clinically significant.  

The thing is, the convention for academic studies is that even if the estimate isn't statistically significant, you don't treat it differently in high-power studies versus low-power studies. The standard phrasing is the same either way: "There was no statistically-significant effect."  

And that's misleading.

Especially when, as in the Oregon case, the study is so underpowered that even your "clinically significant" result is far from statistical significance. Declaring "the effect was not statistically significant," for a study that weak, is as misleading as saying "The missing can opener could not be found in the kitchen," when all you did was look in the fridge.

If you want to argue for evidence of absence, you have to be Bayesian. You have to acknowledge that your conclusions about absence are subjective. You have to make an explicit argument about your prior assumptions, and how they lead to your conclusion of evidence of absence.  

If you don't want to do that, fine. But then, your conclusion should clearly and directly acknowledge your ignorance. "We found no evidence of a significant effect" doesn't do that: it's a "nudge nudge, wink wink" way to imply "evidence of absence" under the table.

If you don't have statistical significance, here's what you -- and the Oregon study -- should say instead:


"We have no clue.  Our study doesn't have enough data to tell."


Labels: ,