Tuesday, July 29, 2014

Are CEOs overpaid or underpaid?

Corporate executives make a lot of money. Are they worth it? Are higher-paid CEOs actually better than their lower-paid counterparts?

Business Week magazine says, no, they're not, and they have evidence to prove it. They took 200 highly-paid CEOs, and did a regression to predict their company's stock performance from their chief executive's pay. The plot looks highly random, with an r-squared of 0.01. Here's a stolen copy:

The magazine says,

"The comparison makes it look as if there is zero relationship between pay and performance ... The trend line shows that a CEO’s income ranking is only 1 percent based on the company’s stock return. That means that 99 percent of the ranking has nothing to do with performance at all. ...

"If 'pay for performance' was really a factor in compensating this group of CEOs, we’d see compensation and stock performance moving in tandem. The points on the chart would be arranged in a straight, diagonal line."

I think there are several reasons why that might not be right.

First, you can't go by the apparent size of the r-squared. There are a lot of factors involved in stock performance, and it's actually not unreasonable to think that the CEO would only be 1 percent of the total picture.

Second, an r-squared of 0.01 implies a correlation of 0.1. That's actually quite large. I bet if you ran a correlation of baseball salaries to one-week team performance, the r-squared would probably be just as small -- but that wouldn't mean players aren't paid by performance. As I've written before, you have to look at the regression equation, because even the smallest correlation could imply a large effect.

Third, the study appears to be based on a dataset created by Equilar, a consulting firm that advises on executive pay. But Equilar's study was limited to the 200 best-paid CEOs, and that artificially reduces the correlation. 

If you take only the 30 best-paid baseball players, and look at this year's performance on the field, the correlation will be only moderate. But if you add in the rest of the players, and minor-leaguers too, the correlation will be much higher. 

(If you don't believe me: find any scatterplot that shows a strong correlation. Take a piece of paper and cover up the leftmost 90% of the datapoints. The 10% that remain will look much more random.)

Fourth, the observed correlation is fairly statistically significant, at p=0.08 (one-tailed -- calculate it here). That could be just random chance, but, on its face, 0.08 does suggest there's a good chance there's something real going on. On the other hand, the result probably comes out "too" significant because the 200 datapoints aren't really indpendent. It could be the case, for instance, that CEOs tend to get paid more in the oil industry, and, coincidentally, oil stocks happen to have done well recently.


BTW, I don't think there's a full article accompanying the Business Week chart; I think what's in that link is all we get. Which is annoying, because it doesn't tell us how the 200 CEOs were chosen, or what years' stock performance was looked at. I'm not even sure that the salaries were negotiated in advance. If they weren't, of course, the result is meaningless, because it could just be that successful companies rewarded their executives after the fact. 

Furthermore, the chart doesn't match the text. The reporters say they got an r-squared of 0.01. I measured the slope of the regression line in the chart, by counting pixels, and it appears to be around 0.06. But an r of 0.06 implies an r-squared of 0.0036, which is far short of the 0.01 figure. Maybe the authors rounded up, for effect? 

It could be that my pixel count was off. If you raise the slope from 0.06 to 0.071, you now get an r-squared of 0.0051, which does round to 0.01. So, for purposes of this post, I'm going to assume the r is actually 0.07.


A correlation of 0.07 means that, to predict a company's performance ranking, you have to regress its CEO pay ranking 93% towards the mean. (This works out because the X and Y variables have the same SD, both consisting of numbers from 1 to 200.)

In other words, 7 percent of the differences are real. That doesn't sound like much, but it's actually pretty big. 

Suppose you're the 20th ranked CEO in salary. What does that say about your company's likely performance? It means you have to regress it 93% of the way back to 100.5. That takes you to 95th. 

So, CEOs that get paid 20th out of 200 improve their company's stock price by 5 rankings more than CEOs who get paid 100.5th out of 200.

How big is five rankings?

I found a website that allowed me to rank all the stocks in the S&P 500 by one-year return. (They use one year back from today, so, your numbers may be different by the time you try it.  Click on the heading "1-Year Percent.")  

The top stock, Micron Technology, gained 151.47%. The bottom stock, Avon, lost 42.80%.

The difference between #1 and #500 is 194.27 percentage points. Divide that by 499, and the average one-spot-in-the-rankings difference is 0.39 percentage points.

Micron is actually a big outlier -- it's about 33 points higher than #2 (Facebook), and 52 points higher than #5 (Under Armour). So, I'm going to arbitrarily reduce the difference from 0.39 to 0.3, just to be conservative.

On that basis, five rankings is the equivalent of 1.5 percentage points in performance.

How much money is that, in real-life terms, for a stock to overperform by 1.5 points?

On the S&P 500, the average company has a market capitalization (that is, the total value of all outstanding stock) of 28 billion (.pdf). For the average company, then, 1.5 points works out to $420 million in added value.

If you want to use the median rather than the mean, it's $13.4 billion and $200 million, respectively.

Either way, it's a lot more than the difference in CEO compensation.

From the Business Week chart, the top CEO made about $142 million. The 200th CEO made around $12.5 million. The difference is $130 million over 199 rankings, or $650K per ranking. (The top four CEOs are outliers. If you remove them, the spread drops by half. But I'll leave them in anyway.)

Moving up the 80 rankings in our hypothetical example is worth only a $52 million raise -- much less than the apparent value added:

Pay difference:      $52 million
Median value added: $200 million
Mean value added:   $420 million

Moreover ... the value of a good CEO is much higher, obviously, for a bigger company. The ten biggest companies on the S&P 500 have a market cap of at least $200 billion each. For a company of that size, the equivalently "good" CEO -- the one paid 20th out of 200 -- is worth three billion dollars. That's *60 times* the average executive salary.

Assuming my arithmetic is OK, and I didn't drop a zero somewhere.


So, I think the Business Week regression shows the opposite of what they believe it shows. Taking the data at face value, you'd have to conclude that executives are underpaid according to their talent, not overpaid.

I'm not willing to go that far. There's a lot of randomness involved, and, as I suggested before, other possible explanations for the positive correlation. But, if you DO want to take the chart as evidence of anything,it's evidence that there is, indeed, a substantial connection between pay and performance. The r-squared of less than 0.01 only *looks* tiny.


Although I think this is weak evidence that CEOs *do* make a difference that's bigger than their salary, the numbers certainly suggest that they *can* make that big an impact.

Suppose you own shares of Apple, and they're looking for a new CEO. A "superstar" candidate comes along. He wants twice as much money as normal. As a shareholder, do you want the company to pay it? 

It depends what you expect his (or her) production to be. What kind of difference do you think a good CEO will make in the company's performance?

Suppose that, next year, you think Apple will earn $6.50 a share with a "replacement level" CEO. How much more do you expect with the superstar CEO?

If you think he or she can make a 1% difference, that's an extra 6.5 cents per share. That might be too high. How about one cent a share, from $6.51 instead of $6.50? Does that seem reasonable? 

Apple trades at around 15 times annual earnings. So, one cent in earnings means about 15 cents on the stock price. With six billion Apple shares outstanding, 15 cents a share gives the superstar CEO a "value above replacement" of $900 million.

So, for a company as big as Apple, if you *do* think a CEO can make a 1-part-in-650 difference in earnings, even the top CEO salary of $142 million looks cheap.

Apple has the largest market cap of all 500 companies in the index, at about 15 times the average, so it's perhaps a special case. But it shows that CEOs can certainly create, or destroy, a lot more value than than their salaries.


So can you conclude that corporate executives are underpaid? Not unless you can provide good evidence that a particular CEO really is that much better than the alternatives. 

There's a lot of luck involved in how a company's business goes -- it depends on the CEO's decisions, sure, but also on the overall market, and the actions of competitors, and advances in technology in general, and world events, and Fed policy, and random fads, and a million other things. It's probably very hard to figure the best CEOs, even based on a whole career. I bet it's as hard, as, say, figuring baseball's best hitters based on only a week's worth of AB. 

Or maybe not. Steve Jobs was fired as Apple's CEO, then, famously, returned to the struggling company a few years later to mastermind the iPod, iPhone, and iPad. Apple is now worth around 100 times as much as it was before Jobs came back. That's an increase in value of somewhere around $500 billion. It was maybe closer to $300 billion at the time of Jobs' death in 2011.

How much of that is due to Jobs' actual "talent" as CEO? Was he just lucky that his ethos of "insanely great" wound up leading to the iPhone? Maybe Jobs just happened to win the lottery, in that he had the right engineers and creative people to create exactly the right product for the right time?

It's obvious that Apple created hundreds of billions of dollars worth of value during Jobs' tenure, but I have no idea how much of that is actually due to Jobs himself. Well, I shouldn't say *no* idea. From what I've read and seen, I'd be willing to bet that he's at least, say, 1 percent responsible. 

One percent of $300 billion is $3 billion. Divide that by 14 years, and it's more than $200 million per year.

If you give Steve Jobs even just one percent of the credit for Apple's renaissance, he was still worth 50 percent more than today's highest-paid CEO, 300 percent more than today's eighth-highest paid CEO, and 1500 percent more than today's 200th-highest-paid CEO. 

Labels: , , , ,

Tuesday, July 22, 2014

Did McDonald's get shafted by the Consumer Reports survey?

McDonald's was the biggest loser in Consumer Reports' latest fast food survey, ranking them dead last out of 21 burger chains. CR readers rated McDonald's only 5.8 out of 10 for their burgers, and 71 out of 100 for overall satisfaction. (Ungated results here.)

CR wrote,

"McDonald's own customers ranked its burgers significantly worse than those of [its] competitors."

Yes, that's true. But I think the ratings are a biased measure of what people actually think. I suspect that McDonald's is actually much better loved than the survey says. In fact, the results could even be backwards.  It's theoretically possible, and fully consistent with the results, that people actually like McDonald's *best*. 

I don't mean because of statistical error -- I mean because of selective sampling.


According to CR's report, 32,405 subscribers reported on 96,208 dining experiences. That's 2.97 restaurants per respondent, which leads me to suspect that they asked readers to report on the three chains they visit most frequently. (I haven't actually seen the questionnaire -- they used to send me one in the mail to fill out, but not any more.)

Limiting respondents to their three most frequented restaurants would, obviously, tend to skew the results upward. If you don't like a certain chain, you probably wouldn't have gone lately, so your rating of "meh, 3 out of 10" wouldn't be included. It's going to be mostly people who like the food who answer the questions.

But McDonald's might be an exception. Because even if you don't like their food that much, you probably still wind up going occasionally:

-- You might be travelling, and McDonald's is all that's open (I once had to eat Mickey D's three nights in a row, because everything else nearby closed at 10 pm). 

-- You might be short of time, and there's a McDonald's right in Wal-Mart, so you grab a burger on your way out and eat it in the car.

-- You might be with your kids, and kids tend to love McDonald's.

-- There might be only McDonald's around when you get hungry. 

Those "I'm going for reasons other than the food" respondents would depress McDonald's ratings, relative to other chains.

Suppose there are two types of people in America. Half of them rate McDonald's a 9, and Fuddruckers a 5. The other half rate Fuddruckers an 8, but McDonald's a 6.

So, consumers think McDonald's is a 7.5, and Fuddrucker's is a 6.5.

But the people who prefer McDonald's seldom set foot anywhere else -- where there's a Fuddrucker's, the Golden Arches are always not too far away. On the other hand, fans of Fuddrucker's can't find one when they travel. So, they wind up eating at McDonald's a few times a year.

So what happens when you do the survey? McDonald's gets a rating of 7.5 -- the average of 9s from the loyal customers, and 6s from the reluctant ones. Fuddruckers, on the other hand, gets an average of 8 -- since only their fans vote. 

That's how, even if people actually like McDonald's more than Fuddrucker's, selective sampling might make McDonald's look worse.


It seems likely this is actually happening. If you look at the burger chain rankings, it sure does seem like the biggest chains are clustered near the bottom. Of the five chains with the most locations (by my Googling and estimates), all of them rank within the bottom eight of the rankings: Wendy's (burger score 6.8), Sonic (6.7), Burger King (6.6), Jack In The Box (6.6), and McDonald's (5.8). 

As far as I can tell, Hardees is next biggest, with about 2,000 US restaurants. It ranks in the middle of the pack, at 7.5. 

Of the ten chains ranked higher than Hardee's, every one of them has less than 1,000 locations. The top two, Habit Burger Grill (8.1) and In-N-Out (8.0), have only 400 restaurants between them. Burgerville, which ranked 7.7, has only 39 stores. (Five Guys (7.9) now has more than 1,000, but the survey covered April, 2012, to June, 2013, when there were fewer.)

The pattern was the same in other categories, where the largest chains were also at or near the bottom. KFC ranked worst for chicken; Subway rated second-worst for sandwiches; and Taco Bell scored worst for Mexican.

And, the clincher, for me at least: the chain with the worst "dining experience," according to the survey was Sbarro, at 65/100. 

What is Sbarro, if not the "I'm stuck at the mall" place to get pizza? Actually, I think there's even a Sbarro at the Ottawa airport -- one of only two fast food places in the departure area. If you get hungry waiting for your flight, it's either them or Tim Hortons.

The Sbarro ratings are probably dominated by customers who didn't have much of a choice. 

(Not that I'm saying Sbarro is actually awesome food -- I don't ever expect to hear someone say, unironically, "hey, I feel like Sbarro tonight."  I'm just saying they're probably not as bad as their rating suggests.)


Another factor: CR asked readers to rate the burgers, specifically. In-N-Out sells only burgers. But McDonald's has many other popular products. You can be a happy McDonald's customer who doesn't like the burgers, but you can't be a happy In-N-Out customer who doesn't like the burgers. Again, that's selective sampling that would skew the results in favor of the burger-only joints.

And don't forget: a lot of people *love* McDonald's french fries. So, their customers might be prefer "C+ burger with A+ fries" to a competitor who's a B- in both categories. 

That thinking actually *supports* CR's conclusion that people like McDonald's burgers less ... but, at the same time, it makes the arbitrary ranking-by-burger-only seem a little unfair. It's as if CR rated baseball players by batting average, and ignored power and walks.

For evidence, you can compare CRs two sets of rankings. 

In burgers, the bottom eight are clustered from 6.6 to 6.8 -- except McDonald's, a huge outlier at 5.8, as far from second-worst as second-worst is from average.

In overall experience, though, McDonald's makes up the difference completely, perhaps by hitting McNuggets over the fences. It's still last, but now tied with Burger King at 71. And the rest aren't that far away. The next six range from 74 to 76 -- and, for what it's worth, CR says a difference of five points is "not meaningful".


A little while ago, I read an interesting story about people's preferences for pies. I don't remember where I read it so I may not have the details perfect. (If you recognize it, let me know.)

For years, Apple Pie was the biggest selling pie in supermarkets. But that was when only full-size pies were sold, big enough to feed a family. Eventually, one company decided to market individual-size pies. To their surprise, Apple was no longer the most popular -- instead, Blueberry was. In fact, Apple dropped all the way to *fifth*. 

What was going on? It turns out that Apple wasn't anyone's most liked pie, but neither was it anyone's least liked pie. In other words, it ranked high as a compromise choice, when you had to make five people happy at once.

I suspect that's what happens with McDonald's. A bus full of tourists isn't going to stop at a specialty place which may be a little weird, or have limited variety. They're going to stop at McDonald's, where everyone knows the food and can find something they like.

McDonald's is kind of the default fast food, everybody's second or third choice.


But having said all that ... it *does* look to me that the ratings are roughly in line with what I consider "quality" in a burger. So I suspect there is some real signal in the results, despite the selective sampling issue.

Except for McDonald's. 

Because, first, I don't think there's any way their burgers are *that* much "worse" than, say, Burger King's. 

And, second, every argument I've made here applies significantly more to McDonald's than to any of the other chains. They have almost twice as many locations as Burger King, almost three times as many as Wendy's, and almost four times as many as Sonic. Unless you truly can't stand them, you'll probably find yourself at McDonald's at some point, even if you'd much rather be dining somewhere else.

All the big chains probably wind up shortchanged in CR's survey. But McDonald's, I suspect, gets spectacularly screwed.

Labels: , , ,

Saturday, July 12, 2014

Nate Silver and the 7-1 blowout

Brazil entered last Tuesday's World Cup semifinal match missing two of their best players -- Neymar, who was out with an injury, and Silva, who was sitting out a red-card suspension. Would they still be good enough to beat Germany?

After crunching the numbers, Nate Silver, at FiveThirtyEight, forecasted that Brazil still had a 65 percent chance of winning the match -- that the depleted Brazilians were still better than the Germans. In that prediction, he was taking a stand against the betting markets, which actually had Brazil as underdog -- barely -- at 49 percent. 

Then, of course, Germany beat the living sh!t out of Brazil, by a score of 7-1. 

"Time to eat some crow," Nate wrote after Brazil had been humiliated. "That prediction stunk."

I was surprised; I had expected Nate to defend his forecast. Even in retrospect, you can't say there was necessarily anything wrong with it.

What's the argument that the prediction stunk?  Maybe it goes something like this:

-- Defying the oddsmakers, Nate picked Brazil as the favorite.
-- Brazil suffered the worst disaster in World Cup history. 
-- Nate's prediction was WAY off.
-- So that has to be a bad prediction, right?

No, it doesn't. It's impossible to know in advance what's going to happen in a soccer game, and, in fact, anything at all could happen. The best anyone can do is try to assign the best possible estimate of the probabilities. Which is what Nate did: he said that there was a 65% chance that Brazil would win, and a 35% chance they would lose. 

Nate said Brazil had about twice as much chance of winning as Germany did. He did NOT say that Brazil would play twice as well. He didn't say Brazil would score twice as many goals. He didn't say Brazil would control the ball twice as much of the time. He didn't say the game would be close, or that Brazil wouldn't get blown out. 

All he said was, Brazil has twice the probability of winning. 

The "65:35" prediction *did* imply that Nate thought Brazil was a better team than Germany. But that's not the same as implying that Brazil would play better this particular game. It happens all the time, in sports, that the better team plays like crap, and loses. That's all built in to the "35 percent". 

Here's an analogy. 

FIFA is about to pick a random number of dollars between 1 and 1,000,000. I say, there's a 65 percent chance that the number drawn will be higher than the value of a three-bedroom bungalow, which is $350,000. 

That's absolutely a true statement, right?  650,000 "winning" balls out of a million is 65 percent. I've made a perfect forecast.

After I make my prediction, FIFA reaches into the urn, pulls out out one of the million balls, and it's ... number 14. 

Was my prediction wrong?  No, it wasn't. It was exactly, perfectly correct, even in retrospect.

It might SEEM that my prediction was awful, if you don't understand how probability works, or you didn't realize how the balls were numbered, or you didn't understand the question. In that case, you might gleefully assume I'm an idiot. You might say, "Are you kidding me?  Phil predicted you could buy a house for $14! Obviously, there's something wrong with his model!"

But, there isn't. I knew all along that there was a chance of "14" coming up, and that factored into my "35 percent" prediction. "14" is, in fact, a surprisingly low outcome, but one that was fully anticipated by the model.

When Nate said that Brazil had a 35 percent chance of losing, a small portion of that 35 percent was the chance of those rare events, like a 7-1 score -- in the same way my own 35 percent chance included the rare event of a really small number getting drawn. 

As unintuitive as it sounds, you can't judge Nate's forecast by the score of the game. 


Critics might dispute my analogy by arguing something like this:

"The "14" result in Phil's model doesn't show he was wrong, because, obviously, which ball comes out of the urn it just a random outcome. On the other hand, a soccer game has real people and real strategies, and a true expert would have been able to foresee how Germany would come out so dominating against Brazil."

But ... no. An expert probably *couldn't* know that. That's something that was probably unknowable. For one thing, the betting markets didn't know -- they had the two teams about even. I didn't hear any bettor, soccer expert, sportswriter, or sabermetrician say anything otherwise, like that Germany should be expected to win by multiple goals. That suggests, doesn't it, that it was legimately impossible to foresee?

I say, yes, it was definitely unknowable. You can't predict the outcome of a single game to that extent -- it's a violation of the "speed of light" limit. I would defy you to find any single instance where anyone, with money at stake, seriously predicted a single game outcome that violates conventional wisdom to anything near this extent. 

Try it for any sport. On August 22, 2007, the Rangers were 2:3 underdogs on the road against the Orioles. They won 30-3. Did anyone predict that?  Did anyone even say the Rangers should be heavy favorites?  Is there something wrong with Vegas, that they so obviously misjudged the prowess of the Texas batters?

Of course not. It was just a fluke occurrence, literally unpredictable by human minds. Like, say, 7-1 Germany.

Huh? [Nate Silver] says his prediction  “stunk,” but it was probabilistic. No way to know if it was even wrong. 

Exactly correct. 


So I don't think you can fault Nate's prediction, here. Actually, that's too weak a statement. I don't mean you have to forgive him, as in, "yeah, he was wrong, but it was a tough one to predict."  I don't mean, "well, nobody's perfect."  I mean: you have no basis even for *questioning* Nate's prediction, if your only evidence is the outcome of the game. Not as in, "you shouldn't complain unless you can do better," but, as in, "his prediction may well have been right, despite the 7-1 outcome."  

But I did a quick Google search for "Brazil 7-1 Nate Silver," and every article I saw that talked about Nate's prediction treated it as certain that his forecast was wrong.

1. Here's one guy who agrees that it's very difficult to predict game results. From there, he concludes that all predictions must therefore be bullsh!t (his word). "Why did they even bother updating their odds for the last three remaining teams at numbers like 64 percent for Germany, 14 percent for the Netherlands, when we just saw how useless those numbers can be?"

Because, of course, the numbers *aren't* bullsh!t, if you correctly interpret them as probabilities and not certainties. If you truly believe that no estimate of odds is useful unless it can successfully call the winner of every game, then how about you bet me every game, taking the Vegas underdog at even money?  Then we'll see who's bullsh!tting.

2. This British columnist gets it right, but kind of hides his defense of Nate in a discussion of how sports pundits are bad at predicting. Except that he means that sabermetricians are bad at correctly guessing outcomes. Well, yes, and we know that. But we *are* fairly decent at predicting probabilities, which is all that Nate was trying to do, because he knows that's all that can realistically be done.

"I love Nate Silver and 538, but this result might be breaking his model. Haven't been super impressed with the predictions."

What, in particular, wasn't this guy impressed with?  He can't just be talking about game results, can he?  Because, in the knockout round, Nate's predicted favorites won *every game* up to Germany/Brazil. Twelve in a row. What would have "super impressed" this guy, 13 out of 12?

Here's another one: 

"To be fair to Nate Silver + 538, their model on the whole was excellent. It's how they dealt with Brazil where I (and others) had problems."

What kind of problems?  Not picking them to lose 7-1?  

In fairness, sure, there's probably some basis for critiquing Nate's model, since he's been giving Brazil siginficantly higher odds than the bookies. But, in this case, the difference was between 65% and 49%, not between 65% and "OMG, it's a history-making massacre!"  So this is not really a convincing argument against Nate's method.

It's kind of like your doctor says, "you should stop smoking, or you're going to die before you're 50!"  You refuse, and the day before your fiftieth birthday, a piano falls on your head and kills you. And the doctor says, "See? I was right!" 

4. Here's a mathematical one, from a Guardian blogger. He notes that Nate's model assumed goals were independent and Poisson, but, in real life, they're not -- especially when a team collapses and the opponent scores in rapid-fire fashion.

All very true, but that doesn't invalidate Nate's model. Nate didn't try to predict the score -- just the outcome. Whether a team collapses after going down 3-0, or not, doesn't much affect the probability of winning after that, which is why any reasonable model doesn't have to go into that level of detail.

Which is why, actually, losing 7-1 loss isn't necessarily inconsistent with being a favorite. Imagine if God had told the world, "if Brazil falls behind 2-0, they'll collapse and lose 7-1." Nate, would have figured: "Hmmm, OK, so we have to take subtract off the chance of 'Brazil gives up the first two goals, but then dramatically comes back to win the game,' since God says that can't happen."

Nate would have figured that's maybe a 1 percent of all games, and say, "OK, I'm reducing my 65% to 64%."  

So, that particular imperfection in the model isn't really a serious flaw. 

But, now that I think about it ... imagine that when Nate published his 65% estimate, he explicitly mentioned, "hey, there's still a 1-in-3 chance that Brazil could lose ... and that includes a chance that Germany will kick the crap out of them. So don't get too cocky."  That would have helped him, wouldn't it?  It might even have made him look really good!

I mean, he shouldn't need to say it to statisticians, because it's an obvious logical consequence of his 65% estimate. But maybe it needs to be said to the public.

"It's hard to imagine how Silver could have been more wrong."

No, it's not hard to imagine at all. If Nate had predicted, "Germany has a 100% chance of winning 7-1," that would have been MUCH more wrong. 

6. Finally, and worst for last ... here's a UNC sociology professor lecturing Nate on how he screwed up, without apparently really understanding what's going on at all. I could spend an entire post on this one, but I'll just give you a summary. 

First, she argues that Nate should have paid attention to sportswriters, who said Brazil would struggle without those missing players. Researchers need to know when to listen to subject-matter experts, who knew something Nate's mathematical models don't. 

Well, first, she's cherry-picking her sportswriters -- they didn't ALL say Brazil would lose badly, did they?  You can always find *someone*, after the fact, who bet the right way. So what?

As for subject-matter experts ... Nate actually *is* a subject matter expert -- not on soccer strategy, specfically, but on how sports works mathematically. 

On the other hand, a sociology professor is probably an expert in neither. And it shows. At one point, she informs Nate that since the Brazilian team has been subjected to the emotional trauma of losing two important players, Nate shouldn't just sub in the skills of the two new players and run with it as if psychology isn't an issue. He should have *known* that kind of thing makes teams, and statistical models, collapse.

Except that ... it's not true, and subject-matter experts like Nate who study these things know that. There are countless cases of teams who are said to "come together" after a setback and win one for the Gipper -- probably about as many as appear to "collapse". There's no evidence of significant differences at all -- and certainly no evidence that's obvious to a sociologist in an armchair. 

Injuries, deaths, suspensions ... those happen all the time. Do teams play worse than expected afterwards?  I doubt it. I mean, you can study it, there's no shortage of data. After the deaths of Thurman Munson, Lyman Bostock, Ray Chapman, did their teams collapse?  I doubt it. What about other teams that lost stars to red cards?  Did they all lose their next game 7-1?  Or even 6-2, or 5-3?

Anyway, that's only about one-third of the post ... I'm going to stop, here, but you should read the whole thing. I'm probably being too hard on this professor, who didn't realize that Nate is the expert and not her, and wrote like she was giving a stock lecture to a mediocre undergrad student quoting random regressions, instead of to someone who actually wrote a best-selling book on this very topic.

So, moving along. 


There is one argument that would legitimately provide evidence that Nate was wrong. If any of the critics had chosen to argue convincing evidence for Brazil actually having much less TALENT than Nate and others estimated, evidence that was freely available before the game ... that would certainly be legitimate.

Something like, "Brazil, as a team, is 2.5 goals above replacement with all their players in action, but I can prove that, without Neymar and Silva, they're 1.2 goals *below* replacement!"

That would work. 

And, indeed, some of the critiques seem to be actually suggesting that. They imply, *of course* Brazil wouldn't be any good without those players, and how could anyone have expected they would be?  

Fine. But, then, why did the bookmakers think they still had a 49% chance? Are you that smart that you saw something? OK, if you have a good argument that shows Brazil should have been 30%, or 20%, then, hey, I'm listening.

If the missing two players dropped Brazil from a 65% talent to a 20% talent, what is each worth individually? Silva is back for today's third-place game against Holland. What's your new estimate for Brazil ... maybe back to 40%?

Well, then, you're bucking the experts again. Brazil is actually the favorite today. The betting markets give them a 62% chance of beating the Netherlands, even though Neymar is still out. Nate has Brazil at 71%. If you think the Brazilians are really that bad, and Nate's model is a failure, I hope you'll be putting a lot of money on the Dutch today. 

Because, you can't really argue that Brazil is back to their normal selves today, right?  An awful team doesn't improve its talent that much, from 1-7 to favorite, just from the return of a single player, who's not even the team's best. No amount of coaching or psychology can do that.

If you thought Brazil's 7-1 humiliation was because of bad players, you should be interpreting today's odds as a huge mistake by the oddsmakers. I think they're confirmation that Tuesday's outcome was just a fluke. 

As I write this, the game has just started. Oh, it's 2-0 Netherlands. Perfect. You're making money, right?  Because, if you want to persuade me that you have a good argument that Nate was obviously and egregiously incorrect, now you can prove it: first, show me where you wrote he was still wrong and why; and, second, tell me how much you bet on the underdog Netherlands.

Otherwise, I'm going to assume you're just blowing hot air. Even if Brazil loses again today. 


Update/clarification: I am not trying to defend Nate's methodology against others, and especially not against the Vegas line (which I trust more than Nate's, until there's evidence I shouldn't).  

I'm just saying: the 7-1 outcome is NOT, in and of itself, sufficient evidence (or even "good" evidence) that Nate's prediction was wrong.  

Labels: , , , ,

Wednesday, July 09, 2014

"The Cult of Statistical Significance"

"The Cult of Statistical Significance" is a critique of social science's overemphasis on confidence levels and its convention that only statistically-significant results are worthy of acceptance. It's by two academic economists, Stephen Ziliak and Deirdre McCloskey, and my impression is that it made a little bit of a splash when it was released in 2008.

I've had the book for a while now, and I've been meaning to write a review. But, I haven't finished reading it, yet; I started a couple of times, and only got about halfway through. It's a difficult read for me ... it's got a flowery style, and it jumps around a bit too much for my brain, which isn't good at multi-tasking. But a couple of weeks ago, someone on Twitter pointed me to this .pdf -- a short paper by the same authors, summarizing their arguments. 


Ziliak and McCloskey's thesis is that scientists are too fixated on significance levels, and not enough on the actual size of the effect. To illustrate that, they use an example of two weight-loss pills:

"The first pill, called "Oomph," will shed from Mom an average of 20 pounds. Fantastic! But Oomph is very uncertain in its effects—at [a standard error of] plus or minus 10 pounds. ... Could be ten pounds Mom loses; could be thrice that.

"The other pill you found, pill "Precision," will take 5 pounds off Mom on average but it is very precise—at plus or minus 0.5 pounds. Precision is the same as Oomph in price and side effects but Precision is much more certain in its effects. Great! ...

"Fine. Now which pill do you choose—Oomph or Precision? Which pill is best for Mom, whose goal is to lose weight?"

Ziliak and McCloskey -- I'll call them "ZM" for short -- argue that "Oomph" is the more effectual pill, and therefore the best choice. But, because its effect is not statistically significant from zero*, scientists would recommend "Precision". Therefore, the blind insistence on statistical significance costs Mom, and society, a high price in lost health and happiness.

(*In their example, the effect actually *is* statistically significant, at 2 SDs, but the authors modify the example later so it isn't.)

But: that isn't what happens in real life. In actual research, scientists would *observe* 20 pounds plus or minus 10, and try to infer the true effect as best they can. But here, the authors proceed as if we *already know* the true effect on Mom is 20 +/- 10. But if we did already know that, then, *of course* we wouldn't need significance testing!

Why do the authors wind up having their inference going the wrong way?  I think it's a consequence of failing to notice the elephant in the room, the fact that's the biggest reason significance testing becomes necessary. That elephant is: most pills don't work. 

What I suspect is that when the authors see an estimate of 20, plus or minus 10, they think that must be a reasonable, unbiased estimate of the actual effect. They don't consider that most true values are zero, therefore, most observed effects are just random noise, and that the "20 pounds" estimate is likely spurious.

That's the key to the entire issue of why we have to look at statistical significance -- to set a high-enough bar that we don't wind up inundated with false positives.

At best, the authors are setting up an example in which they already assume the answer, then castigating statistical significance for getting it wrong. And, sure, insisting on p <  .05 will indeed cause false negatives like this one. But ZM fail to set off the false negatives against the inevitable false positives that would result without looking at significance, without realizing we need to find evidence of existence.


In fairness, Ziliak and McCloskey don't say explicitly that they're rejecting the idea that most pills are useless. They might not actually even believe it. They might just be making statistical assumptions that necessarily assume it's true. Specifically:

-- In their example, they assume that, because the "Oomph" study found a mean of 20 pounds and SD of 10 pounds, that's what Mom should expect in real life. But that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- They also seem to assume the implication of that, that when you come up with a 95% confidence interval for the size of the effect, there is actually a 95% probability that the effect lies in that range. Again, that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- And, I think they assume that if a result comes out with a p-value of .75, it implies a 75% chance that the true effect is greater than zero. Same thing: that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

I can't read minds, and I probably shouldn't assume that's what ZM were actually thinking. But that one single assumption would easily justify their entire line of argument -- if only it were true. 

And it certainly *seems* justifiable, to assume that every effect size is equally likely. You can almost hear the argument being made: "Why assume that the drug is most likely useless?  Isn't that an assumption without a basis, an unscientific prejudice?  We should keep a completely open mind, and just let the data speak."  

It sounds right, but it's not. "All effects are equally likely" is just as strong a prejudice as "Zero is most likely."  It just *seems* more open-minded because (a) it doesn't have to be said explicitly, (b) it keeps everything equal, which seems less arbitrary, and (c) "don't be prejudiced" seems like a strong precedent, being such an important ethical rule for human relationships.

If you still think "most pills don't work" is an unacceptable assumption ... imagine that instead of "Oomph" being a pill, it was a magic incantation. Are you equally unwilling to accept the prejudice "most incantations don't work"?

If it is indeed true that most pills (and incantations) are useless, ignoring the fact might make you less prejudiced, but it will also make you more wrong. 


And "more wrong" is something that ZM want to avoid, not tolerate. That's why they're so critical of the .05 rule -- it causes "a loss of jobs, justice, profit, and even life."  Reasonably, they say we should evaluate the results not just on significance, but on the expected economic or social gain or loss. When a drug appears to have an effect on cancer that would save 1,000 lives a year ... why throw it away because there's too much noise?  Noise doesn't cost lives, while the pill saves them!

Except that ... if you're looking to properly evaluate economic gain -- costs and benefits -- you have to consider the prior. 

Suppose that 99 out of 100 experimental pills don't work. Then, when you get a p-value of .05, there's only about a 17 percent chance that the pill has a real effect. Do you want to approve cancer pills when you know five-sixths of them don't do anything?

(Why 5/6?  Of the 99 worthless drugs, about 5 of them will show significance just randomly. So you accept 5 spurious effects for each real effect.)

And that 17 percent is when you *do* have p=.05 significance. If you lower your significance threshold, it gets worse. When you have p=.20, say, you get 20 false positives for every real one.

Doing the cost-benefit analysis for Mom's diet pill ... if there's only a 1 in 6 chance that the effect is real, her expectation is a loss of 3.3 pounds, not 20. In that case, she is indeed better off taking "Precision" than "Oomph".


If you don't read the article or book, here's the one sentence summary: Scientists are too concerned with significance, and not enough with real-life effects. Or, as Ziliak and McCloskey put it, 

"Precision is Nice but Oomph is the Bomb."

The "oomph" -- the size of the coefficient -- is the scientific discovery that tells you something about the real world. The "precision" -- the significance level -- tells you only about your evidence and your experiment.

I agree with the authors on this point, except for one thing. Precision is not merely "nice". It's *necessary*. 

If you have a family of eight and shop at Costco and need a new vehicle, "Tires are Nice but Cargo Space is the Bomb." That's true -- but the "Bomb" is useless without the "Nice".

Even if you're only concerned with real-world effects, you still need to consider p-values in a world where most  hypotheses are false. As critical as I have been about the way significance is used in practice, it's still something that's essential to consider, in some way, in order to filter out false positives, where you mistakenly approve treatments that are no better than sugar pills. 

None of that ever figures into the authors' arguments. Failing to note the false positives -- the word "false" doesn't appear anywhere in their essay, never mind "false positive" -- the authors can't figure out why everyone cares about significance so much. The only conclusion they can think of is that scientists must worship precision for its own sake. They write, 

"[The] signal to noise ratio of pill Oomph is 2-to-1, and of pill Precision 10-to-1. Precision, we find, gives a much clearer signal—five times clearer.

"All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way. "Well," say our significance testing colleagues, "the pill with the highest signal to noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision.” 

"But Precision—precision commonly defined as a large t-statistic or small p-value on a coefficient—is obviously the wrong choice. Wrong for Mother's weight-loss plan and wrong for the many other victims of the sizeless scientist. The sizeless scientist decides whether something is important or not—he decides "whether there exists an effect," as he puts it—by looking not at the something's oomph but at how precisely it is estimated. Mom wants to lose weight, not gain precision."

Really?  I have much, much less experience with academic studies than the authors, but ... I don't recall ever having seen papers boast about how precise their estimates are, except as evidence that effects are significant and real. I've never seen anything like, "My estimates are 7 SDs from zero, while yours are only 4.5 SDs, so my study wins!  Even though yours shows cigarettes cause millions of cancer deaths, and mine shows that eating breakfast makes you marginally happier."

Does that really happen?


Having said that, I agree emphatically with the part of ZM's argument that says scientists need to pay more attention to oomph. I've seen many papers that spend many, many words arguing that an effect exists, but then hardly any examining how big it is or what it means. Ziliak and McCloskey refer to these significance-obsessed authors as "sizeless scientists." 

(I love the ZM terminology: "cult," "oomph," "sizeless".) 

Indeed, sometimes studies find an effect size that's so totally out of whack that it's almost impossible -- but they don't even notice, so focused are they on significance levels.

I wish I could recall an example ... well, I can make one up, just to give you the flavor of how I vaguely remember the outrageousness. It's like, someone finds a statistically-significant relationship between baseball career length and lifespan, and trumpets how he has statistical significance at the 3 percent level ... but doesn't realize that his coefficient estimates a Hall-of-Famer's lifespan at 180 years. 

If it were up to me, every paper would have to show the actual "oomph" of its findings in real-world terms. If you find a link between early-childhood education and future salary, how many days of preschool does it take to add, say, a dollar an hour?  If you find a link between exercising and living longer, how many marathons does it take to add a month to your life?  If fast food is linked with childhood obesity, how many pounds does a kid gain from each Happy Meal?  

And we certainly do also need less talk of precision. My view is that you should spend maybe one paragraph confirming that you have statistical significance. Then, shut up about it and talk about the real world. 

If you're publishing in the Journal of Costcological Science, you want to be talking about cargo space, and what the findings mean for those who benefit from Costcology. How many fewer trips to Costco will you make per year?  Is it now more efficient to get your friends to buy you gift cards instead of purchasing a membership? Are there safety advantages to little Joey no longer having to make the trip home with an eleven-pound jar of Nutella between his legs?

You don't want to be going on and on about, how, yes, the new vehicle does indeed have four working tires!  And, look, I used four different chemical tests to make sure they're actually made out of rubber!  And did I mention that when I redo the regression but express the cargo space in metric, the car still tests positive for tires?  It did!  See, tires are robust with respect to the system of mensuration!

For me, one sentence is enough: "The tire treads are significant, more than 2 mm from zero."  


So I agree that you don't need to talk much about the tires. The authors, though, seem to be arguing that the tires themselves don't really matter. They think drivers must just have some kind of weird rubber fetish. Because, if the vehicle has enough cargo space, who cares if the tires are slashed?

You need both. Significance to make sure you're not just looking at randomness, and oomph to tell you what the science actually means.

Labels: , , ,

Wednesday, July 02, 2014

When a null hypothesis makes no sense

In criminal court, you're "innocent until proven guilty."  In statistical studies, it's "null hypothesis until proven significant."

The null hypothesis, generally, is the position that what you're looking for isn't actually there. If you're trying to prove that early-childhood education leads to success in adulthood, the default position is "we're going to assume it doesn't until evidence proves otherwise."

Why do we make "no" the null?  It's because, most times, there really IS nothing there. Pick a random thing and a random life outcome: shirts, marriage. Is there a relationship between shirt color and how happy a marriage you'll have?  Probably not. So "not" becomes the null hypothesis.

Carl Sagan famously said, "extraordinary claims require extraordinary evidence."  And, in a world where most things are unrelated, "my drug shrinks tumors" is indeed an extraordinary claim.

The null hypothesis is the one that's the LEAST extraordinary -- the one that's most likely, in some common-sense way. "Randomness caused it until proven otherwise," not "Fairies caused it until proven otherwise."

In studies, authors usually gloss over that, and just use the convention that the null is always "zero". They'll say, "the difference between the treatment and control groups is not statistically-significantly different from zero, so we do not reject the hypothesis that the drug is of no benefit."


But, "zero" isn't always the least extraordinary claim.

I believe that teams up 1-0 lead in hockey games get overconfident and wind up doing worse than expected. So, I convince Gary Bettman to randomly pick a treatment group of teams, and give them a free goal to start the first game of their season. At the end of the year, I compare goal differential between the treatment and control groups.

Which should my null hypothesis be?

--  The treatment has an effect of 0
--  The treatment has an effect of +1

Obviously, it's the second one. The first one, even though it includes the typical "zero," is, nonetheless, an extraordinary claim: that you give one group a one-goal advantage, but by the end of the year, that advantage has disappeared. Instead of saying "innocent until proven guilty," you'resaying, "one goal guilty unless proven otherwise."  But that's hidden, because you use the word "zero" instead of "one goal guilty."

If you use 0 instead of +1, you're effectively making your hypothesis the default, by stealth. 

(In this case, the null should be +1 ... in real life, the researcher would probably keep the same null, but also transform the model to put the conventional "0" back in. Instead of Y = b(treatment dummy), they'll use the model Y = (b+1)(treatment dummy), so that b=0 now means "no effect other than the obvious extra goal".)

What that shows is: it's not enough that you use "0". You have to make an argument about whether your zero is an appropriate null hypothesis for your model. If you choose the right model, and the peer reviewers don't notice, you can "privilege your hypothesis" by making "zero" represent anything you like.

But that's actually not my main point here.


A while ago, I saw a blog post where an economist ran a regression to predict wins from salaries, for a certain sport. The coefficient was not statistically-significantly different from zero, so the author declared that we can't reject the null hypothesis that team payroll relates to team performance.

But, in this case, "salary has an effect of zero" is not a reasonable null hypothesis. Why?  Because we have strong, common-sense knowledge that salary DOES sometimes have an effect. 

That knowledge is: we all know that better free-agent players get paid higher salaries. If you don't believe that's the case -- if you don't believe that LeBron James will earn more than a bench player next season -- you are excused from this argument. But, the economist who did that regression certainly believes it.

In light of that, "zero" is no longer the likeliest, least extraordinary, possibility, so it doesn't qualify as a null. 

That doesn't mean it can't still be the right answer. It could indeed turn out that the relationship between salary and wins is truly 0.00000. For that to happen, it would have to be that other factors exactly cancel out the LeBron factor.

Suppose every million dollars more you spend on Lebron James gives you 0.33446 extra wins, on average (from a God's-eye view). In that case, if you use "zero" in your null hypothesis, it's exactly equivalent to this alternative:

"For every million dollars less you spend on Lebron, you just happen to get exactly 0.33446 extra wins from other players."

Well, that's completely arbitrary!  Why would 0.33446 be more likely than 0.33445, or any other number?  There's no reason to believe that 0.33446 is "least extraordinary."  And so there's no reason to believe that the original "zero" is least extraordinary.

Moreover, if you use a null hypothesis of zero, you're contradicting yourself, because you're insisting on two contradictory things:

(1)  Players who sign for a lot more money, like LeBron, are generally much better players.

(2) We do not reject the assumption that the amount of money a team pays is irrelevant to how good it is.

You can believe either one of these, but not both. 


It used to be conventional wisdom that women over 40 should get a mammogram every year. The annual scan, it was assumed, would help discover cancer earlier, and lead to better outcomes.

Recent studies, though, dispute that conclusion. They say that there is no evidence that there's any difference in cancer survival or diagnosis rates for women who get the procedure and women who don't.  

Well, same problem: the null of "no difference" is an arbitrary one. It's the same argument as in the salary case:

Imagine two identical women, with the same cancer. One of them gets a mammogram, the cancer is discovered, and she starts treatment. Another one doesn't get the mammogram, and the cancer isn't discovered until later.

Obviously, the diagnosis MUST make a difference in the expected outcomes for these two patients. Nobody believes that whether you get treatment early or late makes NO difference, right?  Otherwise, doctors would just shrug and ignore the mammogram.

But, the null hypothesis of "zero difference" suggests that, when you add in all the other women, the expected overall survival rates should be *exactly the same*. 

That's an extraordinary claim. Sure it's *possible* that the years of life lost by the undiagnosed cancer are exactly offset by the years lost from the unnecessary treatment from false positives after a mammogram. Like, for instance, the 34 cancer patients who didn't get the mammogram each lose 8.443 years off their lives, and the 45 false-positives each lose 6.379 years, and if you work it out, it comes to exactly zero. 

"We can't reject that there is no difference" is exactly as arbitrary as "We can't reject that the difference is the cosine of 1.2345".

Unless, of course, you have an argument about how zero is a special case. If you DID want to argue that cancer treatment is completely useless, then, certainly, your zero null would be appropriate. 


"Zero" works well as a null hypothesis when it's most plausible that there's nothing there at all, when it's quite possible that there isn't any trace of a relationship. It's inappropriate otherwise: when there's SOME  evidence of SOME real relationship, SOME of the time. 

In other words, zero works when it's a synonym for "there's no relationship at all."   It doesn't work when it's a synonym for, "the relationship is so small that it might as well be zero."

The null hypothesis works as a defense against the placebo effect. It does not work as a defense against actual effects that happen to be close to zero.

But, isn't it almost the same thing?  It's it just splitting hairs?

No, not at all. It's an important distinction.

There are two different questions you may want a study to answer. First: is there actually a relationship there?  And, second, if there is a relationship there, how big is it?

The traditional approach is: if you don't get statistical significance, you're considered to have not proved it's really there -- and, therefore, you're not allowed to talk about how big in might be. You have to stop dead.

But, in the case of the mammogram studies, you shouldn't have to prove it's really there. Under any reasonable assumptions a researcher might have about mammograms and cancer, there MUST be an effect. Whether the observed size is bigger or smaller than twice the SD -- which is the criterion for "existence" -- is completely irrelevant. You already know that an effect must exist.

If you demand statistical proof of existence when you already know it's there, you're choosing to ignore perfectly good information, and you're misleading yourself.

That's what happened in the Oregon Medicaid study. It found that Medicaid coverage was associated with "clinically significant" improvements in reducing hypertension. But they ignored those improvements, because there wasn't enough data to constitute sufficient evidence -- evidence that there actually is a relationship between having free doctor visits and having your hypertension treated. 

But that's silly. We KNOW that people behave differently when they have Medicaid than when they don't. That's why they want it, so they can see doctors more and pay less. There MUST be actual differences in the two groups. We just don't know how large.  

But, because the authors of the study chose to pretend that existence was in doubt, they threw away perfectly good evidence. Imprecise evidence, certainly -- the confidence interval was very wide. But imprecision was not the problem. If the point estimate had been just an SD higher than it was, they would have accepted it at face value, imprecision be damned. 


One last analogy:

The FDA has 100 untested pills that drugmakers say treat cancer. The FDA doesn't know anything about them. However, God knows, and He tells you. 

It turns out 96 of the 100 compounds don't work at all -- they have no effect on cancer whatsoever, no more than sugar pills. The other four do something. They may help cancer, or they may hurt it, and all to different degrees. (Some of the four may even have an effect size of zero -- but if that's the case, they actually do something to the cancer, but the good things they do are balanced out by the bad things.)

You test one of the pills. The result is clinically significant, but only 0.6 SD from zero, not nearly strong enough to be statistically significant. It's reasonable for you to say, "well, I'm not even going to look at the magnitude of the effect, because, it's likely that it's just random noise from one of the 96 sugar pills."

You test another one of the pills, and get the same result. But this time, God pops His head into the lab and says, "By the way, that drug is one of the four that actually do something!"  

This time, the size of the effect matters, doesn't it?  You'd look like a fool to refuse to consider the evidence, given that you now know the pill is doing *something* to the cancer.

Well, that's what happens in real life. God has been telling us -- through common sense and observation -- that expensive players cost more money, that patients on Medicaid get more attention from doctors, and that patients with a positive mammogram get treated earlier. 

Clinging to a null hypothesis of "no real effect," when we know that null hypothesis is false, makes no sense at all.

Labels: , , ,