Tuesday, August 26, 2014

Sabermetrics vs. second-hand knowledge

Does the earth revolve around the sun, or does the sun revolve around the earth?

The earth revolves around the sun, of course. I know that, and you know that.

But do we really? 

If you know the earth revolves around the sun, you should be able to prove it, or at least show evidence for it. Confronted by a skeptic, what would you argue?  I'd be at a loss. Honestly, I can't think of a single observable fact that I could use to make a case.

I say that I "know" the earth orbits the sun, but what I really mean by that is, certain people told me that's how it is, and I believe them. 

Not all knowledge is like that. I truly *do* know that the sun rises in the east, because I've seen it every day. If a skeptic claimed otherwise, it would be easy to show evidence: I'd make sure he shared my definition of "east," and then I'd wake him up at 6 am and take him outside.

But that sun/earth thing?  I can only I only say I "know" it because I believe that astronomers *truly* know it, from direct evidence.

------

It occurred to me that almost all of our "knowledge" of scientific theories comes from that kind of hearsay. I couldn't give you evidence that atoms consist, roughly, of electrons orbiting a nucleus. I couldn't prove that every action has an equal and opposite reaction. There's no way I could come close to figuring out why and how e=mc^2, or that something called "insulin" exists and is produced by the pancreas. And I couldn't give you one bit of scientific evidence for why evolution is correct and not creationism. 

That doesn't stop us from believing, really, really strongly, that we DO know these things. We go and take a couple of undergraduate courses in, say, geology, and we write down what the professors tell us, and we repeat them on exams, and we solve mathematical problems based on formulas and principles we are told are true. And we get our credits, and we say we're "knowledgeable" in geology. 

But it's a different kind of knowledge. It's not knowledge that we have by our own experience or understanding. It's knowledge that we have by our own experience of how to evaluate what we're told -- how and when to believe other people. We extrapolate from our social knowledge. We believe that there are indeed people, "geologists," who have firsthand evidence. We believe that evidence gets disseminated among those geologists, who interact to reliably determine which hypotheses are supported and which ones are not. We believe that, in general, the experts are keeping enough of a watchful eye on what gets put in textbooks and taught at universities, that if Geology 101 was teaching us falsehoods, they'd get exposed in a hurry.

In other words, we believe that the system of scientists and professors and Ph.D.s and provosts and deans and journals and textbook publishers is a reliable separator of truth from falsehood. We believe that, if the earth really were only 6,000 years old, that's what scientists would be telling us.

------

Most of the time, it doesn't matter that our knowledge is secondhand. We don't need to be able to prove that swallowing arsenic is fatal; we just need to know not to do it. And, we can marvel at Einstein's discovery that matter and energy are the same thing, even if we can't explain why.

But it's still kind of unsatisfying. 

That's one of the reasons I like math. With math, you don't have to take anyone's word for anything. You start with a few axioms, and then it's all straight logic. You don't need geology labs and test tubes and chemicals. You don't need drills and excavators. You don't actually have to believe anyone on indirect evidence. You can prove everything for yourself.

The supply of primes is infinite. No matter how large a prime you find, there will always be one larger. That's a fact. If you like, you can look it up on the internet, or ask your math teacher, or find it in a textbook. It's a fact, like the earth revolving around the sun.

If you do it that way, you know it, but you don't really KNOW it. You can't defend it. In a sense, you're believing it on faith. 

On the other hand, you can look at a proof. Euclid's proof that there is no largest prime number is considered one of the most elegant in mathematics. The versions I found on the internet use a lot of math notation, so I'll paraphrase.

-----

Suppose you have a really big prime number, X. The question is: is there always a prime bigger than X?  

Try this: take all the numbers from 1 to X, and multiply them together: 1 times 2 times 3 .... times X. Now, add 1. Call that really huge number N. That huge N is either prime, or is the product of some number of primes. 

But N can't be divisible by X, or anything less than X, because that division has to always leave a remainder of 1. Therefore: either N is prime, or, when you factor N into other primes, they're all bigger than X. 

Either way, there is a prime bigger than X.

------

I may not have explained that very well. But, if you get it ... now you know that there is no highest prime. If you read it in a book, you "know" it, but if you understand the proof, you KNOW it, in the sense that you can explain it and prove it to others.

In fact ... if you read it in a textbook, and someone tells you the textbook is wrong, you may have some doubt. But once you see the proof, you will *never* have doubt (except in your own logic). Even if the greatest mathematician in the world tells you there's a largest prime, you still know he's wrong. 

-----

In theory, everything in math is like that, provable from axioms. In practice ... not so much. The proofs get complicated pretty quickly. (When Andrew Wiles solved Fermat's Last Theorem in 1993, his proof was 200 pages long.)  Still, there are significant mathematical results where we can all say we know from our own efforts. For years, I wondered why it was that multiplication goes both ways -- why 8 x 7 has to equal 7 x 8. Then it hit me -- if you draw eight rows of seven dots, and turn it sideways, you get seven rows of eight dots.

There are other fields like math that way ... you and I can know things on our own, fairly easily, in economics, and finance, and computer science. Other sciences, like physics and chemistry, take more time and equipment. I can probably prove to myself, with a stopwatch and ruler, that gravitational acceleration on earth is 9.8 m/s/s, but there's no way I could find evidence of what it is on the moon. 

But: sabermetrics. What started me on all this is realizing that the stuff we know about sabermetrics is more like infinite primes than like the earth revolving around the sun. Active researchers know sabermetrics just because Bill James and Pete Palmer told us. We know because we actually see how to replicate their work, and we see, all the way back to first principles, where everything came from. 

I can't defend "e equals mc squared," but I can defend Linear Weights. It's not that hard, and all I need is play-by-play data and a simple argument. Same with Runs Created: I can pull out publicly-available data and show that it's roughly unbiased and reasonably accurate. (I can even go further ... I can take partial derivatives of Runs Created and show that the values of the individual events are roughly in line with Linear Weights.)

DIPS?  No problem, I know what the evidence is, there, and I can generate it myself. On-base percentage more important than batting average?  Geez, you don't even need data for that, but you can still do it formally if you need to without too much difficulty. 

For my own part -- and, again, many of you active analysts reading this would be able to say the same thing --  I don't think I could come up with a single major result in sabermetrics that I couldn't prove, from scratch, if I had to. Even the ones from advanced data, or proprietary data, I'm confident I could reproduce if you gave me the database.

For all the established principles that are based on, say, Retrosheet-level data ... honestly, I can't think of a single thing in sabermetrics that I "know" where I would need to rely on other people to tell me it's true. That might change: if something significant comes out of some new technique -- neural nets, "soft" sabermetrics, biomechanics -- I might have to start "knowing" things secondhand. But for now, I can't think of anything.

If you come to me and say, "I have geological proof that the earth is only 6,000 years old," I'm just going to shrug and say, "whatever."  But if you come to me and say, "I have proof that a single is worth only 1/3 of a triple" ... well, in that case, I can meet you head on and prove that you're wrong. 

I don't really know that creationism isn't right -- I only know what others have told me. But I *do* know firsthand what a triple is worth, just as I *do* know firsthand that there is no highest prime. 

------

And that, I think, is why I love sabermetrics so much -- it's the only chance I've ever had to actually be a scientist, to truly know things directly, from evidence rather than authority.

I have a degree in statistics, but if nuclear war wiped out all the statistics books, how much of that science could I restore from my own mind?  Maybe, a first-year probability course, at best. I could describe the Central Limit Theorem in general terms, but I have no idea how to prove it ... one of the most fundamental results in statistics, one they teach you in your first statistics class, and I still only know it from hearsay.

But if nuclear war wipes out all the sabermetrics books ... as long as someone finds me a copy of the Retrosheet database, I can probably reestablish everything. Nowhere near as eloquently as Bill James and Palmer/Thorn, and I'd probably wouldn't think of certain methods that Tango/MGL/Dolphin did, but ... yeah, I'm pretty sure I could restore almost all of it. 

To me, that's a big deal. It's the difference between knowing something, and only knowing that other people know it. Not to put down the benefits of getting knowledge from others -- after all, that's where most of our useful education comes from. It's just that, for me, knowing stuff on my own ... it's much more fulfilling, a completely different state of mind. As good as it may be to get the Ten Commandments from Moses, it's even better to get them directly from God.



Labels: , ,

Tuesday, August 12, 2014

More r-squared analogies

OK, so I've come up with yet another analogy for the difference between the regression equation coefficient and the r-squared.

The coefficient is the *actual signal* -- the answer to the question you're asking. The r-squared is the *strength of the signal* relative to the noise for an individual datapoint.

Suppose you want to find the relationship between how many five-dollar bills someone has, and how much money those bills are worth. If you do the regression, you'll find:

Coefficient = 5.00 (signal)
r-squared = 1.00 (strength of signal)
1 minus r-squared = 0.00 (strength of noise)
Signal-to-noise ratio = infinite (1.00 / 0.00)

The signal is: a five-dollar bill is worth $5.00. How strong is the signal?  Perfectly strong --  the r-squared is 1.00, the highest it can be.  (In fact, the signal to noise ratio is infinite, because there's no noise at all.)

Now, change the example a little bit. Suppose a lottery ticket gives you a one-in-a-million chance of winning five million dollars. Then, the expected value of each ticket is $5.  (Of course, most tickets win nothing, but the *average* is $5.)

You want to find out the relationship between how many tickets someone has, and how much money those tickets will win. With a sufficiently large sample size, the regression will give you something like:

Coefficient = 5.00 (signal)
r-squared = 0.0001 (strength of signal)
1 minus r-squared = 0.9999 (strength of noise)
Signal-to-noise ratio = 0.0001 (0.0001 / 0.9999)

The average value of a ticket is the same as a five-dollar bill: $5.00. But the *noise* around $5.00 is very, very large, so the r-squared is small. For any given ticketholder, the distribution of his winnings is going to be pretty wide.

In this case, the signal-to-noise ratio is something like 0.0001 divided by 0.9999, or 1:10,000. There's a lot of noise in with the signal.  If you hold 10 lottery tickets, your expected winnings are $50. But, there's so much noise, that you shouldn't count on the result necessarily being close to $50. The noise could turn it into $0, or $5,000,000.

On the other hand, if you own 10 five-dollar bills, then you *should* count on the $50, because it's all signal and no noise.

It's not a perfect analogy, but it's a good way to get a gut feel. In fact, you can simplify it a bit and make it even easier:

-- the coefficient is the signal.
-- the r-squared is the signal-to-noise ratio.

You can even think of it this way, maybe:

-- the coefficient is the "mean" effect.
-- the (1 - r-squared) is the "variance" (or SD) of the effect.

Five-dollar bills have a mean value of $5, and variance of zero. Five-dollar lottery tickets have a mean value of $5, but a very large variance.  

------

So, keeping in mind these analogies, you can see that this is wrong: 

"The r-squared between lottery tickets and winnings is very close to zero, which means that lottery tickets have very little value."

It's wrong because the r-squared doesn't tell you the actual value of a ticket (mean). It just tells you the noise (variance) around the realized value for an individual ticket-holder. To really see the value of a ticket, you have to look at the coefficient.  

From the r-squared alone, however, you *can* say this:

"The r-squared between lottery tickets and winnings is very close to zero, which means that it's hard to predict what your lottery tickets are going to be worth just based on how many you have."

You can conclude "hard to predict" based on the r-squared. But if you want to conclude "little value on average," you have to look at the coefficient.  

------

In the last post, I linked to a Business Week study that found an r-squared of 0.01 between CEO pay and performance. Because the 0.01 is a small number, the authors concluded that there's no connection, and CEOs aren't paid by performance.

That's the same problem as the lottery tickets.

If you want to see if CEOs who get paid more do better, you need to know the size of the effect. That is: you want to know the signal, not the *strength* of the signal, and not the signal-to-noise ratio. You want the coefficient, not the r-squared.

And, in that study, the signal was surprisingly high -- around 4, by my lower estimate. That is: for every $1 in additional salary, the CEO created an extra $4 for the shareholders. That's the number the magazine needs in order to answer its question.

The low r-squared just shows that the noise is high. The *expected value* is $4, but, for a particular case, it could be far from $4, in either direction.  I haven't checked, but I bet that some companies with relatively low-paid executives might create $100 per dollar, and some companies who pay their CEOs double or triple the average might nonetheless wind up losing value, or even going bankrupt.

------

Now that I think about it, maybe a "lottery ticket" analogy would be good too: 


Think of every effect as a combination of lottery tickets and cash money.

-- The regression coefficient tells you the total value of the tickets and money combined.

-- The r-squared tells you what proportion of that total value is in money.  

That one works well for me.

------

Anyway, the idea is not that these analogies are completely correct, but that they make it easier to interpret the results, and to spot errors of interpretation. When Business Week says, "the r-squared is 0.01, so there is no relationship," you can instantly respond:

"... All that r-squared tells you is, whatever the relationship actually turns out to be, the signal-to-noise ratio is 1:99. But, so what? Maybe it's still an important signal, even if it's drowned out by noise. Tell us what the coefficient is, so we can evaluate the signal on its own!"

Or, when someone says, "the r-squared between team payroll and wins is only .18, which means that money doesn't buy wins," you can respond:

"... All that r-squared tells you is, whatever the relationship actually turns out to be, 82 percent of it comes in the form of lottery tickets, and only 18 percent comes in cash. But those tickets might still be valuable! Tell us what the coefficient is, so we can see that value, and we can figure out if spending money on better players is actually worth it."

------

Does either one of those work for you?  




(You can find more of my old stuff on r-squared by clicking here.)


Labels: , ,

Tuesday, July 29, 2014

Are CEOs overpaid or underpaid?

Corporate executives make a lot of money. Are they worth it? Are higher-paid CEOs actually better than their lower-paid counterparts?

Business Week magazine says, no, they're not, and they have evidence to prove it. They took 200 highly-paid CEOs, and did a regression to predict their company's stock performance from their chief executive's pay. The plot looks highly random, with an r-squared of 0.01. Here's a stolen copy:




The magazine says,


"The comparison makes it look as if there is zero relationship between pay and performance ... The trend line shows that a CEO’s income ranking is only 1 percent based on the company’s stock return. That means that 99 percent of the ranking has nothing to do with performance at all. ...

"If 'pay for performance' was really a factor in compensating this group of CEOs, we’d see compensation and stock performance moving in tandem. The points on the chart would be arranged in a straight, diagonal line."

I think there are several reasons why that might not be right.

First, you can't go by the apparent size of the r-squared. There are a lot of factors involved in stock performance, and it's actually not unreasonable to think that the CEO would only be 1 percent of the total picture.

Second, an r-squared of 0.01 implies a correlation of 0.1. That's actually quite large. I bet if you ran a correlation of baseball salaries to one-week team performance, the r-squared would probably be just as small -- but that wouldn't mean players aren't paid by performance. As I've written before, you have to look at the regression equation, because even the smallest correlation could imply a large effect.

Third, the study appears to be based on a dataset created by Equilar, a consulting firm that advises on executive pay. But Equilar's study was limited to the 200 best-paid CEOs, and that artificially reduces the correlation. 

If you take only the 30 best-paid baseball players, and look at this year's performance on the field, the correlation will be only moderate. But if you add in the rest of the players, and minor-leaguers too, the correlation will be much higher. 

(If you don't believe me: find any scatterplot that shows a strong correlation. Take a piece of paper and cover up the leftmost 90% of the datapoints. The 10% that remain will look much more random.)

Fourth, the observed correlation is fairly statistically significant, at p=0.08 (one-tailed -- calculate it here). That could be just random chance, but, on its face, 0.08 does suggest there's a good chance there's something real going on. On the other hand, the result probably comes out "too" significant because the 200 datapoints aren't really indpendent. It could be the case, for instance, that CEOs tend to get paid more in the oil industry, and, coincidentally, oil stocks happen to have done well recently.

-----

BTW, I don't think there's a full article accompanying the Business Week chart; I think what's in that link is all we get. Which is annoying, because it doesn't tell us how the 200 CEOs were chosen, or what years' stock performance was looked at. I'm not even sure that the salaries were negotiated in advance. If they weren't, of course, the result is meaningless, because it could just be that successful companies rewarded their executives after the fact. 

Furthermore, the chart doesn't match the text. The reporters say they got an r-squared of 0.01. I measured the slope of the regression line in the chart, by counting pixels, and it appears to be around 0.06. But an r of 0.06 implies an r-squared of 0.0036, which is far short of the 0.01 figure. Maybe the authors rounded up, for effect? 

It could be that my pixel count was off. If you raise the slope from 0.06 to 0.071, you now get an r-squared of 0.0051, which does round to 0.01. So, for purposes of this post, I'm going to assume the r is actually 0.07.

-----

A correlation of 0.07 means that, to predict a company's performance ranking, you have to regress its CEO pay ranking 93% towards the mean. (This works out because the X and Y variables have the same SD, both consisting of numbers from 1 to 200.)

In other words, 7 percent of the differences are real. That doesn't sound like much, but it's actually pretty big. 

Suppose you're the 20th ranked CEO in salary. What does that say about your company's likely performance? It means you have to regress it 93% of the way back to 100.5. That takes you to 95th. 

So, CEOs that get paid 20th out of 200 improve their company's stock price by 5 rankings more than CEOs who get paid 100.5th out of 200.

How big is five rankings?

I found a website that allowed me to rank all the stocks in the S&P 500 by one-year return. (They use one year back from today, so, your numbers may be different by the time you try it.  Click on the heading "1-Year Percent.")  

The top stock, Micron Technology, gained 151.47%. The bottom stock, Avon, lost 42.80%.

The difference between #1 and #500 is 194.27 percentage points. Divide that by 499, and the average one-spot-in-the-rankings difference is 0.39 percentage points.

Micron is actually a big outlier -- it's about 33 points higher than #2 (Facebook), and 52 points higher than #5 (Under Armour). So, I'm going to arbitrarily reduce the difference from 0.39 to 0.3, just to be conservative.

On that basis, five rankings is the equivalent of 1.5 percentage points in performance.

How much money is that, in real-life terms, for a stock to overperform by 1.5 points?

On the S&P 500, the average company has a market capitalization (that is, the total value of all outstanding stock) of 28 billion (.pdf). For the average company, then, 1.5 points works out to $420 million in added value.

If you want to use the median rather than the mean, it's $13.4 billion and $200 million, respectively.

Either way, it's a lot more than the difference in CEO compensation.

From the Business Week chart, the top CEO made about $142 million. The 200th CEO made around $12.5 million. The difference is $130 million over 199 rankings, or $650K per ranking. (The top four CEOs are outliers. If you remove them, the spread drops by half. But I'll leave them in anyway.)

Moving up the 80 rankings in our hypothetical example is worth only a $52 million raise -- much less than the apparent value added:

Pay difference:      $52 million
--------------------------------
Median value added: $200 million
Mean value added:   $420 million

Moreover ... the value of a good CEO is much higher, obviously, for a bigger company. The ten biggest companies on the S&P 500 have a market cap of at least $200 billion each. For a company of that size, the equivalently "good" CEO -- the one paid 20th out of 200 -- is worth three billion dollars. That's *60 times* the average executive salary.

Assuming my arithmetic is OK, and I didn't drop a zero somewhere.

-----

So, I think the Business Week regression shows the opposite of what they believe it shows. Taking the data at face value, you'd have to conclude that executives are underpaid according to their talent, not overpaid.

I'm not willing to go that far. There's a lot of randomness involved, and, as I suggested before, other possible explanations for the positive correlation. But, if you DO want to take the chart as evidence of anything,it's evidence that there is, indeed, a substantial connection between pay and performance. The r-squared of less than 0.01 only *looks* tiny.

-----

Although I think this is weak evidence that CEOs *do* make a difference that's bigger than their salary, the numbers certainly suggest that they *can* make that big an impact.

Suppose you own shares of Apple, and they're looking for a new CEO. A "superstar" candidate comes along. He wants twice as much money as normal. As a shareholder, do you want the company to pay it? 

It depends what you expect his (or her) production to be. What kind of difference do you think a good CEO will make in the company's performance?

Suppose that, next year, you think Apple will earn $6.50 a share with a "replacement level" CEO. How much more do you expect with the superstar CEO?

If you think he or she can make a 1% difference, that's an extra 6.5 cents per share. That might be too high. How about one cent a share, from $6.51 instead of $6.50? Does that seem reasonable? 

Apple trades at around 15 times annual earnings. So, one cent in earnings means about 15 cents on the stock price. With six billion Apple shares outstanding, 15 cents a share gives the superstar CEO a "value above replacement" of $900 million.

So, for a company as big as Apple, if you *do* think a CEO can make a 1-part-in-650 difference in earnings, even the top CEO salary of $142 million looks cheap.

Apple has the largest market cap of all 500 companies in the index, at about 15 times the average, so it's perhaps a special case. But it shows that CEOs can certainly create, or destroy, a lot more value than than their salaries.

----- 

So can you conclude that corporate executives are underpaid? Not unless you can provide good evidence that a particular CEO really is that much better than the alternatives. 

There's a lot of luck involved in how a company's business goes -- it depends on the CEO's decisions, sure, but also on the overall market, and the actions of competitors, and advances in technology in general, and world events, and Fed policy, and random fads, and a million other things. It's probably very hard to figure the best CEOs, even based on a whole career. I bet it's as hard, as, say, figuring baseball's best hitters based on only a week's worth of AB. 

Or maybe not. Steve Jobs was fired as Apple's CEO, then, famously, returned to the struggling company a few years later to mastermind the iPod, iPhone, and iPad. Apple is now worth around 100 times as much as it was before Jobs came back. That's an increase in value of somewhere around $500 billion. It was maybe closer to $300 billion at the time of Jobs' death in 2011.

How much of that is due to Jobs' actual "talent" as CEO? Was he just lucky that his ethos of "insanely great" wound up leading to the iPhone? Maybe Jobs just happened to win the lottery, in that he had the right engineers and creative people to create exactly the right product for the right time?

It's obvious that Apple created hundreds of billions of dollars worth of value during Jobs' tenure, but I have no idea how much of that is actually due to Jobs himself. Well, I shouldn't say *no* idea. From what I've read and seen, I'd be willing to bet that he's at least, say, 1 percent responsible. 

One percent of $300 billion is $3 billion. Divide that by 14 years, and it's more than $200 million per year.

If you give Steve Jobs even just one percent of the credit for Apple's renaissance, he was still worth 50 percent more than today's highest-paid CEO, 300 percent more than today's eighth-highest paid CEO, and 1500 percent more than today's 200th-highest-paid CEO. 



Labels: , , , ,

Tuesday, July 22, 2014

Did McDonald's get shafted by the Consumer Reports survey?

McDonald's was the biggest loser in Consumer Reports' latest fast food survey, ranking them dead last out of 21 burger chains. CR readers rated McDonald's only 5.8 out of 10 for their burgers, and 71 out of 100 for overall satisfaction. (Ungated results here.)

CR wrote,


"McDonald's own customers ranked its burgers significantly worse than those of [its] competitors."

Yes, that's true. But I think the ratings are a biased measure of what people actually think. I suspect that McDonald's is actually much better loved than the survey says. In fact, the results could even be backwards.  It's theoretically possible, and fully consistent with the results, that people actually like McDonald's *best*. 

I don't mean because of statistical error -- I mean because of selective sampling.

-----

According to CR's report, 32,405 subscribers reported on 96,208 dining experiences. That's 2.97 restaurants per respondent, which leads me to suspect that they asked readers to report on the three chains they visit most frequently. (I haven't actually seen the questionnaire -- they used to send me one in the mail to fill out, but not any more.)

Limiting respondents to their three most frequented restaurants would, obviously, tend to skew the results upward. If you don't like a certain chain, you probably wouldn't have gone lately, so your rating of "meh, 3 out of 10" wouldn't be included. It's going to be mostly people who like the food who answer the questions.

But McDonald's might be an exception. Because even if you don't like their food that much, you probably still wind up going occasionally:

-- You might be travelling, and McDonald's is all that's open (I once had to eat Mickey D's three nights in a row, because everything else nearby closed at 10 pm). 

-- You might be short of time, and there's a McDonald's right in Wal-Mart, so you grab a burger on your way out and eat it in the car.

-- You might be with your kids, and kids tend to love McDonald's.

-- There might be only McDonald's around when you get hungry. 

Those "I'm going for reasons other than the food" respondents would depress McDonald's ratings, relative to other chains.

Suppose there are two types of people in America. Half of them rate McDonald's a 9, and Fuddruckers a 5. The other half rate Fuddruckers a 8, but McDonald's a 6.

So, consumers think McDonald's is a 7.5, and Fuddrucker's is a 6.5.

But the people who prefer McDonald's seldom set foot anywhere else -- where there's a Fuddrucker's, the Golden Arches are always not too far away. On the other hand, fans of Fuddrucker's can't find one when they travel. So, they wind up eating at McDonald's a few times a year.

So what happens when you do the survey McDonald's gets a rating of 7.5 -- the average of 9s from the loyal customers, and 6s from the reluctant ones. Fuddruckers, on the other hand, gets an average of 8 -- since only their fans vote. 

That's how, even if people actually like McDonald's more than Fuddrucker's, selective sampling might make McDonald's look worse.

------

It seems likely this is actually happening. If you look at the burger chain rankings, it sure does seem like the biggest chains are clustered near the bottom. Of the five chains with the most locations (by my Googling and estimates), all of them rank within the bottom eight of the rankings: Wendy's (burger score 6.8), Sonic (6.7), Burger King (6.6), Jack In The Box (6.6), and McDonald's (5.8). 

As far as I can tell, Hardees is next biggest, with about 2,000 US restaurants. It ranks in the middle of the pack, at 7.5. 

Of the ten chains ranked higher than Hardee's, every one of them has less than 1,000 locations. The top two, Habit Burger Grill (8.1) and In-N-Out (8.0), have only 400 restaurants between them. Burgerville, which ranked 7.7, has only 39 stores. (Five Guys (7.9) now has more than 1,000, but the survey covered April, 2012, to June, 2013, when there were fewer.)

The pattern was the same in other categories, where the largest chains were also at or near the bottom. KFC ranked worst for chicken; Subway rated second-worst for sandwiches; and Taco Bell scored worst for Mexican.

And, the clincher, for me at least: the chain with the worst "dining experience," according to the survey was Sbarro, at 65/100. 

What is Sbarro, if not the "I'm stuck at the mall" place to get pizza? Actually, I think there's even a Sbarro at the Ottawa airport -- one of only two fast food places in the departure area. If you get hungry waiting for your flight, it's either them or Tim Hortons.

The Sbarro ratings are probably dominated by customers who didn't have much of a choice. 

(Not that I'm saying Sbarro is actually awesome food -- I don't ever expect to hear someone say, unironically, "hey, I feel like Sbarro tonight."  I'm just saying they're probably not as bad as their rating suggests.)

------

Another factor: CR asked readers to rate the burgers, specifically. In-N-Out sells only burgers. But McDonald's has many other popular products. You can be a happy McDonald's customer who doesn't like the burgers, but you can't be a happy In-N-Out customer who doesn't like the burgers. Again, that's selective sampling that would skew the results in favor of the burger-only joints.

And don't forget: a lot of people *love* McDonald's french fries. So, their customers might be prefer "C+ burger with A+ fries" to a competitor who's a B- in both categories. 

That thinking actually *supports* CR's conclusion that people like McDonald's burgers less ... but, at the same time, it makes the arbitrary ranking-by-burger-only seem a little unfair. It's as if CR rated baseball players by batting average, and ignored power and walks.

For evidence, you can compare CRs two sets of rankings. 

In burgers, the bottom eight are clustered from 6.6 to 6.8 -- except McDonald's, a huge outlier at 5.8, as far from second-worst as second-worst is from average.

In overall experience, though, McDonald's makes up the difference completely, perhaps by hitting McNuggets over the fences. It's still last, but now tied with Burger King at 71. And the rest aren't that far away. The next six range from 74 to 76 -- and, for what it's worth, CR says a difference of five points is "not meaningful".

-----

A little while ago, I read an interesting story about people's preferences for pies. I don't remember where I read it so I may not have the details perfect. (If you recognize it, let me know.)

For years, Apple Pie was the biggest selling pie in supermarkets. But that was when only full-size pies were sold, big enough to feed a family. Eventually, one company decided to market individual-size pies. To their surprise, Apple was no longer the most popular -- instead, Blueberry was. In fact, Apple dropped all the way to *fifth*. 

What was going on? It turns out that Apple wasn't anyone's most liked pie, but neither was it anyone's least liked pie. In other words, it ranked high as a compromise choice, when you had to make five people happy at once.

I suspect that's what happens with McDonald's. A bus full of tourists isn't going to stop at a specialty place which may be a little weird, or have limited variety. They're going to stop at McDonald's, where everyone knows the food and can find something they like.

McDonald's is kind of the default fast food, everybody's second or third choice.

------

But having said all that ... it *does* look to me that the ratings are roughly in line with what I consider "quality" in a burger. So I suspect there is some real signal in the results, despite the selective sampling issue.

Except for McDonald's. 

Because, first, I don't think there's any way their burgers are *that* much "worse" than, say, Burger King's. 

And, second, every argument I've made here applies significantly more to McDonald's than to any of the other chains. They have almost twice as many locations as Burger King, almost three times as many as Wendy's, and almost four times as many as Sonic. Unless you truly can't stand them, you'll probably find yourself at McDonald's at some point, even if you'd much rather be dining somewhere else.

All the big chains probably wind up shortchanged in CR's survey. But McDonald's, I suspect, gets spectacularly screwed.







Labels: , , ,

Saturday, July 12, 2014

Nate Silver and the 7-1 blowout

Brazil entered last Tuesday's World Cup semifinal match missing two of their best players -- Neymar, who was out with an injury, and Silva, who was sitting out a red-card suspension. Would they still be good enough to beat Germany?

After crunching the numbers, Nate Silver, at FiveThirtyEight, forecasted that Brazil still had a 65 percent chance of winning the match -- that the depleted Brazilians were still better than the Germans. In that prediction, he was taking a stand against the betting markets, which actually had Brazil as underdog -- barely -- at 49 percent. 

Then, of course, Germany beat the living sh!t out of Brazil, by a score of 7-1. 


"Time to eat some crow," Nate wrote after Brazil had been humiliated. "That prediction stunk."

I was surprised; I had expected Nate to defend his forecast. Even in retrospect, you can't say there was necessarily anything wrong with it.

What's the argument that the prediction stunk?  Maybe it goes something like this:

-- Defying the oddsmakers, Nate picked Brazil as the favorite.
-- Brazil suffered the worst disaster in World Cup history. 
-- Nate's prediction was WAY off.
-- So that has to be a bad prediction, right?

No, it doesn't. It's impossible to know in advance what's going to happen in a soccer game, and, in fact, anything at all could happen. The best anyone can do is try to assign the best possible estimate of the probabilities. Which is what Nate did: he said that there was a 65% chance that Brazil would win, and a 35% chance they would lose. 

Nate said Brazil had about twice as much chance of winning as Germany did. He did NOT say that Brazil would play twice as well. He didn't say Brazil would score twice as many goals. He didn't say Brazil would control the ball twice as much of the time. He didn't say the game would be close, or that Brazil wouldn't get blown out. 

All he said was, Brazil has twice the probability of winning. 

The "65:35" prediction *did* imply that Nate thought Brazil was a better team than Germany. But that's not the same as implying that Brazil would play better this particular game. It happens all the time, in sports, that the better team plays like crap, and loses. That's all built in to the "35 percent". 

Here's an analogy. 

FIFA is about to pick a random number of dollars between 1 and 1,000,000. I say, there's a 65 percent chance that the number drawn will be higher than the value of a three-bedroom bungalow, which is $350,000. 

That's absolutely a true statement, right?  650,000 "winning" balls out of a million is 65 percent. I've made a perfect forecast.

After I make my prediction, FIFA reaches into the urn, pulls out out one of the million balls, and it's ... number 14. 

Was my prediction wrong?  No, it wasn't. It was exactly, perfectly correct, even in retrospect.

It might SEEM that my prediction was awful, if you don't understand how probability works, or you didn't realize how the balls were numbered, or you didn't understand the question. In that case, you might gleefully assume I'm an idiot. You might say, "Are you kidding me?  Phil predicted you could buy a house for $14! Obviously, there's something wrong with his model!"

But, there isn't. I knew all along that there was a chance of "14" coming up, and that factored into my "35 percent" prediction. "14" is, in fact, a surprisingly low outcome, but one that was fully anticipated by the model.

When Nate said that Brazil had a 35 percent chance of losing, a small portion of that 35 percent was the chance of those rare events, like a 7-1 score -- in the same way my own 35 percent chance included the rare event of a really small number getting drawn. 

As unintuitive as it sounds, you can't judge Nate's forecast by the score of the game. 

-------

Critics might dispute my analogy by arguing something like this:

"The "14" result in Phil's model doesn't show he was wrong, because, obviously, which ball comes out of the urn it just a random outcome. On the other hand, a soccer game has real people and real strategies, and a true expert would have been able to foresee how Germany would come out so dominating against Brazil."

But ... no. An expert probably *couldn't* know that. That's something that was probably unknowable. For one thing, the betting markets didn't know -- they had the two teams about even. I didn't hear any bettor, soccer expert, sportswriter, or sabermetrician say anything otherwise, like that Germany should be expected to win by multiple goals. That suggests, doesn't it, that it was legimately impossible to foresee?

I say, yes, it was definitely unknowable. You can't predict the outcome of a single game to that extent -- it's a violation of the "speed of light" limit. I would defy you to find any single instance where anyone, with money at stake, seriously predicted a single game outcome that violates conventional wisdom to anything near this extent. 

Try it for any sport. On August 22, 2007, the Rangers were 2:3 underdogs on the road against the Orioles. They won 30-3. Did anyone predict that?  Did anyone even say the Rangers should be heavy favorites?  Is there something wrong with Vegas, that they so obviously misjudged the prowess of the Texas batters?

Of course not. It was just a fluke occurrence, literally unpredictable by human minds. Like, say, 7-1 Germany.


Huh? [Nate Silver] says his prediction  “stunk,” but it was probabilistic. No way to know if it was even wrong. 

Exactly correct. 

--------

So I don't think you can fault Nate's prediction, here. Actually, that's too weak a statement. I don't mean you have to forgive him, as in, "yeah, he was wrong, but it was a tough one to predict."  I don't mean, "well, nobody's perfect."  I mean: you have no basis even for *questioning* Nate's prediction, if your only evidence is the outcome of the game. Not as in, "you shouldn't complain unless you can do better," but, as in, "his prediction may well have been right, despite the 7-1 outcome."  

But I did a quick Google search for "Brazil 7-1 Nate Silver," and every article I saw that talked about Nate's prediction treated it as certain that his forecast was wrong.

1. Here's one guy who agrees that it's very difficult to predict game results. From there, he concludes that all predictions must therefore be bullsh!t (his word). "Why did they even bother updating their odds for the last three remaining teams at numbers like 64 percent for Germany, 14 percent for the Netherlands, when we just saw how useless those numbers can be?"

Because, of course, the numbers *aren't* bullsh!t, if you correctly interpret them as probabilities and not certainties. If you truly believe that no estimate of odds is useful unless it can successfully call the winner of every game, then how about you bet me every game, taking the Vegas underdog at even money?  Then we'll see who's bullsh!tting.

2. This British columnist gets it right, but kind of hides his defense of Nate in a discussion of how sports pundits are bad at predicting. Except that he means that sabermetricians are bad at correctly guessing outcomes. Well, yes, and we know that. But we *are* fairly decent at predicting probabilities, which is all that Nate was trying to do, because he knows that's all that can realistically be done.


"I love Nate Silver and 538, but this result might be breaking his model. Haven't been super impressed with the predictions."

What, in particular, wasn't this guy impressed with?  He can't just be talking about game results, can he?  Because, in the knockout round, Nate's predicted favorites won *every game* up to Germany/Brazil. Twelve in a row. What would have "super impressed" this guy, 13 out of 12?

Here's another one: 

"To be fair to Nate Silver + 538, their model on the whole was excellent. It's how they dealt with Brazil where I (and others) had problems."

What kind of problems?  Not picking them to lose 7-1?  

In fairness, sure, there's probably some basis for critiquing Nate's model, since he's been giving Brazil siginficantly higher odds than the bookies. But, in this case, the difference was between 65% and 49%, not between 65% and "OMG, it's a history-making massacre!"  So this is not really a convincing argument against Nate's method.

It's kind of like your doctor says, "you should stop smoking, or you're going to die before you're 50!"  You refuse, and the day before your fiftieth birthday, a piano falls on your head and kills you. And the doctor says, "See? I was right!" 

4. Here's a mathematical one, from a Guardian blogger. He notes that Nate's model assumed goals were independent and Poisson, but, in real life, they're not -- especially when a team collapses and the opponent scores in rapid-fire fashion.

All very true, but that doesn't invalidate Nate's model. Nate didn't try to predict the score -- just the outcome. Whether a team collapses after going down 3-0, or not, doesn't much affect the probability of winning after that, which is why any reasonable model doesn't have to go into that level of detail.

Which is why, actually, losing 7-1 loss isn't necessarily inconsistent with being a favorite. Imagine if God had told the world, "if Brazil falls behind 2-0, they'll collapse and lose 7-1." Nate, would have figured: "Hmmm, OK, so we have to take subtract off the chance of 'Brazil gives up the first two goals, but then dramatically comes back to win the game,' since God says that can't happen."

Nate would have figured that's maybe a 1 percent of all games, and say, "OK, I'm reducing my 65% to 64%."  

So, that particular imperfection in the model isn't really a serious flaw. 

But, now that I think about it ... imagine that when Nate published his 65% estimate, he explicitly mentioned, "hey, there's still a 1-in-3 chance that Brazil could lose ... and that includes a chance that Germany will kick the crap out of them. So don't get too cocky."  That would have helped him, wouldn't it?  It might even have made him look really good!

I mean, he shouldn't need to say it to statisticians, because it's an obvious logical consequence of his 65% estimate. But maybe it needs to be said to the public.


"It's hard to imagine how Silver could have been more wrong."

No, it's not hard to imagine at all. If Nate had predicted, "Germany has a 100% chance of winning 7-1," that would have been MUCH more wrong. 

6. Finally, and worst for last ... here's a UNC sociology professor lecturing Nate on how he screwed up, without apparently really understanding what's going on at all. I could spend an entire post on this one, but I'll just give you a summary. 

First, she argues that Nate should have paid attention to sportswriters, who said Brazil would struggle without those missing players. Researchers need to know when to listen to subject-matter experts, who knew something Nate's mathematical models don't. 

Well, first, she's cherry-picking her sportswriters -- they didn't ALL say Brazil would lose badly, did they?  You can always find *someone*, after the fact, who bet the right way. So what?

As for subject-matter experts ... Nate actually *is* a subject matter expert -- not on soccer strategy, specfically, but on how sports works mathematically. 

On the other hand, a sociology professor is probably an expert in neither. And it shows. At one point, she informs Nate that since the Brazilian team has been subjected to the emotional trauma of losing two important players, Nate shouldn't just sub in the skills of the two new players and run with it as if psychology isn't an issue. He should have *known* that kind of thing makes teams, and statistical models, collapse.

Except that ... it's not true, and subject-matter experts like Nate who study these things know that. There are countless cases of teams who are said to "come together" after a setback and win one for the Gipper -- probably about as many as appear to "collapse". There's no evidence of significant differences at all -- and certainly no evidence that's obvious to a sociologist in an armchair. 

Injuries, deaths, suspensions ... those happen all the time. Do teams play worse than expected afterwards?  I doubt it. I mean, you can study it, there's no shortage of data. After the deaths of Thurman Munson, Lyman Bostock, Ray Chapman, did their teams collapse?  I doubt it. What about other teams that lost stars to red cards?  Did they all lose their next game 7-1?  Or even 6-2, or 5-3?

Anyway, that's only about one-third of the post ... I'm going to stop, here, but you should read the whole thing. I'm probably being too hard on this professor, who didn't realize that Nate is the expert and not her, and wrote like she was giving a stock lecture to a mediocre undergrad student quoting random regressions, instead of to someone who actually wrote a best-selling book on this very topic.

So, moving along. 

------

There is one argument that would legitimately provide evidence that Nate was wrong. If any of the critics had chosen to argue convincing evidence for Brazil actually having much less TALENT than Nate and others estimated, evidence that was freely available before the game ... that would certainly be legitimate.

Something like, "Brazil, as a team, is 2.5 goals above replacement with all their players in action, but I can prove that, without Neymar and Silva, they're 1.2 goals *below* replacement!"

That would work. 

And, indeed, some of the critiques seem to be actually suggesting that. They imply, *of course* Brazil wouldn't be any good without those players, and how could anyone have expected they would be?  

Fine. But, then, why did the bookmakers think they still had a 49% chance? Are you that smart that you saw something? OK, if you have a good argument that shows Brazil should have been 30%, or 20%, then, hey, I'm listening.

If the missing two players dropped Brazil from a 65% talent to a 20% talent, what is each worth individually? Silva is back for today's third-place game against Holland. What's your new estimate for Brazil ... maybe back to 40%?

Well, then, you're bucking the experts again. Brazil is actually the favorite today. The betting markets give them a 62% chance of beating the Netherlands, even though Neymar is still out. Nate has Brazil at 71%. If you think the Brazilians are really that bad, and Nate's model is a failure, I hope you'll be putting a lot of money on the Dutch today. 

Because, you can't really argue that Brazil is back to their normal selves today, right?  An awful team doesn't improve its talent that much, from 1-7 to favorite, just from the return of a single player, who's not even the team's best. No amount of coaching or psychology can do that.

If you thought Brazil's 7-1 humiliation was because of bad players, you should be interpreting today's odds as a huge mistake by the oddsmakers. I think they're confirmation that Tuesday's outcome was just a fluke. 

As I write this, the game has just started. Oh, it's 2-0 Netherlands. Perfect. You're making money, right?  Because, if you want to persuade me that you have a good argument that Nate was obviously and egregiously incorrect, now you can prove it: first, show me where you wrote he was still wrong and why; and, second, tell me how much you bet on the underdog Netherlands.

Otherwise, I'm going to assume you're just blowing hot air. Even if Brazil loses again today. 


-----

Update/clarification: I am not trying to defend Nate's methodology against others, and especially not against the Vegas line (which I trust more than Nate's, until there's evidence I shouldn't).  

I'm just saying: the 7-1 outcome is NOT, in and of itself, sufficient evidence (or even "good" evidence) that Nate's prediction was wrong.  



Labels: , , , ,

Wednesday, July 09, 2014

"The Cult of Statistical Significance"

"The Cult of Statistical Significance" is a critique of social science's overemphasis on confidence levels and its convention that only statistically-significant results are worthy of acceptance. It's by two academic economists, Stephen Ziliak and Deirdre McCloskey, and my impression is that it made a little bit of a splash when it was released in 2008.

I've had the book for a while now, and I've been meaning to write a review. But, I haven't finished reading it, yet; I started a couple of times, and only got about halfway through. It's a difficult read for me ... it's got a flowery style, and it jumps around a bit too much for my brain, which isn't good at multi-tasking. But a couple of weeks ago, someone on Twitter pointed me to this .pdf -- a short paper by the same authors, summarizing their arguments. 

------

Ziliak and McCloskey's thesis is that scientists are too fixated on significance levels, and not enough on the actual size of the effect. To illustrate that, they use an example of two weight-loss pills:


"The first pill, called "Oomph," will shed from Mom an average of 20 pounds. Fantastic! But Oomph is very uncertain in its effects—at [a standard error of] plus or minus 10 pounds. ... Could be ten pounds Mom loses; could be thrice that.

"The other pill you found, pill "Precision," will take 5 pounds off Mom on average but it is very precise—at plus or minus 0.5 pounds. Precision is the same as Oomph in price and side effects but Precision is much more certain in its effects. Great! ...

"Fine. Now which pill do you choose—Oomph or Precision? Which pill is best for Mom, whose goal is to lose weight?"

Ziliak and McCloskey -- I'll call them "ZM" for short -- argue that "Oomph" is the more effectual pill, and therefore the best choice. But, because its effect is not statistically significant from zero*, scientists would recommend "Precision". Therefore, the blind insistence on statistical significance costs Mom, and society, a high price in lost health and happiness.

(*In their example, the effect actually *is* statistically significant, at 2 SDs, but the authors modify the example later so it isn't.)

But: that isn't what happens in real life. In actual research, scientists would *observe* 20 pounds plus or minus 10, and try to infer the true effect as best they can. But here, the authors proceed as if we *already know* the true effect on Mom is 20 +/- 10. But if we did already know that, then, *of course* we wouldn't need significance testing!

Why do the authors wind up having their inference going the wrong way?  I think it's a consequence of failing to notice the elephant in the room, the fact that's the biggest reason significance testing becomes necessary. That elephant is: most pills don't work. 

What I suspect is that when the authors see an estimate of 20, plus or minus 10, they think that must be a reasonable, unbiased estimate of the actual effect. They don't consider that most true values are zero, therefore, most observed effects are just random noise, and that the "20 pounds" estimate is likely spurious.

That's the key to the entire issue of why we have to look at statistical significance -- to set a high-enough bar that we don't wind up inundated with false positives.

At best, the authors are setting up an example in which they already assume the answer, then castigating statistical significance for getting it wrong. And, sure, insisting on p <  .05 will indeed cause false negatives like this one. But ZM fail to set off the false negatives against the inevitable false positives that would result without looking at significance, without realizing we need to find evidence of existence.

-----

In fairness, Ziliak and McCloskey don't say explicitly that they're rejecting the idea that most pills are useless. They might not actually even believe it. They might just be making statistical assumptions that necessarily assume it's true. Specifically:

-- In their example, they assume that, because the "Oomph" study found a mean of 20 pounds and SD of 10 pounds, that's what Mom should expect in real life. But that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- They also seem to assume the implication of that, that when you come up with a 95% confidence interval for the size of the effect, there is actually a 95% probability that the effect lies in that range. Again, that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

-- And, I think they assume that if a result comes out with a p-value of .75, it implies a 75% chance that the true effect is greater than zero. Same thing: that only follows if every effect has the same probability of occurrence -- which isn't the case, since most true effects are actually zero.

I can't read minds, and I probably shouldn't assume that's what ZM were actually thinking. But that one single assumption would easily justify their entire line of argument -- if only it were true. 

And it certainly *seems* justifiable, to assume that every effect size is equally likely. You can almost hear the argument being made: "Why assume that the drug is most likely useless?  Isn't that an assumption without a basis, an unscientific prejudice?  We should keep a completely open mind, and just let the data speak."  

It sounds right, but it's not. "All effects are equally likely" is just as strong a prejudice as "Zero is most likely."  It just *seems* more open-minded because (a) it doesn't have to be said explicitly, (b) it keeps everything equal, which seems less arbitrary, and (c) "don't be prejudiced" seems like a strong precedent, being such an important ethical rule for human relationships.

If you still think "most pills don't work" is an unacceptable assumption ... imagine that instead of "Oomph" being a pill, it was a magic incantation. Are you equally unwilling to accept the prejudice "most incantations don't work"?

If it is indeed true that most pills (and incantations) are useless, ignoring the fact might make you less prejudiced, but it will also make you more wrong. 

----

And "more wrong" is something that ZM want to avoid, not tolerate. That's why they're so critical of the .05 rule -- it causes "a loss of jobs, justice, profit, and even life."  Reasonably, they say we should evaluate the results not just on significance, but on the expected economic or social gain or loss. When a drug appears to have an effect on cancer that would save 1,000 lives a year ... why throw it away because there's too much noise?  Noise doesn't cost lives, while the pill saves them!

Except that ... if you're looking to properly evaluate economic gain -- costs and benefits -- you have to consider the prior. 

Suppose that 99 out of 100 experimental pills don't work. Then, when you get a p-value of .05, there's only about a 17 percent chance that the pill has a real effect. Do you want to approve cancer pills when you know five-sixths of them don't do anything?

(Why 5/6?  Of the 99 worthless drugs, about 5 of them will show significance just randomly. So you accept 5 spurious effects for each real effect.)

And that 17 percent is when you *do* have p=.05 significance. If you lower your significance threshold, it gets worse. When you have p=.20, say, you get 20 false positives for every real one.

Doing the cost-benefit analysis for Mom's diet pill ... if there's only a 1 in 6 chance that the effect is real, her expectation is a loss of 3.3 pounds, not 20. In that case, she is indeed better off taking "Precision" than "Oomph".

-----

If you don't read the article or book, here's the one sentence summary: Scientists are too concerned with significance, and not enough with real-life effects. Or, as Ziliak and McCloskey put it, 


"Precision is Nice but Oomph is the Bomb."

The "oomph" -- the size of the coefficient -- is the scientific discovery that tells you something about the real world. The "precision" -- the significance level -- tells you only about your evidence and your experiment.

I agree with the authors on this point, except for one thing. Precision is not merely "nice". It's *necessary*. 

If you have a family of eight and shop at Costco and need a new vehicle, "Tires are Nice but Cargo Space is the Bomb." That's true -- but the "Bomb" is useless without the "Nice".

Even if you're only concerned with real-world effects, you still need to consider p-values in a world where most  hypotheses are false. As critical as I have been about the way significance is used in practice, it's still something that's essential to consider, in some way, in order to filter out false positives, where you mistakenly approve treatments that are no better than sugar pills. 

None of that ever figures into the authors' arguments. Failing to note the false positives -- the word "false" doesn't appear anywhere in their essay, never mind "false positive" -- the authors can't figure out why everyone cares about significance so much. The only conclusion they can think of is that scientists must worship precision for its own sake. They write, 


"[The] signal to noise ratio of pill Oomph is 2-to-1, and of pill Precision 10-to-1. Precision, we find, gives a much clearer signal—five times clearer.

"All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way. "Well," say our significance testing colleagues, "the pill with the highest signal to noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision.” 

"But Precision—precision commonly defined as a large t-statistic or small p-value on a coefficient—is obviously the wrong choice. Wrong for Mother's weight-loss plan and wrong for the many other victims of the sizeless scientist. The sizeless scientist decides whether something is important or not—he decides "whether there exists an effect," as he puts it—by looking not at the something's oomph but at how precisely it is estimated. Mom wants to lose weight, not gain precision."

Really?  I have much, much less experience with academic studies than the authors, but ... I don't recall ever having seen papers boast about how precise their estimates are, except as evidence that effects are significant and real. I've never seen anything like, "My estimates are 7 SDs from zero, while yours are only 4.5 SDs, so my study wins!  Even though yours shows cigarettes cause millions of cancer deaths, and mine shows that eating breakfast makes you marginally happier."

Does that really happen?

-------

Having said that, I agree emphatically with the part of ZM's argument that says scientists need to pay more attention to oomph. I've seen many papers that spend many, many words arguing that an effect exists, but then hardly any examining how big it is or what it means. Ziliak and McCloskey refer to these significance-obsessed authors as "sizeless scientists." 

(I love the ZM terminology: "cult," "oomph," "sizeless".) 

Indeed, sometimes studies find an effect size that's so totally out of whack that it's almost impossible -- but they don't even notice, so focused are they on significance levels.

I wish I could recall an example ... well, I can make one up, just to give you the flavor of how I vaguely remember the outrageousness. It's like, someone finds a statistically-significant relationship between baseball career length and lifespan, and trumpets how he has statistical significance at the 3 percent level ... but doesn't realize that his coefficient estimates a Hall-of-Famer's lifespan at 180 years. 

If it were up to me, every paper would have to show the actual "oomph" of its findings in real-world terms. If you find a link between early-childhood education and future salary, how many days of preschool does it take to add, say, a dollar an hour?  If you find a link between exercising and living longer, how many marathons does it take to add a month to your life?  If fast food is linked with childhood obesity, how many pounds does a kid gain from each Happy Meal?  

And we certainly do also need less talk of precision. My view is that you should spend maybe one paragraph confirming that you have statistical significance. Then, shut up about it and talk about the real world. 

If you're publishing in the Journal of Costcological Science, you want to be talking about cargo space, and what the findings mean for those who benefit from Costcology. How many fewer trips to Costco will you make per year?  Is it now more efficient to get your friends to buy you gift cards instead of purchasing a membership? Are there safety advantages to little Joey no longer having to make the trip home with an eleven-pound jar of Nutella between his legs?

You don't want to be going on and on about, how, yes, the new vehicle does indeed have four working tires!  And, look, I used four different chemical tests to make sure they're actually made out of rubber!  And did I mention that when I redo the regression but express the cargo space in metric, the car still tests positive for tires?  It did!  See, tires are robust with respect to the system of mensuration!

For me, one sentence is enough: "The tire treads are significant, more than 2 mm from zero."  

-----

So I agree that you don't need to talk much about the tires. The authors, though, seem to be arguing that the tires themselves don't really matter. They think drivers must just have some kind of weird rubber fetish. Because, if the vehicle has enough cargo space, who cares if the tires are slashed?

You need both. Significance to make sure you're not just looking at randomness, and oomph to tell you what the science actually means.


Labels: , , ,