Monday, August 06, 2012

Why r-squared works

My most visited post on this blog, by far, is this one, which is about regression, rather than sports.  Most readers got there by googling "difference between r and r-squared."  

I've been meaning to do a long followup post explaining the differences more formally ... but, I guess, that was too ambitious a quest, because I never got around to it.  So, what I figured I'd do is just write about certain aspects as I thought about them.  So here's the first one, which is only about r-squared, for now.

------

You roll a single, fair die.  I try to predict what number you'll roll.  I'm trying to be as accurate as possible.  That means, I want my average error to be as small as possible. 

If the die is fair, my best strategy is to guess 3.5, which is the average roll.  If I do that, and you roll a 1 or a 6, I'll be off by 2.5.  If you roll a 2 or a 5, I'll be off by 1.5.  And if you roll a 3 or a 4, I'll be off by 0.5.

My average error, then, will be 1.5. 

What if you roll two dice instead of one?  My best guess will be 7, which again is the average.  One time out of 36, you'll roll a 2, and I'll be off by five.  Two times out of 36, you'll roll a 3, and I'll be off by four.  And so on.  If you average out all 36 rolls, you'll see my mean error will be 1.94.

With three dice, my best guess is again the average, or 10.5.  This time, there are 216 possible rolls.  I did the calculation ... if I've got it right, my mean error is now 2.42.

Let me repeat these for you:

1.50 -- mean error with one die
1.94 -- mean error with two dice
2.42 -- mean error with three dice

As I said, these are the mean errors when I have no information at all.  If you give me some additional information, I can reduce the mean error.

Suppose we go back to the three dice case, where I guess 10.5 and my mean error is 2.42.  But, this time, let's say you tell me the number showing on the first die (designated in advance -- let's suppose it's the blue one).

That is: you roll the dice out of my sight, and come back and say, "the blue die shows a 6."  How can I reduce my error?

Well, I should guess that the total of all three dice is 13.  That's because the first die is 6, and the expectation is that the other two will add up to 7.  The obvious strategy is just to add 7 to the number on the blue die, and guess that number.

If I do that, what's my average error?  Well, it's obviously 1.94, because it's the exactly the same as trying to predict two dice.

Now, what if you tell me the numbers on two dice, the blue one and the red one?  Then, my best guess is that total, plus 3.5.  My mean error will now be 1.50, because this is exactly the same as the case of guessing one die.

And, what if you tell me the numbers on all three dice?  Now, my best guess is the total, and it's going to be exactly right, for an error of zero.

Here's another summary, this time of what my mean error is depending on what information you give me:

2.42 -- no information
1.94 -- given the number on the blue die
1.50 -- also given the number on the red die
0.00 -- also given the number on the green die

Since the blue die reduces my mean error from 2.42 to 1.94, it's "worth" 0.48.  Since the red die reduces my mean error from 1.94 to 1.50, it's "worth" 0.44.  Since the green die reduces my mean error from 1.50 to 0.00, it's "worth" 1.50:

0.48 -- value of blue die
0.44 -- value of red die
1.50 -- value of green die

Now, in a way, this doesn't make sense, does it?  The dice are identical except for their colors.  How can one be worth more than another?

What's really happening is that the difference isn't because of the color, but because of the *order*.  The last die reduces the error much more than the first two dice -- even by more than the first two dice combined. 

It's like, if you have a car with no tires.  Adding the first tire does you no good.  Adding the second tire does you no good.  Adding the third tire does you no good.  But, adding the fourth tire ... that fixes everything, and now you can drive away.  Even though the tires are identical, they have different values because of the different situations.

-----

But ... let's do the same thing again, but, this time, instead of the mean error, let's look at the mean SQUARE error.  That is, let's square all the errors before taking the average.

For one die, my best strategy is still to guess the average, 3.5.  One third of the time, the roll will be 1 or 6, which means my error will be 2.5, which means my *square* error will be 6.25.  One third of the time, the roll will be 2 or 5, so my error will be 1.5, and my *square* error will be 2.25.  Finally, one third of the time the roll will be 3 or 4, which means my error will be 0.5, which means my *square* error will be 0.25.

Average out all the square errors, and you get a mean square error of 2.92.

If you repeat this exercise for two dice, when I guess 7, you get a mean square error of 5.83.  And for three dice, when I guess 10.5, the mean square error is 8.75.

Again repeating the logic: if I have to guess three dice with no information, my mean square error will be 8.75.  If you tell me the number on the blue die, we now have the equivalent of the two-dice case, and the mean square error drops to 5.83.  If you tell me the values on the blue and red dice, we're now in the one-die case and the mean square error drops further, to 2.92.  And, finally, if you tell me all three dice, the error drops all the way to zero.

8.75 -- no information
5.83 -- given the number on the blue die
2.92 -- also given the number on the red die
0.00 -- also given the number on the green die

So if we again subtract to figure out how much each die is "worth", we get:

2.92 -- value of blue die
2.92 -- value of red die
2.92 -- value of green die

Ha!  Now, we have something that makes sense: each die is worth exactly the same value, regardless of where it came in the order.  When we didn't square the errors, the last die was worth much more than the other dice.  But, when we *do* square the errors, each die is worth the same!

How come?

It's the way God designed the math.  It just so happens that errors of independent variables are like Pythagorean triangles.  Just like, in a right triangle, "red squared" plus "blue squared" equals "hypotenuse squared," it's also true that "mean squared red error" plus "mean squared blue error" equals "mean squared red+blue error".

It's not something statisticians made up, any more than the right triangle Pythagorean Theorem is something geometers made up.  It's just the way the universe is.

------

So, the second way, the one where all the dice seem to have equal values, is "nicer" in at least two ways.  First, it means the order in which we consider the dice doesn't matter.  Second, it means that the dice all have equal values. 

This "niceness" is the reason that statisticians use the mean *square* error, instead of just the mean error.  The mean square error, as you probably know, is a common statistical term, called the "variance".  The mean error -- without squaring -- is just called the "mean error".  It doesn't get used much in statistics at all, because it doesn't have those two nice properties (and others).

Instead of the mean error, statisticians will use the "standard deviation," or "SD", which is just the square root of the variance.  That is, it's the square root of the mean square error.  The SD is usually not too much different from the mean error.  For instance, in the two dice case, the mean error was 1.94.  The SD, on the other hand, is the square root of 5.83,  which is 2.41.  They're roughly similar, though not exact.  (The SD will always be equal to or larger than the mean error, but usually not too far off.  For a normal distribution, the SD is 25.3% higher than the mean error, I think).

The statistical Pythagorean Theorem, then, says that if you have two independent variables, A and B, then

SD(A+B)^2  = SD(A)^2 + SD(B)^2

------

This is really important ... I learned it in the first statistics course I took, but I finished my entire statistics degree without realizing just how important it was, and how it makes certain regression results work.  I searched the internet, and I only found one page that talks about it: it's a professor from Cornell calling it "the second most important theorem" in statistics.  He's right ... it really is amazing, and useful, that it all works out. 

In the three dice case, the Pythagorean relationship meant that each die reduced the variance by the same amount, 2.92 out of 8.75 total.  That allows statisticians to say, "The blue die explains one-third of the variance.  The red die explains one-third of the variance.  And the green die explains one-third of the variance.  Therefore, together, the three dice explain 100% of the total variance."

Because it works out so neatly, it means statisticians are going to like to talk in variances, in squares of errors instead of just errors.  That's because they can just add them up, which they couldn't do if they didn't square them first.

And that's what r-squared is all about.

Let's suppose you actually ran an experiment.  You got someone to run 1,000 rolls of three dice each, and report to you (a) the total of the three dice, and (b) the number on the blue die.  He reports back to you with 1,000 lines like this:

Total was 13; blue die was 4
Total was 11; blue die was 6
Total was 08; blue die was 1
Total was 17; blue die was 6
...

With those 1,000 lines, you run a univariate regression, to try to predict the total from just the blue die.  The regression equation might wind up looking like this (which I made up):

Predicted total = 6.97 + (1.012 * blue die)

(The coefficients should "really" be 7 and 1.000, but they won't be exact because of random variation in the rolls of the green and red dice.)

And the regression software will spit out that the r-squared is approximately 0.33 -- meaning, the mean square error was reduced by approximately 1/3, when you change your guess from the average (around 10.5), to "6.97 + (1.012* blue die)".

That is: "the blue die explains 33% of the variance." 

------

This makes perfect sense.  But, as I argued in my 2006 post -- and others -- the figure 33 percent is limited in its practical application.  Because, in real life, you're normally not concerned about squared errors -- you're concerned about actual errors. 

Suppose you can't perfectly estimate how many customers your restaurant will have.  Sometimes, you order too many lobsters, and you have to throw out the extras, at a cost of $10 each.  Sometimes, you order too few, and you have to turn away customers, which again costs you $10 each.  Suppose the SD of the discrepancy is 10 lobsters a day. 

If a statistician comes along and says he can reduce your error by 50% of the variance, what does that really mean to you?  Well, your squared error was 100 square lobsters.  The statistician can reduce it to 50 square lobsters.  The square root of that is about 7 lobsters.  So, even though the statistician is claiming to reduce your error by half ... in practical terms, he's only reducing it from 10 lobsters to 7.  That's an improvement of 30 percent, not 50 percent, since your monetary loss is proportional to the number of lobsters, not the number of square lobsters.

The r-squared of 50% is something that's meaningful to statisticians in some ways, but not always that meaningful to people in real life. 

Statements like "the model explains 50% of the variance" are useful in certain specific ways, that I'll talk about in a future post ... but it usually makes more sense, in my opinion, to think about things in lobsters, rather than in square lobsters.



Labels: ,

7 Comments:

At Monday, August 06, 2012 7:51:00 PM, Anonymous mettle said...

Great post - really informative.

One question: So what should the statistician tell the restaurant owner? The "r" in this case (the sqrt of r^2) is .7, but in your example, the SD is reduced by 3. How is r (.7) related to the reduction in SD (3)?
How is r anymore informative here?

 
At Monday, August 06, 2012 8:01:00 PM, Blogger Phil Birnbaum said...

I didn't talk about r yet. I don't think r factors into the restaurant owner's situation.

I'd tell the restaurant owner that we can reduce her errors by 30%, roughly.

 
At Monday, August 06, 2012 9:09:00 PM, Anonymous A Liberal Columnist said...

It's pretty clear when to use r-squared. If you're trying to trick your readers into thinking an effect is alarmingly large, use r-squared. But when you're trying to trick your readers into thinking an effect is modest, use r. That's what I always do.

Also, you should always obfuscate things by using the term 'variation' instead of variance. That way, people who know better can't call you out on it.

 
At Tuesday, August 07, 2012 1:22:00 PM, Anonymous Lex Logan said...

Great columns on r and r-squared. "Lib" brings up an interesting point: my stats textbooks clearly say that a regression explains r-squared percent of the "variation" in y; any idea why they don't just say variance?

 
At Tuesday, August 07, 2012 1:28:00 PM, Blogger Phil Birnbaum said...

Lib and Lex: No, I have no idea why they say "variation" instead of "variance". Anyone else know?

 
At Tuesday, August 07, 2012 3:12:00 PM, Anonymous Sean said...

Great post, Phil!

I am not certain about why the stats book would use variation instead of variance, but variance is typically considered a specific measure of variation. Alternatively, it's possible that the stats book is trying to distinguish between population vs. sample measures? Can't say if it's imprecise/lazy language or not without actually seeing the context.

 
At Wednesday, August 08, 2012 1:57:00 AM, Anonymous Patrick D said...

Hey Phil,
There's not much to comment on here, but I just wanted to stop by and say that this series of articles has been phenomenal. I think I speak for all your long time readers yet non-commentors by saying, thanks!

 

Post a Comment

<< Home