Wednesday, November 21, 2007

Still more on r-squared

This post is about "r-squared" again. It's one in a series of posts about how r-squared, and statements like "payroll explains 18% of wins," are not very meaningful.

(Previous posts are here and here and here and here and here.)


Suppose you want to get from Meville to Youtown. The cities are 11.66 miles apart, as the crow flies. But you can't get there in a straight line. Instead, you have to go east 6.32 miles, then north 9.72 miles.

Now, ask yourself: what percentage of the 11.66 mile difference is "explained" by the 6.32-mile eastbound leg? And what percentage is explained by the 9.72 mile northbound leg?

One way to answer the question is to figure, hey, the trip is 16.04 miles total. So the 6.32-mile leg is 39.4% of the total, and the 9.72-mile leg is 61.4% of the total:

39.4% short leg
61.4% long leg

Alternatively, you might figure that distance doesn't matter, just time. And it turns out the short leg takes 20 minutes to ride on your bike, but the second leg also takes 20 minutes, even though it's longer, because it's on smoother terrain. So it's

50% short leg
50% long leg

If you don't like those choices, then, intuitively, you might do it by the amout of gas used by each leg if you drive. Or how much you have to pay for a bus ticket for each leg. Or any other measure like that.

Or, you could do it this way. You could say: by the pythagorean theorem and the little map above, we know that 6.32 squared plus 9.72 squared = 11.66 squared. Put into numbers, the short leg squared is 39.94, the long leg squared is 94.48, and the total distance squared is 135.96. So if we divide the squares, we can say that the short leg is 39.94/135.96, and the long leg is 94.48/135.96:

29.4% short leg
70.6% long leg

To me, among all these alternatives, this last one makes the *least* amount of sense. Why use the squares? The numbers 39.94 and 94.48 don't really mean anything. There is no human-related description of this trip that could make the 29.4% figure mean something intuitive. Squaring the distances squeezes the meaning out of them.

But this last method is exactly what r-squared is doing!

Take a baseball example. The numbers in the above example are exactly the standard deviations that apply to team wins. For a certain group of seasons (that I don't have in front of me right now), the standard deviation of team wins is 11.66 wins. Decomposing that, and making a few simplifying assumptions, you can figure out that it's comprised of

Luck, with a standard deviation of 6.32 wins;
Talent, with a standard deviation of 9.72 wins.

If you assume luck and talent are independent, then, by the mathematical properties of standard deviations, the relationship between the three SDs (luck, talent, and total) is the same as the sides of a right triangle: 6.32 squared plus 9.72 squared equals 11.66 squared. That's the same right triangle we had for the Meville to Youtown trip.

Ask yourself now: how much of team wins is explained by luck, and how much by talent? An intuitive answer (which I'm not advocating, just noting that it's inutitive) would be to notice that the talent number is about 50% higher than the luck number.

But statisticians wouldn't do that. They'd square everything, like in the Mytown example. Since 6.32 squared divided by 11.66 squared equals .294, statisticians would say that the r-squared between luck and wins is .294. Or, they'd say

29.4% of the variance in wins is explained by luck;
70.6% of the variance in wins is explained by talent.

That's true, as a mathematical statement. Notice the words "of the variance in wins". The variance is the standard deviation squared – about 136 square wins. And that number doesn't mean anything. Squaring the standard deviation is arbitrary, as arbitrary as squaring the 6.32 mile distance travelled on the eastbound leg of the trip. If you're going to square it, why not cube it? Or take the square root, or the fifth power? Well, actually, there are very good reasons to take the square – but those reasons are related to statistical properties, not real life.

And if you're talking about real life concepts, those are what you should be measuring. If a $20 shirt is on sale for $10, it's 50% off. The fact that if a $20-squared shirt were on sale for $10-squared, it would be 75% off ... well, that just doesn't matter.

I still argue that when a statistican says "XX% of the variance in Y is explained by Z," that is something that is only meaningful to statisticians. It is not something that should have a lot of meaning to a casual fan, or in real life.



At Friday, November 23, 2007 8:58:00 AM, Anonymous Anonymous said...

I would base the "explain" on the standard deviation, not the variance.

The standard deviation among team pitching is around double that of fielding. The variance is obviously therefore 4 times as large.

If you were to ask me: how much of defense is pitching; I would say 67%, not 80%.

You pay based on the standard deviation, not on the variance.


Post a Comment

<< Home