Sabermetric Research: Yet another r vs. r-squared explanation

There's an election with one million voters, who randomly choose between Party A and Party B. The margin of victory is important, not just who got the most votes.

If you run a regression to predict the margin of victory, based on a single vote, what will the r-squared be? It will be 1/1,000,000. That's because, if you knew all the votes, you'd know the margin of victory perfectly and the r-squared would be 1. Since no vote is more important than any other, each must be equal in r-squared. And the r-squareds have to add up, since the voters are independent. So, 1/1,000,000 is the answer.

But, here's a different question: what is the impact of one vote on the margin of victory? Well, that margin will be fairly small -- if you do the calculation, the SD of the difference between A and B will be 1,000 votes. So, a single vote will be 1/1000 of the margin. Not one in a million, but one in a thousand.

It's not that hard to see why. When a million people vote, their votes will mostly cancel out, since they're all choosing randomly. We know that they cancel out by the square root of the number of votes, since SD goes down by the square root of sample size. So they'll cancel out to a difference of 1,000 votes. John Smith becomes one vote in a margin of 1,000, instead of 1 vote in 1,000,000.

That's the r: 1/1,000. It's the square root of the r-squared of 1/1,000,000.

Effectively, a single vote sticks out higher because everyone else cancels out.

-----

This, I think, is a good analogy to visualize the difference between r-squared and r:

-- r-squared tells you how important the factor is relative to all the other factors.

-- r tells you how important the factor is relative *to the size of the final outcome*.

The size of the outcome -- in the sense of the difference from the mean -- is the square root of the size of the number of independent "others", which is why this works out.

-----

This appears to lead to a contradiction: if there are a million voters, and they're all equal, how can they ALL be 1/1000 of the outcome? That would add up to one thousand outcomes!

But that's OK. Remember, we're talking about the *size* of the outcome, not the responsibility for it. If all million voters voted for A, the outcome would have been a 1,000,000 vote margin, instead of just 1,000. So, all the voters combined DO add up to 1,000 outcomes -- by size.

If you don't like that, here's a non-statistical analogy. Suppose there's an election where party A wins by one vote. Whose vote tipped the balance? Everyone's! That is, everyone who voted for party A. There might have been 500,001 votes for A, and 500,000 votes for B. If *any* of the A voters had voted the other way instead, B would have won.

That is: 500,001 voters can say that *they* made 100% of the difference in the election -- and they'd all be right! That is, it's perfectly OK that the sum of the voters' effects add up to a huge number. It's an illusion that it seems they shouldn't be able to.

-----

Moving to a sports example ... let's go back to payroll vs. wins in baseball. Suppose you do a regression, like the ones in this Freakonomics post, and you find that the r-squared equals .1, like it was in 2008.

What that means is: payroll explained about 1/10 of the variance of wins. That means that, in a sense, there could be 9 other factors that are just as important as payroll. (Or, one factor that's "nine times" as important. Or one factor "five times" as important, and two other factors "two times" as important, or some combination like that.)

That is: payroll gets "one vote out of 10" in determining wins.

OK, fair enough.

But: those 9 other factors can be treated as independent and random (as an assumption of the regression -- and, also, if they were correlated to payroll, the regression would lump that in with payroll). Therefore, they mostly cancel each other out, down to their square root. So the SD of the other 9 factors is only 3 times the SD of payroll.

If you add payroll back in as the 10th factor, you get that the SD of the total is the square root of 10 times the SD of payroll (the square root of 3 squared from the other factors, plus 1 squared from payroll). That's around 3.1.

If payroll represents a single vote, the margin of victory is 3.1 votes. So, salary influences wins not by 1/10 (0.1), but by 1/3.1 (0.32), which is the square root.

Which is why we say, if you increase your payroll by 1 SD, you increase your wins by 0.32 SD. If you move one inch up or down the normal curve for payroll, you'll move 0.32 inches up or down the normal curve for wins.

That's fairly large.

-----

The moral of the story is:

1. The r-squared tells you what percentage of the *votes* you got.

2. The r tells you what percentage of the *result* was because of you.

For a cause-and-effect relationship, like payroll and wins, you almost always want number 2.

------

(I've written about r and r-squared numerous times in the past, such as here.)

Labels: r-squared, regression, statistics

Sabermetric Research

Tuesday, October 23, 2012

Yet another r vs. r-squared explanation

1 Comments:

About Me

Previous Posts