## Friday, August 17, 2012

### A benefit of r-squared

(This is a sequel to a previous post on r-squared.)

-----

Bob and Sam are arguing about how raffles are won.  Bob says, "People tend to win because they buy lots of tickets.  That's the biggest factor."  And Sam replies, "Well, you can buy a lot of tickets, but you still have to get lucky.  People win more because they get lucky, rather than because of how many tickets they buy."

You do a regression to predict prizes won, based on how many tickets were bought.  And the r-squared winds up at .52.

Well, Bob has won the argument, hasn't he?  The number of tickets explains 52 percent of the variance in prizes won.  If Sam were right, then luck would have to explain more than 52 percent.  Since tickets and luck are presumably independent, you can just add the variances (by the Pythagorean theorem of statistics).  That adds up to 104 percent, which is impossible.

Anything above 50 percent must be the biggest factor -- at least, of all other factors independent of that one.

----

And that's where it gets tricky.  Because, it's hard to find factors that are legitimately independent.

Suppose you're looking to "explain" differences in salary.  Why does Chris make \$30,000, while Pat makes \$50,000?  What factors contribute to a higher or lower salary?

Someone does a regression, to predict salary based on intelligence (as imperfectly measured by an IQ test, say).  He winds up with an r-squared of, say, .41.  That's pretty impressive!  He concludes that how smart you are is a big predictor of how much money you'll make.

But, then, a colleague comes along.  She thinks it's education that leads to higher salaries.  She does her own regression, to predict salary based number of years of schooling.  The r-squared is .36.  Again, impressive!  She writes a study claiming that schooling increases salaries.

Finally, a third colleague thinks it's family culture.  He does a third regression, this time using parental income.  There's again a high r-squared, this time .43.  The conclusion, this time, is that high salary is something that parents influence you to achieve.

(These r-squareds are all made up; they're probably way too high to be realistic.)

Now, if you just add up the three r-squareds, you might conclude that those three factors, taken together, explain 120% of the variation in salary!  Obviously, that can't be right.

And it's not right -- because you can only add variances when the variables are independent.

These aren't.  They're highly correlated.  If you have a high IQ, you're more likely to stay in school longer.  If your parents are academics, they probably had a high IQ, and therefore, probably, so would you.

So you can't just add up the r-squareds, like you could for the lottery example, or the dice example from the last post.

All you can do, if you choose, is perhaps to say that the r-squared for IQ is the highest, so that's the most plausible theory right now.  But, really, it's probably some combination of all three factors, which overlap quite a bit.  The r-squareds don't help a whole lot, here, in supporting one hypothesis over another.

What we *can* do, to help a little bit, is run a multiple regression, using all three variables.  Multiple regression is smart enough to adjust for the fact that the variables aren't independent, by taking them all at once.

Let's suppose we did that, and we got an r-squared of 0.6. That's half of the total sum, which says that exactly half the variance is shared by the three factors.  That means that in the aggregate, our three researchers are "half right".  It doesn't mean they're all half right; one of them might be 100% right, one might be 40%, and one might be 10%.  But, on average, they're half.  That doesn't really do us a lot of good.  We still don't know which of the three factors are important in what proportion.

Or, even, if there are other factors that just happen to correlate with these.

------

It's obvious when I spell it out like this, with all three studies.  But, suppose only the first one was done.  You might just read that schooling explains 36% of the variation in salary, and, without thinking, conclude that getting more formal education makes you richer.  But, you'd probably be wrong.  It might be IQ.  It might be culture.  It might be other things, that correlate with schooling, that we haven't thought of yet (how much you study, for instance, or what kind of degree you have).

This is why, I say, you always have to make an argument.  Sure, you got a high r-squared when you looked at schooling.  But why do you assume cause and effect?  How do you know it's not something else instead, something that correlates with schooling?  No statistical test can tell you.  You have to argue for it.

-------

Let me give you a baseball example.

I took every Major League Baseball team since 1970 (excluding 1981 and 1994), and ran a regression to predict their runs scored from their doubles hit.

The r-squared came out to .462.

I was surprised how high that was.  Knowing 2B, you can reduce the variance by almost half.

Can we conclude that 46 percent of baseball is doubles?  No, of course not.  It's not the doubles -- it's a confounding factor, something that correlates with doubles.  What I think is actually happening is that teams that hit a lot of doubles also hit a lot of home runs.  (The correlation between HR and 2B was .559.)

Let's get rid of doubles and substitute home runs.  Now, the r-squared is .594.

Can we conclude that 59 percent of baseball is HR?  Again, of course not.  Again, what's likely happening is that teams that hit a lot of home runs also hit a lot of doubles (and probably other things).

All we can say is that knowing HR lets us reduce our mean squared error by 59 percent.  If we want to argue *why*, that's fine, but the r-squared alone doesn't tell us.

--------

Another thing we can do is a multiple regression: predict runs based on both HR and 2B.  I did that, and I got an r-squared of .684.

So, if you start with HR (r-squared .594), and then add doubles, you gain an extra .084.  If you start with doubles (r-squared .462), and then add home runs, you gain an extra .222.

So, HR and 2B have .378 in common.  HR adds .222 to that.  2B adds .084 to that.  You could draw a Venn diagram to illustrate the overlap, if you wanted to.

But ... those numbers don't mean a whole lot to me, in real life terms.

-------

So what good is the r-squared, then?

For me, there's one particular task that r-squared is great for: helping figure out how much luck is embedded in performance.  For instance, how much of the variation in team W-L records is based on clutch performance, as opposed to just scoring and preventing runs?

Most sabermetricians say, not much.  My impression is that most mainstream baseball people would say, quite a bit.

Well, here's what I did.  I ran a regression to predict team wins based on runs scored and runs allowed, for 1973-2011 (omitting 1981 and 1994).  The r-squared came out to .87.

That leaves only .13 remaining for clutch performance.  That is, assuming clutch is uncorrelated with RS and RA.  It isn't, quite, but it's probably close.  In any case, you can argue that the *independent* portion has only .13 remaining to claim.  That should satisfy the clutch advocates, since they usually argue that clutch matters in a way *not measured* by raw runs.  (As in, "sure, team A and B scored 750 runs each, but team B scored them when they counted most.")

That's what I like r-squared for.  It lets you estimate the "explanation space" available for the unusual theories, the ones that don't correlate with other variables.  Effectively, it takes some of the air out of the weird hypotheses.

Suppose I have a theory that having a job interview on a good biorhythm day is an important explanation for differences in salary.  If I run a regression to account for the other, mainstream variables -- IQ, education, study habits, height, sex, race, and so on -- and I get an r-squared of .90, that means there's only .10 left for unrelated factors.  So, I have to be aware that those factors are *at least* nine times more important than my biorhythm theory.

So that's the thing I like about r-squared: it gives you a mental pie chart of the strength of competing explanations -- or, at least, competing *independent* explanations.

Labels: ,