On correlation, r, and r-squared
The ballpark is ten miles away, but a friend gives you a ride for the first five miles. You’re halfway there, right? Nope, you’re actually only one quarter of the way there.
That’s according to traditional regression analysis, which bases some of its conclusions on the square of the distance, not the distance itself. You had ten times ten, or 100 miles squared to go – your buddy gave you a ride of five times five, or 25 miles squared. So you’re really only 25% of the way there.
This makes no sense in real life, but, if this were a regression, the "r-squared" (which is sometimes called the "coefficient of determination") would indeed be 0.25, and statisticians would say the ride "explains 25% of the variance." There are good mathematical reasons why they say this, but they mean "explains" in the mathematical sense, not in the real-life sense.
For real-life, you can also use "r". That’s the correlation coefficient, which is the square root of 0.25, or 0.5. In this example, obviously the r=0.5 is the value which makes the most sense in the context of getting to the ballpark. Because you really are, in the real life sense, halfway there.
r is usually the value you use to draw real life conclusions from a regression. According to "The Hidden Game of Baseball," if you regress Runs Scored against Winning Percentage, you get an r of .737, which is an r-squared of .543. A statistician might use the r-squared to say that runs "explains 54.3% of the variation in winning percentage." Which is true if you are concerned with the sums of the squares of the differences – and only a statistician cares about those.
What real people are concerned about is what conclusions we can draw about baseball. And those conclusions are based on the "r", the 0.737. What that tells us is that (a) if a team ranks one standard deviation above average in runs scored, then (b) on average, it will rank 0.737 standard deviations above average in winning percentage.
The 73.7% is useful information about the value of runs to winning ballgames. But the 54.3% figure doesn’t tell you anything you need to know.
I made this point in my review of "The Wages of Wins," where the authors found that payroll "explains only 18%" of wins. They were using r-squared. The r is the square root of .18, which is about .42. Every SD of increased salary leads to an increase of 0.42 SD in wins. In real life, salary explains 42% of wins – although a statistician would probably never put it that way.
Sometimes, the correlation coefficient is used not to predict anything, but just to give you an idea of the relationship between variables. Everyone knows that +1 is a perfect positive relationship, -1 is a perfect negative relationship, and 0 is no relationship at all. And the higher the absolute value of the number, the stronger the relationship. So an r of 0.1 is a weak relationship, but -0.9 is a very strong relationship.
But a "very strong relationship" depends on the context. Sean Forman reports that the correlation between year-to-year players’ batting average is 0.45. That’s pretty high. But if the game-to-game correlation was 0.45, that would be enormous! It would indicate a huge "hot hand" effect. It would mean that if a player was two hits above average one night – say, he went 3-for-4 instead of 1-for-4 – he would be 0.9 hits above average the next night. That would mean that a .250 hitter turns into a .475 hitter after a 3-for-4 game!
Obviously, if you really did the experiment of computing game-to-game correlations, you’d get a very small number. I’m guessing, but, for the sake of argument, let’s say it might be 0.04.
Now, these two correlations are measuring the same ability – hitting for average. But because of context, an 0.45 can be pretty high in the season case, but earth-shattering in the game case. Conversely, 0.04 is meaningful in the game case, but, in the season case, it would show that batting average is barely a repeatable skill at all.
It all depends on context.
I mention this because of a blog entry on the "Wages of Wins" website. There, David J. Berri compares his book’s quarterback ranking to various versions of more sophisticated stats from Football Outsiders. He finds that the correlations are 90%, 92%, and 95% respectively.
And so he writes, "this exercise reveals that there is a great deal of consistency between the Football Outsiders metrics and the metrics we report in The Wages of Wins."
With which I disagree. The interpretation of correlation coefficient depends, again, upon the context. If you were completely ignorant about football statistics, then, yes, a 90% correlation would indicate that you’re measuring roughly the same thing. But given the vast amount of sabermetric knowledge we have about football, 90% could mean the statistics are very different at the margins of knowledge.
For instance, I’d bet that, in baseball, Total Average and Runs Created might correlate on the order of 90%. But, given our knowledge of baseball, we know that Total Average is unsatisfactory in many ways, and the differences are significant at the level of detail that we need for future research. 90% is enough to put Babe Ruth on the top and Mario Mendoza on the bottom. But it’s not good enough to tell the productive base stealers from the unproductive, or give us reliable information about the relative value of hits, or even to distinguish the 55th percentile player from the 45th.
To sum up: in one example, a 0.45 correlation was huge; in another example, a 0.9 correlation was mediocre. If your analysis starts and stops with the correlation coefficient, you really haven’t proven anything at all.
More posts on r and/or r-squared:
The regression equation vs. r-squared
Still more on r-squared
Why r-squared doesn't tell you much, revisited
"The Wages of Wins" on r and r-squared