### On correlation, r, and r-squared

The ballpark is ten miles away, but a friend gives you a ride for the first five miles. You’re halfway there, right? Nope, you’re actually only one quarter of the way there.

That’s according to traditional regression analysis, which bases some of its conclusions on the square of the distance, not the distance itself. You had ten times ten, or 100 miles squared to go – your buddy gave you a ride of five times five, or 25 miles squared. So you’re really only 25% of the way there.

This makes no sense in real life, but, if this were a regression, the "r-squared" (which is sometimes called the "coefficient of determination") would indeed be 0.25, and statisticians would say the ride "explains 25% of the variance." There are good mathematical reasons why they say this, but they mean "explains" in the mathematical sense, not in the real-life sense.

For real-life, you can also use "r". That’s the correlation coefficient, which is the square root of 0.25, or 0.5. In this example, obviously the r=0.5 is the value which makes the most sense in the context of getting to the ballpark. Because you really are, in the real life sense, halfway there.

r is usually the value you use to draw real life conclusions from a regression. According to "The Hidden Game of Baseball," if you regress Runs Scored against Winning Percentage, you get an r of .737, which is an r-squared of .543. A statistician might use the r-squared to say that runs "explains 54.3% of the variation in winning percentage." Which is true if you are concerned with the sums of the squares of the differences – and only a statistician cares about those.

What real people are concerned about is what conclusions we can draw *about baseball*. And those conclusions are based on the "r", the 0.737. What that tells us is that (a) if a team ranks one standard deviation above average in runs scored, then (b) on average, it will rank 0.737 standard deviations above average in winning percentage.

The 73.7% is useful information about the value of runs to winning ballgames. But the 54.3% figure doesn’t tell you anything you need to know.

I made this point in my review of "The Wages of Wins," where the authors found that payroll "explains only 18%" of wins. They were using r-squared. The r is the square root of .18, which is about .42. Every SD of increased salary leads to an increase of 0.42 SD in wins. In real life, salary explains 42% of wins – although a statistician would probably never put it that way.

Sometimes, the correlation coefficient is used not to predict anything, but just to give you an idea of the relationship between variables. Everyone knows that +1 is a perfect positive relationship, -1 is a perfect negative relationship, and 0 is no relationship at all. And the higher the absolute value of the number, the stronger the relationship. So an r of 0.1 is a weak relationship, but -0.9 is a very strong relationship.

But a "very strong relationship" depends on the context. Sean Forman reports that the correlation between year-to-year players’ batting average is 0.45. That’s pretty high. But if the game-to-game correlation was 0.45, that would be enormous! It would indicate a huge "hot hand" effect. It would mean that if a player was two hits above average one night – say, he went 3-for-4 instead of 1-for-4 – he would be 0.9 hits above average the next night. That would mean that a .250 hitter turns into a .475 hitter after a 3-for-4 game!

Obviously, if you really did the experiment of computing game-to-game correlations, you’d get a very small number. I’m guessing, but, for the sake of argument, let’s say it might be 0.04.

Now, these two correlations are measuring the same ability – hitting for average. But because of context, an 0.45 can be pretty high in the season case, but earth-shattering in the game case. Conversely, 0.04 is meaningful in the game case, but, in the season case, it would show that batting average is barely a repeatable skill at all.

It all depends on context.

I mention this because of a blog entry on the "Wages of Wins" website. There, David J. Berri compares his book’s quarterback ranking to various versions of more sophisticated stats from Football Outsiders. He finds that the correlations are 90%, 92%, and 95% respectively.

And so he writes, "this exercise reveals that there is a great deal of consistency between the Football Outsiders metrics and the metrics we report in The Wages of Wins."

With which I disagree. The interpretation of correlation coefficient depends, again, upon the context. If you were completely ignorant about football statistics, then, yes, a 90% correlation would indicate that you’re measuring roughly the same thing. But given the vast amount of sabermetric knowledge we have about football, 90% could mean the statistics are very different at the margins of knowledge.

For instance, I’d bet that, in baseball, Total Average and Runs Created might correlate on the order of 90%. But, given our knowledge of baseball, we know that Total Average is unsatisfactory in many ways, and the differences are significant at the level of detail that we need for future research. 90% is enough to put Babe Ruth on the top and Mario Mendoza on the bottom. But it’s not good enough to tell the productive base stealers from the unproductive, or give us reliable information about the relative value of hits, or even to distinguish the 55th percentile player from the 45th.

To sum up: in one example, a 0.45 correlation was huge; in another example, a 0.9 correlation was mediocre. If your analysis starts and stops with the correlation coefficient, you really haven’t proven anything at all.

----

More posts on r and/or r-squared:

The regression equation vs. r-squared

Still more on r-squared

Why r-squared doesn't tell you much, revisited

R-squared abuse

"The Wages of Wins" on r and r-squared

Labels: regression, statistics

## 22 Comments:

There was an excellent academic article related to this topic -- using baseball actually -- published in Psychological Bulletin in 1985 (Vol. 97) by Robert Abelson, entitled:

A Variance Explanation Paradox: When a Little is a Lot

Through calculations and simulations, Abelson concludes that, for any single at-bat, the variance in outcome (hit/non-hit) explained by batting skill is only about 1%.

Abelson also notes, however, that "good teams usually win." In other words, even before the season starts, teams whose hitters look good on paper actually tend to do well over the long haul of the season.

Hence the paradox: Batting skill accounts for seemingly microscopic variance in any at-bat, yet teams with hitters identifiable a priori as "good" usually win.

Abelson resolves the conundrum with the idea of cumulativity:

"...a team scores runs by conjunctions of hits, so a team with many high-average batters is more likely to stage rallies than a team with many low-average batters."

var(observed) = var(true) + var(random)

where var=variance

If we look at OBP, var(true) in MLB is around .030^2.

For any single PA, it's either safe or out. That makes our var(random) = .474^2.

var(observed) therefore will be .475^2.

Your regression toward the mean therefore will be over 99%.

That's for one PA. But, there are 80 PAs per game, more or less (hitters and pitchers). The var(random) drops down to .053^2.

So, it all depends on the number of "trials". In football, you probably have around 150 possessions? Basketball is what, 200? Hockey likely in the 100+ neighboorhood? Tennis, 4 matches x 9 games x 6 to 8 points = 250?

The less trials, and the closer the var(true) is to zero, the more luck plays a role. My guess is that tennis has far fewer upsets simply because the trials are so high, and the spread in talent is so much wider.

On the subject of tennis upsets, I think there was a mathematical treatment in "Game, Set, Math" by Ian Stewart. I'll look for it next time I'm at the library.

The conclusion, if I remember correctly, was that the number of trials is so large that the probability that the better player wins is very close to 100%, even if one player is only a bit better than the other.

Another interesting and relevant article is by Dan Ozer, 1985, Psychological Bulletin. Here he shows that there are really two models for understanding the correlation between two variables. The first, the one referred to in this comment as the regression approach, is called the variance decomposition model. Here, the predictor represents some component of a larger outcome criterion. For example, if we were interested in knowing the relationships between gender and voting preference (e.g., republican vs democrat). There are many factors that go into voting preference, and gender is just one. In this case, to understand the amount of overlapping variance, the correlation would be squared. Thus if gender correlated .40 with voting preference, it would be said to explain 16% of the variance. This is the common model presented in stat books.

However, there is another model that is commonly used but not much ackowledged. This is what Ozer refers to as the Common Elements model. Here, the two variables are believed to share a common cause. In this case, the correlation coefficient DOES GIVE THE AMOUNT OF SHARED VARIANCE. We see this model operating most clearly in a test-retest reliability paradigm. Here, the same instrument is given to the same people at two different points in time. Then, scores on the two administrations of the test are correlated. The correlation IS the retest reliability coefficient (as you know, a reliability coefficient is a variance estimate). So if the retest correlation is .80, we say that 80% of the variance is reliable.

Thus, in looking at the baseball example, it needs to be determined whether the predictor is a component of a larger criterion, or if the two variables are themselves reflections of a common, underlying factor. So, one would need to specify the conceptual model that underlies the relationship between the two variables in order to determine whether to square r or not.

I don't see how it can be close to 100%. It's not like we always see players ranked #1 through #16 in every tournament.

In any case, the exact answer can be determined either empirically (we have enough data), or through the process I explained.

Tango,

The "close to 100%" was in theory for a very simplified model ... it may have assumed each player had the same chance of winning any given point. It may have taken service into account; I'm not sure.

Or, actually, it's more likely that I'm wrong about the "a bit better than the other" part. I remember there was one example where one player randomly won 60% of the points, which is a huge difference in ability.

I'll look it up.

Ah, 60% is huge! I guess the simple question is: if a guy who wins 60% of the points faces a guy who wins 40% of the point, how often will the second guy win more than 50% of the points, over 250 trials? I get 99.9%.

If the probability we expected was simply 51% to 49% for any single point, the better guy will win 62% of the time. If let's say this was Sampras/Agassis head-to-head record, it shows you how very close they are, and it's only the setup in tennis that allows Sampras to stand out much more.

Agreed.

It could even be that early in the sport, it was realized that players are very close in ability, and they deliberately chose a long-match format so that you could find out who the better players are.

That is, maybe the "length" of the game is deliberately chosen to produce an aesthetically-pleasing frequency of upsets -- not too many, not too few.

If this is true, then maybe a prediction would be that sports in which there was a large variation in team or player ability (when the rules were being made or changed) would have a "shorter" game, and sports in which the variation was smaller would have a "longer" game.

Or alternatively, maybe the "length" is chosen to make a game 2-3 hours, and the length of the *season* or *tournament* is chosen so that the best team comes out on top.

There does seem to be a correlation between season length and single-game balance.

Baseball has more one-game upsets, which is why it needs a 162-game season. Football can make do with a lot less because the better team is much more likely to win a given game.

For tennis, this is likely the case, with women. The spread in talent in women's tennis is likely far wider than in men's tennis. To ensure that the same women don't always win, you need fewer games per match.

As for baseball, var(true) for a baseball team is about .060 (which can be calculated in many ways).

var(random) reaches .060, when the number of games played is 69. That is, after 69 games, the "r" is .50.

I don't know what the var(true) for a football team is. I'm sure it's quite a bit higher. Just taking a quick stab at it now, let's say var(true) it's .150 for football. To get var(random) to be .150, you need 11 games. That, is, after 69 baseball games, you'll know as about the true talent of teams, as you would after 11 NFL games.

Here is one way to figure out the var(true) for any league.

Step 1 - Take a sufficiently large number of teams (preferably all with the same number of games).

Step 2 - Figure out each team's winning percentage.

Step 3 - Figure out the standard deviation of that winning percentage.

I just did it quick, and I took the last few years in the NFL, and the SD is .19, which makes var(observed) = .19^2

Step 4 - Figure out the random standard deviation. That's easy: sqrt(.5*.5/16)

16 is the number of games for each team.

So, var(random) = .125^2

Solve for:

var(obs) = var(true) + var(rand)

var(true), in this case, is .143^2

Knowing that var(true) is .143, to get an "r" of .50, you need var(rand) to also be .143. For that to happen, the number of games played equals 12. That is sqrt(.5*.5/12)= .144

In baseball, var(true) is .060.

I haven't figured out what it is in NHL, or NBA, but perhaps someone wants to look at it?

I'll take a quick guess with the NHL. The number of points (adjusted for ties/OT) parellels somewhat the number of wins in baseball.

So, if the observed SD win% in baseball is .072, then it would be around .080 for the NHL, I'll guess.

The random variation for 82 games is .055^2

which makes var(true) = .058

Therefore, in the NHL, 82 games is pretty much the point where r=.50. That is, 12 NFL games, 69 MLB games, and 75 NHL games are all equivalent.

MLB decides to play 162 games instead. The NHL decides to allow 16 teams in the playoffs.

Hmmm... should have checked first. var(obs) in the NHL is .100^2, making var(true) = .083^2. To match var(rand) of of .083^2, you need to play 36 games.

So, 12 NFL games, 36 NHL games, and 69 MLB games are equivalent.

In the NFL, with only 16 games, luck plays a huge role. In the NHL and MLB, both those number of games is 43-44% of their respective seasons. There's no "true talent" reason for the NHL to have all those playoff games.

what the heck... NBA var(obs) is around .145^2. var(rand) just like hockey, or .055^2. So var(true) = .134^2

To get an r of .5, you need only 14 NBA games! Sheesh. This is a huge problem here. 14 NBA games tells you as much as 36 NHL games.

On top of which, 16 NBA teams make the playoffs.

NBA games need to be cut down from 48 minutes to something alot less.

I apologize for the cross-posting to here an on my site, but here's another thought.

======================

Which makes me think about the home field advantage. We all know in basketball it's way high. I always figured it was because of travel and fatigue. But, maybe it's something similar here. Let's say that all athletes get a 1% boost by playing at home. In basketball, because of the way the game is laid out (100 possessions per team as opposed to 40 for baseball), then they get to keep piling up on that. That is, if basketball were only played for one quarter (25 possessions each), and you look at the home record, I'm sure it won't be .620. Likely, it'll be something like .530.

Is Roland Beech around?

I actually commented on the home field advantage, as applied to this topic, in my review of "The Wages of Wins" ... didn't do any calculations, just a quick summary.

In reference to your comment about correlation coefficients for various run estimators, I wrote about this awhile back at:

http://www.hardballtimes.com/main/article/ops-for-the-masses/

Win percentage of the NBA last eyar was 61%.

This relationship between r-squared and the coefficient of correlation only holds in a univariate regression setting. And the discussion of "real life" use of statistics sounds like you mean "people who can't be bothered to take the time to learn statistics."

Very interesting article and comments. I have one comment about the use of r versus r^2. I'm always looking for simple explanations of statistics and I think r^2 can actually have a practical and intuitive interpretation in some cases. Suppose, you are looking at the correlation between team winning percentage and runs scored. Here is how I would explain r^2 in this case: The winning percentages vary a lot. Why are they different? One of the reasons is differences in runs scored. How much of the variance in winning percentages is explained by runs scored? About 33%. How much of the variance is explained by runs and runs allowed together? About 83%.

Lee

Hi, Lee,

>"How much of the variance in winning percentages is explained by runs scored? About 33%."

I'd argue that "variance" is not an intuitive real-life concept. Standard deviation is, but not variance. If adult male height has a standard deviation of 4 inches, that makes sense. But if the variance is 16 inches squared, what the heck does *that* mean to the non-statistician?

Post a Comment

## Links to this post:

Create a Link

<< Home