Why r-squared doesn't tell you much, revisited
In a blog post I wrote about yesterday, "Wages of Wins" author Stacey Brook ran a regression to try to figure out what kind of relationship there is between an NBA team's payroll and its success on the court.
The regression gives you several pieces of information. Which ones should you use to best explain the relationship?
Brook says it's the r-squared. He writes,
"We use R2 since we are interested in the proportion of variance that is in common between NBA team payroll and NBA team performance."
But is that truly what we're interested in? I don't think so.
I do agree with Brook when he says that R-squared gives you "the proportion of variance that is in common between NBA team payroll and NBA team performance." But what does that mean? Almost nothing, unless you're a statistician.
When you do research like this, there's a question that you want to answer. In this case, if your question is "what proportion of variance is in common between NBA team payroll and NBA team performance?," well, then, there's your answer. But that's not the question. It's not even Brook's real question. His real question is implied by the first paragraph of his post:
"I have to disagree that NBA (or for that matter NHL, MLB or NFL) teams that have high payrolls result in higher winning percentages; nor am I the first to say this."
The question is: do teams with higher payrolls do better on the court? And that question is different from "what proportion of variance is in common between NBA team payroll and NBA team performance?"
If you want to see what payroll does to performance, what you want to see is the regression equation. The way regression works, of course, is to plot all the datapoints on a graph, then draw the best fit straight line among those points. That line represents the best-fit relationship between payroll and wins.
If you do that for the 2008-09 NBA teams, you get
Wins = 0.61 (millions of $ spent) - 0.76
This, basically, answers your question, in several ways
-- every extra million dollars you spend on salaries gives you three-fifths of a win.
-- every extra $1.64 million you spend gives you an extra win.
-- if you spend $100 million, like the Knicks, you should win about 60 games.
-- if you spend only $45 million, like the Grizzlies, you should win only about 27 games.
Not that complicated, right? If you want to know about the direct relationship between salary and wins, the regression equation does it.
Of course, you want to check the statistical significance; it's possible that while the best-fit straight line says $1.64 million per win, that might not be significantly different from zero. (As it turns out, it IS significant, at the 99.5% level. In fairness to Brook, it appears his data source had incorrect information, and because of that, his results were not, in fact, significant.)
I think we can all agree, from these results, that it certainly does appear that spending leads to winning. When the highest-spending team is expected to go 60-22, and the lowest-spending team is expected to go 27-55, you can't really claim that payroll is irrelevant. (Again, in fairness to Brook, he didn't get results this extreme. With the incorrect data, the regression suggests the highest-spending team should only be 45-37.)
So if the regression equation is the gold standard for making these kinds of calculations, what's with the r-squared? Well, the r-squared answers a different question.
Let's suppose that you had no idea what makes teams win basketball games. You see the Cavs go 66-16, and you see the Clippers go 19-63, and you think, what causes the difference?
What you could do is list as many plausible things as you could think of. Payroll would be one of them. Maybe average days of rest. Maybe whether they're an offensive or defensive team. Maybe average age. Maybe pace of play. Just list them all, as many as you want. Then, run a regression, and look at the r-squared.
What the r-squared will do is tell you, in a certain mathematical sense, after correcting for all those variables, what percentage of all the variation in wins have you explained? What you're trying to do is get as close to 100% as you can. The closer you get, the more you've explained what makes teams win and what makes teams lose. Maybe, if you actually ran this regression, you'd get to something like 40%. If you adjusted team wins for all those variables, as best you could, your variance would decrease by 40%.
In this particular case, our regression didn't include all that other stuff, like pace of play or average age. We only had one variable, payroll. And it turned out that the r-squared was .256, which means that 25.6% of the variation is "explained" by payroll.
It doesn't sound like a lot. In "The Wages of Wins," Brook (and co-authors David Berri and Martin Schmidt) did that for MLB, and came up with only 18%. That doesn't sound like a very big number either, and those authors decide that means that payroll isn't very important.
But that doesn't follow.
The r-squared, the seemingly-low 25.6% number, does NOT tell you about the relationship between payroll and wins. It just tells you that payroll is 25.6% of the total variance, and other factors are 74.4%. But, if the total variance is large, 25.6% of it would be substantial.
When you go into the car dealership and ask for a price, you want the amount in dollars. If you ask "how much for that Camry," and the salesman says, "it's 700% of your monthly pay," it may sound like a lot. If he says, "it's 9.5% of your net worth," it may sound cheaper. And if he says, "it's less than 0.01% of Bill Gates' disposable income for the week," it may sound cheaper still. But those all represent the same number of dollars. The fact that one percentage is a large number, and one percentage is a small number, doesn't change that fact.
It's the same thing for r-squared. The size of the percentage number depends what it's a percentage of -- which happens to be the total variance of wins in the league. Do you know, intuitively, what that variance is? I don't. But I know that a lot of it is random chance. And random variation depends on sample size. You could have exactly the same relationship between salary and wins, but, in one case, the r-squared is .25, and in another case, it's .04, and in another case, it's .5.
I wrote before about one example of how that can happen. But I can do another right now.
Want to see how you can use the same data to get a larger r-squared? Easy. I'm going to take the actual data for the 30 teams, but group them into threes according to payroll. So instead of the three data points "$100 million, 32 wins" (Knicks), "$90.1 million, 66 wins" (Cavs), and "$86 million, 50 wins" (Mavericks), I'm going to add them all up into the one data point "$276.1 million, 148 wins". Then I'm going to repeat for the other 27 teams, until I have 10 sums of three teams. Then, I'm going to run a regression on those 10 data points.
What happens? The r-squared now goes up to .497 -- almost double what it was!
But while I was able to arbitrarily double the r-squared, the regression line stayed almost the same -- which makes sense, since the actual relationship between salary and wins shouldn't change just because we arranged the data differently. Using all 30 teams, we got 0.61 wins per million dollars. Using the 10 groups of three teams, we get 0.68 wins per million dollars. Pretty close.
Here, let me give you everything in one place:
30 teams.... r-squared = 0.256
10 groups... r-squared = 0.497
30 teams.... Wins = 0.61 ($millions) - 0.76
10 groups... Wins = 0.68 ($millions) - 5.5
If Stacey Brook did the analysis his way, using all 30 teams, he'd say "salary explains 25.6% of the variance in wins." If I do the analysis my way, using groups of three teams, I'd say "salary explains 49.7% of the variance in wins." Which one of us would be right? Both of us! Because we are using different denominators, different variances. The same Toyota Camry can be a smaller percentage of Brook's salary than of my salary, because our salaries are different.
And so saying "payroll explains 25.6% of the variance of wins" is like saying "a Camry costs 35% of salary." Whose salary, and how much does he earn? Unless you know that, the "35%" figure is useless.
But, again, despite the fact that Brook and I did our regression differently, the equation should come out very similar. It won't come out exactly the same, because of random fluctuation, but you should *expect* it to come out the same, in the same sense as you expect a coin to come up heads 50% of the time. 0.61 wins per $million and 0.68 wins per $million are pretty close.
The regression equation is meaningful, it requires less information to interpret, and its expected value is the same regardless of your sample size. Most importantly, it answers the exact question that you want to know.
The r-squared, on the other hand, is unintuitive, can be made to come out to almost anything you like by tweaking the sample size to get a different total variance, and requires you to know how the study was done in order to interpret what it means. In terms of answering real-life questions, it's not very useful at all.