Sabermetric Research: Why r-squared doesn't tell you much, revisited

In a blog post I wrote about yesterday, "Wages of Wins" author Stacey Brook ran a regression to try to figure out what kind of relationship there is between an NBA team's payroll and its success on the court.

The regression gives you several pieces of information. Which ones should you use to best explain the relationship?

Brook says it's the r-squared. He writes,

"We use R2 since we are interested in the proportion of variance that is in common between NBA team payroll and NBA team performance."

But is that truly what we're interested in? I don't think so.

I do agree with Brook when he says that R-squared gives you "the proportion of variance that is in common between NBA team payroll and NBA team performance." But what does that mean? Almost nothing, unless you're a statistician.

When you do research like this, there's a question that you want to answer. In this case, if your question is "what proportion of variance is in common between NBA team payroll and NBA team performance?," well, then, there's your answer. But that's not the question. It's not even Brook's real question. His real question is implied by the first paragraph of his post:

"I have to disagree that NBA (or for that matter NHL, MLB or NFL) teams that have high payrolls result in higher winning percentages; nor am I the first to say this."

The question is: do teams with higher payrolls do better on the court? And that question is different from "what proportion of variance is in common between NBA team payroll and NBA team performance?"

If you want to see what payroll does to performance, what you want to see is the regression equation. The way regression works, of course, is to plot all the datapoints on a graph, then draw the best fit straight line among those points. That line represents the best-fit relationship between payroll and wins.

If you do that for the 2008-09 NBA teams, you get

Wins = 0.61 (millions of $ spent) - 0.76

This, basically, answers your question, in several ways

-- every extra million dollars you spend on salaries gives you three-fifths of a win.
-- every extra $1.64 million you spend gives you an extra win.
-- if you spend $100 million, like the Knicks, you should win about 60 games.
-- if you spend only $45 million, like the Grizzlies, you should win only about 27 games.

Not that complicated, right? If you want to know about the direct relationship between salary and wins, the regression equation does it.

Of course, you want to check the statistical significance; it's possible that while the best-fit straight line says $1.64 million per win, that might not be significantly different from zero. (As it turns out, it IS significant, at the 99.5% level. In fairness to Brook, it appears his data source had incorrect information, and because of that, his results were not, in fact, significant.)

I think we can all agree, from these results, that it certainly does appear that spending leads to winning. When the highest-spending team is expected to go 60-22, and the lowest-spending team is expected to go 27-55, you can't really claim that payroll is irrelevant. (Again, in fairness to Brook, he didn't get results this extreme. With the incorrect data, the regression suggests the highest-spending team should only be 45-37.)

So if the regression equation is the gold standard for making these kinds of calculations, what's with the r-squared? Well, the r-squared answers a different question.

Let's suppose that you had no idea what makes teams win basketball games. You see the Cavs go 66-16, and you see the Clippers go 19-63, and you think, what causes the difference?

What you could do is list as many plausible things as you could think of. Payroll would be one of them. Maybe average days of rest. Maybe whether they're an offensive or defensive team. Maybe average age. Maybe pace of play. Just list them all, as many as you want. Then, run a regression, and look at the r-squared.

What the r-squared will do is tell you, in a certain mathematical sense, after correcting for all those variables, what percentage of all the variation in wins have you explained? What you're trying to do is get as close to 100% as you can. The closer you get, the more you've explained what makes teams win and what makes teams lose. Maybe, if you actually ran this regression, you'd get to something like 40%. If you adjusted team wins for all those variables, as best you could, your variance would decrease by 40%.

In this particular case, our regression didn't include all that other stuff, like pace of play or average age. We only had one variable, payroll. And it turned out that the r-squared was .256, which means that 25.6% of the variation is "explained" by payroll.

It doesn't sound like a lot. In "The Wages of Wins," Brook (and co-authors David Berri and Martin Schmidt) did that for MLB, and came up with only 18%. That doesn't sound like a very big number either, and those authors decide that means that payroll isn't very important.

But that doesn't follow.

The r-squared, the seemingly-low 25.6% number, does NOT tell you about the relationship between payroll and wins. It just tells you that payroll is 25.6% of the total variance, and other factors are 74.4%. But, if the total variance is large, 25.6% of it would be substantial.

When you go into the car dealership and ask for a price, you want the amount in dollars. If you ask "how much for that Camry," and the salesman says, "it's 700% of your monthly pay," it may sound like a lot. If he says, "it's 9.5% of your net worth," it may sound cheaper. And if he says, "it's less than 0.01% of Bill Gates' disposable income for the week," it may sound cheaper still. But those all represent the same number of dollars. The fact that one percentage is a large number, and one percentage is a small number, doesn't change that fact.

It's the same thing for r-squared. The size of the percentage number depends what it's a percentage of -- which happens to be the total variance of wins in the league. Do you know, intuitively, what that variance is? I don't. But I know that a lot of it is random chance. And random variation depends on sample size. You could have exactly the same relationship between salary and wins, but, in one case, the r-squared is .25, and in another case, it's .04, and in another case, it's .5.

I wrote before about one example of how that can happen. But I can do another right now.

Want to see how you can use the same data to get a larger r-squared? Easy. I'm going to take the actual data for the 30 teams, but group them into threes according to payroll. So instead of the three data points "$100 million, 32 wins" (Knicks), "$90.1 million, 66 wins" (Cavs), and "$86 million, 50 wins" (Mavericks), I'm going to add them all up into the one data point "$276.1 million, 148 wins". Then I'm going to repeat for the other 27 teams, until I have 10 sums of three teams. Then, I'm going to run a regression on those 10 data points.

What happens? The r-squared now goes up to .497 -- almost double what it was!

But while I was able to arbitrarily double the r-squared, the regression line stayed almost the same -- which makes sense, since the actual relationship between salary and wins shouldn't change just because we arranged the data differently. Using all 30 teams, we got 0.61 wins per million dollars. Using the 10 groups of three teams, we get 0.68 wins per million dollars. Pretty close.

Here, let me give you everything in one place:

30 teams.... r-squared = 0.256
10 groups... r-squared = 0.497

30 teams.... Wins = 0.61 ($millions) - 0.76
10 groups... Wins = 0.68 ($millions) - 5.5

If Stacey Brook did the analysis his way, using all 30 teams, he'd say "salary explains 25.6% of the variance in wins." If I do the analysis my way, using groups of three teams, I'd say "salary explains 49.7% of the variance in wins." Which one of us would be right? Both of us! Because we are using different denominators, different variances. The same Toyota Camry can be a smaller percentage of Brook's salary than of my salary, because our salaries are different.

And so saying "payroll explains 25.6% of the variance of wins" is like saying "a Camry costs 35% of salary." Whose salary, and how much does he earn? Unless you know that, the "35%" figure is useless.

But, again, despite the fact that Brook and I did our regression differently, the equation should come out very similar. It won't come out exactly the same, because of random fluctuation, but you should *expect* it to come out the same, in the same sense as you expect a coin to come up heads 50% of the time. 0.61 wins per $million and 0.68 wins per $million are pretty close.

The regression equation is meaningful, it requires less information to interpret, and its expected value is the same regardless of your sample size. Most importantly, it answers the exact question that you want to know.

The r-squared, on the other hand, is unintuitive, can be made to come out to almost anything you like by tweaking the sample size to get a different total variance, and requires you to know how the study was done in order to interpret what it means. In terms of answering real-life questions, it's not very useful at all.

Labels: basketball, Berri, NBA, payroll, statistics, The Wages of Wins

5 Comments:

At Friday, May 08, 2009 10:37:00 AM, Brian Burke said...: If I understand correctly, one other thing that r-squared won't tell you is covariance. If you run a regression with a single predictor to find the importance of that variable, you won't get the whole story.

In this case, payroll accounts for 25% (or whatever) of the variance in wins. But that 25% contribution is only for payroll in pure abstract isolation from any other factor. In reality, payroll will interact with many other variables--player age, injuries, coaching, strength of schedule, etc. There will be some amount of overlapping covariance of payroll with those other factors that will that would not be captured by the r-squared in the single-variable regression.

So the total effect of payroll, plus the interactive effect of payroll combined with other factors, is likely to be different than reported.
At Friday, May 08, 2009 11:35:00 AM, Phil Birnbaum said...: Right. Does the NBA have slave players? If it does, then payroll is probably more important than this regression says. In a regression, what you get is the effect of a variable (payroll) *holding all other variables constant*. But there are no other variables in the regression to hold constant.

It could turn out that if you include "number of slave players," payroll becomes even more significant, because, without considering how many free agents are on the team, high payroll could mean either (a) lots of free agents, or (b) good free agents. Once you take out (a) by including "number of slave players" in the regression, you might find the regression equation shows that payroll now matters more than before.

I'm not 100% sure, but I think it's possible (even likely) that the portion of the r-squared attributed to payroll could go *down*, even while the regression equation shows that the number of wins per dollar goes *up*. Can someone confirm this?
At Friday, May 08, 2009 11:59:00 AM, James said...: This paper seems timely - "If the Team Doesn't Win, Nobody Wins:" A Team-Level Analysis of Pay and Performance Relationships in Major League Baseball:

http://www.bepress.com/jqas/vol5/iss2/6/
At Friday, May 08, 2009 5:19:00 PM, Anonymous said...: Uh, well ... duh!

The limitations on the value of r^2 have long been understood, even though r^2 has long been the lingua franca of the social science community. The social science community has long been looked down upon by their cousins in the physical sciences: What passes for acceptable among the latter (.98 <= r^2 <= 1.00) differs greatly from that among the former (.50 <= r^2 <= 1.00). Nevertheless, it is always good when a minute subgroup of the social science community -- SABR -- examines the metric and rightly concludes that, yea verily, r^2 doesn't tell you much. Of course, that's not to say that r^2 is bereft of value, for reading about the shortcomings of r^2 from within said subgroup has long been entertaining. Akin to Moliere's Jourdain's discovery that he had been speaking prose all his life.
At Monday, September 28, 2009 12:07:00 AM, Alex said...: Your viewpoint here is fairly mistaken. If you'd like to see my take, I posted it here: http://xkonk.wordpress.com/2009/09/27/i-apologize-in-advance/

<< Home

Sabermetric Research

Friday, May 08, 2009

Why r-squared doesn't tell you much, revisited

5 Comments:

About Me

Previous Posts