The regression equation versus r-squared
OK, I hope I'm not beating a dead horse here, but here's another way to think of the difference between r-squared and the regression equation.
The r-squared comes from the standpoint of stepping back and looking at the distribution of wins among teams in your dataset. Some teams have over 60 wins, some teams have under 20 wins, and some teams are in the middle. If you look at the standings, and ask yourself, "how important are differences in salary to how we got this way?", then you're asking about r-squared.
The regression equation matters more if you're interested in the future, if you care about how much you can influence wins by increasing payroll. If you ask yourself, "how much do I have to spend to get a few extra wins?", then you want the regression equation.
The r-squared looks at the past, and asks, "was salary important to how we got to this variance in wins?". The regression equation looks to the future, and says, "can we use salary to influence wins?"
It's very possible, and very easy, to have two different answers to these two questions. Here's an example.
Suppose you're trying to see what activities 25-year-olds partake in that affect their life expectancy. You might discover that the average 25-year-old lives to 80, but you want to try to figure out what factors influence that. You run a multiple regression, and you figure out that if the person smokes at 25, it appears to cut five years off his life expectancy. If he eats healthy, it adds four years. If he commits suicide at 25, it cuts off 55 years (since he dies at 25 instead of 80).
Your regression equation would look something like:
life expectancy = 80 - (5 * smoker) + (4 * eats healthy) - (55 * commits suicide).
We should all agree that committing suicide has a big effect on life expectancy, right?
Now, let's look at the r-squared. To do that, look at all the 25-year-olds in the sample (which might be several thousand). You'll see a few that live to 25, some that live to 45, a bunch that live to 65, a larger bunch that live to 80, and some that live to 100. The distribution is probably bell-shaped.
For the r-squared, ask yourself: how much did suicide contribute to the curve looking like this? The answer: very little. There are probably very few suicides at 25, and even if you adjusted for those, by taking those points out of the left side of the curve and moving them to the peak, the curve would still look roughly the same. Suicide is not a very big factor in making the curve look like it does.
And so, you get a very low r-squared for suicide. Maybe it would be .01, or even less.
See the apparent contradiction?
-- suicide has a HUGE effect on lifespan.
-- r-squared for suicide vs. lifespan is very low
And, again, that's because:
-- the regression equation tells you what effect the input has on the output;
-- the r-squared tells you how important that input was in creating the distribution you see.
The regression equations tell you that having a piano drop on your head is very dangerous. The low r-squared tells you that pianos haven't historically been a major source of death.
Here's a different way to explain this, which might make more sense to gamblers:
Suppose that you had to predict the lifespan of a random 25-year-old. Obviously, the more information you have, the more accurate your estimate will be. And, imagine the amount you lose is the square of the error in your guess. So if you guess 80, and the random person dies at 60, you lose $400 (the square of 80 minus 60).
Without any information, your best strategy is to guess the average, which we said was 80. Your average loss will be the variance, which is the square of the SD. Suppose that SD is 15. Then, your average loss would be $225.
Now, how valuable is knowing the value of whether or not the guy committed suicide? It's probably not that valuable. Most of the time, the answer will be "no", and you're only slightly better off than when you started (maybe you guess 80.05 now instead of 80). A tiny, tiny proportion of the time, the answer will be "yes," and you can safely guess 25 and be right on. On balance, you're a little better off, but not much.
On average, how much less will you lose given the extra information? The answer is given by the r-squared. If the r-squared of the suicide vs. lifespan regression is .01, as estimated above, then your loss will be reduced by 1%. Instead of losing $225, on average, you'll lose only about $222.75.
Again: the r-squared doesn't tell you that suicide is dangerous. It just tells you that, because of *some combination of dangerousness of suicide and historical frequency of suicide*, you can shave 1% off your error by taking it into account.
Now, let's reapply this to basketball. The r-squared for salary vs. wins was .2561. The SD of wins was 14.1, so the variance was the square of that, or 199.
If you took a bet where you had to guess a random team's wins, and had to pay the square of the difference, you'd pick "41" and, on average, owe $199. But let's suppose someone tells you the team's payroll. Now, you can adjust your guess, to predict higher if the team has a high payroll, or lower if the team has a low payroll. If you adjust your guess optimally -- by using the results of the regression equation -- you'll cut your average loss by 25.61%. So, on average, you'd lose only 74.39% as much as before. That works out to $148.11.
What Berri, Brook and Schmidt are saying, in "The Wages of Wins," is, "look, if you can only cut your losses by 25.61% by knowing salary, then money can't be that important in buying wins." But that's wrong. What they should conclude is that "how important money is, combined with how often it's been used to buy wins," isn't that important.
And, really, if you look at the full results of the regression, it turns out that money IS important in buying wins, but that not too many teams took advantage of that fact in 2008-09.
The equation shows that every $1.6 million dollars in additional salary will buy you a win -- so if you want to go 61-21, it should only cost you $32 million more than the league-average payroll of $68.5 million.
That's pretty important, and so the low r-squared must be that not a lot of teams varied much in salary. If you look at the salary chart, there's a huge group bunched near the average: there are 18 teams between $62mm and $75mm, within $6.5 million of the average. Those teams are so close together that there's not much difference in their expected wins.
If you have to bet, and the random team you pick turns out to be the lowest-spending in the league, you'll reduce your estimate. You would have lost a lot of money guessing 41, so the information that you picked a low-spending team will cut your losses a lot. If it turns out be be one of the highest-spending in the league, same thing. But if it turns out to be one of the 18 teams in the mdidle, the salary information won't help you much. And why the r-squared is only about 25% -- for many of the teams in the sample, knowing the salary doesn't help you cut your losses much.
What if we take out those 18 teams, and regress only on the remaining 12? Well, the regression equation stays almost the same -- $1.5 million per win instead of $1.6. But the r-squared increases to .4586. Why does the r-squared increase? Because salary is much more significant a factor for those 12 teams than for the ones in the middle. Before, knowing the salary might not do you much good for your estimate if it's one of the teams bunched in the middle. But, now, those teams are gone. Your random team is much more likely to be the Cavaliers or the Clippers, so knowing the salary is a much bigger help, and it lets you cut your betting losses by almost half.
One last summary:
1. The regression equation tells you how powerful the input is in affecting output -- is it a nuclear weapon, or a pea-shooter?
2. The r-squared tells you how powerful the input is, "multiplied by" how extensively the input was historically used. That is: a nuclear weapon used once might give you the same r-squared as a pea-shooter used a billion times.
So a low r-squared might mean
-- an input that doesn't have much effect on the output (e.g., shoe size probably doesn't affect lifespan much);
-- an input that has a big effect on output but doesn't happen much (e.g., suicide curtails 100% of lifespan but happens rarely); or
-- an input that doesn't affect output and also doesn't happen much. (e.g., fluorescent purple shoes' effect on lifespan).
In the case of the 2008-09 NBA, the regression equation shows that salary is a fairly powerful bomb. And the moderate r-squared shows that not every team uses it to its full potential.
Bottom line: salary can indeed very effectively buy wins. The r-squared is as small as it is because, in 2008-09, NBA teams differed only moderately in how they chose to vary their spending.