## Friday, May 08, 2009

### The regression equation versus r-squared

OK, I hope I'm not beating a dead horse here, but here's another way to think of the difference between r-squared and the regression equation.

The r-squared comes from the standpoint of stepping back and looking at the distribution of wins among teams in your dataset. Some teams have over 60 wins, some teams have under 20 wins, and some teams are in the middle. If you look at the standings, and ask yourself, "how important are differences in salary to how we got this way?", then you're asking about r-squared.

The regression equation matters more if you're interested in the future, if you care about how much you can influence wins by increasing payroll. If you ask yourself, "how much do I have to spend to get a few extra wins?", then you want the regression equation.

The r-squared looks at the past, and asks, "was salary important to how we got to this variance in wins?". The regression equation looks to the future, and says, "can we use salary to influence wins?"

It's very possible, and very easy, to have two different answers to these two questions. Here's an example.

Suppose you're trying to see what activities 25-year-olds partake in that affect their life expectancy. You might discover that the average 25-year-old lives to 80, but you want to try to figure out what factors influence that. You run a multiple regression, and you figure out that if the person smokes at 25, it appears to cut five years off his life expectancy. If he eats healthy, it adds four years. If he commits suicide at 25, it cuts off 55 years (since he dies at 25 instead of 80).

Your regression equation would look something like:

life expectancy = 80 - (5 * smoker) + (4 * eats healthy) - (55 * commits suicide).

We should all agree that committing suicide has a big effect on life expectancy, right?

Now, let's look at the r-squared. To do that, look at all the 25-year-olds in the sample (which might be several thousand). You'll see a few that live to 25, some that live to 45, a bunch that live to 65, a larger bunch that live to 80, and some that live to 100. The distribution is probably bell-shaped.

For the r-squared, ask yourself: how much did suicide contribute to the curve looking like this? The answer: very little. There are probably very few suicides at 25, and even if you adjusted for those, by taking those points out of the left side of the curve and moving them to the peak, the curve would still look roughly the same. Suicide is not a very big factor in making the curve look like it does.

And so, you get a very low r-squared for suicide. Maybe it would be .01, or even less.

-- suicide has a HUGE effect on lifespan.
-- r-squared for suicide vs. lifespan is very low

And, again, that's because:

-- the regression equation tells you what effect the input has on the output;
-- the r-squared tells you how important that input was in creating the distribution you see.

The regression equations tell you that having a piano drop on your head is very dangerous. The low r-squared tells you that pianos haven't historically been a major source of death.

----

Here's a different way to explain this, which might make more sense to gamblers:

Suppose that you had to predict the lifespan of a random 25-year-old. Obviously, the more information you have, the more accurate your estimate will be. And, imagine the amount you lose is the square of the error in your guess. So if you guess 80, and the random person dies at 60, you lose \$400 (the square of 80 minus 60).

Without any information, your best strategy is to guess the average, which we said was 80. Your average loss will be the variance, which is the square of the SD. Suppose that SD is 15. Then, your average loss would be \$225.

Now, how valuable is knowing the value of whether or not the guy committed suicide? It's probably not that valuable. Most of the time, the answer will be "no", and you're only slightly better off than when you started (maybe you guess 80.05 now instead of 80). A tiny, tiny proportion of the time, the answer will be "yes," and you can safely guess 25 and be right on. On balance, you're a little better off, but not much.

On average, how much less will you lose given the extra information? The answer is given by the r-squared. If the r-squared of the suicide vs. lifespan regression is .01, as estimated above, then your loss will be reduced by 1%. Instead of losing \$225, on average, you'll lose only about \$222.75.

Again: the r-squared doesn't tell you that suicide is dangerous. It just tells you that, because of *some combination of dangerousness of suicide and historical frequency of suicide*, you can shave 1% off your error by taking it into account.

----

Now, let's reapply this to basketball. The r-squared for salary vs. wins was .2561. The SD of wins was 14.1, so the variance was the square of that, or 199.

If you took a bet where you had to guess a random team's wins, and had to pay the square of the difference, you'd pick "41" and, on average, owe \$199. But let's suppose someone tells you the team's payroll. Now, you can adjust your guess, to predict higher if the team has a high payroll, or lower if the team has a low payroll. If you adjust your guess optimally -- by using the results of the regression equation -- you'll cut your average loss by 25.61%. So, on average, you'd lose only 74.39% as much as before. That works out to \$148.11.

What Berri, Brook and Schmidt are saying, in "The Wages of Wins," is, "look, if you can only cut your losses by 25.61% by knowing salary, then money can't be that important in buying wins." But that's wrong. What they should conclude is that "how important money is, combined with how often it's been used to buy wins," isn't that important.

And, really, if you look at the full results of the regression, it turns out that money IS important in buying wins, but that not too many teams took advantage of that fact in 2008-09.

The equation shows that every \$1.6 million dollars in additional salary will buy you a win -- so if you want to go 61-21, it should only cost you \$32 million more than the league-average payroll of \$68.5 million.

That's pretty important, and so the low r-squared must be that not a lot of teams varied much in salary. If you look at the salary chart, there's a huge group bunched near the average: there are 18 teams between \$62mm and \$75mm, within \$6.5 million of the average. Those teams are so close together that there's not much difference in their expected wins.

If you have to bet, and the random team you pick turns out to be the lowest-spending in the league, you'll reduce your estimate. You would have lost a lot of money guessing 41, so the information that you picked a low-spending team will cut your losses a lot. If it turns out be be one of the highest-spending in the league, same thing. But if it turns out to be one of the 18 teams in the mdidle, the salary information won't help you much. And why the r-squared is only about 25% -- for many of the teams in the sample, knowing the salary doesn't help you cut your losses much.

What if we take out those 18 teams, and regress only on the remaining 12? Well, the regression equation stays almost the same -- \$1.5 million per win instead of \$1.6. But the r-squared increases to .4586. Why does the r-squared increase? Because salary is much more significant a factor for those 12 teams than for the ones in the middle. Before, knowing the salary might not do you much good for your estimate if it's one of the teams bunched in the middle. But, now, those teams are gone. Your random team is much more likely to be the Cavaliers or the Clippers, so knowing the salary is a much bigger help, and it lets you cut your betting losses by almost half.

----

One last summary:

1. The regression equation tells you how powerful the input is in affecting output -- is it a nuclear weapon, or a pea-shooter?

2. The r-squared tells you how powerful the input is, "multiplied by" how extensively the input was historically used. That is: a nuclear weapon used once might give you the same r-squared as a pea-shooter used a billion times.

So a low r-squared might mean

-- an input that doesn't have much effect on the output (e.g., shoe size probably doesn't affect lifespan much);

-- an input that has a big effect on output but doesn't happen much (e.g., suicide curtails 100% of lifespan but happens rarely); or

-- an input that doesn't affect output and also doesn't happen much. (e.g., fluorescent purple shoes' effect on lifespan).

In the case of the 2008-09 NBA, the regression equation shows that salary is a fairly powerful bomb. And the moderate r-squared shows that not every team uses it to its full potential.

Bottom line: salary can indeed very effectively buy wins. The r-squared is as small as it is because, in 2008-09, NBA teams differed only moderately in how they chose to vary their spending.

At Friday, May 08, 2009 7:27:00 PM,  Ryan J. Parker said...

Awesome post. Not much more to say. :)

At Saturday, May 09, 2009 4:04:00 PM,  Cyril Morong said...

Does Brook speculate as to why salary would have not effect on winning? A couple of possibilities are that GMs are not judging talent very well, so there are many bad players getting paid alot and many good players not getting paid much. Or, the GMs do a good job of judging talent but what happens on the baseketball floor is so random that talent matters very little. Maybe in that case we should see a very different mix of teams in the playoffs every year.

At Wednesday, May 13, 2009 4:31:00 PM,  Anonymous said...

In this same vein, pp. 211-212 of Making Social Sciences More Scientific by Rein Taagepera (Oxford Univ. Press, 2008) notes the following: "It may be argued that one might want to see if a set of variables 'explains' an outcome better than another set, by comparing their R^2, without aiming at prediction. But what does an 'explanation' stand for, if devoid of ability to predict? ... If all authors obtained roughly the same intercept and slope in V = a + bN, we would have a solid empirical regularity even if all of them found low values of R^2. The numerical values of a and b would still look reproducible. On the other hand, if they all got high R^2 but for wildly different regression lines, then we would have nothing, unless we could introduce some other factor to explain the discrepancy. In such a case, high values of R^2 would actually enhance the confusion .... Correlation coefficients are pointless in the absence of equations [emphasis in the original]. It should be realized that even the worst-quality data fit equation is worth more than an excellent value of R^2, reported devoid of the substantive equation to which it applies. If a physicist reported R^2 alone, he would meet blank stares: You say you have good quality results, but what are the results? The practice of reporting R^2 without the equation to which it refers has fortunately become rare. The cult of R^2 carried to that point would pull any journal down to the level of pseudo-scientific formalism." The Taagepera assessment complements the comparison/contrast between R^2 and the equation itself as made by Birnbaum.

At Wednesday, May 13, 2009 5:03:00 PM,  Phil Birnbaum said...

Anonymous: very cool ... thanks for posting that!

At Thursday, May 14, 2009 4:47:00 PM,  Cliff Blau said...

So, you are saying that if the Knicks had paid their players more this season, they would have won more games? Perhaps they would have played harder due to their gratitude towards their benevolent bosses?

No, there is some relationship between payroll and wins because payroll is a proxy for the ability of the players. This shows for the umpteenth time that correlation does not prove causation. We don't want to know what the relationship between payroll and wins is, we want to know how we can predict wins. In this case, we'd be better off knowing the playing ability of the team than its payroll. If we have a direct way to measure that, we will no doubt find a much stronger correlation with wins.

Looking at the regression equation without considering the correlation statistic will often give us misleading results, since normally even two completely unrelated numbers will have some small correlation just due to random chance.

We need to look at both the r-squared and the equation, as well as use some common sense.

At Thursday, May 14, 2009 5:26:00 PM,  Phil Birnbaum said...

Cliff,

1. Yes, of course the increase in wins comes from the salary buying talent. I assumed that was understood.

2. Yes, you do have to look at the significance level to make sure you're not just seeing random noise. I mentioned that in a previous post, and assumed here that it was understood.

At Monday, May 18, 2009 8:36:00 PM,  Robert Quinn said...

Hey Phil this is not Sabermetric related or related to the Post but I wanted to run this by you.

Why don't national league away teams put their pitcher from the night before in the 9th spot to start games. It sounds strange but there is no down side to a manger doing this.

In the top half of the first inning if the team bats around the manager has three options 1) use a pinch hitter in that spot; 2) use the pitcher from the day before who is in the line up, or 3) choose to sub in the pitcher who was scheduled to pitch that day. There is no downside to this move because if you take the 3rd path it is the same exact position you would have been in the first place.

Of course if you didn't path around you would just sub in the days starter.

Since the starting pitcher will never throw the next day (i imagine a team would exhaust positional players before resorting to this option), there is no downside of taking him out of the game.

I am assuming that there must be a rule against this because I doubt that no one in the history of baseball has came across this before, if they did every away national team would do this.

At Monday, May 18, 2009 8:59:00 PM,  Phil Birnbaum said...

Doesn't the announced starting pitcher have to pitch to at least one batter? That would make the strategy not very workable ...

At Tuesday, May 19, 2009 9:17:00 PM,  Cliff Blau said...

It's rule 3.05(a)