Saturday, April 14, 2007

Can payroll buy wins? "Percentage of variance explained" doesn't tell you

I've always been uncomfortable with studies that express their results in "percentage of variance explained." For instance, in "The Wages of Wins," the authors run a regression of wins on salaries, and get an R-squared of .18, which means that "payroll explains about 18% of [the variance of] wins." Since 18 percent is a small number, they argue that the relationship is not strong, and the authors conclude that "you can't buy the fan's love" by spending money to get wins.

I don't think that's correct – and for reasons that have to do with statistics, not baseball. The percentage of variance explained does NOT necessarily tell you anything about the strength of the relationship between the variables.

(Technical note: I'm going to use "percentage of variance explained" most of the time, but you can probably just substitute "R-squared" if you prefer. They're the same thing.)

-----

First, the percentage figure doesn't tell you, outright, the importance of wins – it tells you the importance of wins *relative to the total importances of all factors*. If those total importances go up, the percentage goes down.

One of those other factors is luck. The more games you have, the more the luck evens out. So if you were to analyze five seasons instead of one, the luck would drop, so the payroll-to-luck ratio would increase. Then, instead of payroll explaining only 18% of OPS, maybe it would explain 30%, or even 40%.

Going the other way, if you ran a regression on a single day's worth of games, there's a lot more luck. Over an entire season, there's no way the Yankees are going to be worse than the Brewers – but on a single night, it's quite possible that New York might lose and Milwaukee might win. So, on a single game, payroll might explain only, say, 3% of the variance.

The relationship between salary and expected wins is a constant: if paying $200 million buys you an (expected) .625 winning percentage, it should buy you that .625 over a day, a week, a month, or a year. But depending on how you do the study, you can get a "percentage of variance explained" of 3%, or 18%, or 30%, or 40%. So, obviously, the number you *do* come up with can't, by itself, tell you anything about the strength of the relationship between payroll and wins.

-----

We know that smoking causes heart disease, and that the relationship is pretty strong.

Now: suppose you do a regression. How much of the variance in heart disease can be explained by lifetime smoking?

There's no way to tell. It depends on the distribution of smokers.

Suppose, out of the entire world population, only one person smokes. His risk of heart disease increases substantially. But how much of the variance heart disease is explained by smoking? Close to zero percent. Why? Because even in the absence of smoking, there's substantial variance in the population of six billion people. There's age, there's diet, there's exercise, and there's genetic predisposition, among many other things. The variance contributed by the one smoker, compared to the variance caused by six billion different sets of other causes, is effectively zero.

Now, suppose that half the world smokes, and half doesn't. Now, there is lots more variance in heart disease rates. Instead of just variance caused by genetics and eating habits, you now have, on top of that, 50% of the population varying hugely from the other half in this one area. If smoking is extremely risky compared to eating and genetics, you might find that (say) 40% of the variance is explained by smoking. If it's only moderately more risky, you might get a figure of 15% or something.

What if everyone in the world smokes except one person? In that case, everyone's risk has risen by the same amount, except for that one non-smoker. So the variance caused by smoking is, again, very low. It's the same situation when only one person *doesn't* smoke as when only one person *does* smoke. And so, again, about zero percent of the variance is explained by smoking.

So, in theory, how much of the variance in heart disease should be explained by smoking? It could be almost any number. It depends on the variance of smoking behavior just as much as it depends on the effects of smoking.

If you did an actual study, and you found that 18% of the variance in heart disease was explained by cigarette use, what would that tell you about the riskiness of smoking?

Almost nothing! It could be that (a) there is lower variance in how much people smoke, but the risk of smoking is higher; or (b) there is higher variance in how much people smoke, but the risk is lower. Either of those possibilities is consistent with the 18% figure.

Going back to the baseball example: if 18% of the variance is explained by salaries, it could be that (a) teams vary little in how much they spend, but money buys wins fairly reliably, or (b) teams vary a lot in how much they spend, but money buys wins fairly weakly.

Which is correct? We can't tell yet. The 18% number, on its own, simply does not tell you anything whether money can buy wins.

-----

So if the percentage of variance explained doesn't tell you a whole lot about the relationship between the variables, then what does? Answer: the regression equation.

"The Wages of Wins" doesn't give full regression results for their payroll study, but they do mention the figure of about $5 million dollars per win. So the computer output from their regression probably looks something like

R-squared = .18
Expected Wins = 65 + (Payroll divided by 5 million)


The .18 tells us little, but the equation for wins tells us almost everything we need to know – that the actual relationship between wins and payroll is $5 million dollars a win.

Unlike the R-squared, the $5 million per win should work out the same regardless of whether our regression is based on a day, a season, or even five seasons. And it should come out the same regardless of whether teams vary a lot in spending, or just a little.

If you want to know the strength of the relationship between X and Y, the R-squared won't tell you. But the equation will.

Of course, the estimate will be much less precise for samples in which there's less data. If we took 162 different single days, and ran 162 different regressions, some would come out to $20 million a win, some would come out to $7 million a win, some would come out to $0 a win, and some would even come out negative, to maybe -$10 million a win. But those 162 estimates should average out to $5 million a win, and cluster around $5 million with a normal distribution.

If you only had a single day's worth of data, you might find that the 95% confidence interval for the cost of a win comes out very wide. For instance, the interval might say that a win could cost as much as $20 million, or as little as –6 million. That huge range doesn't constitute very useful information – in fact, zero is in that interval, you can't even statistically significantly conclude that money buys wins at all! And so you'd probably decide not to conclude anything based on the equation, at least until you could rerun the study for a full season's worth of data to get a more precise estimate. But you might still be tempted to look at the r-squared of .03, and say that "only 3% of the variance is accounted for by the model."

You can say it, but, taken alone, it doesn't tell you much about whether there's a strong relationship in real-life terms. Only the regression equation can tell you that.

-----

So what *does* the "percentage explained" figure tell you? It tells you how much more accurate your predictions of wins would be if you had the extra information provided by salary.

Suppose that at the beginning of the season, you had to predict how many wins each team would get, knowing absolutely nothing about any of them. Your best prediction would be that each team would win 81 games. Some of your predictions would be off by many games; one or two might be right on. The standard deviation of your errors would probably have been about 12 games, which means the average variance would have been the square of that, or about 144.

Now, suppose you had, in advance, the information from the regression: the r-squared of .18, each team's payroll, the finding that each $5 million in payroll would have bought one win. You would now adjust your prediction for each team based on its salary. You'd predict the Yankees would be at 97 wins, the Nationals would be at 76, and so on.

The R-squared of .18 would tell you that the extra information covers 18% of the variance of 144. Of the 144 points of variance, 26 points are "explained by payroll". What's left is 118 points. The square root of 118 is about 11, so the extra information allowed you to cut your typical error from 12 games down to 11 games.

The "18%" figure answers the question, "How valuable is knowing the team's payroll if you're trying to predict team wins for the year?"

And that's a completely different question – and a less important one -- than "Can increasing your payroll buy you more wins?"

One last example, just for the sake of overkill. Suppose, in 2020, a vaccine for cancer is invented. It works 90% of the time. Almost the entire population rushes out to get vaccinated.

At that time, you ask yourself these two questions:

1. If I'm trying to predict whether Joe Smith will get cancer, how valuable is knowing if he had the vaccine? Answer: it's not valuable at all – almost everyone has had the vaccine, so knowing that Joe is one of them doesn't give me any useful information. The "percentage of variance explained" is zero.

2. But how strong is the relationship between the vaccine and the incidence of cancer? Answer: extremely strong.

-----

"The Wages of Wins" study shows that payroll does indeed buy wins, at a rate of about $5 million each. The "percentage of variance explained" is almost completely irrelevant.

Labels: ,

16 Comments:

At Sunday, April 15, 2007 1:31:00 AM, Blogger Brian Burke said...

Phil-I've followed your criticism of Wages on this point for a while. You're completely correct in terms of statistics.

But the authors don't know baseball nearly as well as they know economics or basketball. Baseball is a highly random game of razor thin win expectations. That's why a season needs so many games to be significant. Assuming the authors' logic about 0.18 r-squared, that's the difference between 1st place and 4th or 5th place every year. (0.18*162=29 wins)

As I recall, the more appropriate measure is r, not r-squared, which would be 0.42. That's a lot of wins.

 
At Sunday, April 15, 2007 1:40:00 AM, Blogger Phil Birnbaum said...

Hi, Brian,

Agreed on the razor thin expectations, and that r is more appropriate than r-squared (although r alone doesn't tell you much without knowing the SDs of payroll and wins).

But I'm not sure why you're multiplying the 0.18 by 162 ...

 
At Sunday, April 15, 2007 9:25:00 PM, Blogger Brian Burke said...

Just following the authors' faulty logic.

 
At Sunday, April 15, 2007 9:27:00 PM, Blogger Phil Birnbaum said...

Oh, I get it now. Sorry.

Phil

 
At Monday, April 16, 2007 3:17:00 PM, Blogger Don Coffin said...

Just because I'm weird (and happened to have data for 2002 - 2005), here's what I getfor the correlatiob between winning percentage and total team salary annually from 2002 to 2005:

2002: 0.466
2003: 0.446
2004: 0.460
2005: 0.488

So that's pretty much constant on an annual basis. Then I looked at the correlation between average teqam winning percentage over the 2002 - 2005 period and average team salary over the 2002 - 2005 period. The correlation is 0.530, about 10% higher. This suggests that, yes, the correlation rises as we use a longer time period, but, no, even using a 4-year period, it's still not all that high. Longer runs of data are, of course possible. (I have average team salary in my data set back to 1998, but have been too laze to add winning percentages further back than 2002...don't ask me why...)

 
At Monday, April 16, 2007 3:31:00 PM, Blogger Don Coffin said...

I'm apparently also bored today. Here's a regression relating average wins per year to average team salary (in millions) per year, across the 2002 - 2005 period:

AveWins = 67.4 + 5.7*AveSal(Min$)

The adjusted R-squared is 0.255. The constant and the coefficient on salary are both significantly different from zero at the 1% level. So apparently, on average over the 2002 - 2005 period, you could "buy" one more win for $5.7 million. The coefficient of variation (standard deviation divided by mean) for salary is 0.415; the coefficient of variation of wins is 0.131. The relative variation in salaries is more than 3 times as much as the relative variation in wins (even when averaged over the 4 year period).

In a statistician's terms, what we have here is clearly an omitted variables problem. Something quite significant is going on that is not caputured by variation in salary.

 
At Monday, April 16, 2007 3:40:00 PM, Blogger Phil Birnbaum said...

Not weird at all ... thanks for doing all that work!

There are at least three obvious "omitted variables":

1. Binomial luck. A .500 team could win only 91 games, just by luck alone, for the same reason a fair coin could land heads by luck alone.

2. Player luck. A team could pay $12M for a player, expecting him to contribute two wins above average, but he has an off-year (or wasn't as good as the team thought). That is, teams don't always spend their money wisely.

3. Draft luck. A team might wind up drafting an Albert Pujols, who plays several seasons for them for next to nothing.

You could make the point that if you were to take out all the luck, the "percentage of total variance" expressed as a percentage of total NON-LUCK variance is pretty high.

Of course, only #1 is pure luck -- #2 and #3 can be increased to at least some extent by spending more money on player evaluation and drafting. I bet you couldn't increase them *that* much, though.

BTW, your r-squareds increase from 21% (using 2004 as typical) to 28% (over all five years combined). So when I said you would rise from 16% to 30%, I wasn't *that* far off. :)

 
At Monday, April 16, 2007 3:42:00 PM, Blogger Phil Birnbaum said...

Doc, are you sure about that regression equation? That puts wins at about $175,000 each, not $5.7 million.

Did you mean to say (Salary divided by 5.7 million)? That would work.

 
At Monday, April 16, 2007 3:43:00 PM, Blogger Don Coffin said...

I really should do some work. Here's the regression using the individual observations from 2002 - 2005 (so there are 120 data points, not just 30:

Wins = 69.1 + 5.02*AveSal(Mil$)

The adjusted R2 is 0.169. Everything is highly statistically significant. The relationship is essentially the same on an annual basis as it is averaged over 5 year--about $5 million buys a win. The coefficients of variation are larger (45% for salaries, 16% for wins), but not a lot (indeed, not significantly diferent, for that matter).

How much any of this means, I'm not sure. My gut tells me that, yes, on average a larger team salary buys more wins, but that it's not a highly certain way of doing business. The larger variation in d=salaries than in wins suggests that there are significant determinants of wins that are uncorrelated with salary that may be more important than salary.

 
At Monday, April 16, 2007 3:48:00 PM, Blogger Phil Birnbaum said...

Doc,

That last post just showed something very interesting. When you do single years, you get r-squareds from .2 to .24. But when you combine all the observations, you get .17, which is about what TWOW got.

I thought what might have caused this is salary inflation: if you were to normalize each observation to the payroll norm for that year, you'd get a higher r-squared.

But TWOW *did* normalize, and they still only got .18. Hmmm, not sure what to think.

 
At Monday, April 16, 2007 4:02:00 PM, Blogger Don Coffin said...

Two things. I have to pay more attention to what I'm doing. My salary data is not total team salary, it's average player salary. But, doing the math, I still get about $5 million + to add one win.

Second, there's more variation in average salary (and in wins) in a single year than across years. This reduces the amount of variation to be explained, and so whould lead to an increase in R-squared. (It's sort of the same thing you get when you do a t-test for the difference in means--the variation in sample means is much less than the variation within a sample.) So, no, I'm not surprised. Even the effect of the larger sample size isn't enough to counteract the effect of the larger variance in individual years.

 
At Monday, April 16, 2007 4:06:00 PM, Blogger Phil Birnbaum said...

OK, that makes sense. $175K per player to get one win is $4.4 million for the team.

As for the reduced variance ... yes, of course. Geez, I just mentioned that twice in my own post, that r-squared depends on the observed variance. Should have caught that.

 
At Monday, April 16, 2007 4:12:00 PM, Blogger Tangotiger said...

My equation is:
win% = (P+1)/(P+3)

where P = payroll index (team payroll divided by avrage payroll)

So, a team that pays twice the league average will be expected to win 60% of the time. A team that pays at the league minimum (about 15% of league payroll) will win 36.5% of the time.

Around the 50% mark (P=1), we see that the marginal $ per win is around 4 MM per win.

That number is also how much free agents are making.

 
At Monday, April 16, 2007 4:18:00 PM, Blogger Phil Birnbaum said...

Tom: I bet your formula works out similar to the regression-based one. I'll graph the two if I have time.

 
At Tuesday, April 17, 2007 1:32:00 PM, Blogger Unknown said...

If you calculate R as var(expected)/var(observed) you get the following ... using rough numbers here.

Var(observed)=121 -- assumes 1 std deviation in observed wins is 11

Var(random)= 41 -- from binomial where st dev = 6.4 wins over 162 games

 
At Tuesday, April 17, 2007 1:34:00 PM, Blogger Unknown said...

Sorry the last part of my post was missed out.

Therefore R = 0.66; R^2 = 0.43 ... in that context an R^2 of 0.2 or whatever it was isn't too shabby when you consider all the other factors at play


Thanks
John Beamer

 

Post a Comment

<< Home