I've always been uncomfortable with studies that express their results in "percentage of variance explained." For instance, in "The Wages of Wins," the authors run a regression of wins on salaries, and get an R-squared of .18, which means that "payroll explains about 18% of [the variance of] wins." Since 18 percent is a small number, they argue that the relationship is not strong, and the authors conclude that "you can't buy the fan's love" by spending money to get wins.
I don't think that's correct – and for reasons that have to do with statistics, not baseball. The percentage of variance explained does NOT necessarily tell you anything about the strength of the relationship between the variables.
(Technical note: I'm going to use "percentage of variance explained" most of the time, but you can probably just substitute "R-squared" if you prefer. They're the same thing.)
First, the percentage figure doesn't tell you, outright, the importance of wins – it tells you the importance of wins *relative to the total importances of all factors*. If those total importances go up, the percentage goes down.
One of those other factors is luck. The more games you have, the more the luck evens out. So if you were to analyze five seasons instead of one, the luck would drop, so the payroll-to-luck ratio would increase. Then, instead of payroll explaining only 18% of OPS, maybe it would explain 30%, or even 40%.
Going the other way, if you ran a regression on a single day's worth of games, there's a lot more luck. Over an entire season, there's no way the Yankees are going to be worse than the Brewers – but on a single night, it's quite possible that New York might lose and Milwaukee might win. So, on a single game, payroll might explain only, say, 3% of the variance.
The relationship between salary and expected wins is a constant: if paying $200 million buys you an (expected) .625 winning percentage, it should buy you that .625 over a day, a week, a month, or a year. But depending on how you do the study, you can get a "percentage of variance explained" of 3%, or 18%, or 30%, or 40%. So, obviously, the number you *do* come up with can't, by itself, tell you anything about the strength of the relationship between payroll and wins.
We know that smoking causes heart disease, and that the relationship is pretty strong.
Now: suppose you do a regression. How much of the variance in heart disease can be explained by lifetime smoking?
There's no way to tell. It depends on the distribution of smokers.
Suppose, out of the entire world population, only one person smokes. His risk of heart disease increases substantially. But how much of the variance heart disease is explained by smoking? Close to zero percent. Why? Because even in the absence of smoking, there's substantial variance in the population of six billion people. There's age, there's diet, there's exercise, and there's genetic predisposition, among many other things. The variance contributed by the one smoker, compared to the variance caused by six billion different sets of other causes, is effectively zero.
Now, suppose that half the world smokes, and half doesn't. Now, there is lots more variance in heart disease rates. Instead of just variance caused by genetics and eating habits, you now have, on top of that, 50% of the population varying hugely from the other half in this one area. If smoking is extremely risky compared to eating and genetics, you might find that (say) 40% of the variance is explained by smoking. If it's only moderately more risky, you might get a figure of 15% or something.
What if everyone in the world smokes except one person? In that case, everyone's risk has risen by the same amount, except for that one non-smoker. So the variance caused by smoking is, again, very low. It's the same situation when only one person *doesn't* smoke as when only one person *does* smoke. And so, again, about zero percent of the variance is explained by smoking.
So, in theory, how much of the variance in heart disease should be explained by smoking? It could be almost any number. It depends on the variance of smoking behavior just as much as it depends on the effects of smoking.
If you did an actual study, and you found that 18% of the variance in heart disease was explained by cigarette use, what would that tell you about the riskiness of smoking?
Almost nothing! It could be that (a) there is lower variance in how much people smoke, but the risk of smoking is higher; or (b) there is higher variance in how much people smoke, but the risk is lower. Either of those possibilities is consistent with the 18% figure.
Going back to the baseball example: if 18% of the variance is explained by salaries, it could be that (a) teams vary little in how much they spend, but money buys wins fairly reliably, or (b) teams vary a lot in how much they spend, but money buys wins fairly weakly.
Which is correct? We can't tell yet. The 18% number, on its own, simply does not tell you anything whether money can buy wins.
So if the percentage of variance explained doesn't tell you a whole lot about the relationship between the variables, then what does? Answer: the regression equation.
"The Wages of Wins" doesn't give full regression results for their payroll study, but they do mention the figure of about $5 million dollars per win. So the computer output from their regression probably looks something like
R-squared = .18
Expected Wins = 65 + (Payroll divided by 5 million)
The .18 tells us little, but the equation for wins tells us almost everything we need to know – that the actual relationship between wins and payroll is $5 million dollars a win.
Unlike the R-squared, the $5 million per win should work out the same regardless of whether our regression is based on a day, a season, or even five seasons. And it should come out the same regardless of whether teams vary a lot in spending, or just a little.
If you want to know the strength of the relationship between X and Y, the R-squared won't tell you. But the equation will.
Of course, the estimate will be much less precise for samples in which there's less data. If we took 162 different single days, and ran 162 different regressions, some would come out to $20 million a win, some would come out to $7 million a win, some would come out to $0 a win, and some would even come out negative, to maybe -$10 million a win. But those 162 estimates should average out to $5 million a win, and cluster around $5 million with a normal distribution.
If you only had a single day's worth of data, you might find that the 95% confidence interval for the cost of a win comes out very wide. For instance, the interval might say that a win could cost as much as $20 million, or as little as –6 million. That huge range doesn't constitute very useful information – in fact, zero is in that interval, you can't even statistically significantly conclude that money buys wins at all! And so you'd probably decide not to conclude anything based on the equation, at least until you could rerun the study for a full season's worth of data to get a more precise estimate. But you might still be tempted to look at the r-squared of .03, and say that "only 3% of the variance is accounted for by the model."
You can say it, but, taken alone, it doesn't tell you much about whether there's a strong relationship in real-life terms. Only the regression equation can tell you that.
So what *does* the "percentage explained" figure tell you? It tells you how much more accurate your predictions of wins would be if you had the extra information provided by salary.
Suppose that at the beginning of the season, you had to predict how many wins each team would get, knowing absolutely nothing about any of them. Your best prediction would be that each team would win 81 games. Some of your predictions would be off by many games; one or two might be right on. The standard deviation of your errors would probably have been about 12 games, which means the average variance would have been the square of that, or about 144.
Now, suppose you had, in advance, the information from the regression: the r-squared of .18, each team's payroll, the finding that each $5 million in payroll would have bought one win. You would now adjust your prediction for each team based on its salary. You'd predict the Yankees would be at 97 wins, the Nationals would be at 76, and so on.
The R-squared of .18 would tell you that the extra information covers 18% of the variance of 144. Of the 144 points of variance, 26 points are "explained by payroll". What's left is 118 points. The square root of 118 is about 11, so the extra information allowed you to cut your typical error from 12 games down to 11 games.
The "18%" figure answers the question, "How valuable is knowing the team's payroll if you're trying to predict team wins for the year?"
And that's a completely different question – and a less important one -- than "Can increasing your payroll buy you more wins?"
One last example, just for the sake of overkill. Suppose, in 2020, a vaccine for cancer is invented. It works 90% of the time. Almost the entire population rushes out to get vaccinated.
At that time, you ask yourself these two questions:
1. If I'm trying to predict whether Joe Smith will get cancer, how valuable is knowing if he had the vaccine? Answer: it's not valuable at all – almost everyone has had the vaccine, so knowing that Joe is one of them doesn't give me any useful information. The "percentage of variance explained" is zero.
2. But how strong is the relationship between the vaccine and the incidence of cancer? Answer: extremely strong.
"The Wages of Wins" study shows that payroll does indeed buy wins, at a rate of about $5 million each. The "percentage of variance explained" is almost completely irrelevant.
Labels: competitive balance, statistics