### r-squared abuse

I bet you think a Ferrari is expensive. You're wrong. I did a statistical study, and I found out that 99.999% of Bill Gates' wealth is explained by factors other than the cost of a Ferrari.

That leaves only 0.001%, not even a thousandth of one percent. That's a very small number. Clearly, buying a Ferrari doesn't affect your wealth!

Not falling for it? You shouldn’t. And neither should you fall for this Andrew Zimbalist quote (as discussed in a recent Tangotiger blog post):

"If you do a statistical analysis [of] the relationship between team payroll and team win percentage, you find that 75 to 80 percent of win percentage is determined by things other than payroll," says Andrew Zimbalist, a noted sports economist ...

This figure seems to be quoted by economists everywhere, in the service of somehow proving that "you can't buy wins" by spending money on free agents. But there are a few problems. First, it's misinterpreted. Second, it doesn't measure anything real. And, third, as Tango points out, you can make the number come out to be wildly different, depending on your sample size.

------

(For purposes of this discussion, I'm going to turn Zimbalist's statement around. If 75% is *not* determined by payroll, it implies that 25% *is* determined by payroll. I'm going to use that 25% here, since that's how the result is usually presented.)

First, and least important, Zimbalist misstates what the number means. It's actually the r-squared of a regression between season wins and season payroll (for the record, using 2007 data for all 30 teams, I got .2456 – almost exactly 25%). That r-squared , technically speaking, is the percentage of *the total variance* that can be explained by taking payroll into account. Zimbalist simply says it's "win percentage," instead of *the variance* of win percentage. But the way he phrases it, readers would infer that, if you look at the Red Sox at 96-66, payroll somehow accounts for 25% of that.

What could that mean? Does it mean that payroll was responsible for exactly 24 wins and 16.5 losses? Does it mean that without paying their players, the Red Sox would have been .444 instead of .593? Of course not. Phrased the way Zimbalist did, the statement simply makes no sense.

But that's nitpicking. Zimbalist really didn't mean it that way.

What he *did* mean is that payroll accounts for 25% of the total variance of wins among the 30 MLB teams. But there's no plain English interpretation of what that number means. The only way to explain it is mathematically. Here's the explanation:

----

Look at all 30 teams. Guess at how each one was supposed to do in 2007. With no additional information, you have to project them all at 81-81.

Now, take the difference between the actual number of wins and 81. Square that number. (Why square it? Because that's just the way variance works). So, for instance, for the Red Sox, you figure that 96 minus 81 is 15. Fifteen squared is

Repeat this for the other 29 teams – Arizona is 81, Milwaukee is 4, and so on. Add all those numbers up. I did, and I got 2,488. That's the total variance for the league.

Now, get a sheet of graph paper. Put payroll on the X axis, and wins on the Y axis. Place 30 points on the graph, one for each team. Now, figure out how to draw a straight line that comes as close as possible to the 30 points. By "as close as possible," we mean the line – there's only one – that minimizes the sum of the squares of each of the 30 vertical distances from the line to each point. (Fortunately, there's an algorithm to figure this out for you automatically, so you don’t have to test an infinity of possible lines and square an infinity of vertical distances.)

That line is now your new guess at how each team would do, adjusted for payroll. For instance, the Red Sox, with a payroll of $143 million, would come out to 89-73. Arizona, with a payroll of $52 million, comes out at 77-85.

Now, repeat all the squaring, this time using the projections. The Red Sox won 96 games, not 89, so the difference is 7. Square that to get 49. The Diamondbacks won 90, not 77, so the difference is 13. Square that to get 169. Repeat for the other 28 teams. I think you should get something around 1,877.

Originally, we had 2,488. After the payroll line, we have 1,877. The reduction is around 25%.

Therefore, payroll explains 25% of the variance in winning percentage.

----

This entire process, taken together, is linear regression. And it's all correct – mathematically, at least. But the statement, "payroll explains 25% of the variance in winning percentage," has meaning only within the definitions of linear regression. It means very little in terms of baseball. Indeed, the 25% is not a statement about baseball at all, or even a statement about payroll, or about wins. It is a statement about what happens when you start accounting for *squares of differences in wins*. I think of it, not completely correctly, as a statement about "square wins." And who cares about square wins?Most statements that involve a number like 25% have a coherent meaning – when you hear those statements, you can use them to draw conclusions. For instance, suppose I tell you "toilet paper is 25% off today." From that, you can calculate:

-- the same amount of money that bought 3 rolls yesterday will buy 4 rolls today

-- if it cost $10.00 per jumbo pack yesterday, it's $7.50 today

-- when the sale is over, the price will increase by 33%

-- if I use eight rolls a week, which normally costs $4.00, and I buy an 8-week supply today, I will save $8.00, which works out to 12.5 cents per roll.

That's how you know you got useful information – when you can make predictions and calculations based on the figure.

But suppose I now tell you, "payroll explains 25% of the variance of winning percentage." What can you tell me? Nothing! Even if you're very familiar with regression analysis, I challenge you to write an English sentence about wins and payroll – one that doesn't include any stats words like "variance" – that uses the number 25%, but refers to wins and payroll. I think it can't be done, at least not without taking the square root of 25%.

You can't even tell me, from this analysis, if payroll is an important factor in wins or not. If we can't make any statement about what the 25% means, how do we know whether it's important or not? Our intuition suggests that payroll must be not very important, because the percentage is "only" 25. But that's not true, as Tango said. I'll get back to that in a bit.

-----

The thing is that the regression analysis DID tell us lots of useful information about payroll and wins. The 25% figure is actually one of the least important things we get out of it, and it's strange that economists would emphasize it so much.

The most important thing we get is the regression equation itself. That actually answers our most urgent question about payroll – how many extra wins do high-spending teams get? The answer is in this equation (which I've rounded off to keep things simple):

Wins = 70 + (payroll / $7.4 million)

This gives us an exact answer:

-- in 2007, on average, every $7.4 million teams spent on salaries added an extra win above 70-92.

Did payroll buy wins? Yes, it did – at $7.4 million per win. Can't get much more straightforward than that. If you take that number, then figure out how many wins the free-spending teams bought ... well, work it out. The Yankees spent $190 million, which should have made them 96-66. The Red Sox, as we said, should have been 89-73. The Devil Rays, at only $24 million, should have been 73-89.

And that's the answer, based on the regression, on how payroll bought wins in 2007. Based on that, you may think that means payroll is important. Or you may think payroll is unimportant. But your answer should be based on $7.4 million, not "25%" of some mathematical construction.

(By the way, I think $7.4 million is unusually high – there were few extreme records in 2007, which made wins look more expensive last year than in 2006. But that's not important right now.)

------

Which brings us back to Tango's assertion, that the 25% can be almost anything depending on sample size.

Why is that true? Because the 25% means 25% of *total variance*. And total variance depends on the sample size.

Suppose there are only two things that influence the number of wins – payroll, and luck. Since payroll is 25% of the total, luck must be the other 75%. (This is one of the reasons statisticians like to use r-squared even though it's not that valuable – it's additive, and all the explanations will add up to 100%. That's very convenient.)

Let's suppose that over a single season, payroll accounts for 100 "units" of variance, and luck accounts for 300 "units":

100 units – payroll

300 units – luck

Payroll is 25% of the total.

But now, instead of basing this on 30 teams over one season, what if we based it on the same 30 teams, but on their average payroll and wins over two consecutive seasons?

Over two seasons, payroll should cause variance in winning percentage exactly the same way as over one season. But luck will have *less* variance. The more games you have, the more chance luck will even out. Mathematically, twice the sample size means half the variance of the mean. And so, if you take two seasons, you get

100 units – payroll

150 units – luck

And now payroll is 40% of the variance, not 25%.

If you go three seasons, you get 100 units payroll, 100 units luck. And now payroll is 50%. Go ten seasons, and payroll is 77% of the variance. The more seasons, the higher the r and the r-squared. That's why Tango says

"If [games played] approach infinity, r approaches 1. If GP approached 0, r approaches 0. You see, the correlation tells you NOTHING, absolutely NOTHING, unless you know the sample size."

(Update clarification: this happens ONLY if payroll and luck are the only factors, and payroll reflects actual talent. In real life, there are many other factors, and it's impossible to assess talent perfectly -- so while the r-squared between payroll and talent will increase with sample size, it won't come anywhere near 1.)

And it goes the other way, too – the higher the variance due to luck, the lower the percentage that's payroll. Suppose instead of winning percentage over a *season*, you use winning percentage over a *game*. I did that – I ran a regression using the 4,860 games of the 2007 season, where each winning percentage was (obviously) zero or 1.000.

Now the r-squared was .003. Instead of payroll explaining 25% of total variance, it explained only 0.3%!

Payroll explains 25% of winning percentage variance over a season

Payroll explains 0.3% of winning percentage variance over a game.

But the importance of payroll to winning has not changed. Payroll has the same effect on winning a single game as it does on a series of 162 games. But, on a single-game basis, the variance due to luck increases by a huge factor, and that dwarfs the variance due to payroll.

It's like the Ferrari. It explains only 0.001 percent of Bill Gates' net worth. But it explains 1% of a more typical CEO's net worth, 100% of a middle class family's net worth, and 1,000,000% of a beggar's net worth. The important thing is not what percentage of anything the Ferrari explains, but how much the damn thing costs in the first place!

And for payroll, what's important is not what percentage of wins are explained, but how much the win costs.

As Tango notes, the r-squared figure all depends on the denominator. Just like the Ferrari, you can wind up with a big number, or a small number. Both are correct, and neither is correct. Just seizing on the 25% figure, and noting that it's a small number ... that's not helpful.

A pinch of salt has more than 1,000,000,000,000,000,000,000,000 molecules, but weighs less than 0.000001 tons. And payroll explains 25% of season wins, and 0.3% of single game wins. As Tango notes, it's all in the unit of measure. The r-squared, without its dimension, is just a number. And it's almost meaningless.

The r-squared is meaningless, but the regression equation is not. Indeed, the equation does NOT depend on the sample size at all! When I ran the single game regression, with the 4,860 datapoints, I got an r-squared of 0.03, but this equation:

Wins per game = 0.43 + (payroll / $1.2 billion)

Multiplied by 162, that gives

Wins per season = 70 + (payroll / $7.4 million)

Which is EXACTLY THE SAME regression equation as when I did the season.

------

And so:

-- the r-squared changes drastically depending on the total variance, which depends on the sample size, but the regression equation does not;

-- the r-squared doesn't answer any real baseball question, but the regression equation gives you an answer to the exact question you're asking.

That is: if you do it right, and use the regression equation, you get the same answer regardless of what sample you use. If you blindly quote the r-squared, you get wildly different answers depending on what sample you use.

So why are all these economists just spouting r-squared? Do they really not get it? Or is it me?

Labels: baseball, payroll, statistics