The flaw in the Scully pay/performance regession
In 1974, Gerald Scully published an academic article called "Pay and Performance in Major League Baseball." (Here's a Google search that finds a copy on David Berri's site.) It's a very famous paper, because it reached the conclusion that, in the pre-free-agent era, teams were paying players far, far less than the players were earning for their employers.
It was also one of the first academic papers to try to find a connection between individual performance and team wins. Unfortunately, Scully chose SLG and K/BB ratio as his measures of performance, but, I suppose, those probably seemed like reasonable choices at the time. But it turns out there's a much more serious problem.
In his set of variables to use in predicting winning percentage, the set of variables that included SLG and K/BB, Scully also included dummy variables for how far the team was out of first place. That is, in trying to predict how well a team did, Scully based his estimates partially on ... how well the team did!
That biases the results so much that I can't believe nobody's mentioned it before ... at least they haven't in all the mentions I've seen of this study.
If it's not obvious why that's the wrong thing to do, let me try to explain.
Suppose you predict that, this season, your favorite team will slug .390 and have a K/BB ratio of 1.5. What will its winning percentage be?
Well, if Scully had used only SLG and K/BB in his regression, it would be easy to figure out: you just take his regression equation, which would look something like
PCT = (a * SLG) + (b * K/BB) + c (if an NL team) + d
Plug in Scully's estimates for a, b, and c, plug in .390 and 1.5, and there you go -- your estimate.
But Scully's actual equation included those two extra terms
PCT =(.92 * SLG) + (.90 * K/BB) - 38.57 (if an NL team) + 37.24 + 43.78 (if the team finished within 5 games of first) - 75.64 (if the team finished more than 20 games out)
So now how do you calculate your team's expected PCT? You can't! Because you don't know whether to include those last two variables. After all, how can you predict in advance whether your team will wind up having finished near the top or the bottom? You can't! If you could, you probably wouldn't need this regression in the first place!
Not only does the regression not make sense, but, more importantly, by including those two dummy variables, Scully's estimates of productivity wind up completely wrong. For instance: what is the effect of raising your SLG by 10 points? Well, that depends. Keeping all the other variables constant, it's .92 * 10, or 9.2 points. That's .0092, or about 1.5 wins in a 162-game season.
But wait! Those dummy variables for standings position won't necessarily stay constant. What if those 1.5 wins lifted you from 21 games back to 19.5 games back? In that case, the equation would give 9.2 for the SLG, but an extra 75.7 for the change in the dummy, for a total of 86.9 points! And what if they lifted you from 6 games back to 4.5 games back? In that case, the equation would estimate an extra 43.8 point bump, for a total of 53.0!
So what's the benefit of an extra 10 points slugging on the team's winning percentage?
--> 9.2 points -- for a team 22 or more games out
-> 75.7 points -- for a team 21.5 to 20 games out
--> 9.2 points -- for a team 19.5 to 7 games out
-> 43.8 points -- for a team 6.5 to 5 games out
--> 9.2 points -- for a team less than 5 games out
That makes no sense like that. You can't figure out how much the player's productivity is worth unless you know which of the five groups the team is in. But which group the team is in is exactly what you're trying to predict!
In any case, it's obvious that using 9.2 points as the measure of the player's increased productivity is wrong. It's *at least* 9.2 points, but sometimes substantially more. You need to average all five cases out, in proportion to how often they'd occur (and how can you know how often they occur without further study?). When you do that, you'll obviously get more than 9.2 points. But, as far as I can tell, Scully just used the SLG coefficient as his measure -- the player only got credit for the 9.2 points! And so he *severely underestimated* how much a player's performance helps his team.
Here's an example that will make it clearer. Suppose a lottery gives you a 1 in a million chance of winning $500,000. Then, if you do a regression to predict winnings based on how many tickets you buy, you'll probably get something close to
Winnings = 0.5 * tickets bought
Which makes sense: a 1 in a million chance of winning half a million dollars is worth 50 cents.
But now, suppose I add a term that says whether or not you won. Now, I'll get
Winnings = 0.0 * tickets bought + $500,000 (if you won)
That's true, but it completely hides the relationship between the ticket and the winnings. If you ignore the dummy variable, it looks like the ticket is worthless!
Same for Scully's regression. By including part of the "winnings" for having a good SLG or K/BB "ticket" in a different term, he underestimates the value of the "ticket".
So, since Scully's conclusion was that players are underpaid for their productivity, and Scully himself had underestimated that productivity ... well, the conclusion is completely unjustified by the results of the study. It may be true that players were underpaid -- I think it almost certainly is -- but this particular study, famous as it is, doesn't even come close to proving it.
UPDATE: As commenter MattyD points out (thanks!), I got it backwards. For the previous paragraph, I should have said:
However, Scully's conclusion on pay and productivity still holds. The study underestimated player productivity, but, if it found that players are paid less than even that underestimated production, it's certainly true that they are underpaid relative to their true production.