Monday, October 16, 2006

On predicting log(salary)

It’s common in the economics literature to run studies trying to predict how various factors affect a person’s salary. When those studies run regressions, they don’t try to predict salary – they try to predict the logarithm of the salary.

The reason you’d use log(salary) is that you expect that an arithmetic (additive) change in the inputs produce a geometric (multiplicative) change in outputs. For instance, leaving your money in a 5% savings account for one extra year (adding 1) to the amount of time your money stays in a 5% savings account increases your amount of money by five persent (multiplying by 1.05). Because of that, if you were to run a regression to predict log(balance) based on years, you’d get a perfect correlation of 1.00. But if you ran the regression on just balance, instead of log(balance), it woudn’t work as well.

On the other hand, the relationship between wages and hours worked is just the opposite. If you work an extra day (at, say, \$10 per hour), you’re going to increase your wages by a fixed \$80. It’s additive, not multiplicative – add one day, add \$80. In this case, using just plain “wages” would give the perfect correlation, and using log(wages) would be the less accurate method.

So sometimes “salary” is the better choice, and sometimes “log(salary)” is the better choice. Which to choose depends on whether the relationship is additive or multiplicative. Does adding one X change salary by a fixed number? If so, don’t use the log. Does adding one X change salary by a certain proportion? If so, then the logarithm is necessary.

But why does it have to be one or the other? Isn’t there actually an infinite number of possible relationships between salary and other factors?

Here’s a hypothetical example. Suppose you sell pay-per-use cell phones to high school kids, and you get a commission every time two of your customers call each other. (And assume that every user is equally likely to call any other user.) In that case, the right regression is not log(salary), and it’s not just salary. A better choice is to use the square root of salary. That’s because, as the number of users gets large, the number of calls starts to become almost proportional to the square of the number of users.

And what do you do if some of the dependent variables have an artithmetic relationship but others have a geometric one? Suppose you want to figure the impact of hiring assistants for a car salesman. Some assistants scout for clients and bring in 5 new prospects a day. This increases the salesman’s commission arithmetically. Other assistants help the salesman close, and each assistant increases the probability of closing by 1%. This increases the salesman’s commission geometrically. Which means that the effect of assistants on salary is neither arithmetic nor geometric.

So what should you use, salary or log(salary)? My guess is that researchers would try both and see which works better, and use that one. This seems a completely reasonable way to decide.

But whether you use salary or log(salary), you’re probably going to be somewhat wrong. Baseball studies that I’ve seen all use log(salary) – but it’s very unlikely that any performance measure actually does work exactly geometrically, at all points on the salary scale. Suppose hitting an extra 5 home runs increases pay by 10%. Does it necessarily follow that hitting an extra 10 home runs should increase pay by 21% (10% on top of 10% compounded)? Or that hitting an extra 15 home runs should increase pay by exactly 33.1%? Who says it has to work that way? Shouldn’t you have to show it?

What you can do is show that a regression appears to give reasonable results – that Barry Bonds doesn’t come out to be worth a billion dollars a year. But that’s far from proving that log(salary) is as accurate as it needs to be.

And, of course, there are many different performance measures, all of which differ in some way. Even the most accepted measurements can have significant differences. If log(salary) is proportional to linear weights batting runs, it cannot also be proportional to VORP. If log(salary) is proportional to VORP, then it cannot be proportional to Win Shares. If log(salary) is proportional to Win Shares, it cannot also be proportional to RC27. It can be roughly proportional to all of them at the same time, of course, but not exactly proportional. But what is the size of the “roughly” error? It can be pretty substantial. I’d bet that at the extremes, an increase of 10% in Linear Weights would easily be a 20% difference in RC27.

Which brings me, finally, to the point of all this. The academic studies I’ve seen that do this kind of thing are all very careful about the detailed calculations they do on the dependent variables. They’ll adjust for season, they’ll adjust for DH. They won’t hesitate to criticize other studies for using offensive measure X when measure Y has a higher correlation to runs scored. They’ll add lots of indicator variables for all kinds of things, like managers and parks, just to make sure they’ve considered everything reasonable.

But isn’t that kind of overkill, to try to make sure all the dependent variables are correct to three decimal places, when the results might be so much more dependent on which function of salary you take as the independent variable? Your results are only as valid as your weakest link. And the salary function, to me, seems like a pretty weak link.

At Monday, October 16, 2006 6:32:00 PM,  Arb said...

Actually, you're wrong about why these studies use log(salary). If you don't use log(salary), your residuals are not normal, and therefore you can't do inference.

At Monday, October 16, 2006 6:41:00 PM,  Phil Birnbaum said...

Ah, that makes sense ... thanks!

Wouldn't log(salary) have to be a good fit for the real-life phenomenon in order for the residuals to be normal? That is, is it true that if you ran wage vs. hours worked, you wouldn't get normal residuals with log(salary), but you would with just plain salary?

That is, would testing to see if the residuals are normal be one way to see if log(salary) is the right function?

And, now that I think about it, taking it one step further ... could you run a bunch of regressions using different functions of salary -- salary, log(salary), root(salary), and so on -- and choose the one that gave you the "normalest" residuals?

Apologize if I'm over my head here and spouting nonsense.

At Monday, October 16, 2006 8:55:00 PM,  JavaGeek said...

Problem of course with doing a regression on log salary you get:
log(salary) = a + b*x +....
or
Salary = exp(a+b*x)
with a +/- c
and b +/- d
with error of 1% for the a, b estimates you can get
a salary:
4,500,000 +/- 700,000 (or 16% error)...

At Monday, October 16, 2006 9:04:00 PM,  Phil Birnbaum said...

Javageek,

You're right ... I'll have to think about that a bit. Wow.

At Monday, October 16, 2006 11:22:00 PM,  Phil Birnbaum said...

OK, I think I get what arb is trying to tell me. I think my last comment is wrong.

Suppose that the errors are roughly proportional to the salaries, instead of fixed. So the standard error on a salary of \$1 million might be \$50,000 (5%), but the standard error on a salary of \$10 million might be \$500,000 – still 5%.

In that case, if your regression uses salary, instead of log(salary), your errors won’t be normal, because they’ll have a very, very long tail for the high values, and a shorter tail for the small values. And, as arb points out, if the errors aren’t normal, you can’t do inference (such as confidence intervals, etc.).

If you use log(salary), the errors now have constant standard error – in the 5% case, .049. Whether the log(salary) is log(1 million) or log(10 million), the standard error is log(1.05) for each observation. This means the residuals are normal, and you can proceed with tests.

However, the fact that the residuals are normal doesn’t mean you’ve got the best fitting model. Suppose we know the “true” formula for wages is

Wages = \$10 * hours + e

But the e’s aren’t normal: each e is proportional to the total wages. This isn’t that farfetched. Imagine the paymaster just estimates how large a stack of bills is necessary to pay each worker. The \$1000 worker is probably within \$20 or so, but the \$100,000 worker’s stack is probably off by a standard error of at least \$2000. In this case, the errors are "better" in the log case, but the estimate of pay to hours is worse in the log case. Using log(wages) helps the inferences, but makes all the actual estimates biased.

To test that, here’s what I did. I set up a bunch of players with OPS normally distributed around .800. I assumed salary is directly proportional to OPS (no log), but that the errors were normal only in the log sense. That is, I chose normally distributed errors, but instead of adding them to the right side of the equation, I added them to 1 and *multiplied* them:

Salary = OPS * 1,000,000 * (1+e)

Then I did two regressions: one using salary, and one using log(salary).

What happened? The regression using log(salary) had a higher r^2 and normal residuals. Looks good, right? But the estimates are off.

We know that a player with an OPS of 1.000 should be making a million dollars a year. The salary model (without the log) predicts about \$999,xxx – very close. But the log(salary) model predicts \$1,019,783 – almost 2% too high.

What you gain in ability to infer, you lose in biasedness.

Is there a regression method that handles this situation – a linear relationship with errors that are non-linear in this particular specific way? It could be that there is one, and that’s what all these academic studies are actually using. If that’s the case, my post is wrong to complain about this problem.

And even if not, I have to admit that the error in using log(wages) instead of wages is small – much smaller than I expected. It could be that this problem is known, but its effects are minimal.

Or, as always, I could be wrong about all this. My knowledge of these kinds of econometric methods is very small compared to practitioners like arb.

At Tuesday, October 17, 2006 11:02:00 AM,  Guy said...

I have a hard time making sense of the coefficients you get using log(salary) as dependent variable. Let's use Bradbury's DIPS paper cited in the other post. His average coefficient for strikeouts is .128. So the difference between striking out 5 hitters/ game and 8 hitters/game would mean multiplying the salary by about 2.5, a 150% increase, if I'm interpreting this correctly. Now, for a pitcher whose other skills make him a \$2M pitcher, the extra Ks are worth \$3m, but for a \$6M pitcher (sans Ks) the extra Ks means a \$9M bonus. That doesn't make a lot of sense.

The real value of 3 extra Ks per game is about .70 in ERA, or 2 wins a season, so probably something like \$5-6M for a free agent.

I'm not saying the marginal price of wins is constant. I think high-win players probably do get paid more on a per-win basis. But if I had to choose, a linear relationship is probably closer to reality than what this model is giving us.

At Tuesday, October 17, 2006 11:13:00 AM,  Phil Birnbaum said...

Guy,

I'm getting a 47% increase on 3 extra strikeouts ... e to the (.128 x 3) equals 1.468, so a 46.8% increase. Am I doing it wrong?

At Tuesday, October 17, 2006 11:30:00 AM,  Guy said...

You're right, Phil, left out a ")" there. So it's essentially a 50% increase. But the point stands, I think: a 50% increase is far too low for most pitchers (5 K/G is below avg, 8 K/G is an all-star), perhaps about right for an otherwise-outstanding pitcher. It just seems like you'd be better off with a linear relationship, or a much less radical adjustment than Log(sal) gives you.

On a separate point, I think including all 3 classes of players (non-arb, arb-elig, and FAs) is hugely problematic. In the first group there is essentially no measurable relationship btwn performance and pay -- they should be excluded. Whether the other two can be combined with use of a dummy variable, or need to be in separate regressions, I'm not sure.

At Tuesday, October 17, 2006 11:49:00 AM,  Phil Birnbaum said...

Guy,

Agreed. I'd vote to make them separate regressions -- there's no guarantee that what gives you a 20% pay increase as a free agent would give you the same 20% pay increase in arbitration. And if you use a dummy variable, that's what you're forcing.

Also, I'm not really sure that arbitration results should be considered anyway ... since arbitration salaries aren't determined by the free market, they don't tell you much about GMs' decision making. They give you an estimate of what the arbitrator is going to decide about the player's value, not about what GMs believe about a player's value.

(For the record, I should say that my point about log(salary) wasn't meant to comment on this study in particular, but, rather, on the general practice. Everyone does it, for the reason arb gave, and I'm just wondering if there are other costs to it.)

At Tuesday, October 17, 2006 12:39:00 PM,  Guy said...

Right, it applies to other studies as well. For example, this paper by Hakes and Sauer uses a similar methodology: http://hubcap.clemson.edu/~sauerr/working/moneyball-v2.pdf

In this case, we see similar problems: 150 points of SLG translates into only a 43% salary increase (and the authors speculate that the coefficient for SLG suggests it is overvalued by the market). But 150 points of SLG is worth about 28 runs or 3 wins from a FT player, or \$8-12M depending on what you think wins cost today. So 43% doesn't really come close.

At Tuesday, October 17, 2006 1:01:00 PM,  Phil Birnbaum said...

Guy,

Good eye. I wonder how much of the apparent discrepancy is caused by using dummy variables for arbitration and free agency instead of just sticking to free agency -- and how much of it is caused by using the log function instead of something more suitable.

BTW, do we have any idea what a good type of function is for turning player offense into salary? Is it reasonable to assume that every win should cost about the same, or should there be some kind of "bonus" to players who create more wins without using up outs?

For instance, in rotisserie, Rickey Henderson is worth more than, say, three other players whose stats sum to his, becase with Rickey, you get two replacement-value players in addition. Is it the same in real baseball?

For instance, adding 150 slugging points to Jeff Kent is better than adding 150 slugging points to Neifi Perez, because by adding to Kent you're free to get rid of Perez and get someone decent to replace him. If you add them to Perez, it's much harder to improve your team.

Any suggestions for the way to find out the right answer?

At Tuesday, October 17, 2006 2:58:00 PM,  Tangotiger said...

I use a linear relationship, whether in baseball or back when I was playing fantasy games.

It serves me well. Just do wins above replacement times some constant. That value of the constant is rather easy to figure out, once you've set the replacement level.

Sum(wins above repl) * k = Sum(salary above repl)

To illustrate, if the wins above repl is about 1000 wins, and the payroll above repl is 2 billion\$, then your marginal \$/marginal win is 2 million\$/win.

I know some like to use a rising value for marginal\$/marginal win. I'm not convinced that should happen, as I prefer to pay two +4 wins above repl guys the same as a +7, +1.

At Tuesday, October 17, 2006 6:26:00 PM,  David Gassko said...

Here's a post I made on The Book Blog. I think it's somewhat pertinent:

Player valuation is an extremely tricky subject. I’ve been looking at it quite a bit for the past year, and besides some articles on specific contracts, I’ve never published any kind of article on the topic because at this point, I still am not at the point where I could be sure of anything. Nate Silver has done some good stuff in this area as well, with MORP and his chapter in Baseball Between the Numbers. Still, even Nate hasn’t scratched the surface.

I will make three points, however:

(1) Wins in the free agent market are extremely expensive—\$3.5 to \$4 million per marginal win. Studes has documented this in detail on the Hardball Times website and Annual(s). This suggests that teams are highly over-valuing the worth of a marginal win (which is about \$2.5 million).

But that’s wrong, I think. There are a few things that come into play. One, it may be that pre-FA players are severely underpaid (Neifi Perez makes four times what Miguel Cabrera does). So owners have lots of extra cash to spend on free agents. You can’t make the playoffs with a team full of replacement-level players, and as Nate documented, the playoffs are where the real money is (and again, his research undervalues marginal wins because he used WARP).

There’s more than just that, though. One thing Nate did not look at, and I do (or will—when I publish something) include in my analysis is the increasing value of the team itself, which goes way beyond extra revenues. For example, Forbes says that the Yankees lost \$50 million last year, but their VALUE grew \$76 million. The Twins made \$7 million, and their value grew \$28 million. So while traditional analysis might say that the Twins did \$57 million better than the Yankees, they actually only did \$9 million better. And that’s with the Yankees sporting a \$200+ million payroll. Across town, the Mets lost *just* \$16 million, and their value increased by \$99 million (though that number is higher than it would be in a normal year because of their new cable network). The Red Sox lost \$18.5 million but the value of the team grew \$54 million.

But that’s still not all. Once you factor in growing team values, you’ll see that all owners make money, pretty much. So owning a baseball team—even a poorly-managed one—is a cash cow. (And this isn’t even mentioning all the tax loopholes that owning a team opens up.) So the longer you can own a team, the better. But guess what? Owners of bad teams are run out quickly. Part of it might be that they just want to make a quick buck (though I doubt it), but part of it is certainly public pressure (and maybe that REALLY bad teams do end up losing money because eventually the fans don’t show up). Teams like the Royals or Pirates are constantly on the selling block—Steinbrenner has owned the Yankees for decades.

So if a team is guaranteed to make money no matter how much you spend (pretty much), it is in an owner’s best interests to hang on to it for as long as possible. That further increases the value of each marginal win. All of these things need to be taken into account when appraising the value of a marginal win.

(2) Player value is non-linear. Both Nate and I have documented this. It’s much more expensive to get a 4 WARP players than it is to get a 2 WARP player, but a team of 2 WARP players will only be about average. So you need the 4 WARP guys to make the playoffs, and unless you can bring them up through your minor league system (which is also expensive, BTW), you’re going to need to overpay them.

(3) Inflation in the MLB is at about 10% per year. So if \$10 million buys you four marginal wins today, five years from now it’ll buy you 2.5 five years from now. That’s why large long-term contracts are not nearly as bad as they look. If you sign a 4 WARP player for 5 years/\$50 million, and he declines by .4 wins a year, he’ll be worth his salary in the last year and not overpaid by \$4 million as most would think.

Of course, you also have to ask if those wins could be acquired more cheaply because making money isn’t the same thing as maximizing profits, because the opportunity cost might be greater than what you’re making. This has to do with the difference between absolute and comparative advantages and a lot complex economics that I’m not going to explain, nor am I particularly qualified to talk about. Economists solve these kinds of problems for a living, so let’s not make it sound simpler than it is.

Seriously, Nate’s chapter was a nice start, but a whole book could be written about this. It’s on my to-do list as part of a larger GM handbook—sort of like The Book for general managers. As Mickey, Tom, and Andy can attest, these things take a long time to do.

Told you it wasn’t easy… :)