On predicting log(salary)
It’s common in the economics literature to run studies trying to predict how various factors affect a person’s salary. When those studies run regressions, they don’t try to predict salary – they try to predict the logarithm of the salary.
The reason you’d use log(salary) is that you expect that an arithmetic (additive) change in the inputs produce a geometric (multiplicative) change in outputs. For instance, leaving your money in a 5% savings account for one extra year (adding 1) to the amount of time your money stays in a 5% savings account increases your amount of money by five persent (multiplying by 1.05). Because of that, if you were to run a regression to predict log(balance) based on years, you’d get a perfect correlation of 1.00. But if you ran the regression on just balance, instead of log(balance), it woudn’t work as well.
On the other hand, the relationship between wages and hours worked is just the opposite. If you work an extra day (at, say, $10 per hour), you’re going to increase your wages by a fixed $80. It’s additive, not multiplicative – add one day, add $80. In this case, using just plain “wages” would give the perfect correlation, and using log(wages) would be the less accurate method.
So sometimes “salary” is the better choice, and sometimes “log(salary)” is the better choice. Which to choose depends on whether the relationship is additive or multiplicative. Does adding one X change salary by a fixed number? If so, don’t use the log. Does adding one X change salary by a certain proportion? If so, then the logarithm is necessary.
But why does it have to be one or the other? Isn’t there actually an infinite number of possible relationships between salary and other factors?
Here’s a hypothetical example. Suppose you sell pay-per-use cell phones to high school kids, and you get a commission every time two of your customers call each other. (And assume that every user is equally likely to call any other user.) In that case, the right regression is not log(salary), and it’s not just salary. A better choice is to use the square root of salary. That’s because, as the number of users gets large, the number of calls starts to become almost proportional to the square of the number of users.
And what do you do if some of the dependent variables have an artithmetic relationship but others have a geometric one? Suppose you want to figure the impact of hiring assistants for a car salesman. Some assistants scout for clients and bring in 5 new prospects a day. This increases the salesman’s commission arithmetically. Other assistants help the salesman close, and each assistant increases the probability of closing by 1%. This increases the salesman’s commission geometrically. Which means that the effect of assistants on salary is neither arithmetic nor geometric.
So what should you use, salary or log(salary)? My guess is that researchers would try both and see which works better, and use that one. This seems a completely reasonable way to decide.
But whether you use salary or log(salary), you’re probably going to be somewhat wrong. Baseball studies that I’ve seen all use log(salary) – but it’s very unlikely that any performance measure actually does work exactly geometrically, at all points on the salary scale. Suppose hitting an extra 5 home runs increases pay by 10%. Does it necessarily follow that hitting an extra 10 home runs should increase pay by 21% (10% on top of 10% compounded)? Or that hitting an extra 15 home runs should increase pay by exactly 33.1%? Who says it has to work that way? Shouldn’t you have to show it?
What you can do is show that a regression appears to give reasonable results – that Barry Bonds doesn’t come out to be worth a billion dollars a year. But that’s far from proving that log(salary) is as accurate as it needs to be.
And, of course, there are many different performance measures, all of which differ in some way. Even the most accepted measurements can have significant differences. If log(salary) is proportional to linear weights batting runs, it cannot also be proportional to VORP. If log(salary) is proportional to VORP, then it cannot be proportional to Win Shares. If log(salary) is proportional to Win Shares, it cannot also be proportional to RC27. It can be roughly proportional to all of them at the same time, of course, but not exactly proportional. But what is the size of the “roughly” error? It can be pretty substantial. I’d bet that at the extremes, an increase of 10% in Linear Weights would easily be a 20% difference in RC27.
Which brings me, finally, to the point of all this. The academic studies I’ve seen that do this kind of thing are all very careful about the detailed calculations they do on the dependent variables. They’ll adjust for season, they’ll adjust for DH. They won’t hesitate to criticize other studies for using offensive measure X when measure Y has a higher correlation to runs scored. They’ll add lots of indicator variables for all kinds of things, like managers and parks, just to make sure they’ve considered everything reasonable.
But isn’t that kind of overkill, to try to make sure all the dependent variables are correct to three decimal places, when the results might be so much more dependent on which function of salary you take as the independent variable? Your results are only as valid as your weakest link. And the salary function, to me, seems like a pretty weak link.