## Tuesday, December 11, 2012

### Linear regressions cause problems when reality isn't linear

"Moneyball" popularized the idea that in baseball, walks are undervalued.  I believe, however, that baseball insiders didn't learn from the book, and that they continued to overvalue batting average.

So I did a study.

I took every player-season from 2005 to 2011 (minimum 50 AB), and ran a regression to predict playing time.  I used RC27 (runs created per 27 outs) as a proxy for skill; it's a fairly accurate measure of a batter's performance, and you'd think that better batters would get more plate appearances.

Then, I added a dummy variable for batting average.*  (*Update: Oops!  It's not really a dummy, as Alex points out in the comments.  Dummy variables are binary 1/0 indicators, not linear ones like BA.)

If managers are evaluating players correctly, the dummy coefficient should be zero.  After all, if a player creates 5.00 runs per game, it shouldn't matter whether he does it with walks, or power, or batting average.  If it does matter, it must be that managers aren't evaluating offense properly.

It turned out that the dummy variable was very significant, at 10 SD.  Every additional point of batting average was related to an extra 0.93 plate appearances for that player.

This suggests that, managers are indeed overvaluing batting average.

------

I'm not making this up; I actually did this regression, and got that result.  But, I don't believe the conclusion.

Why not?  Mainly, because I don't believe that playing time is linearly related to RC27.  Past a certain point of good performance, you're not going get any additional plate appearances, because you're already a full-time player.  And, at the bottom, you're not going to get less than zero, no matter how bad you are.  And you'll probably get roughly the same playing time whether you're at .050 or .090 -- either way, the estimate of your future performance isn't much different.

Taking that and oversimplifying, I think you'd expect an S-curve, not a straight line.  For instance, just guessing, playing time might be low and horizontal from (say) 0.00 to 2.00, then sloping up from 2.00 to 6.00, then high and horizontal from 6.00 to infinity.

I'd argue that problem is serious enough that we shouldn't trust the results.

------

Now, I'm pretty sure my objection wouldn't necessarily change any minds.  There are always lots of skeptics for any paper, saying, "hey, it could be this," or, "it could be that."  I may be sure my argument is valid, but, it's not obvious that it's valid, or that it's important enough an objection to dismiss the paper's conclusions.  Furthermore, most readers wouldn't have the time or inclination to follow my objections.  They'd think, well, the paper passed peer review, and if it's wrong, another paper will come around later with other evidence.

And, so, generally, people will believe the paper's conclusion to be true.  In fact, even as you read this, you might think I'm just nitpicking, and that the regression did, in fact, find something real.

------

So, let me try to convince you, and then I'll get to my "real" point.

There's a crucial assumption in linear regressions, that estimation errors are random and unbiased.  That means, when the regression tries to predict playing time based on the other variables, we should expect a positive error (player got more PA than expected) about as often as a negative error (player got fewer PA than expected).  That should be the case regardless of the values of the Xs -- that is, regardless of how well the player batted.

But that didn't happen here.  Here's a graph of the regression's errors (residuals), plotted against the quality of the player:

That's not random.

The really bad players got a lot more PA than the prediction, and the really good players fewer.  That's exactly what you'd get by my "S-shaped" hypothesis.

How big are the errors?  Huge.  For the worst 100 batters, the average actual playing time 48 PA higher than the estimate.  For the best 100 batters, the average was 165 PA lower than the estimate.

Compare that to the effect we got for batting average.  A typical 20-point difference in batting average worked out to 19 PA.

How can I argue that I've found a real 19 PA effect, when my measuring stick is obviously biased by a lot more than 19 PA?  All it would take, for the effect to be artificial, is for the +20 BA players to be concentrated in the top half of the graph.

That's probably what's happening.  The dots on the far left and right are mostly part-time players, because no full-time player performs that well or that badly.  There are many more part-time players on the left than the right.  And the residuals are biased high on the left.  So, the BA effect is likely just a part-time effect.

But you don't have to buy my explanation.  All you have to do is look at the graph, and see that there are huge biases at the extremes -- biases that are higher in magnitude than the effect we've found.  At that point, you shouldn't need an explanation -- you should just realize that there's something wrong, and that we really shouldn't be drawing any conclusions from this study.

-------

In the real world, these types of regression studies don't normally use sabermetric stats like RC27 and such.  More likely, they'll throw in all kinds of primary stats, like HR, BB, outs, and so on.  And they'll add in other things, like all-star status, and draft position, and whatever else seems to be significant.

But the problem remains.  That's because, in this case, the primary issue isn't the way performance is measured -- it's that the rate of performance is not a linear predictor of playing time.  The model just doesn't work.  You could have the most perfect performance statistic ever, one that's accurate down to the third decimal, and you'd still have the problem.  If it's not linear, it's not linear.

------

Anyway, that was just a very long way of getting to my real point, which is: is there an "automated" test to check for problems like this, where the readers won't have to listen to my argument, and the problem will make itself obvious?

My first thought was: maybe we could see the problem by just looking at the correlation coefficient of that graph.  But we can't.  The correlation is zero!  It always is, when you look at the residuals of a linear regression -- the math makes it work out that way.

But just because the correlation is zero, doesn't mean there's no bias.  A graph of zero correlation can have all kinds of fancy shapes.  For instance, a symmetrical smiley face pattern has r=.00, and so does a frowny face, and a sine wave.  (See more here.)  Those all are obviously biased for certain Xs.  Only the traditional "cloud" shape is unbiased everywhere.

But how do you automatically test for "shape"?  One way I can think of is to examine the extremes of the graph, because that's often where the effect is strongest.

So, I'd suggest, tor every variable in your model (other than the dummy you're concerned with):

1.  Show the average error for the the rows with the highest values;
2.  Show the average error for the rows with the lowest values;
3.  Show the average error for the middle values.

I used 100 rows for the top and bottom, but anything reasonable is fine; if you don't have a huge dataset, use the top and bottom 25%, or the top and bottom 10%, or whatever.

Whatever cutoff you choose, it should turn out that those average errors, subject to random variation, can't be too much larger than the effect you think you've found.  Suppose you're interested in a "rookie" dummy variable, and you find a statistically significant coefficient of 15 blorgs.  But then you find that the fastest 10 percent of players are biased by 18 blorgs.  That's probably OK -- for the 18 to cause the 15, you'd need 11/12 of rookies to be in the top 10%, which is unlikely.

On the other hand, if you find a "fast" effect of 115 blorgs, you're in trouble.  Then you'd need only a weak relationship between rookies and speed to cause an effect of 18.  That's quite likely.

So, it's not just the *shape* of the curve: it's the *magnitude* of the worst biases, compared to the effect.  If you find a small effect, and you want to believe it's real, you have to prove that the regression controls for other factors at least that precisely.

-----

(It's true that one existing recommendation is to eyeball the residual graph to spot non-linearity.  But, usually, they suggest looking only at the residuals for the regression as a whole -- errors vs. expected Y.  That would work here, because the bias is strongly related to the expected Y.  But, often, it's not (for instance, defense vs. age when your Y variable is salary).  So, you really do need to look at each variable separately.)

-----

I'd like to see something like this, for a hypothetical study:

"We attempted to find an effect for age, and we found an effect that young players outperformed by 13 runs, significant at 4 SD.  However, examining the residuals for the top 10% of each variable, we found an average of +35 for stolen bases, and +49 for defensive skill.  We have reason to believe that young players are significantly more likely to be faster and better fielders, and therefore we believe our dummy might be evidence of a biased, ill-fitting model, rather than an actual effect."

------

You should do the "10%" thing for every variable -- and then do it for *combinations* of variables that measure similar things, or that might be correlated.  Find the 10% of players most above average in a combination of doubles, home runs, walks -- and see if those are biased.  Find the 10% with a combination of few triples, few stolen bases, and games at catcher.  You have to combine, a bit, because you might not find big enough effects for single factors.  We found big biases for bad performance overall: if you split that into 'bad performance for doubles", and "bad performance for triples," and so on, you might find only small biases and miss the big one.

All this could easily be automated by regression software. Most packages already can give you the correlation between your independent variables -- that is, they can tell you that players with lots of home runs tend to draw lots of walks and hit lots of doubles.  So, have the software automatically look at players in the top 25% of all three categories, and see if there's a bias.  Do the same for the bottom.  Flag anything bigger than the size of the dummy you're looking at, along with a significance level.  Then you can decide.

Another thing the software could do, is this: taking the variable of concern -- batting average, in my case -- look at all the other variables that correlate highly with it (RC27, in my case).  Check the bias when those variables are all high, and when they're all low.  And, again, flag anything large, for further review.

------

Would this work?  It seems to me that it would, at least enough of the time to make it worthwhile.  Is it already being done?  Am I reinventing the wheel?

Labels: ,

At Tuesday, December 11, 2012 5:21:00 PM,  mettle said...

Please look up logistic regression or higher-order regression analysis.
Many (most?) relationships in nature aren't linear-to-linear. You can easily transform ABs in such a way that your regression will be 100% valid.

At Tuesday, December 11, 2012 5:29:00 PM,  Phil Birnbaum said...

Sure, a lot of the time you can do that. Squared terms for age, for instance, you see that one a lot.

The point is, you have to know to do that. So you need some way of checking that your non-linearity is big enough that it could be causing your observed coefficient. Once you know that, then you can experiment with making it work.

At Tuesday, December 11, 2012 6:12:00 PM,  Kristine said...

"For instance, just guessing, playing time might be low and horizontal from (say) 0.00 to 2.00, then sloping up from 2.00 to 6.00, then high and horizontal from 6.00 to infinity."

You could try piece wise regression to solve this problem. That might end up fitting the data better.

At Wednesday, December 12, 2012 1:38:00 PM,  Alex said...

If you graph your data in the first place, these issues don't arise. If the author of this study looked at the scatterplot for BA versus R27, he would see it was non-linear and not run a linear regression. Or if he simply hypothesized that BA can't be normally distributed, he wouldn't run a linear regression.

To be more nitpicky, a dummy variable usually refers to something that can be coded as a 1 or 0. Batting average can't be a dummy variable. Also, towards the end when you say that regression software spits out correlations among your variables, I think you mean 'independent' instead of 'dependent'.

At Wednesday, December 12, 2012 5:09:00 PM,  Phil Birnbaum said...

Alex, thanks. I was thinking about the race case when I wrote about dummies, and I probably forgot to reset when I switched to batting average. And I'll fix the "dependent" error too.

Appreciate it!

At Wednesday, December 19, 2012 2:01:00 AM,  Anonymous said...

"The point is, you have to know to do that. So you need some way of checking that your non-linearity is big enough that it could be causing your observed coefficient. Once you know that, then you can experiment with making it work."

Something like the "coefficient of determination"?
http://en.wikipedia.org/wiki/Coefficient_of_determination

At Thursday, December 27, 2012 10:16:00 AM,  G Wolf said...

Sorry if I'm a little late to the game here, but honestly these problems aren't really that confounding or unique.

Logistic regression is a very well-known and often-utilized concept. Your response of "you have to know to do that" can be applied to anything else as well, even basic OLS.

As for error checking, there are pretty common tests for things like constant variance (homoscedasticity), nonindependence of error terms, or nonnormality of error terms. Some are based on a visual assessment of a plot, while others (eg, Bartlett's test) are more rigorous.