Two regression puzzles
Here's a couple of interesting sports regression problems I ran into in the past week, if you're into that kind of thing. What struck me about them is how simple the actual regressions are, but how hard you have to think to figure out what they really mean.
The first one comes from Brian Burke.
Brian ran a regression to link an NFL quarterback's performance to his salary. He got a decent relationship, with a correlation of .46. Based on that regression, it looked like Aaron Rodgers should be worth around $25 million a year.
So far so good.
Then, Brian ran exactly the same regression, but switched the X and Y axes. He got the same correlation, of course. And the points on the graph were exactly the same, just sideways. But, this time, it looked like Rodgers should be worth only about $11 million!
How is that possible?
Here's the post where Brian lays out both arguments -- along with pictures -- and asks which is right. It took me a couple of hours of pondering, but I think I figured it out.
My answer is in the comments to Brian's post. I think it's correct, but I'm not completely sure ... and I don't think I even convinced Brian.
The second one you can understand, probably, without pictures. I'll elaborate in the next post, but I'll just lay it out for now.
It's an established result, in baseball analysis, that a point of on-base percentage is worth about 1.7 times as much as a point of slugging percentage. (Here's a discussion at Tango's old blog; you can probably Google and find others.)
But ... if you do a regression, that's not what you get.
I ran a regression to predict team winning percentage from OBP and SLG, using seasons from 1960 to 2011. My equation was:
wpct = (2.52 OBP) + (0.71 SLG) - 0.62
By this regression, it looks like a point of OBP is worth 3.5 times a point of SLG -- almost twice as much as the true value of 1.7. Also, the 2.52 and the 0.71 aren't right either, individually.
It's not just random error ... even if you move the two coefficients together by 2 standard errors each, the ratio still won't reach 1.7. Also, if you break this down into subsets, you get roughly similar results for each (as long as you keep enough seasons to reduce the randomness enough).
What's going on?
It took me a while -- again -- but I think I figured this one out too. I'll explain in the next post.
UPDATE, Friday 5/17: Upon further reflection, I *haven't* figured out the second one yet. But I'm working on it!