Sabermetric Research: Two regression puzzles

Thursday, May 16, 2013

Two regression puzzles

Here's a couple of interesting sports regression problems I ran into in the past week, if you're into that kind of thing. What struck me about them is how simple the actual regressions are, but how hard you have to think to figure out what they really mean.

----

The first one comes from Brian Burke.

Brian ran a regression to link an NFL quarterback's performance to his salary. He got a decent relationship, with a correlation of .46. Based on that regression, it looked like Aaron Rodgers should be worth around $25 million a year.

So far so good.

Then, Brian ran exactly the same regression, but switched the X and Y axes. He got the same correlation, of course. And the points on the graph were exactly the same, just sideways. But, this time, it looked like Rodgers should be worth only about $11 million!

How is that possible?

Here's the post where Brian lays out both arguments -- along with pictures -- and asks which is right. It took me a couple of hours of pondering, but I think I figured it out.

My answer is in the comments to Brian's post. I think it's correct, but I'm not completely sure ... and I don't think I even convinced Brian.

----

The second one you can understand, probably, without pictures. I'll elaborate in the next post, but I'll just lay it out for now.

It's an established result, in baseball analysis, that a point of on-base percentage is worth about 1.7 times as much as a point of slugging percentage. (Here's a discussion at Tango's old blog; you can probably Google and find others.)

But ... if you do a regression, that's not what you get.

I ran a regression to predict team winning percentage from OBP and SLG, using seasons from 1960 to 2011. My equation was:

wpct = (2.52 OBP) + (0.71 SLG) - 0.62

By this regression, it looks like a point of OBP is worth 3.5 times a point of SLG -- almost twice as much as the true value of 1.7. Also, the 2.52 and the 0.71 aren't right either, individually.

It's not just random error ... even if you move the two coefficients together by 2 standard errors each, the ratio still won't reach 1.7. Also, if you break this down into subsets, you get roughly similar results for each (as long as you keep enough seasons to reduce the randomness enough).

What's going on?

It took me a while -- again -- but I think I figured this one out too. I'll explain in the next post.

UPDATE, Friday 5/17: Upon further reflection, I *haven't* figured out the second one yet. But I'm working on it!

Labels: regression, statistics

25 Comments:

At Thursday, May 16, 2013 11:10:00 AM, BMMillsy said...: As I tweeted, there was a recent comment in JSE addressing issues with past regressions and OBP vs. SLG.

http://jse.sagepub.com/content/14/2/203.abstract

However, I think this is a poorly posed question to begin with. When we're talking about a large sample of a small number of discrete events, it is pretty easy to simply simulate games to evaluate the relative importance of each of the events. SLG and OBP are summary measures of these events. Why not just use the events? (not in a regression)
At Thursday, May 16, 2013 12:18:00 PM, Alex said...: I'm not familiar enough with baseball to have strong insight if that's the issue, but is it because you ran the regression on win percentage? That Tango post seemed to use runs or wins, and I could imagine a non-linear transition from either of those to win percentage. That would distort the relationship between OBP and SLG.
At Thursday, May 16, 2013 12:20:00 PM, Phil Birnbaum said...: Nope, don't think that's it. Using runs would probably improve the results, but I suspect the coefficients would stay roughly the same. I'll confirm that later when I have time.

Wins is close to linear in terms of runs, for most levels of actual team offense.
At Thursday, May 16, 2013 1:03:00 PM, Anonymous said...: The answer is multi-collinearity. OBP and SLG are not independent, both have H in the numerator and something close to PA in the denominator. So, one has to be very careful in interpreting coefficients like this in an environment where the IVs are highly correlated.
At Friday, May 17, 2013 1:29:00 PM, Guy said...: Phil: Millsy is of course correct that regression is the wrong tool here, but the puzzle is still interesting. I agree with Anonymous that multi-collinearity is a potential problem here. Here are my guesses on other reasons SLG is being undervalued:

1) Your sample includes pre- and post-1993 data -- almost always a problem. After 1993, SLG increased proportionately more than OBP -- but of course wins didn't change at all -- making it harder for your regression to see the value of SLG.

2) SLG is more likely to be park-dependent than OBP, and so high SLG rates (CO, TX, old Fenway) is more likely to be offset by a high opponent SLG (this won't be a problem if you use RS rather than wins).

3) It may be that OBP is a skill that is better correlated with other useful skills -- speed, defense -- than SLG.

4) Building a high OBP roster may better reflect management skill, that then manifests itself on the pitching/fielding side of the ledger as well.
At Friday, May 17, 2013 1:55:00 PM, aweb said...: Anonymous is certainly right about at least a large portion of the effect. I always think of this as one variable "winning" over the other. In this case, OBP and SLG are colinear, but OBP is better, so it "wins" the coefficient battle. In my head, OBP is capturing part of SLG that is common between the two in some way.

Hits are ~75% of OBP and ~60% of SLG. If you broke out the regression results in terms of the formulae for OBP and SLG, it might be reconstructable. I think you might need to break it into the components (BB, HBP, Sin, Dbl, Trp, HR) to fit them, and then reconstruct OBP and SLG from those separate events. Although they still aren't independent events, it would be a lot closer.
At Friday, May 17, 2013 3:13:00 PM, Guy said...: BTW, the Hakes-Sauer Moneyball paper looked at the impact of SLG and OBP on winning in exactly this way, and they found an OBP:SLG ratio of about 2:1. They then did a similar regression with salary as the dependent variable, and concluded that the market radically undervalued OBP until Moneyball was published.

http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.20.3.173

Another helpful reminder that publication in a refereed journal is no assurance of quality!
At Friday, May 17, 2013 4:19:00 PM, Phil Birnbaum said...: Guy,

Thanks, I looked at some of the stuff you suggested. (And, actually, I got the idea from the Hakes/Sauer paper.)

1. The effect is there before and after 1993, I think. Will confirm that. H/S did find that too.

2. I included opposition stats, as a check (which is what H/S did). That is, I subtracted opposition OBP/OPS from the team one. Those should cancel out park/defense/pitching effects. Results were very similar.

3, 4. Possible, but the effect would have disappeared again when I included opposition ...
At Friday, May 17, 2013 5:38:00 PM, Matt Crawford said...: Interesting stuff.

What happens if you use (OBP above/below league average) and (SLG above/below league average) as your inputs? Because winning percentage is a funny Y variable, as Guy said. OBP and SLG could change all over the place but the overall average winning percentage will still be .500.

Multi-collinearity doesn't affect the coefficients at all, so I'm not convinced that's the issue. It does affect the standard errors, and it definitely affects the interpretation of the coefficients. I.e., it doesn't make much sense to say "increasing OBP by 1 unit will increase your winning percentage by 2.52 units, if you hold SLG constant" because that doesn't happen very often in real-life (increasing OBP without affecting SLG).
At Friday, May 17, 2013 6:08:00 PM, Phil Birnbaum said...: Aweb ... hmmm, maybe. But it does seem to be consistent in that range, which it wouldn't be if it were just multicollinearity. Or maybe not.

Matt: that's what I did, actually, normalized each team to the MLB average. Maybe I should have used league instead of MLB ...
At Saturday, May 18, 2013 7:45:00 AM, Guy said...: Phil: I tried using wOBA as my dependent variable, which should be a "correct" offensive rate measure, rather than win%. On a much smaller dataset, I get a ratio of about 1.5:1 for the coefficients on OBP and SLG. Not 1.7, but in the right ballpark, unlike your results. And when I use individual players rather than teams, I get the same 1.5.

So I do think the use of win% is the main source of the problem here. Though I'm not sure why that isn't fixed when you include opponent OBP and SLG.

And I'd be surprised if mixing pre- and post-1993 teams has no impact at all. A huge proportion of your high SLG teams will be post-1993.
At Saturday, May 18, 2013 11:19:00 AM, Phil Birnbaum said...: Oops! May have screwed up somewhere ... now runs seems to work fine, and opposition runs too. So, maybe that *is* the answer, that there's something about winning percentage as the dependent variable.

Will verify everything today or tomorrow.
At Saturday, May 18, 2013 11:59:00 AM, Phil Birnbaum said...: When I do runs per game, I get the expected 1.7. When I do OPPOSITION runs per game, I get 2.05, which is not that much higher.

When I predict *differential* runs per game, based on *differential* OBP and SLG, I get 2.09.

When I predict WINNING PERCENTAGE based on differential OBP and SLG, I get 2.33.

So, one possible answer is: "predict runs instead, Phil!" But that doesn't really address the question of why it doesn't work as well when you use winning percentage.

Could the 2.33 just be random? Hakes/Sauer did four different eras, and got 2.43/2.36/2.46/3.09. The first one was 1986-1993. The last one was 2004-06.

It *could* be coincidence that they're all over 1.7 ... but, when I do 1960 to 1985 -- years that Hakes/Sauer did *not* do -- I get 2.26. It seems to be very consistently above 1.7.
At Saturday, May 18, 2013 12:50:00 PM, Guy said...: Well, the run-to-win ratio changes in different environments. In high-scoring environments, the win value of a run is less. And within your samples, I think that will be in situations where SLG is disproportionately high (compared to OBP): in AL games (DH) and in certain ballparks. In other words, I think the variance in SLG and differential-SLG will be highest when runs have the lowest win value. This will also be true for OBP, but not as much.
At Saturday, May 18, 2013 1:03:00 PM, Phil Birnbaum said...: I'll try normalizing by league instead of MLB. Park effects changed the "own OPS/SLG" case, but not the "own and opposition" case, which is what you'd expect if "more runs per win" were an issue.

But I'll try those again, more formally.

It's interesting how there are so many possibilities. Normally, these minor adjustments make minor differences, but when you have a ratio, they're magnified.
At Saturday, May 18, 2013 5:02:00 PM, Phil Birnbaum said...: It does seem like the ratio is consistently higher for winning percentage than runs. It could be the run/win ratio causing this, but might it be other things? I can check if certain types of teams beat their Pythag. Maybe it doesn't take much to cause this effect?
At Saturday, May 18, 2013 10:50:00 PM, Tangotiger said...: Instead of OBP - oppOBP, can you do OBP/oppOBP? And/or, rather than W/L, can do you (W/L)^0.5?
At Saturday, May 18, 2013 10:51:00 PM, Tangotiger said...: I meant rather than W/(W+L). Try W/L and (W/L)^0.5.
At Saturday, May 18, 2013 11:17:00 PM, Phil Birnbaum said...: Will do. If all goes well, will post all the findings tomorrow.
At Monday, May 20, 2013 10:00:00 AM, Phil Birnbaum said...: Tango, I tried both of those things, and the ratio closed a little bit .. 1.96 and 1.93. But not much.

I have an idea what's happening, writing it up now.
At Thursday, May 23, 2013 9:04:00 PM, Jared Cross said...: Just playing around with this using the same dataset and seeing the same things.

The standard deviation of team slugging is .033 but the standard deviation of team slg - lg slg (adjusted team slugging, that is) is 0.028, whereas with obp I get 0.015 whether it's adjusted or not so I think it's important to control for league as you do (since the league explains a considerable amount of the variance in slg but not in obp).

Adjusting everything for league averages and regressing R/G on OBP and SLG I get coefficients of:

OBP : 17.8 +/- 0.5 (one standard error)
SLG: 9.9 +/- 0.3

which is just what we'd expect. Replacing R/G with winning percentage though I get:

OBP: 1.39 +/- 0.16
SLG: 0.66 +/- 0.09

1.39/0.66 = 2.1

BUT, 1) this ratio is within range of 1.7 given the standard errors 2) the standard errors are somewhat underestimated here since these data points aren't quite independent (the same high slugging team with crap pitching could have stayed intact for a few years, say).

So, maybe there's nothing going on here? btw, are high OBP players on average better defensively than high SLG players?
At Thursday, May 23, 2013 11:36:00 PM, Phil Birnbaum said...: Hi, Jared,

My numbers are a little different from yours ... different season sets? Yes, the offensive ratio when you use R/G is the closest to 1.7. But ... try using runs *allowed*, and OBP *allowed*, and SLG *allowed*. For that, I got a much higher ratio, I think around 12.

Most of the numbers I looked at are over 2.0. The one you cite was the smallest.

It might be random, but the fact that different eras seem to have similar results suggests not ...
At Friday, May 24, 2013 1:35:00 PM, Guy said...: Phil/Jared: One possible complication: are you both using runs per 27 outs? The fact that winning home teams don't bat in the 9th can screw things up if you fail to adjust for that. (A mistake I've made a few times).

Phil's different results for offense allowed is interesting. I wonder if the variance in the two metrics, and amount of correlation between them, is subtly different when you look at offense *allowed*? I would think hitters vary much more than pitchers in terms of having OBP-heavy vs SLG-heavy skills.
At Friday, May 24, 2013 1:38:00 PM, Phil Birnbaum said...: Guy: Yes, quite possible that offense allowed is so homogeneous that the effects of SLG can't be determined precisely. Alternatively, it could be that the homogeneity is such that SLG has diminishing returns, because of what kind of SLG increases, and so comes out as not being important.

All part of the puzzle, I guess ...
At Friday, May 24, 2013 1:39:00 PM, Phil Birnbaum said...: IIRC, Guy, the results didn't change much when I adjusted offense and defense into the same number of batting outs per game.

Sabermetric Research

Thursday, May 16, 2013

Two regression puzzles

25 Comments:

About Me

Previous Posts