The OBP/SLG regression puzzle
In the second "puzzle" last post, I noticed that, when you run a regression to predict winning percentage from on-base percentage and slugging percentage, you get that a point of OBP is worth between two and three times as much as a point of SLG. That's different from the consensus value of 1.7 (which Tango derives here). Why the difference?
When I wrote that post, I thought I knew the answer ... but I actually hadn't. So I spent the last few days trying to figure it out. I actually jumped around among a whole bunch of possibilities, and hit a lot of dead ends ... but I think I finally got somewhere. Here's the current state of my thinking.
As usual, I could be wrong. I was wrong a few times in the course of working on all this ...
------
I think the issue is one of non-linearity. The regression assumes that runs/wins are linear in OBP and SLG, but they might not be. In fact, in a bit, I'll show they're not.
Why does non-linearity matter? Because, the "1.7" comes from adding events to an average team, and looking at the marginal impact. If there's linearity, then we know that impact must be the same for all teams. But, if there's not, that doesn't necessarily work.
To see why: consider a relationship that's actually cubic:
0, 1, 8, 27, 64, 125, 216
If you do a linear regression to predict x-cubed from x, for those seven values, you get
y = 34x - 39
The regression says that if you increase X by 1, Y increases by 34.
But ... that's not true for the *average* value of X. The average value is 3. The difference between 3.5 cubed and 2.5 cubed isn't 34 -- it's 27.25. (To be more precise, we can take the first derivative of x-cubed at 3, and get 27.)
So the average coefficient is higher than the coefficient at the average.
Sometimes the coefficient at the average will be "too high", and sometimes, like now, it will be "too low". I'm not completely sure of the exact conditions for each. The point is, though, that if you have non-linearity, the two coefficients will probably be different.
And that means if you have two variables, the ratio will be different. Suppose you have Y = a^3 + b. The regression will give you coefficients of 34 and 1. But the values at the average will be 27 and 1. So the ratio is 34 overall, but 27 at the average.
That might be what's happening here. OPS and SLG are non-linear in separate ways, and that changes the ratio from 2.3 overall, to 1.7 at the average.
-------
OPS and SLG are indeed non-linear in a certain way. In the way I'm going to show you, you don't need any baseball knowledge. I'm going to show you that they're non-linear, not in terms of *runs*, but in terms of *raw events*.
Suppose you have a batting line, and you want to add walks to raise the OBP by one point (.001). How many walks do you have to add? Well, it depends on your original OBP.
Suppose you're at .333 -- you have 333 "OBs" (walks or hits) in 1000 PA. How do you get to .334? You can't just add one walk, because that only brings you to .333666 (334/1001). It turns out you have to add approximately 1.5015 walks. That brings you to 334.5015 OBs in 1001.5015 PA, which brings you to .334.
But, now, suppose you're at .400, and you want to get to .401. How many walks do you have to add now? This time, it's 1.66945. 401.66945 divided by 1001.66945 equals .401.
(I did a little algebra to get the formula that, for every 1000 PA you start with, the number of additional walks you need is (1000 divided by (.999 minus OBP)). That's where those two numbers came from.)
That is: points of OBPs give you increasing returns *in terms of number of walks*. So the more OBPs you get, the more each additional one is worth. Or, if you want to put it another way, the more OBPs you have, the harder it is to get another one, because you need more walks to get it.
Again, this is not a baseball observation. The same thing applies, to, say, games of gin rummy. If you're at .333 and you want to get to .334, you only need to win your next 1.5015 games. If you're at .400 and you want to get to .401, though, you have to win your next 1.66945 games.
-------
So: as we saw, a point OBP offers a higher return when OBP is already high. That, by itself, is enough to make the coefficient of OBP different from the marginal value *for an average team*, which is where the 1.7 came from.
But ... what about SLG? If SLG also offers increasing returns, its coefficient will vary, too. If it varies the same way, we should still get 1.7!
Yes, indeed. But, who knows if SLG *does* have increasing returns? And who knows if it does, if it's exactly equal to the increasing returns of OBP? That would be quite a coincidence, wouldn't it?
Since we have no reason to expect OBP and SLG to offer the exact same distortion caused by increasing returns ... we have no reason to expect the ratio OBP/SLG to be exactly 1.7.
This doesn't explain why it's at the level it's at -- "slightly higher than 1.7," we could call it. From the logic we've seen so far, it could be anything: lower than 1.7, higher than 1.7, much different, a little different, whatever.
But: that's why, in theory, it won't be exactly 1.7. If that's all you're looking for, an explanation of why it *could* be different, there it is. I'm going to keep going, but it gets boring and technical and long for the next bit.
------
OK, so we talked about adding a point of OBP by adding walks. Now, let's talk about adding a point of SLG by adding an extra base.
Adding extra bases doesn't change the denominator of SLG (which is at-bats). So, if you want to add one point of SLG where there's 1,000 AB, you can just add one extra base. Change a double to a triple, or something.
But: the denominator, the number of AB, is not the same for every team. The more AB you have, the more valuable a point of SLG. At 1,000 AB, you need only 1 extra base. At 1,020 AB, you need 1.02 extra bases, which is 2% more valuable.
AB is hits plus outs. In our regression, every team has roughly the same number of outs (since we did full seasons only), so the only difference is hits. So, the more hits a team has, the more valuable a point of SLG from extra bases. And hits correlates highly with OPS.
So: the more OPS a team has, the more valuable a point of SLG. But ... well, it's a weak increase, compared to the OPS increase. I'm almost willing to call this one linear.
------
What about adding a point of SLG by adding a single? That's different, because a single affects both SLG and OBP. So, we need to do this in two steps: we add enough singles to raise SLG by a point, and then subtract enough walks to lower OBP back to where it was before.
How many singles to we have to add to SLG? That's the same formula as for how many walks we had to add to OBP. For 1,000 AB, it's
1000 divided by (.999 minus SLG)
That increases OBP by that many "events", so we subtract that exact number of walks, and OBP is back to where it was before. (Effectively, we've just converted walks to singles at the exact rate that OBP stays the same, but SLG goes up a point.)
The increase in runs is, therefore,
[1000 / (.999 - SLG)] * [value of single - value of walk]
We're assuming singles and walks have constant value -- .47 and .34, say -- so we get that adding one point of OBP adds
+.14 * [1000 / (.999 - SLG)] runs.
That's a higher number when SLG is higher, so we see that a point of SLG also has increasing returns. (I'm not going to try to figure out by how much.)
------
The last case is adding a point of OBP by singles (and leaving SLG alone).
How many singles do we need to add? Same formula:
[1000 / (.999 - OBP)]
But, that will also increase SLG, so we have to subtract that enough "extra bases" from SLG to bring it back to where it was before.
Adding hits increased total bases by the same number as it increased AB. But, to keep slugging the same when adding AB, we need to increase total bases by only SLG times the number of AB. So, we need to subtract (1-SLG) total bases, for each single added.
That is, we need to subtract
[1000 / (.999 - OBP)] * (1-SLG) bases for each hit.
Combining the the two steps, gives a batting line change of
[1000 / (.999 - OBP)] cases of "add one single, and subtract (1-SLG) bases".
Assigning run values here -- say, .47 for a single, and .26 for a base -- gives a run increase of
[1000 / (.999 - OBP)] * (.47 - .26 (1-SLG))
That gives increasing returns in OBP, and also increasing returns in SLG. Again, I'm not going to try and quantify which is bigger.
------
Those are the only four cases I see of how to increase one of OBP and SLG at the margin. (For extra-base hits, you just add the two cases -- add singles, and then add extra bases. The math works out the same.)
That means, in terms of increasing returns, we have:
Increase SLG by bases -- roughly linear
Increase SLG by hits -- increasing in SLG
Increase OBP by walks -- increasing in OBP
Increase OBP by hits -- increasing in OBP and SLG
So, some ways are increasing in OBP, and some in SLG, and ... it looks like OBP and SLG are represented roughly equally. It looks like we should expect a ratio that's not too far from 1.7. It might not be *exactly* 1.7, but our gut says it should be not too different. Which is about right -- it's in the 2s.
This is all theory. Is there any evidence we can look at?
Well, it looks like teams with lots of walks should be different from teams with lots of hits. The walking teams should see lots of increasing returns in OBP, and so a higher ratio. And the hitting teams should see lots of increasing returns in SLG, and so a lower ratio.
So, I repeated the regression, but included only teams who were at least two percentage points higher than normal in their BB/H ratio. This is the regression for those teams:
wpct = 2.69 OBP + 1.05 SLG - .845 (ratio: 2.5)
And for teams who walked two percentage points *less* than normal:
wpct = 1.62 OBP + 1.03 SLG - .491 (ratio: 1.6)
So, that seems to support the theory! More walks = higher ratio, as hypothesized.
The results are similar if I use other point thresholds for higher/lower than average:
0 points: 2.6 low, 3.7 high
1 point : 2.0 low, 4.6 high
2 points: 1.6 low, 2.5 high
3 points: 0.8 low, 1.5 high
4 points: 7.1 low, 3.6 high
(The theory seems to fail in the extreme case ... but it's probably sample size. If you up the SLG coefficient by 2 SDs, the ratio drops from 7.1 all the way to 1.6.)
Overall, I'd say, the test seems to support the theory.
-----
OK, now the bad news. I don't think this is the real answer. Yes, I think it's all correct, but I wonder if the effect is much too small to make such a big difference, from 1.7 to 2.3.
Also, this occurred to me, another explanation that seems bigger:
Walks get lumped in with singles in OBP. Extra bases get lumped in with singles in SLG. Which is worth more: a single, or the exact number of walks and extra bases that have the same impact on OBP and SLG? Whichever is worth more, if the good teams get more of that one relative to the other, that will show up in a higher coefficient. If the good teams get fewer of that one, the coefficient would be lower.
This last explanation seems to me like the effect would be bigger. Further research required, I guess.
-----
Part II is here.
Labels: OBP, OPS, regression, SLG
8 Comments:
OBP and SLG are both derivative stats with overlapping features. By that I mean they are calculated from hits, walks, ABs, etc., and many of the features that go into OBP also go into SLG,. As a result they are highly correlated with each other. This can lead to poor results in a regression when the only features used are highly correlated with each other. I haven't done this myself so it may not work, but I would recommend trying a regression using 1B, 2B, 3B, HR, BB, AB, etc. as the covariate features. This should give you cleaner results in your regression as there is no overlap in the features. I'm not sure, but it may be possible to then aggregate the weights from these features to determine the importance of OBP vs. SLG.
While I don't agree with the regression tactic Anonymous says above (Phil, you have done this before with linear weights), I still just don't understand the "point of OBP vs. point of SLG".
The ultimate result becomes misleading when we think about these: that a point of OBP is worth more than a point of SLG.
But what does that actually mean? We know that a 1B (marginal), 2B, 3B, HR are worth more than a BB. By saying that OBP is more important than SLG is nonsense. OBP is nested within SLG, and any point in SLG necessarily increases OBP.
So again, I think this is an ill-coneived question. These are non-mutually exclusive measures. We want to know on-base skill vs. hitting for power, but they just don't measure these things separately. While it might be an interesting regression exercise, I find it to be a pointless endeavor from a practical standpoint.
Phil:
I thought you had found a ratio of about 1.7 when predicting runs, but a larger ratio when predicting wins. So isn't the mystery to be explained here the disconnect between wins and runs? I don't see how this new exercise, which is entirely about runs, can explain that disconnect.
Practically speaking, I think Millsy is right. The focus on determining the "correct" values for OBP and SLG is not helpful. What would be interesting I think would be for someone to replicate the Moneyball study, but instead look at players for whom on-base skills were a large vs. small portion of their value. For example, see if a metric like OBP/wOBA has an impact on a player's salary, separate from his wOBA?
Still, the divergent results you get from using wins vs. runs as the dependent variable is an interesting puzzle....
I agree with all three of you, that it doesn't really make much sense to use OBP and SLG if you really want to do serious analysis. As guy says, I present this as an interesting regression puzzle.
For wins vs runs ... I think that might just be randomness. Not certain, of course. It could also have to do with not normalizing runs for number of outs (bottom of ninth effects). I think the results were similar with/without normalizing outs, but I'm not sure if that would be enough to explain the difference between runs/wins. That wasn't the question I was most concerned about at the time.
The reason the 1.7/2.3ish thing is more important, IMO, is that the 2.3ish is confirmed among multiple eras, so it's not randomness. The 1.7 also has a strong theoretical justification. So, that's a real puzzle that can't be explained by randomness, it would seem.
Also: it's true that when you have two dependent variables that correlate with each other, the coefficients are unpredictable. But that would show up in confidence intervals and significance. In this case, we have enough data that the mutual correlation is not a problem, that way. It might be a problem in other ways, but not that one.
For wins vs runs ... I think that might just be randomness..... That wasn't the question I was most concerned about at the time. The reason the 1.7/2.3ish thing is more important..."
I guess I'm lost. I thought the higher coefficients (over 2) arise precisely when you try to predict wins rather than runs. What "1.7/2.3 thing" are we trying to figure out?
The 1.7 happens only in the one case: individual team runs. All the other cases are higher.
Team runs: 1.7
Opposition runs: 1.9
Combined: 2.1
Combined, both team's runs equalized to the same number of batting outs: 2.1
I guess that could be random variation, and the question could be, why are wins different from runs?
But ... I'm not sure we really took care of bottom of the 9th issues, here. And, the fact that the results are so different for teams with different walk tendencies, suggests that maybe there's something else happening too.
Had no intention of devaluing the regression puzzle--these are good food for thought--but I think it's difficult (impossible?) to determine when we have hit the *right* relative values, no?
In any case, I am not sure what is all going on.
Post a Comment
<< Home