The OBP/SLG regression puzzle
In the second "puzzle" last post, I noticed that, when you run a regression to predict winning percentage from on-base percentage and slugging percentage, you get that a point of OBP is worth between two and three times as much as a point of SLG. That's different from the consensus value of 1.7 (which Tango derives here). Why the difference?
When I wrote that post, I thought I knew the answer ... but I actually hadn't. So I spent the last few days trying to figure it out. I actually jumped around among a whole bunch of possibilities, and hit a lot of dead ends ... but I think I finally got somewhere. Here's the current state of my thinking.
As usual, I could be wrong. I was wrong a few times in the course of working on all this ...
I think the issue is one of non-linearity. The regression assumes that runs/wins are linear in OBP and SLG, but they might not be. In fact, in a bit, I'll show they're not.
Why does non-linearity matter? Because, the "1.7" comes from adding events to an average team, and looking at the marginal impact. If there's linearity, then we know that impact must be the same for all teams. But, if there's not, that doesn't necessarily work.
To see why: consider a relationship that's actually cubic:
0, 1, 8, 27, 64, 125, 216
If you do a linear regression to predict x-cubed from x, for those seven values, you get
y = 34x - 39
The regression says that if you increase X by 1, Y increases by 34.
But ... that's not true for the *average* value of X. The average value is 3. The difference between 3.5 cubed and 2.5 cubed isn't 34 -- it's 27.25. (To be more precise, we can take the first derivative of x-cubed at 3, and get 27.)
So the average coefficient is higher than the coefficient at the average.
Sometimes the coefficient at the average will be "too high", and sometimes, like now, it will be "too low". I'm not completely sure of the exact conditions for each. The point is, though, that if you have non-linearity, the two coefficients will probably be different.
And that means if you have two variables, the ratio will be different. Suppose you have Y = a^3 + b. The regression will give you coefficients of 34 and 1. But the values at the average will be 27 and 1. So the ratio is 34 overall, but 27 at the average.
That might be what's happening here. OPS and SLG are non-linear in separate ways, and that changes the ratio from 2.3 overall, to 1.7 at the average.
OPS and SLG are indeed non-linear in a certain way. In the way I'm going to show you, you don't need any baseball knowledge. I'm going to show you that they're non-linear, not in terms of *runs*, but in terms of *raw events*.
Suppose you have a batting line, and you want to add walks to raise the OBP by one point (.001). How many walks do you have to add? Well, it depends on your original OBP.
Suppose you're at .333 -- you have 333 "OBs" (walks or hits) in 1000 PA. How do you get to .334? You can't just add one walk, because that only brings you to .333666 (334/1001). It turns out you have to add approximately 1.5015 walks. That brings you to 334.5015 OBs in 1001.5015 PA, which brings you to .334.
But, now, suppose you're at .400, and you want to get to .401. How many walks do you have to add now? This time, it's 1.66945. 401.66945 divided by 1001.66945 equals .401.
(I did a little algebra to get the formula that, for every 1000 PA you start with, the number of additional walks you need is (1000 divided by (.999 minus OBP)). That's where those two numbers came from.)
That is: points of OBPs give you increasing returns *in terms of number of walks*. So the more OBPs you get, the more each additional one is worth. Or, if you want to put it another way, the more OBPs you have, the harder it is to get another one, because you need more walks to get it.
Again, this is not a baseball observation. The same thing applies, to, say, games of gin rummy. If you're at .333 and you want to get to .334, you only need to win your next 1.5015 games. If you're at .400 and you want to get to .401, though, you have to win your next 1.66945 games.
So: as we saw, a point OBP offers a higher return when OBP is already high. That, by itself, is enough to make the coefficient of OBP different from the marginal value *for an average team*, which is where the 1.7 came from.
But ... what about SLG? If SLG also offers increasing returns, its coefficient will vary, too. If it varies the same way, we should still get 1.7!
Yes, indeed. But, who knows if SLG *does* have increasing returns? And who knows if it does, if it's exactly equal to the increasing returns of OBP? That would be quite a coincidence, wouldn't it?
Since we have no reason to expect OBP and SLG to offer the exact same distortion caused by increasing returns ... we have no reason to expect the ratio OBP/SLG to be exactly 1.7.
This doesn't explain why it's at the level it's at -- "slightly higher than 1.7," we could call it. From the logic we've seen so far, it could be anything: lower than 1.7, higher than 1.7, much different, a little different, whatever.
But: that's why, in theory, it won't be exactly 1.7. If that's all you're looking for, an explanation of why it *could* be different, there it is. I'm going to keep going, but it gets boring and technical and long for the next bit.
OK, so we talked about adding a point of OBP by adding walks. Now, let's talk about adding a point of SLG by adding an extra base.
Adding extra bases doesn't change the denominator of SLG (which is at-bats). So, if you want to add one point of SLG where there's 1,000 AB, you can just add one extra base. Change a double to a triple, or something.
But: the denominator, the number of AB, is not the same for every team. The more AB you have, the more valuable a point of SLG. At 1,000 AB, you need only 1 extra base. At 1,020 AB, you need 1.02 extra bases, which is 2% more valuable.
AB is hits plus outs. In our regression, every team has roughly the same number of outs (since we did full seasons only), so the only difference is hits. So, the more hits a team has, the more valuable a point of SLG from extra bases. And hits correlates highly with OPS.
So: the more OPS a team has, the more valuable a point of SLG. But ... well, it's a weak increase, compared to the OPS increase. I'm almost willing to call this one linear.
What about adding a point of SLG by adding a single? That's different, because a single affects both SLG and OBP. So, we need to do this in two steps: we add enough singles to raise SLG by a point, and then subtract enough walks to lower OBP back to where it was before.
How many singles to we have to add to SLG? That's the same formula as for how many walks we had to add to OBP. For 1,000 AB, it's
1000 divided by (.999 minus SLG)
That increases OBP by that many "events", so we subtract that exact number of walks, and OBP is back to where it was before. (Effectively, we've just converted walks to singles at the exact rate that OBP stays the same, but SLG goes up a point.)
The increase in runs is, therefore,
[1000 / (.999 - SLG)] * [value of single - value of walk]
We're assuming singles and walks have constant value -- .47 and .34, say -- so we get that adding one point of OBP adds
+.14 * [1000 / (.999 - SLG)] runs.
That's a higher number when SLG is higher, so we see that a point of SLG also has increasing returns. (I'm not going to try to figure out by how much.)
The last case is adding a point of OBP by singles (and leaving SLG alone).
How many singles do we need to add? Same formula:
[1000 / (.999 - OBP)]
But, that will also increase SLG, so we have to subtract that enough "extra bases" from SLG to bring it back to where it was before.
Adding hits increased total bases by the same number as it increased AB. But, to keep slugging the same when adding AB, we need to increase total bases by only SLG times the number of AB. So, we need to subtract (1-SLG) total bases, for each single added.
That is, we need to subtract
[1000 / (.999 - OBP)] * (1-SLG) bases for each hit.
Combining the the two steps, gives a batting line change of
[1000 / (.999 - OBP)] cases of "add one single, and subtract (1-SLG) bases".
Assigning run values here -- say, .47 for a single, and .26 for a base -- gives a run increase of
[1000 / (.999 - OBP)] * (.47 - .26 (1-SLG))
That gives increasing returns in OBP, and also increasing returns in SLG. Again, I'm not going to try and quantify which is bigger.
Those are the only four cases I see of how to increase one of OBP and SLG at the margin. (For extra-base hits, you just add the two cases -- add singles, and then add extra bases. The math works out the same.)
That means, in terms of increasing returns, we have:
Increase SLG by bases -- roughly linear
Increase SLG by hits -- increasing in SLG
Increase OBP by walks -- increasing in OBP
Increase OBP by hits -- increasing in OBP and SLG
So, some ways are increasing in OBP, and some in SLG, and ... it looks like OBP and SLG are represented roughly equally. It looks like we should expect a ratio that's not too far from 1.7. It might not be *exactly* 1.7, but our gut says it should be not too different. Which is about right -- it's in the 2s.
This is all theory. Is there any evidence we can look at?
Well, it looks like teams with lots of walks should be different from teams with lots of hits. The walking teams should see lots of increasing returns in OBP, and so a higher ratio. And the hitting teams should see lots of increasing returns in SLG, and so a lower ratio.
So, I repeated the regression, but included only teams who were at least two percentage points higher than normal in their BB/H ratio. This is the regression for those teams:
wpct = 2.69 OBP + 1.05 SLG - .845 (ratio: 2.5)
And for teams who walked two percentage points *less* than normal:
wpct = 1.62 OBP + 1.03 SLG - .491 (ratio: 1.6)
So, that seems to support the theory! More walks = higher ratio, as hypothesized.
The results are similar if I use other point thresholds for higher/lower than average:
0 points: 2.6 low, 3.7 high
1 point : 2.0 low, 4.6 high
2 points: 1.6 low, 2.5 high
3 points: 0.8 low, 1.5 high
4 points: 7.1 low, 3.6 high
(The theory seems to fail in the extreme case ... but it's probably sample size. If you up the SLG coefficient by 2 SDs, the ratio drops from 7.1 all the way to 1.6.)
Overall, I'd say, the test seems to support the theory.
OK, now the bad news. I don't think this is the real answer. Yes, I think it's all correct, but I wonder if the effect is much too small to make such a big difference, from 1.7 to 2.3.
Also, this occurred to me, another explanation that seems bigger:
Walks get lumped in with singles in OBP. Extra bases get lumped in with singles in SLG. Which is worth more: a single, or the exact number of walks and extra bases that have the same impact on OBP and SLG? Whichever is worth more, if the good teams get more of that one relative to the other, that will show up in a higher coefficient. If the good teams get fewer of that one, the coefficient would be lower.
This last explanation seems to me like the effect would be bigger. Further research required, I guess.