The OBP/SLG regression puzzle -- Part II
A point of OBP seems to be worth 2+ times as much as a point of SLG, when you run a regression on team performance. But, when you look at the marginal effects on an average team, you get 1.7. Why the difference?
Last post, I suggested two reasons:
1. Different rates of increasing returns on OBP and SLG;
2. Different values of OBP and SLG depending on ratio of singles/walks/power.
My argument then was mostly about #1. But, now, I realize the answer is probably almost all #2. The evidence was there all along.
When Tango did the analysis that got him the 1.7 factor, he showed the OBP and SLG run equivalents for the various events. Here they are:
1B: actual 0.474, estimate 0.485
2B: actual 0.764, estimate 0.786
3B: actual 1.063, estimate 1.087
HR: actual 1.409, estimate 1.389
BB: actual 0.336, estimate 0.313
out: actual -.302, estimate -.286
("Actual" refers to the known, accepted values from other methods; "Estimate" refers to the approximation from the OBP/SLG method.)
The values are reasonably close, but not exact. The differences are:
The discrepancies are actually pretty large. Why? And, why are there discrepancies at all?
Because: there just isn't a way to get an OBP/OPS linear relationship to be as accurate as one where you look at the underlying events.
It's like ... suppose I create two stats for money. "Bigness" (BGN) reflects whether the bill is $50 or higher. "One-ness" (ONE) reflects whether the value contains a "1" ($1, $10, $100). When I use those instead of the real values, I'm obviously losing accuracy, because I'm eliminating valuable information. (For one thing, a $1 and a $10 look exactly the same to those two stats.)
You can get equal points of BGN and ONE in different ways: For example, "$1 and $50", has the same effect on BGN/ONE as "$100 and $5." If you increase your stats with $100s and $5s, you're going to have more money than your BGN+ONE suggests. If you do it with $1s and $50s, you're going to have less than your BGN+ONE suggests.
It's the same for OBP and SLG. You can get a high OBP and SLG in two separate ways: walks, or hits. If you do it with walks, you're going to score more runs than your OBP/SLG suggests. If you do it with hits, you're going to score fewer runs (unless you hit more home runs than any other type of hit, which is unlikely).
Now, that wouldn't be a problem, if all teams had the same walk/hit tendencies. In that case, the regression would just smooth everything out, and the discrepancies would all cancel.
Suppose it's true that teams don't have different tendencies. Then, the 1.7 holds everywhere, and a team's success is linear on (OBP*1.7 + SLG). That means if you have two teams, one of which is the league average .333 and .430 (for a 1.7-weighted total of .996), but the other team is (say) .350 and .401 (for the same .996), you'd expect the two teams to have the same BB/hit ratio.
That doesn't seem right, does it? It seems like the .350/.401 team should be hitting more singles and fewer home runs. I mean, not necessarily ... I guess you could come up with a scenario where it hits .200 with lots of walks and power. But that seems unlikely. It seems like the higher the contribution of slugging percentage, the more hits relative to walks.
And, yes, that's how it is. I ran a regression to predict the walk ratio (BB/(BB+H)) from 1.7-weighted OPS, and SLG. The results:
Walk ratio = (0.9295*weighted OPS) - (1.5234*SLG) - 0.267
See? The higher the SLG, the fewer the walks. Therefore, high-SLG teams will underperform their regression estimate, and high-OBP teams will overperform.
And that's why, when you look at all teams, the regression "notices" that OBP teams are underestimated relative to SLG teams. And so, it moves the OBP coefficient higher, and the SLG coefficient lower.
And that's the answer to why the ratio is higher than the 1.7 we'd otherwise expect.