The OBP/SLG regression puzzle -- Part V
(Links: Part I, Part II, Part III, Part IV)
When you run a regression to predict a team's runs per game based on OPS and SLG, you get that a point of OPS is about 1.7 times as important as a point of SLG. When you predict *opposition* runs per game, you get 1.8. But, when you try to predict run differential based on OPS differential and SLG differential, you get 2.1.
Why the difference?
It all hinges on the idea that the relationship between runs and OBP/SLG is non-linear.
Suppose there are seasons where teams hit for a higher OBP and SLG are higher than normal -- steroid years, say. And suppose those are the exact seasons where a single point of OBP or SLG is worth more in terms of runs. That's not farfetched -- an offensive event seems like it would be worth more in years when there's more offense. When there's lots of hitting, a walk has a better chance to score, and a double has more men on base to drive in.
So, it's a double whammy. OBP/SLG are higher, and each point of OBP/SLG is worth more. It's almost like a "squared" relationship, which is non-linear.
-----
A good analogy would be something like, tickets sold vs. ticket revenue. On one level, it's linear, because for each extra ticket you sell, your revenue goes up $25 or whatever. But, then, the second whammy: attendance is much higher now than it was in the sixties. So, if you sell a lot of tickets, it's more likely you're in 2011 than 1966, which means that, along with your higher attendance, you also have higher-priced tickets! So, more tickets means more revenue because of more sales, but also more revenue because of more revenue *per ticket*.
Now, what happens when you switch to *differences*, so you're predicting "revenue over opposition" based on "tickets over opposition"?
Well, that depends. The original source of the "double whammy" was that teams with high tickets sold were more likely to have higher ticket prices. Is that still true for teams with high *differences* of tickets sold?
Maybe, or maybe not. Suppose you have two teams, team A from 2002 that averaged 40,000 tickets, and team B from 1964 that averaged 10,000 tickets. That's a ratio of 4:1, four times as many tickets sold when per-point OBP/SLG values were high.
After you take the difference, does the 4:1 ratio change? If it stays at 4:1, nothing happens. If it moves higher -- maybe team A outdrew its opposition by 10,000, but B outdrew its opposition by only 1,000, for a ratio of 10:1 -- the relationship becomes even more non-linear. If it moves lower -- team A outdraws by 2,000, team B outdraws by 1,000, for a 2:1 ratio -- the relationship becomes *less* non-linear.
(Actually, I'm not sure if I should be dividing (to get the ratio, which I did) or subtracting (to get the raw difference) or something else. But this is just an intuitive explanation anyway.)
Since there is no reason to expect that the differences will have the exact same ratio as the original attendance figures, we are almost *assured* that the nature of the relationship will change. And, since, as a general rule, the coefficient increases with non-linearity, we expect the coefficients to change.
So it was a mistake, on my part, to originally assume that the "difference" regression should have the same ratio as the "single-team" regression.
-----
BTW, that was the "non-baseball" explanation that I promised you. If tickets is still too basebally, just substitute any non-baseball relationship where a high X is correlated with a high value per X. What works well is a time series featuring some commodity that sells more now, when per-unit prices are obviously higher because of inflation.
Like, say, Starbucks coffee, or bicycle helmets. If you want a "decreasing returns" example, I bet cigarettes is a perfect one -- lower cigarettes sold is strongly associated with higher prices per cigarette.
------
Now, to the actual baseball data.
For the MLB teams and their oppositions in the dataset, we need to know: when we look at *differences* in OBP and SLG instead of the actual values, do high differences still correlate with times when individual points are more valuable? That is: do the ratios get wider, or narrower? It's hard to know, exactly, but we can get a rough idea, by looking at the spread of the data before and after.
Let's start with OBP.
For the seasons in the study, the range of team OBP was .273 to .372, which is 99 points. For opposition OBP, the range was 103 points. In terms of standard deviations, it was .015 for the offenses, and .016 for the opposition offenses.
But, for the differences, the spread was wider. The range was 123 points, and the SD was .019.
Teams: spread 99 points, SD .015
Oppositions: spread 103 points, SD .016
Differences: spread 123 points, SD .019
That makes sense ... when you subtract opposition hitting, it's like adding team pitching. The spread should be wider because if you take a team with awesome hitting, and they also have awesome pitching, they stick out from the average twice as much.
In general, when you subtract one variable from another, if they're independent (or have a negative correlation), the SD and spread increase. If hitting and opposition hitting were, in fact, independent, you'd expect the SD to increase by a factor of root 2 -- from .015 to .021. It only increased to .019, because hitting and opposition hitting are, in fact, somewhat correlated. They both are affected by whether you play in hitters' parks or pitchers' parks. They also depend on what era you play in.
Still, the difference in OBP seems to have increased the spread. We don't know for sure, though, that the new difference numbers are still correlated with high values for each point OBP, but it seems reasonable to expect.
-------
What about SLG?
Well, the range for teams was .301 to .491 (190 points). The range for opposition teams was .306 to .499 (193 points).
But -- surprisingly -- the range for (team minus opposition) was almost the same. It was 194 points.
And the SD of the differences was *smaller* than the SDs of the originals. The teams were .034, the opposition was .033, but the SD of the differences was only .031.
Teams: spread 190 points, SD .034
Opposition: spread 193 points, SD .033
Differences: spread 194 points, SD .031
Why is the SD actually *lower* when you combine the two variables?
Well, it's the same argument as for OBP: they're correlated with each other. But, it seems, SLG correlates much more highly than OBP did. Probably, park and era effects are slugging related more than on-base related -- after all, parks are better known for home runs than for walks, say.
I wasn't expecting those effects to be so huge ... this seems to suggest that environment may be *more important* than team talent, at least for raw SLG!
So, it would be reasonable to assume that, when we subtract opposition runs from team runs, the "non-linearness" of SLG to runs stays about the same, or decreases a bit.
-------
What have we found? We found that when we use "team minus opposition," we're increasing the non-linearity of OBP, but *decreasing* the non-linearity of SLG. So, we'd expect the OBP coefficient to increase, and the SLG coefficient to decrease. And that's what happens:
Teams: 16.62 OBP, 9.98 SLG [ratio: 1.67]
Opposition: 17.38 OBP, 9.38 SLG [ratio: 1.85]
Differences: 18.73 OBP, 8.97 SLG [ratio: 2.09]
And that explains why the ratio for differences is higher than the ratio for individual teams.
By the way: in the above table, every higher coefficient in a column is associated with a higher SD of that variable, and every lower coefficient in a column is associated with a lower SD of that variable. That doesn't have to be the case, but it's more likely that way than any other way, I would think.
Labels: baseball, OBP, regression, SLG
1 Comments:
You know, I just thought of another way you could test all these issues. For the regressions you've been running, you can run it the way you have with OBP and SLG as two separate variables. Then create a third variable, which is OBP/SLG. Run a new regression with whatever DV you're looking at but only using OBP/SLG as the predictor. Then see what the confidence interval is, and how different the fit is from your first regression. If none of the fits are that different and the CIs include or are close to 1.7, I would say you're just looking at noise.
Post a Comment
<< Home