Sunday, June 02, 2013

The OBP/SLG regression puzzle -- Part IV

(Here's part 1, part 2, and part 3.)


A couple of posts ago, Alex and others suggested that I try a regression to predict runs per game (RPG) instead of winning percentage.  Maybe *that* regression would come out to the OBP/SLG ratio of 1.7 that we've been expecting.  

Jared Cross actually did that, in another comment, and it worked!  Here's my version of Jared's result:

RPG = (16.6*OBP) + (10.0*SLG) - 4.9  [ratio: 1.67]

And the same regression, but for opposition runs per game:

RPG = (17.4*OBP) + (9.4*SLG) - 4.9  [ratio: 1.85]

The two ratios, 1.67 and 1.85, are almost perfectly in line with the expected 1.7!


So, my first reaction was ... I wasted my time!  All those worries about walks and non-linearity and increasing returns ... well, they weren't necessary.  The issue was just that I used a different variable, winning percentage instead of runs!

But ... actually, on reflection, I don't think that's it.  I think it's just a bit of a coincidence that these regressions work out to 1.7.  Let me give you a couple of intuitive arguments that may or may not convince you.  

First argument: I redid the first regression above, the 1.67 one, but three times, with the dataset split based on walk tendencies (BB/(H+BB)).  "High" and "low" mean one percentage point above or below average; "medium" is everyone else.  The ratios:

High walks:   1.70
Medium walks: 1.25
Low walks:    1.89

It's strange: the teams with average numbers of walks had a much lower ratio than the high-walk and low-walk teams.  But the original 1.7 ratio, the one that Tango got, was based on a perfectly average team.  

So: if this is the answer, that this regression is the right one ... shouldn't Tango's result have come in at 1.25 instead of 1.7?  

Second argument: I repeated the regression, but this time I combined the team with its opponents.  That is, I predicted (RPG minus opposition RPG) from (OPS minus opposition OPS) and (SLG minus opposition SLG).

Since we're just subtracting the two equations, you'd expect that the ratio would be somewhere in between 1.67 and 1.85.  But, no:

RPG = (18.73*OBP) + (8.97*SLG)   [ratio: 2.09]

The ratio goes up to 2.09.

Why should that persuade you that the original 1.67 is coincidence?  This is a red herring: Tango's analysis didn't include opposition.

But ... it did, in a way.  Tango's logic and numbers work exactly the same way if you include opposition.

Instead of asking, "what happens if we add an event to an average batting line," you can ask, "what happens if we add an event to the zero batting line that's the difference of two average teams."  The calculation is exactly the same either way.  

Specifically, if adding a point of OBP to an average team gives 1.7 times as many additional runs as a point of SLG, then adding a point of OBP to an average team *with a given opponent* should add 1.7 times as many additional runs *over that opponent* as adding a point of SLG.  Right?

But ... here, the results are different.  We we added in the opponent, we got 2.09 instead of 1.67.  And the difference, I will argue next post, is real, not just a random artifact.  Actually, I don't even think the difference has anything to do with baseball.


Talking baseball for now, though ... why are these results so much different from the winning percentage case?  Especially the walk breakdown.  With winning percentage, it seemed like walks increased the ratio, but, when it comes to runs, it seems like sometimes walks increase the ratio and sometimes they decrease it!

Well, it's going to sound like I'm just making this up, but here's my latest theory, for what it's worth:

The linear weights values of the various offensive events are based on an average team.  In real life, their actual values are different for good teams and bad teams, but we just assume the differences don't matter.  And they wouldn't, if they all changed roughly equally, because then the regression would adjust.

But, maybe they don't.

I ran a regression to predict the linear weights values based on runs.  Then, I looked at only the best offenses (based roughly on a 1.7 OPS stat), and the worst.  (I know it's not very accurate to use regression for this, but I was too lazy to do play-by-play data, and I only need roughly correct values anyway.)

event     1B   2B   HR    BB   out
average  .53  .71  1.45  .34  -.10
high     .58  .69  1.50  .34  -.13
low      .49  .72  1.47  .34  -.08

Almost all the difference is in singles and outs!  

In our sample, every team had roughly the same number of outs, so the out value doesn't matter that much.  But, not every team has the same number of singles, relative to the other events.  

For a given (1.7 * OPS + SLG), the team with the higher slugging will probably have more singles.  Therefore, they will score more runs than expected from their 1.7-weighted OPS.  Therefore, the regression will attribute that to the SLG, and weight it higher, reducing the ratio.

This is the reverse of what I thought happened in the "winning percentage" case, which is why you should assume I don't know what I'm talking about.  But, if you're still with me ... well, if I'm going to change my mind, I should probably come up with an explanation of why the cases are different. 

Here's an attempt.  I'll put it in block quotes to emphasize that it's a guess and I'm just throwing it out there ...

The singles thing was happening in the winning percentage case, too, obviously.  However, it was overshadowed by another factor.
When a team has a high SLG, much of that is caused by the park.  In that case, the opposition will also have a higher SLG.  So, much of the high SLG doesn't translate into a higher winning percentage (although it *does* translate into more runs).  

On the other hand, if a team has a high OBP, more of that is "real", since park factors don't affect walks and singles as much as, say, doubles and home runs.  

So that's why walks mattered more when we were looking at winning percentage.  When it comes to runs, walks are taken at face value.  But for winning percentage, we have to give them extra weight, because the relationship between slugging and winning percentage is too tainted by park. 

Is that right?  Who knows.  It sounds plausible.  But, so did everything else I argued earlier.  I'm probably full of crap.  Take this part with a grain of salt.


For the record, though: when we combine a team and its opposition, we get a ratio of 2.33 for winning percentage, and 2.09 for run differential.  Those aren't too far off from each other, and still higher than 1.7.


So, the question remains: if I've convinced you that the 1.7 is coincidence, and the 2.09 matters just as much ... then, why do we still have that difference?  I'm going to back off from specific theories, and stick to generalities.

The relationship between OBP/SLG and runs/wins is non-linear in many ways.  One way is that a point of OBP/SLG has different numbers of events depending on how good the offense is.  Another is that a point of OBP has different proportions of walks/hits depending on slugging.  A third is that the different basic events have different run values depending on offense.  And there are probably a lot more.  

So, the answer to the question is: because there is so much non-linearness going on, there's no reason to expect that the coefficient of the average team will equal the coefficient from a regression of all teams.  


Next post, I'm going to make even a stronger argument, one that has nothing to do with baseball: I'm going to argue that the ratio we get from these regressions is almost useless anyway.  Tango's 1.7, for an average team, is meaningful -- but these other regression results that I've been doing, these 1.67s and 2.33s and 2.09s, don't tell us anything useful at all.

Here's that next post.

Labels: , , ,


Post a Comment

<< Home