Wednesday, May 29, 2013

The OBP/SLG regression puzzle -- Part III

I ran regressions in the previous posts to predict winning percentage from on-base percentage and slugging.  In those regressions, I had adjusted all teams to the league SLG and OBP.  I had to.  If you don't adjust, the results vary a lot.

Here's the regression completely unadjusted.  (It's all teams from 1961 to 2009, except strike seasons.)  Here's the equation.  (I'll put the OBP/SLG coefficient ratio in brackets too.)

wpct = (2.19*OBP) + (0.07*SLG) - .2405  [ratio: 30]

That's an OBP/SLG ratio of over 30!  We were expecting 1.7.  It seems like slugging barely matters at all!

Compare that to the "regular" regression, which adjusts for league-season: 

wpct = (2.70*OBP) + (0.89*SLG) - .7843   [ratio: 3]

OK, that's a bit better.  The ratio is down to 3.

Guy argued, in the comments to the first post, that I need to adjust for park, too.  He's right.

If I change winning percentage to what it would be if the team had posted those stats in a neutral park -- while still keeping the league adjustment -- I get this equation: 

wpct = (2.65*OBP) + (1.09*SLG) - .8504  [ratio: 2.43]

Even better: we're down to 2.43!

An easier way might be just to not adjust anything, but include the league and park in the regression:

wpct = (2.63*OBP) + (1.15*SLG) - (2.58*league OBP) - (1.16*league SLG) - (0.0029* BPF) + 0.091  [ratio: 2.1]

Now, the ratio is all the way down to 2.1.

What's going on?

This one's pretty simple.  When a team has a high OBP or SLG, it's a combination of two things:

-- batting talent, and
-- a high run environment for the league and park.

The first one actually has an impact on winning percentage.  The second one doesn't.  A high SLG doesn't help you if it's caused by the park, because the opposition benefits from it too.

The same is true for OBP.  But ... SLG should be affected *more*.  There are more high-HR and low-HR parks than there are, say, low-walk parks.  The "steroids era" was mostly home-run related.  

Comparing 1968 to 2001:

1968: OBP .299, SLG .340
2001: OBP .332, SLG .427

OBP increased 11 percent, but SLG increased 26 percent.

So, when you don't adjust, slugging doesn't matter as much, because it benefits the opposition too.  That makes OBP look more important, relatively speaking.

(All credit for this finding goes to Guy ... he actually explained all this to me in his comment.)


As you'd expect, the problem goes away when you combine team offense with opposition offense in the same regression.  Even without adjusting, you don't have a big problem, because both teams are affected the same way.  

I used the *differences* between team OBP/SLG and opposition OBP/SLG, without any adjustument, and got

wpct = (2.09*OBP) + (0.897*SLG) + .5  [ratio: 2.33]

That's a ratio of 2.33.


But why do we need to care about the opposition at all?  Commenter Alex suggested that if we try to predict "runs per game" instead of "winning percentage," we'll get even better results, because the opposition won't matter.  

I'm checking that out for a future post.


Update: that future post, part IV, is here.

Labels: , , ,


At Friday, May 31, 2013 7:22:00 PM, Blogger Cyril Morong said...

Interesting. You seasons from 1960 to 2011. Where did you get the opposition OBP and SLG from?

At Friday, May 31, 2013 9:49:00 PM, Blogger Cyril Morong said...

Another thing that might matter is that the error rate is different as time goes by. So a given OBP & SLG in 1960 might lead to more runs being scored than in 2011 since more errors were made back in 1960

At Saturday, June 01, 2013 10:08:00 AM, Blogger Pete Ridges said...

That "30x more important!" thing...A problem is that OBP and SLG are strongly correlated with each other. In particular, if you had two variables that were perfectly correlated, then you simply would not be able to tell which of them was responsible for levels of a third variable. That means that if you were to change the data slightly- say, if you were to give the 2001 Mariners another 50 points of SLG- then you could perhaps get a very different answer.
A more usual way of measuring connections is to ignore regression and do correlations: "W-L vs OBP" then "W-L vs SLG".

At Saturday, June 01, 2013 12:13:00 PM, Blogger Phil Birnbaum said...


I got the opposition stats from going through Retrosheet play-by-play. I only did them to 2009, that's what I'm using for parts II and III.

Good point about the errors, thanks!


I think the correlation between the two variables shows up in the confidence intervals and significance tests. There's enough data that the regression is pretty good at figuring out what's really going on, in this case, because the standard errors are fairly small.

At Saturday, June 01, 2013 2:37:00 PM, Blogger Cyril Morong said...


Here is an example of how the error rate can affect things. The error rate is 1 - fielding pct. The regression below shows runs per game as a function of OBP & SLG for each season of the NL from 1920-2012 (I used the whole league instead of teams)

R/G = 22.77*OBP + 6.7*OBP - 5.68

Now what if we add in the error rate. The regression becomes

R/G = 15.75*OBP + 9.7*SLG + 15.69*ERATE - 4.94

The relative value of OBP & SLG changed quite a bit. But this is for a whole league. The ERATE applies to the whole league.

I have used the ERATE and applied it to teams. That assumes that the rate of errors made against each team is the same. Not totally realistic, but that is what I have. Sof it the ERATE was .02 one year in a league, every team got that rate.

I did all teams from 1920-1998. Here are the two regressions

R/G = 19.89*OBP + 9.79*SLG - 5.95

R/G = 17.63*OBP + 10.7*SLG + 13.51*ERATE - 5.87

So again the relative values of OBP & SLG change


Post a Comment

<< Home