Friday, April 20, 2012

Don't use regression to calculate Linear Weights, Part II

I've written a couple of times now about how why it's better to calculate linear weights via the "traditional" play-by-play method than by regression. That was all theory .... this time I figure I might as well go into more detail, including doing the work and showing the real numbers.

I ran the play-by-play method using all major league games from 1990 to 1998. You probably know this already, but here's how the method works:


1. Look through every baseball game for the period you're interested in (probably by computer, through Retrosheet). For every plate appearance, note the number of outs, bases occupied, and the number of runs that scored in the remainder of that inning.

2. For each of the 24 game states (number of outs, bases occupied), group all the observations and average them, giving you the average number of runs scored in the remainder of that inning.

For instance: from 1990 to 1998, there were 27,580 instances of a runner on second and nobody out. A total of 31,749 runs scored in the remainder of those innings, for an average of 1.151 runs.

There were 9,090 instances of runners on first and third with no outs. The average there was 1.807 runs.

3. Once you have those 24 averages, go through every baseball game again. Find every single. Look at the game state before the single, and after the single. Calculate the difference in run expectation between the two states. The difference, plus the number of runs scored on the play, is what that single was worth.

For instance: with nobody out and a runner on second, the batter singles, runner to third. The value of that single is 1.807 minus 1.151, which is 0.656. So a single in that situation is worth 0.656 runs, on average.

Repeat for every single, and average out all the values. That's your answer, your linear weight for the single.

4. Repeat step 3 for every other event: doubles, home runs, walks ... whatever you want to calculate the linear weight for.

The logic of this method, I think, is very solid. Because we looked at all the data for those years, it is a *fact* -- not an estimate -- that on average, 1.151 runs were scored with a runner on second with nobody out. It is a *fact* that an average 1.807 runs were scored with runners on first and third with nobody out. Therefore, there is a very strong presumption that a single that moves the runner to third is worth .656 runs.

Technically, the value of the single is only a very close estimate. The reason is that we used three different samples. The 1.201 figure is from a group of 27,580 teams that put a runner on second with nobody out. The 1.879 figure is from a *different* group of 9,090 that put a runner on first and third with nobody out. And, the singles we're investigating were hit by a *third* group of teams -- 139,477 of them. Teams in the third group, by definition, form the weighted average of all teams that hit singles. But the other two groups are probably very slightly better than average, which means the calculation isn't quite exact.

But that's a very picky objection. Any bias caused by these circumstances is going to be very small.

-----

Here are the values the play-by-play method gives for the "big five" events:

0.468 single
0.768 double
1.076 triple
1.403 home run
0.314 unintentional walk

These are reasonably in line with the "traditional" Pete Palmer linear weights, circa 1984. That's not surprising -- Pete used the play-by-play method himself. (Actually, since Retrosheet didn't exist at the time, Pete ran a simulation to get random play-by-play data, then ran this method on that.)

-----

Then, there's the regression method.

In terms of the amount of work you have to do, this method is much, much simpler than the traditional method. All you do is take team batting lines, and ask your regression software to give you the best predictor of runs scored based on the other factors.

I ran this regression for the same seasons as the play-by-play method: 1990-1998. That comprised 248 team batting lines total. I used only 1B, 2B, 3B, HR, BB, K, and outs. Here are the results (with standard errors in parentheses):

0.494 (.030) single
0.730 (.054) double
1.343 (.193) triple
1.465 (.054) home run
0.342 (.025) walk (including IBB)

You'll notice that there are some significant differences between the two methods, especially singles and walks.

So, what's going on?

Well, first, I should point out that the differences aren't statistically significant. All the regression coefficients are within two standard errors of the traditional ones. So, you'd be within your rights to insist that the two sets of results aren't that much different.

And, of course, it's possible that it's just a programming error on my part.

However, I don't think either of those things is what's happening. I think that the differences are real.

I think the play-by-play method gives the correct result -- or close to it -- and the regression method is inherently biased.

Let me show you why I think that, in a bit of a roundabout way. Let me know if I'm wrong.

I ran another version of the same regression, but this time, instead of using team-seasons in the regression, I used individual games (both teams combined into one batting line). So, I had 81 times as much data in the regression. Actually, because of the 1994 lockout, it was actually only about 77 times.

Here are the results of that one:

0.490 (.004) single
0.779 (.009) double
1.097 (.026) triple
1.433 (.011) home run
0.433 (.006) walk

This time, the value of the walk is huge, much bigger than it should be -- and much bigger than the value we got when we used 1/77 the data. Why?

Start with the observation that the value of the walk (or, indeed, any other event) depends on context. On a high-offense team, the base on balls will be worth a lot, because the baserunner has a higher chance of being driven in. For low offense teams, the value of the walk will be lower, as it's more likely he'll be stranded.

So far, so good. The problem, though, is that the more walks a team gets, the stronger its offense -- and so, the more valuable each individual walk becomes.

Suppose a team has N walks. At the margin, the "N+1"th walk a team gets is worth so many runs. But the "N+2th" walk has to be worth more -- because now it's in the context of a stronger team, a team that has N+1 walks, instead of a team that has only N walks.

It doesn't seem like it should be a big deal -- and it isn't that big, for a season. The range of actual major-league team offenses, in general, is pretty small. In 1998, the Yankees led the major leagues with 678 walks. The Pirates were last, at 393. That means the Yankees walked 73 percent more than the Pirates.

But for games, the difference can be much larger. Some games have 7, 8, or more walks. Some games have only 1 or 2. Now, the difference between most and least is in the hundreds of percents.

The difference between the 8th walk and the 2nd walk, over one game, is much bigger than the difference between the 393rd walk and the 678th walk, over a season.

And so, the problem of non-linearity -- of increasing returns on walks -- is bigger when you go game-by-game.

Now, what happens in the regression when you have increasing returns? Well, first, you probably shouldn't be using linear regression, which, by name and by definition, assumes things are linear. If you do it anyway, your coefficient gets inflated. In this case, the coefficient for the walk got inflated all the way to .433.


If you want a simpler example: Suppose I offer to pay you $1 for one hit, $4 for two hits, and $9 for three hits. That means hits have increasing returns -- the first hit is worth $1, the second is worth $3, and the third is worth $5.

In six games, you get one hit four times, two hits once, and three hits once. You made $17 on 9 hits, so the average hit is worth $1.89.

But, if you run a regression on the six games, you get each hit worth $2.29. (Try it if you don't believe me.)

The more increasing returns you have, the worse the bias. For games, the bias was high, pushing the walk to .433. For seasons, the bias is lower -- but not zero. So the regression's value for the walk will still be higher than it should.

The regression software I use has a test for non-linearity. For games, it comes back that walks (and singles) are definitely non-linear. For seasons, it finds non-linearity, but not enough to be statistically significant. (That insignificance, I think, is why academic studies that use regression don't notice there's a problem.)

------

Here are the results for two other sets of seasons.  2000-2009 NL:

PBP 0.454, reg 0.542, reg games 0.481  (single)
PBP 0.765, reg 0.753, reg games 0.794  (double)
PBP 1.063, reg 1.159, reg games 1.010  (triple)
PBP 1.395, reg 1.576, reg games 1.576  (home run)
PBP 0.303, reg 0.283, reg games 0.416  (walk)

And 2000-2009 NL:

PBP 0.478, reg 0.527, reg games 0.495  (single)
PBP 0.780, reg 0.776, reg games 0.783  (double)
PBP 1.051, reg 1.529, reg games 1.000  (triple)
PBP 1.397, reg 1.396, reg games 1.415  (home run)
PBP 0.334, reg 0.349, reg games 0.430  (walk)

And, for completeness, I'll re-run the original numbers for 1990-1998:

PBP 0.468, reg 0.494, reg games 0.490  (single)
PBP 0.768, reg 0.730, reg games 0.779  (double)
PBP 1.076, reg 1.343, reg games 1.097  (triple)
PBP 1.403, reg 1.465, reg games 1.433  (home run)
PBP 0.314, reg 0.342, reg games 0.433  (walk)

In every case, the regression on individual games ("reg games") seriously inflates the value of walks and singles.  The regression on batting lines ("reg") inflates the value of walks and singles five out of six times. 

-----

Why does the inflation only seem to apply to walks and singles?  Here's my hypothesis.  It might be wrong.

Singles and walks tend to be more concentrated in games than other events are. Some pitchers give up a lot of walks, some give up only a few. The difference between pitchers is not as big for hits. Yes, pitchers do vary a lot in strikeouts, which leads to differences in hits, but each strikeout difference is only 3/10 of a hit (since batting average on balls in play is fairly constant at .300). 


Doubles, triples, and home runs have less non-linearity: two triples in the same inning aren't that much more valuable than two triples in separate innings. Also, home runs may have *diminishing* returns: two HRs in the same inning are probably worth less than two HRs in separate innings.

Also: doubles, triples, and home runs are rare enough that multiples don't happen that often. The number of repeats, I think, should be proportional to the square of the frequency. So if there are twice as many walks as doubles, there are four times as many consecutive walks as consecutive doubles.

That's why I think we see most of the bias in the value of the walk and single. 


------


So, that's one reason I think it's preferable to use the traditional method over the regression method -- the regression method is biased too high.

If that's not enough, here's another reason: the traditional method has a lower random error.

Intuitively, that makes sense. The traditional method uses all the play-by-play data available, at a very granular level. The regression method uses only season statistics, the aggregation of maybe 6,000 plate appearances. It seems obvious that the method that uses six thousand times as much data should be more accurate.

But, that's easy to show empirically.

Here's what I did. I used the same years of data, and the same play-by-play method -- but I divided the data into 13 parts. I then calculated the linear weight independently, for each of the 13 parts.

I took those 13 linear weights, and calculated their standard deviation.

The best estimate of the true value, of course, is the average of the 13 estimates. And the standard error of that average is simply the SD of the 13 estimates, divided by the square root of 13.

So, here are the results of the play-by-play method again, this time including the standard errors I wound up with:

0.468 (.0022) single
0.768 (.0027) double
1.076 (.0072) triple
1.403 (.0028) home run
0.314 (.0023) unintentional walk

(Technical note: because the way I constructed the 13 groups was random-ish, the standard errors are random too. I tried a different randomization and the values changed somewhat, but were the same order of magnitude. If you really, really wanted a precise estimate of the standard error, you could just do the randomization a couple of hundred times and average them all out.)

If you compare those to the standard errors from the regression, you'll see that they're smaller by at least a factor of 10. Which makes sense, considering they use so much more data, at so granular a level.


-------

So, this is a case where the more technical, mathematicky method is actually less accurate, less precise -- and less rigorous -- than the grade-school level method. Other than ease of computation, I don't see any reason to prefer the regression.

Labels: , ,

15 Comments:

At Friday, April 20, 2012 3:08:00 PM, Anonymous David said...

Good stuff Phil. Despite my previous comment on your academic post, I generally find the sabermetric way of finding linear weights to be convincing. My prior is probably different from yours as to how much bias might creep in because the sample of batters observed with bases loaded and two outs is different from the population of batters, but I agree it’s way better than the alternative of year-by-year linear regression, why is much less motivated. (Although, if I was feeling mischievous I might argue that linear weights in the usual way is actually more academic because it’s basically a well-motivated Markov chain)

More specific questions: by chance have you run the regression with plate appearances as the unit of observation and “runs until the end of the inning” as the outcome variable? I think this would be most closely analogous to linear weights in terms of comparing standard errors (kudos on basically starting to bootstrap by the way…). I suspect this would introduce more bias relative to game-by-game, but it would also be granular enough that one could control for team quality, etc.

 
At Friday, April 20, 2012 3:26:00 PM, Anonymous Alex said...

I'm not especially familiar with the analyses you're describing, so I want to make sure I understand it - the 'traditional' method uses bat-by-bat data with tens of thousands of observations, right? Whereas the two regressions you ran used season-level and game-level data? If so, I agree with David's comment on letting regression have bat-by-bat data to work on as well so that it's a fair comparison. I would also guess that as you become more fine-grained a linear regression would become less appropriate because runs scored becomes less normally distributed.

 
At Friday, April 20, 2012 3:56:00 PM, Blogger Andy said...

Perhaps a better explanation for why singles and walks have the most nonlinearity is that their value increases the most within an inning. 1 walk/single in an inning is not that much better than 0, but at 3+ in one inning each subsequent single/walk is worth almost a full run.

For example, if we just look at the marginal probability of getting > 2 singles in an inning based on how many singles there are in a game, you get:
2: 0% (of course)
3: +0.1%
4: +0.4%
5: +0.7%
6: +1.1%
7: +1.4%
8: +2%
9: +2.4%
10: +3%
11: +3.3%

 
At Friday, April 20, 2012 5:09:00 PM, Blogger Phil Birnbaum said...

David,

You're suggesting the regression have 24 (23) dummy variables for base/out state, plus another set of dummy variables for events? Sure, you could do that, but it seems a clumsy method, and just as much work as the traditional one.

You could also convert this to a "traditional" method with one change: instead of computing the difference between the beginning and ending states, you just look at the number of runs that score in the remainder of the inning.

That is, you remove the assumption that only the current state matters, and replace it with the assumption that what matters is both the current state and the next thing to happen.

There's a discussion of whether this approach is better on Tango's site. Peter Jensen pointed out (correctly, IMO) that this approach is definitely better for calculating the value of the IBB.

 
At Friday, April 20, 2012 5:10:00 PM, Blogger Phil Birnbaum said...

Alex:

See my comment above. Are you suggesting a regression with 23 dummy variables for base/out state, and other dummy variables for events?

If not, what does your proposed regression look like?

 
At Friday, April 20, 2012 5:12:00 PM, Blogger Phil Birnbaum said...

Andy: I agree. Thanks for those calculations!

 
At Friday, April 20, 2012 5:14:00 PM, Anonymous David said...

@Alex: Check out some of Tom Tango's stuff at "Inside the Book Blog." It'll explain how they usually do linear weights. But yes, it starts out with averaging over all plate appearances across the 24 baserunner-out states of the world.

@Phil: Again, I agree with you that regression is probably biased. But I'm not so sure it's about non-linearity. Couldn't it simply be various omitted variables? Run scoring is higher in games with wild starting pitchers. Wild starting pitchers give up lots of walks but not a very different distribution of hits. It would be interesting to really nail down the sources of difference between the regression and usual linear weights. I suspect sabermetric linear weights is better, but it seems like the evidence could answer it.

 
At Friday, April 20, 2012 5:18:00 PM, Blogger Phil Birnbaum said...

David: if it's wild starting pitchers, isn't that just another way of saying non-linearity? You're saying the walks get concentrated. I'm agreeing, and I'm saying the more the walks get concentrated, the more higher-value walks you have, because the values are increasing rather than constant.

 
At Friday, April 20, 2012 6:42:00 PM, Anonymous Alex said...

It's a little complex. I think one way to do it would be to regress runs scored after a single was hit with the 24 game states, then do it again for doubles, etc (again, not with linear regression). The regression weights would tell you the expected number of runs scored off a single (or double, etc) in any given state. Then you could follow step 3 of the traditional method to get your final linear weights.

The other way would be to simply take runs scored on a given PA as the DV and the outcome of the PA (single, double, etc) as dummy-coded IVs as in your regressions. That would average over game state, but the interpretation of the weights should be the expected value of a single, double, etc. relative to the intercept, which I guess would be the value of an out.

 
At Monday, April 23, 2012 2:19:00 PM, Blogger Don Coffin said...

Phil, this may be odd, but when I run what I think is the same regression you did, on the same data, I don't get the same coefficients. (I'm using SPSS, but that shouldn't matter.)

My coefficients/std. errors:

1B......0.232 (0.025)
2B......0.645 (0.070)
3B......1.114 (0.251)
HR......1.429 (0.066)
IBB.....0.179 (0.159)
UBB.....0.289 (0.033)

So I'm getting a considerably lower coefficient on singles, a somewhat lower coefficient on doubles and on walks.

When I used an expanded set of explanatory variables (i.e., includingbasically everything--HBP, SB, CS, SH, SF, SO, DP), the coefficients on the core variables don't change much:

1B......0.297 (0.032)
2B......0.578 (0.070)
3B......1.030 (0.227)
HR......1.569 (0.062)
IBB.....0.106 (0.157)
UBB.....0.302 (0.032)

The coefficient on HBP is a little larger than on UBB (0.364).

But I'm at a loss to come up with an explanation of the differences in our results.

 
At Monday, April 23, 2012 2:39:00 PM, Blogger Don Coffin said...

I also ran a log-linear regression (which is difficult to interpret in the same units; the coefficients are "elasticities"--the interpretation would be that the percentage change in runs scored divided by the percentage change in the independent variable is what the coefficient tells us. For example, the coefficient of HRs is 0.310. That says that a 1% change in HRs leads to a 0.31% change in runs scored...I could work out what that means compared to the linear regression, but I;m laze.

But the real point of the log-linear regression is that the *sum* of the coefficients gives the *interaction* effect. If the sum of the coefficients is greater than one (and it is--it's 1.167), that means a 1% increase in *all* the independent variables leads to an increase in the dependent variable that is greater than 1. So a 1% increase in everything leads to a 1.167% increase in runs scored, in the regression model with everything, and a 1.069% i ncrease in runs scored with only 1B, 2B, 3B, HR, and BB.

 
At Monday, April 23, 2012 3:57:00 PM, Blogger Phil Birnbaum said...

Hi, Doc,

Looking at my code ... I used 1B, 2B, 3B, HR, BB, outs (AB - H - SO), strikeouts, SB, and CS.

I can send you my datafile if you like ... send me an e-mail, and let me know what years you're using.

It's possible I screwed up!

 
At Monday, April 23, 2012 5:58:00 PM, Blogger Don Coffin said...

Ah. I didn't use outs. I'll include that and see if it makes a difference...

 
At Monday, April 23, 2012 6:17:00 PM, Blogger Don Coffin said...

A lot closer now to what you're finding:

1B......0.585 (0.041)
2B......0.803 (0.061)
3B......1.130 (0.202)
HR......1.378 (0.057)
IBB.....0.295 (0.133)
UBB.....0.322 (0.027)

I'd say that's roughly in agreement.

My problem with linear weights, comceptually, has always been the linearity of it. The underlying assumption, mathematically, seems to be that an x% in all of the variables used to explain scoring (except outs, which can't increase, really) will lead to an x% increase in scoring. (Actually, with outs constant, this is not quite true; scoring will increase, in either of our regressions, about 2% faster than offensive components.) But I think the reality is that scoring will increase faster than its components (and int he log-linear model I estimated, the interactions suggest that the effect is considerable). I suspect there's a way to adapt the event data underlying llinear weights to account for this, though.

 
At Saturday, July 28, 2012 10:31:00 AM, Blogger Ron Johnson said...

Coming in very late. Doc I think what you and Phil are seeing is evidence that at the team level offense is best explained by multiplicative methods (which don't work at all well for individual players)

It's just that it's really tricky to get the weights right.

 

Post a Comment

<< Home