Don't use regression to calculate Linear Weights, Part II
I've written a couple of times now about how why it's better to calculate linear weights via the "traditional" play-by-play method than by regression. That was all theory .... this time I figure I might as well go into more detail, including doing the work and showing the real numbers.
I ran the play-by-play method using all major league games from 1990 to 1998. You probably know this already, but here's how the method works:
1. Look through every baseball game for the period you're interested in (probably by computer, through Retrosheet). For every plate appearance, note the number of outs, bases occupied, and the number of runs that scored in the remainder of that inning.
2. For each of the 24 game states (number of outs, bases occupied), group all the observations and average them, giving you the average number of runs scored in the remainder of that inning.
For instance: from 1990 to 1998, there were 27,580 instances of a runner on second and nobody out. A total of 31,749 runs scored in the remainder of those innings, for an average of 1.151 runs.
There were 9,090 instances of runners on first and third with no outs. The average there was 1.807 runs.
3. Once you have those 24 averages, go through every baseball game again. Find every single. Look at the game state before the single, and after the single. Calculate the difference in run expectation between the two states. The difference, plus the number of runs scored on the play, is what that single was worth.
For instance: with nobody out and a runner on second, the batter singles, runner to third. The value of that single is 1.807 minus 1.151, which is 0.656. So a single in that situation is worth 0.656 runs, on average.
Repeat for every single, and average out all the values. That's your answer, your linear weight for the single.
4. Repeat step 3 for every other event: doubles, home runs, walks ... whatever you want to calculate the linear weight for.
The logic of this method, I think, is very solid. Because we looked at all the data for those years, it is a *fact* -- not an estimate -- that on average, 1.151 runs were scored with a runner on second with nobody out. It is a *fact* that an average 1.807 runs were scored with runners on first and third with nobody out. Therefore, there is a very strong presumption that a single that moves the runner to third is worth .656 runs.
Technically, the value of the single is only a very close estimate. The reason is that we used three different samples. The 1.201 figure is from a group of 27,580 teams that put a runner on second with nobody out. The 1.879 figure is from a *different* group of 9,090 that put a runner on first and third with nobody out. And, the singles we're investigating were hit by a *third* group of teams -- 139,477 of them. Teams in the third group, by definition, form the weighted average of all teams that hit singles. But the other two groups are probably very slightly better than average, which means the calculation isn't quite exact.
But that's a very picky objection. Any bias caused by these circumstances is going to be very small.
Here are the values the play-by-play method gives for the "big five" events:
1.403 home run
0.314 unintentional walk
These are reasonably in line with the "traditional" Pete Palmer linear weights, circa 1984. That's not surprising -- Pete used the play-by-play method himself. (Actually, since Retrosheet didn't exist at the time, Pete ran a simulation to get random play-by-play data, then ran this method on that.)
Then, there's the regression method.
In terms of the amount of work you have to do, this method is much, much simpler than the traditional method. All you do is take team batting lines, and ask your regression software to give you the best predictor of runs scored based on the other factors.
I ran this regression for the same seasons as the play-by-play method: 1990-1998. That comprised 248 team batting lines total. I used only 1B, 2B, 3B, HR, BB, K, and outs. Here are the results (with standard errors in parentheses):
0.494 (.030) single
0.730 (.054) double
1.343 (.193) triple
1.465 (.054) home run
0.342 (.025) walk (including IBB)
You'll notice that there are some significant differences between the two methods, especially singles and walks.
So, what's going on?
Well, first, I should point out that the differences aren't statistically significant. All the regression coefficients are within two standard errors of the traditional ones. So, you'd be within your rights to insist that the two sets of results aren't that much different.
And, of course, it's possible that it's just a programming error on my part.
However, I don't think either of those things is what's happening. I think that the differences are real.
I think the play-by-play method gives the correct result -- or close to it -- and the regression method is inherently biased.
Let me show you why I think that, in a bit of a roundabout way. Let me know if I'm wrong.
I ran another version of the same regression, but this time, instead of using team-seasons in the regression, I used individual games (both teams combined into one batting line). So, I had 81 times as much data in the regression. Actually, because of the 1994 lockout, it was actually only about 77 times.
Here are the results of that one:
0.490 (.004) single
0.779 (.009) double
1.097 (.026) triple
1.433 (.011) home run
0.433 (.006) walk
This time, the value of the walk is huge, much bigger than it should be -- and much bigger than the value we got when we used 1/77 the data. Why?
Start with the observation that the value of the walk (or, indeed, any other event) depends on context. On a high-offense team, the base on balls will be worth a lot, because the baserunner has a higher chance of being driven in. For low offense teams, the value of the walk will be lower, as it's more likely he'll be stranded.
So far, so good. The problem, though, is that the more walks a team gets, the stronger its offense -- and so, the more valuable each individual walk becomes.
Suppose a team has N walks. At the margin, the "N+1"th walk a team gets is worth so many runs. But the "N+2th" walk has to be worth more -- because now it's in the context of a stronger team, a team that has N+1 walks, instead of a team that has only N walks.
It doesn't seem like it should be a big deal -- and it isn't that big, for a season. The range of actual major-league team offenses, in general, is pretty small. In 1998, the Yankees led the major leagues with 678 walks. The Pirates were last, at 393. That means the Yankees walked 73 percent more than the Pirates.
But for games, the difference can be much larger. Some games have 7, 8, or more walks. Some games have only 1 or 2. Now, the difference between most and least is in the hundreds of percents.
The difference between the 8th walk and the 2nd walk, over one game, is much bigger than the difference between the 393rd walk and the 678th walk, over a season.
And so, the problem of non-linearity -- of increasing returns on walks -- is bigger when you go game-by-game.
Now, what happens in the regression when you have increasing returns? Well, first, you probably shouldn't be using linear regression, which, by name and by definition, assumes things are linear. If you do it anyway, your coefficient gets inflated. In this case, the coefficient for the walk got inflated all the way to .433.
If you want a simpler example: Suppose I offer to pay you $1 for one hit, $4 for two hits, and $9 for three hits. That means hits have increasing returns -- the first hit is worth $1, the second is worth $3, and the third is worth $5.
In six games, you get one hit four times, two hits once, and three hits once. You made $17 on 9 hits, so the average hit is worth $1.89.
But, if you run a regression on the six games, you get each hit worth $2.29. (Try it if you don't believe me.)
The more increasing returns you have, the worse the bias. For games, the bias was high, pushing the walk to .433. For seasons, the bias is lower -- but not zero. So the regression's value for the walk will still be higher than it should.
The regression software I use has a test for non-linearity. For games, it comes back that walks (and singles) are definitely non-linear. For seasons, it finds non-linearity, but not enough to be statistically significant. (That insignificance, I think, is why academic studies that use regression don't notice there's a problem.)
Here are the results for two other sets of seasons. 2000-2009 NL:
And, for completeness, I'll re-run the original numbers for 1990-1998:
In every case, the regression on individual games ("reg games") seriously inflates the value of walks and singles. The regression on batting lines ("reg") inflates the value of walks and singles five out of six times.
Why does the inflation only seem to apply to walks and singles? Here's my hypothesis. It might be wrong.
Singles and walks tend to be more concentrated in games than other events are. Some pitchers give up a lot of walks, some give up only a few. The difference between pitchers is not as big for hits. Yes, pitchers do vary a lot in strikeouts, which leads to differences in hits, but each strikeout difference is only 3/10 of a hit (since batting average on balls in play is fairly constant at .300).
Doubles, triples, and home runs have less non-linearity: two triples in the same inning aren't that much more valuable than two triples in separate innings. Also, home runs may have *diminishing* returns: two HRs in the same inning are probably worth less than two HRs in separate innings.
Also: doubles, triples, and home runs are rare enough that multiples don't happen that often. The number of repeats, I think, should be proportional to the square of the frequency. So if there are twice as many walks as doubles, there are four times as many consecutive walks as consecutive doubles.
That's why I think we see most of the bias in the value of the walk and single.
So, that's one reason I think it's preferable to use the traditional method over the regression method -- the regression method is biased too high.
If that's not enough, here's another reason: the traditional method has a lower random error.
Intuitively, that makes sense. The traditional method uses all the play-by-play data available, at a very granular level. The regression method uses only season statistics, the aggregation of maybe 6,000 plate appearances. It seems obvious that the method that uses six thousand times as much data should be more accurate.
But, that's easy to show empirically.
Here's what I did. I used the same years of data, and the same play-by-play method -- but I divided the data into 13 parts. I then calculated the linear weight independently, for each of the 13 parts.
I took those 13 linear weights, and calculated their standard deviation.
The best estimate of the true value, of course, is the average of the 13 estimates. And the standard error of that average is simply the SD of the 13 estimates, divided by the square root of 13.
So, here are the results of the play-by-play method again, this time including the standard errors I wound up with:
0.468 (.0022) single
0.768 (.0027) double
1.076 (.0072) triple
1.403 (.0028) home run
0.314 (.0023) unintentional walk
(Technical note: because the way I constructed the 13 groups was random-ish, the standard errors are random too. I tried a different randomization and the values changed somewhat, but were the same order of magnitude. If you really, really wanted a precise estimate of the standard error, you could just do the randomization a couple of hundred times and average them all out.)
If you compare those to the standard errors from the regression, you'll see that they're smaller by at least a factor of 10. Which makes sense, considering they use so much more data, at so granular a level.
So, this is a case where the more technical, mathematicky method is actually less accurate, less precise -- and less rigorous -- than the grade-school level method. Other than ease of computation, I don't see any reason to prefer the regression.