Don't use regression to calculate Linear Weights
About a year ago, in a post titled "Regression, Schmegression," Tom Tango argued that regression is not usually one of the better techniques to use in sabermetric research. He's right, especially for the example he used, which is using regression to find the correct Linear Weight values for the basic offensive events.
What a lot of researchers have done, and are still doing, is listing batting lines for various team-years -- singles, doubles, triples, etc. -- and running a regression to predict runs scored. It's not that bad a technique, but there are other, better ones you can use. Still, by looking a bit closer at the regression results, you can get a good idea of why regression results don't always mean what you think they mean.
Let's start with the triple. How much is a triple worth? That is: how many more runs would an average team score if you gave them exactly one extra triples?
We can run a simple regression, runs scored vs. triples hit. I used a dataset consisting of all full team-seasons from 1961 to 2008 (only for teams that played at least 159 games, to omit strike seasons). That was 1,121 teams. The result of the regression:
Runs = 731 - (0.44 * triples)
That's not a misprint: the regression tells us that every triple actually *costs* its team almost half a run!
It's not a sample size issue, either. The standard error of the -0.44 estimate is 0.27. The estimate was actually significantly different from zero (in the wrong direction!) at the 10% level.
Is it possible that a triple actually lowers your runs scored? Of course not. Our baseball knowledge tells us that's logically impossible. A triple maximizes the value of runners on base (by scoring them all), and then puts a runner on third, where he's also likely to score. It's all positive. There must be something else happening here.
It's pretty obvious, but to understand that we can't take the results at face value, we needed subject matter expertise -- we needed to know something about baseball. In this case, we didn't need to know much, just that triples have to be a good thing. But that's subject matter knowledge nonetheless.
No matter how expert you are in the technique of regression, you have to know something about the subject you're researching to be able to reach the correct conclusions from the evidence. Because, as the saying goes, correlation doesn't imply causation. But it doesn't imply *non-causation* either. It could be that triples cause fewer runs, or it could be that there's some third factor that's positively correlated with triples, but negatively correlated with runs scored. Knowing something about baseball lets us argue for which conclusion makes more sense.
Normally, when you interpret a regression result like this, you say something like: "all else being equal, one extra triple will reduce the number of runs scored by about 0.44." But that's not quite right. The "all else" doesn't refer to everything in the universe -- it only refers to everything else *you controlled for in that regression*. Which, in this case, was nothing -- we only regressed on triples.
A more accurate way to put the regression result is:
"One extra triple is associated with a reduction of the number of runs scored by about 0.44. That's either because of the triple itself, or because of something else about teams who hit more triples, something that wasn't controlled for in the regression."
Now, a possible explanation becomes apparent. Teams that hit lots of triples are usually faster teams. Fast teams tend to have fewer fat strong guys who hit for power. Therefore, maybe hitting lots of triples suggests that your team doesn't have much power, which is why triples are negatively correlated with runs scored.
Again, the regression didn't suggest that: it was our knowledge of baseball.
We can test that hypothesis, and there are a couple of ways to test it. First, we can test for a correlation between triples and other hits. And, yes, the correlation between triples and home runs is -0.31: teams who hit a lot of triples do indeed hit fewer home runs than average.
Or, we can just include home runs in the regression. If we do that, we get the equation
Runs = 373 + (1.84 * triples) + (1.93 * home runs)
"Home runs being equal, one extra triple will increase the number of runs scored by about 1.84. That's either because of the triple itself, or because of something else about teams who hit more triples (something other than home runs, which was controlled for)."
Of course, there's still something else other than home runs: our baseball knowledge tells us that teams that hit lots of triples are likely to be different in doubles power, too. And, in fact, in almost every other category: singles, outs, walks, steals, and caught stealings. So if we do a regression on all that stuff, we get:
Runs = 42
+ (0.52 * singles)
+ (0.67 * doubles)
+ (1.18 * triples)
+ (1.48 * home runs)
+ (0.33 * walks)
+ (0.18 * steals)
- (0.21 * caught stealing)
- (0.11 * batting outs (which is AB-H)).
Now we're getting values that are close to traditional Linear Weights. But not completely. For instance (and as Tango noted), we get that a double is worth only 0.67 runs, rather than the 0.8 that we're used to.
Assuming the 0.8 is actually the correct number, we ask: What's going on? Why are we getting only 0.67? It isn't just random variation, because the standard error of the 0.67 estimate in the regression is only 0.02. So what is it?
There must be something about teams that hit a lot of doubles that reduce the number of runs they score, in ways *other* than changing the number of singles, doubles, triples, home runs, walks, caught stealings, CSs, and batting outs. What could that be?
I don't know the answer, but here are some possibilities:
-- maybe teams that hit a lot of doubles (relative to the other events) are more likely to be intentionally walked. Therefore, their walks are less valuable than those of other teams. Every additional double may correlate with one extra walk turning out to be an IBB, which results in the regression giving a lower coefficient for the double.
-- the regression went from 1961 to 2008. Maybe teams that hit a lot of doubles (relative to other events) played in low-offense eras (like the mid 60s). The extra doubles mark the team as being from that era, which makes all events worth less, which makes the regression adjust by giving a lower coefficient for the double.
-- maybe teams that hit a lot of doubles ground into a lot of double plays. Since double plays are extra outs that don't show up in (AB-H), that would cause runs to be overestimated. The regression adjusts for that by building the extra DPs into the value of the double.
And so on. I don't know what the true answer is; none of the suggestions above seem very likely to me. It's a bit of a mystery. I'd add IBB to the regression, but the Lahman database doesn't seem to have it for teams. Maybe I'll calculate it some other way and try again.
Anyway, it was a bit of a shock to me that the doubles estimate was so far off. I would have thought that a technique like linear regression, with over 1,000 rows of data, would be able to come up with the answer. But it didn't, almost certainly because of outside factors that we didn't control for. Not only that, but we don't even really know what those outside factors are! (Although if you have an idea, let me know in the comments.)
So the accepted value for the double is 0.8 runs, using a method I will explain shortly. The regression, on the other hand, gives only 0.67 runs.
It's not that the regression answer is wrong -- it's just that the regression answers a different question than the one we want.
The 0.8 method asks: "If a team happens to hit an extra double, how many more runs will it score?"
The 0.67 method asks, "If one team hits one more double than another team, how many more runs will it score taking into account that the extra double means the team might be slightly different in other ways?"
To most analysts, the first question is more important. Why? Because we really *do* want to find the cause-and-effect relationship. Confounding variables may be interesting, but they usually get in the way of what we're really trying to find. It's interesting to note that more triples is correlated with fewer runs scored. But that information isn't very useful to our understanding of baseball -- it doesn't tell us what makes teams win. That is, we hopefully aren't about to tell the 1985 Cardinals that they would have scored more runs by hitting fewer triples, at least not unless we really believe that triples hurt offense.
What we usually want to do, using Linear Weights, is answer questions like: if you release player X, and sign player Y, where player Y hits 10 more triples than player X, how much will the team improve? And, for that question, the regression gives us the wrong answer.
So what's the method we use to find the *true* value of a triple? It's pretty simple:
-- For a certain period of baseball history, look at the play-by-plays of all the games, and divide all plate appearances (and baserunning events) according to which of the 24 base/out states (such as no outs, runners on second and third) was happening when it occurred.
-- For each state, calculate how many runs were scored after that state was achieved. (Here's the one for 2009, from Baseball Prospectus.)
-- Now, for every triple, calculate the difference between the runs before the triple, and the runs after the triple. For instance, a leadoff triple would have been worth .79 runs (there was an expectation of .52 runs before the triple, and 1.31 runs after the triple). But a triple with the bases loaded and two outs was worth 2.36 runs (before the triple, .75 runs were expected to score. After the triple, only .11 runs were expected, but 3 runs actually did score. 3.11 minus .75 equals 2.36).
-- Average out all the values of all the triples, as calculated above.
That average is how much an extra triple is worth to an average team.
If you do this for the 1988 American League (which is the one I happen to have on hand), you get that a triple is worth 1.024 runs. A double is worth 0.775.
There are several ways why this estimate is much better than what you could get from a regression:
-- as we pointed out, the regression estimate is influenced by superflous other factors possessed by teams that hit triples.
-- as Tango pointed out in his link, the regression uses data aggregated into team-seasons, which means you're losing a lot of information. This method uses PA by PA, inning by inning data, for a much more reliable estimate.
-- we have a direct, logical, cause-and-effect relationship.
-- in effect, we are able to hold *everything* constant, even factors we don't know about. That's because we are not comparing team X's triples to team X's runs. We're comparing a league-average triple to the league-average runs. All other confounding factors are averaged out.
Another way to look at it: the regression looks only at inputs and outputs. So it has no idea if the input *caused* the output, or if there's some third factor that links the two. But the play-by-play method isolates the direct effect of the input. It knows for sure that the triple *caused* the change from one state (bases empty, no outs) to the next (runner on third, no outs), and so it's not fooled by outside factors.
Correlation does not imply causation, and regression can only provide correlation. Why not use this method, which is based on causation, and therefore gives you the right answer?
UPDATE: Below, commenter Ted links to a paper (.pdf) where he used regression to figure Linear Weights, and found that when he added variables for GIDP and reached on error, the doubles coefficient increased (from .689 to .722). The HR coefficient also increased (by 10 points). So that, I think, explains part of the mystery: the doubles are artificially low because teams that hit a lot of doubles and home runs are so slow that they also hit into a lot of DPs and don't reach base on error as much. When these are accounted for separately, some of the true value of the double is restored.