Sabermetric Research: Don't use regression to calculate Linear Weights

Tuesday, October 27, 2009

Don't use regression to calculate Linear Weights

About a year ago, in a post titled "Regression, Schmegression," Tom Tango argued that regression is not usually one of the better techniques to use in sabermetric research. He's right, especially for the example he used, which is using regression to find the correct Linear Weight values for the basic offensive events.

What a lot of researchers have done, and are still doing, is listing batting lines for various team-years -- singles, doubles, triples, etc. -- and running a regression to predict runs scored. It's not that bad a technique, but there are other, better ones you can use. Still, by looking a bit closer at the regression results, you can get a good idea of why regression results don't always mean what you think they mean.

Let's start with the triple. How much is a triple worth? That is: how many more runs would an average team score if you gave them exactly one extra triples?

We can run a simple regression, runs scored vs. triples hit. I used a dataset consisting of all full team-seasons from 1961 to 2008 (only for teams that played at least 159 games, to omit strike seasons). That was 1,121 teams. The result of the regression:

Runs = 731 - (0.44 * triples)

That's not a misprint: the regression tells us that every triple actually *costs* its team almost half a run!

It's not a sample size issue, either. The standard error of the -0.44 estimate is 0.27. The estimate was actually significantly different from zero (in the wrong direction!) at the 10% level.

Is it possible that a triple actually lowers your runs scored? Of course not. Our baseball knowledge tells us that's logically impossible. A triple maximizes the value of runners on base (by scoring them all), and then puts a runner on third, where he's also likely to score. It's all positive. There must be something else happening here.

It's pretty obvious, but to understand that we can't take the results at face value, we needed subject matter expertise -- we needed to know something about baseball. In this case, we didn't need to know much, just that triples have to be a good thing. But that's subject matter knowledge nonetheless.

No matter how expert you are in the technique of regression, you have to know something about the subject you're researching to be able to reach the correct conclusions from the evidence. Because, as the saying goes, correlation doesn't imply causation. But it doesn't imply *non-causation* either. It could be that triples cause fewer runs, or it could be that there's some third factor that's positively correlated with triples, but negatively correlated with runs scored. Knowing something about baseball lets us argue for which conclusion makes more sense.

Normally, when you interpret a regression result like this, you say something like: "all else being equal, one extra triple will reduce the number of runs scored by about 0.44." But that's not quite right. The "all else" doesn't refer to everything in the universe -- it only refers to everything else *you controlled for in that regression*. Which, in this case, was nothing -- we only regressed on triples.

A more accurate way to put the regression result is:

"One extra triple is associated with a reduction of the number of runs scored by about 0.44. That's either because of the triple itself, or because of something else about teams who hit more triples, something that wasn't controlled for in the regression."

Now, a possible explanation becomes apparent. Teams that hit lots of triples are usually faster teams. Fast teams tend to have fewer fat strong guys who hit for power. Therefore, maybe hitting lots of triples suggests that your team doesn't have much power, which is why triples are negatively correlated with runs scored.

Again, the regression didn't suggest that: it was our knowledge of baseball.

We can test that hypothesis, and there are a couple of ways to test it. First, we can test for a correlation between triples and other hits. And, yes, the correlation between triples and home runs is -0.31: teams who hit a lot of triples do indeed hit fewer home runs than average.

Or, we can just include home runs in the regression. If we do that, we get the equation

Runs = 373 + (1.84 * triples) + (1.93 * home runs)

Which means:

"Home runs being equal, one extra triple will increase the number of runs scored by about 1.84. That's either because of the triple itself, or because of something else about teams who hit more triples (something other than home runs, which was controlled for)."

Of course, there's still something else other than home runs: our baseball knowledge tells us that teams that hit lots of triples are likely to be different in doubles power, too. And, in fact, in almost every other category: singles, outs, walks, steals, and caught stealings. So if we do a regression on all that stuff, we get:

Runs = 42
+ (0.52 * singles)
+ (0.67 * doubles)
+ (1.18 * triples)
+ (1.48 * home runs)
+ (0.33 * walks)
+ (0.18 * steals)
- (0.21 * caught stealing)
- (0.11 * batting outs (which is AB-H)).

Now we're getting values that are close to traditional Linear Weights. But not completely. For instance (and as Tango noted), we get that a double is worth only 0.67 runs, rather than the 0.8 that we're used to.

Assuming the 0.8 is actually the correct number, we ask: What's going on? Why are we getting only 0.67? It isn't just random variation, because the standard error of the 0.67 estimate in the regression is only 0.02. So what is it?

There must be something about teams that hit a lot of doubles that reduce the number of runs they score, in ways *other* than changing the number of singles, doubles, triples, home runs, walks, caught stealings, CSs, and batting outs. What could that be?

I don't know the answer, but here are some possibilities:

-- maybe teams that hit a lot of doubles (relative to the other events) are more likely to be intentionally walked. Therefore, their walks are less valuable than those of other teams. Every additional double may correlate with one extra walk turning out to be an IBB, which results in the regression giving a lower coefficient for the double.

-- the regression went from 1961 to 2008. Maybe teams that hit a lot of doubles (relative to other events) played in low-offense eras (like the mid 60s). The extra doubles mark the team as being from that era, which makes all events worth less, which makes the regression adjust by giving a lower coefficient for the double.

-- maybe teams that hit a lot of doubles ground into a lot of double plays. Since double plays are extra outs that don't show up in (AB-H), that would cause runs to be overestimated. The regression adjusts for that by building the extra DPs into the value of the double.

And so on. I don't know what the true answer is; none of the suggestions above seem very likely to me. It's a bit of a mystery. I'd add IBB to the regression, but the Lahman database doesn't seem to have it for teams. Maybe I'll calculate it some other way and try again.

Anyway, it was a bit of a shock to me that the doubles estimate was so far off. I would have thought that a technique like linear regression, with over 1,000 rows of data, would be able to come up with the answer. But it didn't, almost certainly because of outside factors that we didn't control for. Not only that, but we don't even really know what those outside factors are! (Although if you have an idea, let me know in the comments.)

-----

So the accepted value for the double is 0.8 runs, using a method I will explain shortly. The regression, on the other hand, gives only 0.67 runs.

It's not that the regression answer is wrong -- it's just that the regression answers a different question than the one we want.

The 0.8 method asks: "If a team happens to hit an extra double, how many more runs will it score?"

The 0.67 method asks, "If one team hits one more double than another team, how many more runs will it score taking into account that the extra double means the team might be slightly different in other ways?"

To most analysts, the first question is more important. Why? Because we really *do* want to find the cause-and-effect relationship. Confounding variables may be interesting, but they usually get in the way of what we're really trying to find. It's interesting to note that more triples is correlated with fewer runs scored. But that information isn't very useful to our understanding of baseball -- it doesn't tell us what makes teams win. That is, we hopefully aren't about to tell the 1985 Cardinals that they would have scored more runs by hitting fewer triples, at least not unless we really believe that triples hurt offense.

What we usually want to do, using Linear Weights, is answer questions like: if you release player X, and sign player Y, where player Y hits 10 more triples than player X, how much will the team improve? And, for that question, the regression gives us the wrong answer.

So what's the method we use to find the *true* value of a triple? It's pretty simple:

-- For a certain period of baseball history, look at the play-by-plays of all the games, and divide all plate appearances (and baserunning events) according to which of the 24 base/out states (such as no outs, runners on second and third) was happening when it occurred.

-- For each state, calculate how many runs were scored after that state was achieved. (Here's the one for 2009, from Baseball Prospectus.)

-- Now, for every triple, calculate the difference between the runs before the triple, and the runs after the triple. For instance, a leadoff triple would have been worth .79 runs (there was an expectation of .52 runs before the triple, and 1.31 runs after the triple). But a triple with the bases loaded and two outs was worth 2.36 runs (before the triple, .75 runs were expected to score. After the triple, only .11 runs were expected, but 3 runs actually did score. 3.11 minus .75 equals 2.36).

-- Average out all the values of all the triples, as calculated above.

That average is how much an extra triple is worth to an average team.

If you do this for the 1988 American League (which is the one I happen to have on hand), you get that a triple is worth 1.024 runs. A double is worth 0.775.

There are several ways why this estimate is much better than what you could get from a regression:

-- as we pointed out, the regression estimate is influenced by superflous other factors possessed by teams that hit triples.

-- as Tango pointed out in his link, the regression uses data aggregated into team-seasons, which means you're losing a lot of information. This method uses PA by PA, inning by inning data, for a much more reliable estimate.

-- we have a direct, logical, cause-and-effect relationship.

-- in effect, we are able to hold *everything* constant, even factors we don't know about. That's because we are not comparing team X's triples to team X's runs. We're comparing a league-average triple to the league-average runs. All other confounding factors are averaged out.

Another way to look at it: the regression looks only at inputs and outputs. So it has no idea if the input *caused* the output, or if there's some third factor that links the two. But the play-by-play method isolates the direct effect of the input. It knows for sure that the triple *caused* the change from one state (bases empty, no outs) to the next (runner on third, no outs), and so it's not fooled by outside factors.

Correlation does not imply causation, and regression can only provide correlation. Why not use this method, which is based on causation, and therefore gives you the right answer?

-----

UPDATE: Below, commenter Ted links to a paper (.pdf) where he used regression to figure Linear Weights, and found that when he added variables for GIDP and reached on error, the doubles coefficient increased (from .689 to .722). The HR coefficient also increased (by 10 points). So that, I think, explains part of the mystery: the doubles are artificially low because teams that hit a lot of doubles and home runs are so slow that they also hit into a lot of DPs and don't reach base on error as much. When these are accounted for separately, some of the true value of the double is restored.

Labels: regression, statistics

44 Comments:

At Tuesday, October 27, 2009 7:15:00 AM, Dan said...: When you average the values of all the triples, are you using a weighted average based on how often each of the situations occurred? I suspect triples with no one on base are more common than triples with the bases loaded, simply because a bases-loaded condition is less common than a bases-empty one.
At Tuesday, October 27, 2009 9:44:00 AM, Michael said...: I won't speak for Phil, but that's usually how linear weights are done via this method. It's weighted by PA/appearances of each specific base/out state.
At Tuesday, October 27, 2009 10:32:00 AM, Phil Birnbaum said...: Yup, like Michael said. It won't work otherwise -- if you don't weight by situation, you'll wind up (for instance) with the IBB being worth the same as the BB.
At Tuesday, October 27, 2009 11:18:00 AM, Guy said...: It's interesting, and a bit puzzling. that it is invariably the coefficient for doubles -- but not other offensive events -- that regression gets wrong. One reason may be that there just isn't much variation among teams in the number of doubles they hit. The SD this year is about 22 doubles -- pretty small. So good hitting and bad hitting teams both hit doubles (once you correct for correlation of doubles with HRs). Perhaps that's one reason regression has a hard time pinning down the correct coefficient.

I would guess that if you ran a regression at the level of runs per inning, or game, you would get the correct value. But maybe not.
At Tuesday, October 27, 2009 11:27:00 AM, Phil Birnbaum said...: Hi, Guy,

If the problem were just that there isn't much variation in doubles, and so the regression has a hard time getting a handle on their effect, wouldn't that just manifest itself in a large standard error?

But the SE is small, for a confidence interval of (.63, .71). So I still don't get it.

There must be something about a double that means that a team who gets an extra double gets .8 runs from the double, but loses .1 runs because of a correlation between that and something else about the team.

Unless ... suppose doubles correlate highly with HRs. Is there a condition under which the doubles will be evaluated too low and the HRs too high, because of their correlation? It doesn't sound right to me, but let me think about it a bit ...
At Tuesday, October 27, 2009 11:39:00 AM, Anonymous said...: You may be right, you may be wrong, but it is NEVER appropriate to decide in advance what the desired outcome is, and then dismiss any methods that don't produce the expected results.

If you think the world is flat, but Columbus shows you evidence using a different methodology that it is round, you don't say his methods must be wrong because he didn't come up with the "correct" answer.
At Tuesday, October 27, 2009 11:40:00 AM, Phil Birnbaum said...: Anonymous: agreed. But we know what the "right" answer is from many other studies. It wasn't pulled out of the air.

Had those not been done, we probably would accept the regression's flawed estimate, and only found out it was wrong later.
At Tuesday, October 27, 2009 11:43:00 AM, Anonymous said...: Why do you think the value of the total # of doubles hit per team, per season, would exactly match the value of a double when examined on a play-by-play basis?

The regression method allows you to make inferences about entire teams and entire seasons. The play-by-play method lets you know what happened in the past. Both have their utility, depending on what the question is you're trying to answer.
At Tuesday, October 27, 2009 11:44:00 AM, Phil Birnbaum said...: Guy: yes, I think if we did the regression by inning, we'd get a much better result. Or not: every inning has almost exactly the same number of outs. The differences are double plays. So singles and walks would probably wind up with too low a weight, since the weight would have to reflect the increased chance of the DP.

Of course, if you include DP in the regression, you don't have that problem, but now every inning has the same number of outs except for runners killed on base. So you wind up with a new problem, where hits and walks wind up having to carry the weight of the killed baserunner. That might not be a problem, though. In the play-by-play method, the hits DO carry the weight of the killed baserunner. Which makes sense: if you give a single credit for driving in a runner on second, you have to debit it if that runner is thrown out at the plate.

Anyway, just rambling out loud.
At Tuesday, October 27, 2009 11:45:00 AM, Anonymous said...: Sorry, but you don't *know* what the right answer is, in part because you dismiss evidence that disagrees with this "right" answer.

Again, for many years it was believed with complete conviction that the world was flat. Some chose to ignore contrasting evidence, stating that the matter was open to debate. You were probably that person in a past life :)
At Tuesday, October 27, 2009 11:46:00 AM, Anonymous said...: oops, "...stating that the matter was NOT open to debate"
At Tuesday, October 27, 2009 12:06:00 PM, Guy said...: Anonymous: you're coming in late to a long-ongoing discussion. Phil isn't claiming divine knowledge of the true run value of a double -- good research has answered the question definitively. See Tom Ruane's article here: http://www.retrosheet.org/Research/RuaneT/valueadd_art.htm.

Phil, here's another theory: there are really two kinds of doubles, let's call them "soft" and "hard". Soft doubles (e.g. blooper behind 3B) are usually hit by fast runners, and get the batter to 2nd base but have less baserunner advancement impact than a "hard" double (live drive in the gap that rolls to the wall). And let's say hard doubles are correlated with HRs, but soft doubles really aren't (or are even negatively correlated, like triples). We want the regression to give a higher value to the hard double, but it can't do that. Moreover, the more valuable hard doubles tend to be hit by teams with a lot of HRs, so the regression gives some of that value to HRs. That would explain why your HR coefficient is a bit too high. And the coefficient for doubles ends up too low, probably close to the true value of a "soft" double.

Now, the soft doubles are an indicator of speed, which has value too (those hitters probably score from 2B more often). But you have two other variables that are picking up that quality: triples and SBs. And, not coincidentally, those coefficients are also a bit too high (note that SB and CS are nearly equal).
At Tuesday, October 27, 2009 12:29:00 PM, Guy said...: Phil: You might be interested in this paper that makes a similar critique of mult regression, but in the context of crime statistics. Among others, it addresses an analysis by Steve Levitt in the first Freakonomics.

http://www.crab.rutgers.edu/~goertzel/mythsofmurder.htm
At Tuesday, October 27, 2009 12:34:00 PM, Dan said...: I am not as well-read on linear weights as regression, so I have a question concerning how the doubles calculation is performed. For triples, obviously enough, any men on base are plated and count towards the value of the hit. How are men on base before a double handled? A man on third should score - what about a man on second? Or first? Do they always score, sometimes score... the methodology would affect the value substantially, I think. The regression is at least based on what happened in those situations in the past, so no assumptions are made about what happens with those runners.

Or it is possible I've missed something here - as I said, I'm not as familiar with linear weights.
At Tuesday, October 27, 2009 12:57:00 PM, Guy said...: Dan: linear weights looks at how many runs were likely to score before the double was hit, given the base/out situation, and then makes the same calculation after the double is hit (counting runs scored on the double). The value of the double is the change in run expectancy.

See Tom Ruane's paper for more details. (I can't get link to post correctly: go to Retrosheet.org, features, research papers, Ruane: The Value Added Approach to Evaluating Performance).
At Tuesday, October 27, 2009 1:00:00 PM, Dan said...: Guy: I understand the arithmetic in the value-added analysis, but I am unclear on if that calculation is performed using the expected run numbers, the actual play-by-play numbers (which would account for runners on first scoring or not on a double, for instance), or some other methodology.
At Tuesday, October 27, 2009 1:04:00 PM, Colin Wyers said...: The problem is that multiple regression, in order to work, assumes that the odds of hitting a double are independent of the odds of hitting a triple or a home run.

But anyone who knows anything about baseball knows that's simply not true - teams that hit more doubles tend to hit more home runs. That's because both are a measure of power. (And again with all the other variables - walk rates and home run rates aren't independent, for instance.)

It's not that the regression is missing enough inputs to work, it's that it's not able to seperate the contributions of multiple variables that are significantly correlated.
At Tuesday, October 27, 2009 1:08:00 PM, Tim said...: Does anyone have a linear weights formula that uses a different coefficient for infield hits than "regular" singles? I don't think I've ever seen one.
At Tuesday, October 27, 2009 1:29:00 PM, Guy said...: Colin: I'm not sure I agree 100%. The point of regression is to deal with some correlation of predictor variables. If there were no correlation, then you wouldn't need multiple regression at all, right? And as long as the correlation is much less than one (i.e. there are some lo-2B/hi-HR teams and some hi-2B/lo-HR teams), regression should be able to handle it.

The problem is omitted variables more than correlation. In this case, I suspect there is an interaction effect being missed: the doubles hit by high-HR teams actually have more value than the doubles hit by low-HR teams. So it's not the HR/2B correlation that's a problem, but the fact that the value of a double varies and that value correlates with other factors.
At Tuesday, October 27, 2009 2:01:00 PM, Colin Wyers said...: Guy - I have to disagree here. The whole point of a regression analysis is to explain how independent variables explain one dependent variable. And the typical team offensive stats simply are not independent. Let's look at how well other team stats predict the home run, looking at correlation for team stats in '09:

1B -0.27
2B 0.40
3B -0.31
BB 0.46
SO 0.23

Ideally I'd look at more than one year, but this suffices.

We can put together a regression equation that predicts team home runs from other variables:

8.80 + -1.25 * 3B + 0.20 * SO + 0.08 * SO

I didn't spend a lot of time on this - adjusted R-squared of .29, which isn't fantastic.

But the fact that we can build such a linear model at all tells us that we have multicollinearity. Which means that multiple linear regression is simply invalid as a tool to use here.
At Tuesday, October 27, 2009 2:06:00 PM, Phil Birnbaum said...: Guy/12:06: Hmmmm, the "hard" and "soft" double theory does sound interesting. Let me think about it a bit.

Colin: What is the exact problem that multicollinearity causes? Is it just wider confidence intervals? Or is it biased coefficients?
At Tuesday, October 27, 2009 2:18:00 PM, Cyril Morong said...: Dan

I looked at the issue of triples being hit with the bases empty at

http://cybermetric.blogspot.com/2008/07/do-fast-players-hit-fewer-2bs-and-3bs.html

It seems that having someone on first base is not an issue.

For a regression where they put just about everything in, go to

http://www.baseballthinkfactory.org/btf/scholars/furtado/articles/IntroducingXR.htm

That is Jim Furtado's extrapolated runs. They have DPs, K's, IBBs, etc. It seemed to be pretty accurate even though it was a regression. The value for a double was .72
At Tuesday, October 27, 2009 2:22:00 PM, Colin Wyers said...: Phil, let's restrict our example to walks and home runs for a second. Remember, we have a pretty high correlation between walks and home runs (.46) - teams that hit a lot of home runs tend to walk more, and vice versa.

Now, home runs themselves do a very good job of predicting team runs (correlation of 0.74). But (it's not a cliche if it's a universal truth) correlation does not mean causation. Obviously there is a causation effect, but it does not entirely explain the correlation - as noted, for instance, the walk rate and home run rate are substantially correlated.

When you have multicollinearity (and quite frankly, team offensive stats are laden with multicollinearity - I'm barely touching the surface here) your goodness of fit is unaffected - your model will still do a very good job of predicting the dependent variable. But the coefficients of your "independent" variables will be wrong (as you illustrated above) because the regression won't be able to do a very good job in seperating the collinear effects.
At Tuesday, October 27, 2009 2:25:00 PM, Guy said...: Colin: if you can only employ regression when all predictor variables are totally uncorrelated, why would you ever need or use multiple regression?

Now, I agree that very high multicollinearity is a problem. For example, that may be one reason that regressions trying to estimate relative impact of OBP and SLG often fail (BA is a shared component). But I think regressions can, properly done, sometimes help us tease out the relative influence of two correlated variables. What's the alternative when we don't have the kind of laboratory conditions that baseball conveniently provides us?
At Tuesday, October 27, 2009 2:26:00 PM, Colin Wyers said...: IIRC, Furtado's XR is not a pure regression equation - he took the results of a regression and manipulated it to address some of the concerns here, using the weights of other empirical LWTS formulas (like Batting Runs) as a guide to the relationship between variables.
At Tuesday, October 27, 2009 2:31:00 PM, Phil Birnbaum said...: Colin: what you say sounds reasonable. Let me ask you this:

The SE's of the coefficients are .02 for doubles, and .07 for home runs.

Now, I get that some of the 2B may have worked its way to HR. Wouldn't that affect the SE's, though? Because if HRs are (say) .10 too high, and 2Bs are (say) .07 too low (assuming maybe 40% more 2B than HR), wouldn't that have shown a SE for 2B that would have been higher than .02?

According to Wikipedia, "One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large." That isn't the case here.
At Tuesday, October 27, 2009 2:40:00 PM, Colin Wyers said...: Guy - Off the top of my head, I think the right answer for using a regression on highly collinear values is to use something like a PCA/factor analysis to produce a set of noncorrelated variables and then use those as the regression inputs.

Phil - I don't know what to tell you about the SE; those should be affected by the collinearity (all else being equal), but it's posible there are other factors keeping the SE down. Sorry I don't have a good answer for you right now.
At Tuesday, October 27, 2009 3:11:00 PM, Anonymous said...: Standard error is a function of the sample size, and so with large amounts of data, it will certainly be low.
At Tuesday, October 27, 2009 5:45:00 PM, Ted said...: Phil, your post is mis-titled. I think you meant to write, "Don't MISuse regression to calculate Linear Weights."

Regression is a perfectly cromulent way to compute LWTS if you specify your regression model correctly. When you *deliberately* misspecify it, well you get what you deserve, and that's all you've done here.

My paper in Economics Letters touches on some of these issues; the longer version is at http://www.gambit-project.org/turocy/papers/runest/ -- an even longer earlier version talked more about the stability of parameter estimates in *CORRECTLY SPECIFIED* regression models.
At Tuesday, October 27, 2009 5:50:00 PM, Phil Birnbaum said...: Hi, Ted,

The point I was trying to make (perhaps unsuccessfully) is that for the specific task of calculating the values of offensive events, the non-regression method is much, much better than any regression you can come up with, no matter how well-designed or correctly specified.

I will read your paper, though, in case I'm wrong. I will also look up "cromulent". :)
At Tuesday, October 27, 2009 6:17:00 PM, Anonymous said...: "...for the specific task of calculating the values of offensive events, the non-regression method is much, much better than any regression you can come up with, no matter how well-designed or correctly specified."

That sounds like an empirical question to me. Pit the two methods against each other in terms of PREDICTING future outcomes, and see which one comes closest to predicting the total number of runs scored. 2010 is right around the corner. I hope this challenge embiggens you to put your money where your mouth is.
At Tuesday, October 27, 2009 6:47:00 PM, Phil Birnbaum said...: Ted,

Thanks! Your paper suggests that the reason doubles are undervalued is that GIDP and reached on error (I assume that's what AOE means in your paper?) weren't included in the regression. I'm going to add that to the post.

But: even with your added variables, wouldn't you agree that the play-by-play method still yields better estimates than the regression? For two reasons:

1. If we were somehow to compute a standard error for the estimates produced by the PBP method, they would be much smaller than those from the regression (you might dispute this, but I doubt it).

2. Because the regression data is aggregated into team-seasons, it is less precise than the PBP data, which is very granular.

3. As I wrote, the PBP method isolates the cause-and-effect relationship.

What do you say?
At Tuesday, October 27, 2009 7:32:00 PM, Cyril Morong said...: Do we have to wait until next year to see what method predicts better? How about looking at the last few years? I guess we could do that because Ted's paper was written a few years ago.
At Tuesday, October 27, 2009 7:40:00 PM, Phil Birnbaum said...: The problem with prediction is, what do you predict? The regression will be good at predicting teams -- it "realizes" that teams with lots of triples don't hit lots of home runs, and will predict that just fine.

Remember, they answer different questions. You have to pick the question you're interested in.

Mine is: suppose you add one extra double to a team's stats. How many extra runs will they score?

To do that: pick a random team, and a random PA. Insert a double. Play out the rest of the inning with that extra double and compare it to what happened in real life. (It's like reconstructing the inning without the error to get earned runs.)

The PBP method's prediction: on average, it'll be 0.8 runs. The regression's prediction: 0.67 runs.

I'll be happy to bet it's closer to 0.8 than 0.67.

(BTW, there are problems with reconstructing the inning as is. An alternative: find another random spot in the season with the same situation (one out, runner on 1st) where a double WAS hit, and compare the two. It's like a controlled experiment. Now that I think about it, that would work great!)
At Tuesday, October 27, 2009 7:53:00 PM, Phil Birnbaum said...: Actually, now that I think about it, my proposed experiment HAS to yield that result (with a large enough sample size) ... because the test I propose is exactly the same as the way the estimate was calculated!

That is: by definition of how it was constructed, the PBP method answers the question, "how many more runs do you get if you drop in a random double?"
At Tuesday, October 27, 2009 9:07:00 PM, Cyril Morong said...: One great thing about what Ted did is that reaching on errors is one of the explanatory variables in his regression. Teams don't all have the same number of errors made against them, so that means that many regressions leave something out.

I found that the relative run value of OBP and SLG can change alot if you add in the error rate. I posted something on this on the SABR list way back February of 2002. So it is possible that the value of doubles can change if the regression is better specified by including reaching on errors. I think Ted's highest value for doubles is .737. I don't recall what time period that covered but it would be interesting to see what the PBP approach says the value was over those particular years (it may be different from .8 if Ted covered a different number of years). Is .063 (.8 - .737) a big differnce? For a guy with 25 doubles in a season that amounts to 1.575 runs,

Anyway, I think the PBP approach implicitly includes the value of errors since it looks to see how many runs are scored in an inning after every event. Sometimes errors take place after doubles that affect the scoring,
At Wednesday, October 28, 2009 12:58:00 AM, Colin Wyers said...: "That sounds like an empirical question to me. Pit the two methods against each other in terms of PREDICTING future outcomes, and see which one comes closest to predicting the total number of runs scored. 2010 is right around the corner. I hope this challenge embiggens you to put your money where your mouth is."

What would that prove? (Oh, and trust me, it's been done.)

The problem is not that the regression equation is innaccurate, it's that it fails to show a causal link between the events and run scoring.

Or, more technically put - a regression equation where the independent variables are collinear does not produce an inaccurate result, so long as it is applied to results that are colinear in the same fashion. Likely, a regression equation based upon 1993-2009 will do fine in predicting team runs scored for 2010.

The problem emerges when you apply the regression to something that isn't a team season - like pitchers or individual games/innings. (Or the most common use - individual hitters.) I can't speak for the individual hitters so much, but when you apply regression-based linear weights to things like individual games or half-innings, they are simply less accurate (RMSE/AAE) than empiric linear weights.

I can't speak for others who calculate empiric linear weights, but I can speak for the ones I calculate. I don't have to guess at the effects of errors, for instance - using play-by-play data means that I know what errors occured, what all the baserunners did on that play, what base they ended the inning at, etc. I don't only know the number of singles, I know how many singles were hit with the bases empty, with the bases loaded, how many outs there were for EVERY single...

(And while I'm on the subject - thank you, Ted, for making all that possible.)

But, seriously, a regression for linear weights has how many inputs, realistically? Five? Ten? When figuring linear weights from play by play data, I have 20 distinct event codes times 27 base/out states... that's, what, 540 inputs? And for thousands of data points per season (every plate appearance and baserunner event) rather than just 30?

The play-by-play data available from Retrosheet is so incredibly expressive (and again, Ted - thank you for all the work you do in helping researchers like me parse that data) that it can simply do so much more than working with official statistics at the team-season level is capable of.
At Wednesday, October 28, 2009 8:43:00 AM, Guy said...: Interesting paper by Ted. Including ROE and DP clearly improves the regression. But his coefficient for doubles still looks a bit low to me, and HR a bit high, compared to Ruane's values for the same years. So I still suspect there's a 2B-HR interaction effect being missed.

Ted, could you explain why you think SBAs should have a net value of zero? I think most sabermetricians believe that optimizing teams should attempt steals when probability of success equals or exceeds the break-even probability, in which case the mean success rate should be above break-even. Why do you disagree?

In addition, you can't just look at SB and CS to assess whether teams are at equilibrium. A non-trivial portion of CS are the result of failed hit-and-run attempts, not true SBAs. So you are capturing the cost of failed hit/runs, but not capturing the benefits of successful ones (fewer GDPs, better baserunner advancement on hits). So even if SB and CS appeared to be at break-even proportions, teams would still have a net gain from sending runners on the pitch. (And that's what we really need to measure: value of sending vs not sending runner.)

If you focus on SBAs in situations in which a hit-run is very unlikely (e.g. 2 strikes), you'll find the success rate is much higher. Or, if we removed all SBAs in which the batter swung at the pitch, I think we would find a much higher success rate.
At Wednesday, October 28, 2009 2:47:00 PM, Ted said...: Phil, I disagree with the interpretation that the PBP method is better. My argument (which I've been meaning to write up more formally for some time, but it has never risen to the top of the pile) is that we need *both* a PBP-based method *and* a regression method, and we need them to agree. It is not a priori obvious to me that both approaches inevitably would produce the same results. But, they *do*, when they are carried out properly. That is important because it means that this concept of the "marginal value" of an event is a well-defined concept, and so provides an important kind of validity that underpins all the analysis that implicitly uses that concept. (Which is a *lot*.)

As for the argument that "more granular is better," I think you've got the direction of the argument exactly better. Parsimonious models are preferred to less parimonious ones; if you have enough degrees of freedom, after all, you can always explain 100% of variation. As I see it, one of the great fundamental results is that very parsimonious models can explain most of the observed variation in team scoring, whether that's LWTS, a log-linear-type model like runs created, whatever. The first ten or so degrees of freedom buys you 90% of the variation; the other couple thousand you want to allow yourself in using PBP buys you less than 10%.
At Wednesday, October 28, 2009 2:59:00 PM, Ted said...: Guy, "most sabermetricians" are just flat-out wrong when it comes to the stolen base play. They assume it's a decision theory problem -- only the offense chooses -- and that's (clearly?) not the case, since the defense has choices as well. So it's a game theory problem, and "breakeven percentage" isn't the right concept.

One of the implications of moving to that theory is that the marginal value of an attempt should be zero, because optimal strategy involves randomization to make the opponent indifferent.

A simple game theory model fits the data very well: see an old version of my paper on this at

http://www.gambit-project.org/turocy/papers/theft/

(I've got some as-yet-unpublished revisions to make the econometrics cleaner, but the basic model and result are unchanged).

Because by definition LWTS-type models are capturing the marginal contribution of something, and at optimal levels the marginal contribution is zero, LWTS-type models can't be used to adequately evaluate the value (or lack thereof) of any play where strategic choice is an important component. I've been pointing that out since at least 1999, and I'm sure I'm not the first to write about it because it's a pretty fundamental idea, but somehow it doesn't seem to have become integrated into sabermetric thought.
At Wednesday, October 28, 2009 3:08:00 PM, Ted said...: Oops, forgot to reply to the query about AOE. AOE is advancements on errors -- that is, errors which aren't the primary event (in Retrosheet terminology) which allow runners to advance but don't allow the batter to reach.

One of the more interesting things I learned from doing that paper was the strong correlation between CS and AOE at the team level. We usually talk more about ROE (reached on errors), but ROE has a lot more to do with being a right-handed batter than anything else. But AOEs -- lots of which are throwing errors -- are plausibly correlated with teams which run a lot. Speed really *does* put some pressure on the defense, in a quantifiable sense. The problem is, the magnitude of the effect is *MUCH* smaller - at least an order of magnitude smaller - than the Joe Morgans of the world would have you believe. :)
At Wednesday, October 28, 2009 4:48:00 PM, Dan G said...: I'm curious why you would use a linear regression to begin with. Runs are discrete events so a poisson regression seems more appropriate. It doesn't really make sense to say that a double is worth .8 extra runs since one never can score .8 runs.
At Thursday, October 29, 2009 12:20:00 PM, Guy said...: Ted: Thanks for the reply and the link. I'm still digesting your papers. I'm not yet convinced that we should expect the average SBA and non-SBA to have equal outcomes. It seems to me there are important ways in which the stolen base scenario is different from game theory assumptions, such as 1) some baserunners can clearly exceed break-even success even when defense employs the "inspect" strategy 100% of the time, 2) the offense knows whether the inspect defense is on before deciding to run (except for the pitchout element), and 3) the impact of the defense's strategy on success rate is very small compared to factors they can't change at that moment: baserunner speed, catcher arm, and pitcher handedness. But I'm keeping an open mind.

One question for you: if you are right, shouldn't EVERY baserunner have the same success rate (while running at very different frequencies)? How can a Carlos Beltran (286 SB, 38 CS) exist?
At Sunday, July 09, 2017 6:29:00 PM, bob sawyer basballnut570@hotmail said...: a marginal expected value of greater than zero is required for a stolen base attempt to make sense. Break-even analysis is a mechanism by which this value is roughly determined. The problem here is that (unlike industrial production) more stolen base attempts do not necessarily cause the marginal value to fall
The success frequency and marginal value will rise and fall from attempt to attempt in conjunction with (a) the speed and skill of the runner, (b) frequency of pitch-outs (c) the willingness of the defense to hold the runner close to the bag, (d) the quality of the pitcher's pickoff move and (e) The quality of the catchers throwing arm. The defense controls most of these variables, but as nobody changes the pitcher or catcher simply to prevent a stolen base attempt, (d) and (e) may be treated as fixed conditions . Once the decision to use or not use a pinch runner is made, (a) becomes a fixed condition as well.

What (b) and (c) do is change the expected success rate. While Carlos Beltran is normally safe nearly 90% of the time this goes up if the defense is indifferent to his advancement and down when they make a conscious effort to hold him on first base. For many potential base-stealers this relatively modest difference in success rates can shift the expected return from positive to negative and thus take away the incentive for an SBA.

Hence what we have for each team season is a set of SBA opportunities each with a somewhat different expected success rate. A team maximizes its scoring by choosing to attempt the steal for all and only those cases with positive expectations. If the activities of the defense are unclear then the offensive decision maker must mentally estimate (b) and (c) for each and every situation in which the run expectation is not overwhelmingly positive.

There is thus no reason why SBA cannot produce a positive net result for a team or a league. The "sure-thing" attempts pile up a positive score with the iffy attempts doing no net harm. I suspect that the reason that the historical success rate is so close to the overall break-even point is that managers and runners misinterpret the results of the break even test. It is designed to tell you when NOT to try, It is NOT designed to be used as a GO indicator.

Sabermetric Research

Tuesday, October 27, 2009

Don't use regression to calculate Linear Weights

44 Comments:

About Me

Previous Posts