Sabermetric Research: Don't use regression to calculate Linear Weights, Part II

Friday, October 30, 2009

Don't use regression to calculate Linear Weights, Part II

Last post, I wrote about how using regression to estimate Linear Weights values is a poor choice, and that the play-by-play (PBP) method is better. After thinking a bit more about it, I realized that I could have made a stronger case than I did. Specifically, that the regression method gives you an *estimate* of the value of events, but the play-by-play method gives you the *actual answer*. That is: a "perfect" regression, with an unlimited amount of data, will never be able to be more accurate than the results of the PBP method.

An analogy: suppose that, a while ago, someone randomly dumped some red balls and some white balls into an urn. If you were to draw one ball out of the urn, what would be the probability it's red?

Here are two different ways of trying to figure that out. First, you can observe people coming and drawing balls, and you can see what proportion of balls drawn turned out to be red. Maybe one guy draws ten balls (with replacement) and six are red. Maybe someone else comes up and draws one ball, and it's white. A third person comes along and draws five white out of 11. And so on. Over all, maybe there are 68 balls drawn total, and 40 of them are red.

So what do you do? You figure that the most likely estimate is 40/68, or 58.8%. You then use the binomial approximation to the normal distribution to figure out a confidence interval for your estimate.

That's the first way. What's the second way?

The second way is: you just empty the urn and count the balls! Maybe it turns out that the urn contains exactly 60 red balls and 40 white balls. So we now *know* the probability of drawing a red ball is 0.6.

If the second method is available, the first is completely unnecessary. It gives you no additional information about the question once you know what's in the urn. The second method has given you an exact answer.

That, I think, is the situation with Linear Weights. The regression is like observing people draw balls; you can then make inferences about the actual values of the events. But the PBP method is like looking in the urn -- you get the answer to the question you're asking. It's a deterministic calculation of what the regression values will converge to, if you eventually get the regression perfect.

-------

To make my case, let me start by (again) telling you what question the PBP method answers. It's this:

-- If you were to take a team made up of league-average players, and add one double to its batting line, how many more runs would it score?

That question is pretty much identical to the reverse:

-- If you were to take a team made up of league average players, and remove one double from its batting line, how many fewer runs would it score?

I'm going to show how to answer the second question, because the explanation is a bit easier. I'll do it for the 1992 American League.

Start with a listing of the play-by-play for every game (available from Retrosheet, of course). Now, let's randomly choose which double we're going to eliminate. There were 3,596 doubles hit that year; pick one at random, and find the inning in which it was hit.

Now, write out the inning. Write out the inning again, without the double. Then, see what the difference is in runs scored. (The process is almost exactly the same as how you figure earned runs: you replay the inning pretending the error never happened, and see if it saves you some runs.)

Suppose we randomly came up with Gene Larkin's double in the bottom of the fourth inning against the Angels on June 24, 1992. The inning went like this:

Actual: Walk / Double Play / DOUBLE / Fly Out -- 0 runs scored.

Without the double, our hypothetical inning would have been

Hypoth: Walk / Double Play / ------ / Fly Out -- 0 runs scored.

Pretty simple: in this case, taking away the double makes no difference, and costs the team 0 runs.

On the other hand, suppose we chose Brady Anderson's leadoff first-inning double on April 10. With and without the double:

Actual: DOUBLE / Double / Single / Double Play / HR / Ground Out -- 3 runs scored.

Hypoth: ------ / Double / Single / Double Play / HR / Ground Out -- 2 runs scored.

So, in this case, removing the double cost 1 run.

If we were to do this for each of the 3,596 doubles, we could just average out all the values, and we'd know how much a double was worth. The only problem is that sometimes it's hard to recreate the inning. For instance, Don Mattingly's double in the sixth inning on September 8:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Removing the double gives

Hypoth: Out / Single / ------ / Single / Fly Ball / Fly Ball

How many runs score in this reconstructed inning? We don't know. If the second single advanced the runner to third, and the subsequent fly ball was deep enough, one run would have scored. Otherwise, it would be 0 runs. So we don't know which it would have been. What do we do?

The situation arises that we were picking innings randomly, and dividing them into halves (the half before the double, and the half after the double). The problem is that the process creates an inconsistency in the hypothetical inning. The second half of the inning, in real life, started with one out and a runner on second and third. The hypothetical second half started with one out and a runner on first. That created the problem.

So, since we're picking randomly anyway, why don't we throw away the *real* second half of the inning, and instead pick the second half of some *other* inning, some inning where there actually IS one out and a runner on first? That will always give us a consistent inning. And while it will give us a different result for this run of our random test, over many random tests, it'll all even out.

We might randomly choose Cleveland's fourth inning against the Royals on July 16. In that frame, Mark Whiten struck out and Glenallen Hill singled, which gives us our required runner-on-first-with-one-out. After that, Jim Thome singled, and Sandy Alomar Jr. ground into a double play.

Grafting the end of that inning (single, double play) on to the beginning of the original inning gives us our "consistent" hypothetical inning:

Hypoth: (stuff to put a runner on first and one out) / ------ / Single / GIDP -- 0 runs scored.

Since the Indians scored two runs in the original, unadulerated inning, and zero runs in the hypothetical inning, this run of the simulation winds up with a loss of two runs.

Now, there's nothing special about that Cleveland fourth inning: we just happened to choose it randomly. There were 6,380 cases of a runner on first with one out, and we could have chosen any of them instead.

The inning could have gone:

Out / Single / ------ / result of inng 1 of 6,380
Out / Single / ------ / result of inng 2 of 6,380
Out / Single / ------ / result of inng 3 of 6,380
Out / Single / ------ / result of inng 4 of 6,380
...
Out / Single / ------ / result of inng 6,380 of 6,380

If we run the simulation long enough, we'll choose every one of the 6,380 equally. And so, we'll wind up with just the average of those 6,380 innings. So we can get rid of the randomness in the second half of the inning just by substituting the average of the 6,380. Then our "remove the double" hypothetical becomes:

Out / Single / ------ / average of all 6,380 innings

And, with the help of Retrosheet, we find that after having a runner on first and one out, those 6,380 innings resulted in an average of 0.510 runs being scored. So now we have:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Hypoth: Out / Single / ------ / plus an additional 0.510 runs.

I'll rewrite that to make it a bit easier to see what's going on:

Actual: (stuff that put a runner on first with one out) / DOUBLE / other stuff that caused 2 runs to be scored

Hypoth: (stuff that put a runner on first with one out) / other stuff that caused 0.510 runs to be scored, on average

So, for this inning, we can say that removing the double cost 1.490 runs.

Now, the "actual" inning was again random. We happened to choose the Yankees' 6th inning on September 8. But we might have chosen another, similar, inning where there was a runner on first with one out, and a double was hit, and the runner held at third. This particular September 8 inning led to two runs. Another such inning may have led to six runs, or three runs, or no runs (maybe there were two strikeouts after the double and the runners were stranded).

So, what we can do, is aggregate all these types of innings. If we look to Retrosheet, we would find that there were 796 times where there were runners on second and third with one out. In the remainder of those 796 innings, 1129 runs were scored. That's an average of 1.418 per inning.

So we can write:

Actual: stuff that put a runner on 1st with one out / DOUBLE leading to runners and 2nd and 3rd with one out / Other stuff leading to 1.418 runs scoring, on average.

Hypoth: stuff that put a runner on 1st with one out / ------ / Other stuff leading to 0.410 runs scoring, on average.

And so, we know that a double with a runner on first and one out, which ends with runners on 2nd and 3rd, is worth, on average, 1.008 runs.

Let's write this down this way, as an equation:

+1.008 runs = Runner on 1st and one out + Double, runner holds

We can repeat this analysis. What if the runner scores? Then, it turns out, the average inning led to 1.646 runs scoring instead of 0.410. So:

+1.236 runs = Runner on 1st and one out + Double, run scores

We can repeat this for every combination of bases and double results we want. For instance:

-0.117 runs = Runner on 1st and one out + Double, runner thrown out at home

+1.000 runs = Runner on 2nd and nobody out + Double

+1.212 runs = Runner on 1st and two out + Double, runner safe at home and batter goes to third on the throw

I'm not sure how many of these cases there are, but we can look to Retrosheet and list them all. At the end, we have a huge list of all possible combinations of doubles, and what they were worth in runs. We just have to average them, weighted by how often they happened, and we're done. We then have the answer.

As it turns out, the answer for the 1992 American League works out to 0.763 runs.

The answer is NOT a estimate based on a model with random errors that we have to eliminate. It's the exact answer to the question, the same way counting the balls in the urn gave us an exact answer.

Just to be absolutely clear, here's what we've shown:

Suppose we randomly remove one double from the 1992 American League. Then, we reconstruct the inning from the point of that double forward, by looking at the base/out situation before the double, finding a random inning with that same base/out situation, and substituting that new inning instead of what really happened.

If we do that, we should expect 0.763 runs fewer will be scored. If we were to run this same random test a trillion times, the runs lost will average out to .763 almost exactly.

If you try to answer this question by running a regression, to the extent that your estimate is different from 0.763, you got the wrong answer.

------

Anyway, the explanation above was a complicated way of describing the process. Here's a simpler description of the algorithm.

1. Using Retrosheet data, find every situation where there was a runner on second and no outs. It turns out there were 1,572 such situations in the 1992 AL. Count the total number of runs that were scored in the remainder of those innings. It turns out there were, on average, 1.095 runs scored each time that happened (1,722 runs scored in those 1,572 innings).

2. Repeat this process for the other 23 base-out states (two-outs-bases-loaded, one-out-runners-on-first-and-third, and so on). If you do that, and put the results in the traditional matrix, you get:

0 out 1 out 2 out
------------------------------
0.482 0.248 0.096 nobody on
0.853 0.510 0.211 first
1.095 0.646 0.293 second
1.494 0.907 0.423 first/second
1.356 0.940 0.377 third
1.804 1.151 0.470 first/third
2.169 1.418 0.598 second/third
2.429 1.549 0.745 loaded

3. Find every double hit in the 1992 AL. For each of those 3,596 doubles, figure (a) the run value from the above table *before* the double was hit; (b) the run value for the situation *after* the double; and (c) the number of runs that scored on the play.

The value of that double is (b) - (a) + (c). For instance, a 3-run double with the bases loaded and 2 outs is worth 0.293 minus 0.745 plus 3. That works out to 2.548 runs.

4. Average out each of the 3,596 run values. You'll get 0.763.

It's that simple. You can repeat the above for whatever event you like: triples, stolen bases, strikeouts, whatever. Here are the values I got:

-0.272 strikeout
-0.264 other batting out
+0.178 steal
+0.139 defensive indifference
-0.421 caught stealing
+0.276 wild pitch
+0.286 passed ball
+0.277 balk
+0.307 walk (except intentional)
+0.174 intentional walk
+0.331 HBP
+0.378 interference
+0.491 reached on error
+0.460 single
+0.763 double
+1.018 triple
+1.417 home run

-----

Anyway, my point in the original post wasn't meant to be "regression is bad." What I really meant was, why randomly pull balls from urns when Retrosheet gives you enough data to actually count the balls? This method gives you an exact number.

One objection might be that, to do it this way, there's way too much data to use, and so regression is a more practical alternative. But is it really better to use a wrong value just because it's easier to calculate?

Besides, you don't have to calculate them yourself -- they've been calculated, repeatedly, by others who can be referenced. In the worst case, they're close to the "traditional" weights, as calculated by Pete Palmer some 25 years ago. If you need a solid reference, just use Pete's numbers. They're closer than you'll get by running a regression, even a particularly comprehensive one.

Labels: linear weights, regression

32 Comments:

At Friday, October 30, 2009 12:13:00 PM, Phil Birnbaum said...: Disclaimer: when I say this gives you the "right answer," I'm exaggerating a tiny little bit. You could nitpick and say that there may be small features of real life this doesn't capture. For instance, when we figured out the value of the double that left runners on second and third with one out, we looked at *all* innings with runners on second and third with one out, not just those innings that got there via a double. If you believe that this makes a difference, and that there's a reason that getting to that stage via (out, single, double) is different than getting there via (out, single, single, stolen base), you may disagree with the results.

What I'm saying is that this gives the "right answer" if you agree with the assumptions built in to the model.
At Friday, October 30, 2009 12:14:00 PM, Phil Birnbaum said...: This comment has been removed by the author.
At Friday, October 30, 2009 1:18:00 PM, Guy said...: Nice post, Phil. I was going to offer a quibble, but then you made it yourself in comment #1! And I agree that there's really no "parsimony" case for regression at this point, since the linear weights have been calculated. (And with the current cost of computing power, parsimony ain't what it used to be.).

It occurs to me that it should be possible to establish general LWs based on a league's OBP or R/G that would be quite accurate (and much easier than customizing for every season). The weight of the HR never really changes, but other events are worth a bit more when OBP is higher. Has anyone ever developed a model for establishing OBP-dependent LWs?
At Friday, October 30, 2009 1:20:00 PM, Phil Birnbaum said...: Guy, I think Tango has a calculator to create linear weights values, but I'm not sure if it's a formula or a simulation ... anyone else know if there's a method to predict LW values from the league stats?

Now *this* might be where a regression could come in handy ...
At Friday, October 30, 2009 1:36:00 PM, Phil Birnbaum said...: I had originally left this comment, which I now realize is not quite right:

-----

At the risk of repeating myself too many times, here's one last way to think about it.

Suppose you decide to simulate an inning of a baseball game this way. First, you start with nobody on and nobody out. You then go to the play-by-play data for the 1992 American League, and randomly pick one of the 21,017 things that happened with nobody on and nobody out. Let's say it turns out to be a single to right field. So you have a runner on first and nobody out.

Now, you randomly pick one of the 5,526 things that happened with a runner on first and nobody out. Suppose it's a fly ball. You have a runner on first and one out.

Now, you randomly pick from things that happened with a runner on first and one out.

And so on, until the inning ends. At that point, you have a number of runs that scored.

If the inning has at least one double in it, pick one randomly (which is easy if there's only one). Take everything that happened up until the double, not including the double. Then repeat the randomizing process from that point on, creating a second, "control" inning. (You might actually wind up with another double, but that's OK.)

You should expect the control inning to yield 0.763 fewer runs than the original inning. Of course, that single test can't yield a fractional number of runs -- but if you repeat this enough times, the average will be 0.763, in the same way that if you toss a fair coin a few quintillion times, the average number of heads per toss will be 0.500.

-----

What's wrong is that if you don't take actual ends of innings, but simulate ends of innings, you might not get exactly 0.763. What you get depends on the idiosyncrasies of that particular season. It will be something probably very *close* to 0.763, but not necessarily exactly 0.763.

You'll still do better than the regeression, though!
At Friday, October 30, 2009 2:35:00 PM, p said...: Phil, nice set of posts on LW.

Guy and Phil, there are at least three non-PBP estimation techniques out there:

1) Tango's calculator, which is based on his Markov model

2) Estimates derived from a good non-linear run estimator (like BsR)--you can use the "+1" approach, which is just adding one (or some fraction) of each event to the the league totals and finding the change in BsR, or you an use partial derivatives as I did in a BTN article (Nov 05 maybe?)

3) Both Tango and David Smyth have published some quick and dirty estimates of varying degrees of complexity--one was based just on league R/G, Tango had one based largely on league OBA, Smyth had one that used the league run scored/baserunner as a starting point. Maybe Tango will come over here and share some of his.
At Friday, October 30, 2009 3:27:00 PM, Phil Birnbaum said...: Thanks, p. The problem with (1) and (2) is that even though you and I may trust them, academic types might say that (a) Tango's Markov algorithm isn't public or proven to work, and (b) Base Runs (or whatever) hasn't been proven accurate enough to rely on, even using partial derivatives.

I mean, if some people refuse to use LW, which has been around for 25 years, and use regression instead, it's hard to see them using BaseRuns.

The PBP algorithm, on the other hand, speaks for itself.

Just from the point of view of those who want to run their own regressions ...
At Friday, October 30, 2009 3:40:00 PM, JavaGeek said...: You speak like each interval is independent and a "double" in the 1st has no bearing on events that occur in the second inning? (I'm assuming this weights system is done on an inning basis)

The question I have are the periods used in the regression identical to those used in the weights? If not why not?

For example if the regression is
Team Runs/year = a x single + b x double etc.
And the weights are done on a per-inning basis - how is that fair or even answering the same question?

Secondly, what is the variance or errors of your weights estimates? One benefit of regressions is that these things are easy to find.

Finally, you speak as if you "know what's in the urn", yet we still need to use "observed random samples".

Also, I'm kind of curious if you have a proper regression:
- Normally distributed
- Linear Relationship
- Measured without error
- Homoscedasticity
At Friday, October 30, 2009 4:09:00 PM, p said...: Phil, I wouldn't want to use any of those techniques in place of the empirical weights either. I was just referencing them as means of estimating the LW given the league stats. You are right, I'm sure, that those techniques would probably not be accepted by the folks who want to run regressions.

I found one of the quick-and-dirty techniques I was referring to, that relies only on R/G. It was published by David Smyth on the old FanHome (so I don't have a link):

1B = R/G/50 + .38
2B = R/G/40 + .65
3B = R/G/20 + .80
HR = R/G/100 + 1.355
SB = R/G/150 + .16
CS = -(R/G/15 + .12)
BB, HB = R/G/50 + .24 (not IBB)
AB-H = R/G/20 + .04

I don't think anyone ever formally tested these, but the everyone in the thread thought they looked reasonable. In any event, they answer Guy's question of whether R/G-based LW approximations have ever been explored.

As you can see, those formulas take the form of a linear regression y = mx + b.
At Friday, October 30, 2009 4:34:00 PM, Cyril Morong said...: In looking at the run expectancy tables, I wonder if there is some kind of bias (there might not be or it might be really small and there might be nothing anyone can do about it). But the table says something like 2.4 RE with bases loaded an no outs.

But a bad pitcher is more likely to get to that base/out situation. So it may be that everything else is not being held constant. Maybe all of the cases cancel each other out so that we also tend to have more good pitchers pitching with one out and no one on base. Just wondering if any of the run values could be affected by this issue.

Also, I think it is interesting to wonder why regressions generally come close to the "right" run values for the basic events but doubles seem to be alot farther off. Why would that be? Maybe there is something interesting to learn if we answered that. I think in the other post someone mentioned that teams that hit alot of HRs hit alot of doubles, but I don't know if anyone came up with a conclusive answer.
At Friday, October 30, 2009 5:46:00 PM, Phil Birnbaum said...: p,

Cool! That seems like something that would make sense ... regression seems like a decent enough way to estimate that stuff, at least until someone comes up with something better.
At Friday, October 30, 2009 6:34:00 PM, Anonymous said...: One thing I don't understand about the ball/urn analogy:

The balls are discrete units. When you dump over the urn and count the balls, you are counting in integers.

Runs are also discrete units. When a team scores runs, you are counting in integers.

If determining run values was EXACTLY equivalent to counting balls, you would always have integers.

When calculating linear weights, you are taking an average. The average you come up with has a standard deviation, from which you can calculate confidence intervals. These intervals tell you the precision of your calculated average. This seems a lot like the results of a regression, not counting integers.

When you count the number of balls in the urn, there are no averages, no standard deviations, and no confidence intervals. It is simply the number that you count.

Maybe you can explain again how this is the same as counting balls in an urn?
At Friday, October 30, 2009 10:48:00 PM, Phil Birnbaum said...: Anon,

When you count linear weights by the PBP method, there is no standard deviation because there is no sample. You are using *all* the data to come up with an exact average, just as you can count 60 red balls and 30 white balls and get exactly 66.67% red, with no standard deviation.

Note that even though balls are integers, you can still have 66.67% red balls in the urn. And even though runs are integers, a double can be worth 0.763 runs.
At Saturday, October 31, 2009 12:22:00 AM, JavaGeek said...: When you count linear weights by the PBP method, there is no standard deviation because there is no sample. You are using *all* the data to come up with an exact average, just as you can count 60 red balls and 30 white balls and get exactly 66.67% red, with no standard deviation.

So, if any of the 6380 inning that we averaged over had different results we would have ended up with the exact same answer?

For example your 1129 runs in 796 innings could have easily been 1100 runs in 800 innings? I believe runs scored in generally a heavy tailed distribution (someone told me this once, but I'll use Poisson for simplicity).

Would the variance in this situation not be: ~ 1129 so +/-65.0 (95%). So maybe it's 1.33 instead of 1.42?

I'm sure there are some rare play-by-play events that have much higher variations...

Or maybe I just don't understand how it works.
At Saturday, October 31, 2009 12:47:00 AM, Phil Birnbaum said...: Hi, Javageek,

There are two sources of randomness. First, there's randomness in the actual events on the baseball field. Then, there's randomness in how well your sample happens to match reality.

To do the urn analogy again: suppose Bud Selig comes along, and flips a coin 100 times. For each head, he puts a red ball in the urn. For each tail, he puts a white ball in the urn. As it turns out, he flipped 55 heads and 45 tails, so there are 55 red balls in the urn and 45 white balls.

Now, Robert Regression comes along. He pulls 10 balls out of the urn and wins up with 6 tails. "Ha!" he says. "I suspect the urn has 60% red balls, plus or minus 15%."

Now Peter Play-by-Play comes along. He dumps out the contents of the urn and counts 55 red balls and 45 white.

Peter's estimate has no variation. Robert's does. That is completely independent of the fact that *the number of red balls in the urn* was itself chosen randomly. Neither Peter nor Robert takes that into account. Robert's confidence interval is ONLY for his sample matching the urn. It is NOT for what "should have" been in the urn.

The analogy to this stuff:

The actual play on the field is Bud Selig stuffing the urn with a random number of red balls. Robert is the regression trying to estimate LW values. Peter is the play-by-play method actually counting the balls.

Sure, the 1129 runs could have been 1100. If it were, the PBP method would come up with different weights. But the fact remains that, GIVEN that it turned out to be 1129 runs, the PBP method gives pretty much the exact answer for how many runs a double was worth *the way the season turned out*.

Does that help?
At Saturday, October 31, 2009 2:44:00 AM, JavaGeek said...: But the fact remains that, GIVEN that it turned out to be 10 tails, the PBP method gives pretty much the exact answer for how many tails were flipped *the way the coins flipped*

What good is that information. There were 10 tails so what? It's not a statistical analysis, but rather an observation about what has happened. Is it a meaningful observation? Is it a useful observation? These are questions that would need to be asked.

I'm curious, what happens to these estimates if say we break the data into two groups - games on even days and games on odd days?

Finally how would this PBP system deal with co-related innings. What if a DOUBLE in one inning resulted in fewer RUNS in the next inning? This system wouldn't see the "total cost/benifit" of the DOUBLE as it cannot see past its vision (one inning). [If a pitcher is chased, or if the batter cycle changes because of hit etc.]

Anyway, I don't know enough about markov chains and baseball to really discuss these issues, so this is all I really have to say.
At Saturday, October 31, 2009 8:06:00 AM, Guy said...: Patriot: Thanks for the info from David S. The values generally make sense, although I'm surprised that the BB value isn't more heavily dependent on R/G.

Cyril: My theory on the wrong regression coefficient for doubles is that power hitting teams hit "better" doubles, i.e. they advance more runners, and some of that value is then captured by the HR coefficient.

I think regression also tends to overvalue BBs. That's probably because BB rate is so correlated with OBP. So teams that get a lot of BBs have more valuable BBs, those with few BBs have less valuable BBs. That would make it appear the average marginal value of a walk is higher than it really is.
At Saturday, October 31, 2009 8:53:00 AM, Cyril Morong said...: Guy

If power hitting teams hit "better" doubles, then my guess is that power hitters hit "better doubles than other hitters. So whatever value we apply to a hitter will be biased. Some guys should get more than .8 for a double and some less. Is there a way to figure out how much each guy's double value should be altered?
At Saturday, October 31, 2009 11:41:00 AM, Anonymous said...: When you count linear weights by the PBP method, there is no standard deviation because there is no sample. You are using *all* the data to come up with an exact average, just as you can count 60 red balls and 30 white balls and get exactly 66.67% red, with no standard deviation.

But what about the averages you are using to come up with your final answer? Here are portions of your explanation:

And, with the help of Retrosheet, we find that after having a runner on first and one out, those 6,380 innings resulted in an average of 0.510 runs being scored.

If we look to Retrosheet, we would find that there were 796 times where there were runners on second and third with one out. In the remainder of those 796 innings, 1129 runs were scored. That's an average of 1.418 per inning.

And so, we know that a double with a runner on first and one out, which ends with runners on 2nd and 3rd, is worth, on average, 1.008 runs.

...

All of these numbers are averages, with a certain degree of error that you don't describe. Since these numbers are used to come up with your final answer, your final answer still contains the error of these estimates. So it's not like simply counting the number of balls.

Here's a better way of describing the analogy, IMO:

Each red ball = 1 run
You're trying to determine the composition of each run. So a red ball could be made up of 10 different wavelengths of red.

Single = Red wavelength 1
Double = Red wavelength 2
Triple = Red wavelength 3
...

When you're trying to come up with the value of a double, you're trying to estimate the amount of Red wavelength 2 in the balls. However, each ball has a different amount of Red wavelength 2. So you take the average amount of Red wavelength 2 in these ball, and you come up with your best guess of how much Red wavelength 2 is in each ball. If you choose one ball at random, there's no guarantee that your estimated Red wavelength 2 will be in that ball.

Same thing with runs. Choose one run at random, and there's no guarantee that a double would have contributed X% of that run. It's not counting, it's estimating, and error is still present.
At Saturday, October 31, 2009 12:05:00 PM, Phil Birnbaum said...: >All of these numbers are averages, with a certain degree of error that you don't describe.

Those numbers have no error in the statistical sense. They are averages of the entire set of observations.

Suppose I were to carefully poll every American, and get his/her exact age. At the end of the day, I have some 300,000,000+ ages. I then take their average age.

That average has NO error. The average is plus or minus zero. I am able to say that on October 31, 2009, the average age of a US resident was X. Period.

Same with the runs. If I count every time a runner was on second with one out, and give you the average, that average has NO statistical error.
At Saturday, October 31, 2009 12:13:00 PM, Anonymous said...: Are you saying the standard deviation is zero? Are you saying the standard error is zero? Are you saying there are no confidence intervals around that average?
At Saturday, October 31, 2009 12:28:00 PM, Phil Birnbaum said...: Exactly! Given that the PBP came out the way it did in the 1992 AL, the 0.763 is *exactly* what an *average* double was worth in the 1992 AL.

Here's one more analogy. Suppose the urn consists of 50% red and 50% white. And Bob draws out 6 red and 4 white balls.

We can say that Bob drew out 60% red balls. But there is no error in that figure, because it's not an estimate! He did draw out EXACTLY 60% red balls, plus or minus zero.

Maybe you're thinking about another question: if we ran the 1992 AL season again, and the PBP was different, how much different would the doubles estimate be, and what's a confidence interval for that?

That's a legitimate question, but a different one. And no regression based only on 1992 data can answer it.

It is NOT the case that the standard error from a regression on 1992 data gives you a confidence interval for what the double might be worth for other seasons. That standard error just tells you the confidence interval for the observed average for THAT season, the 1992 AL. It's a way of estimating the 1992 value of the double, which we KNOW will come out to 0.763.
At Saturday, October 31, 2009 12:32:00 PM, Anonymous said...: OK, never mind. The analogy of measuring everyone's age is meaningless for this example.

Adding or removing doubles from an inning is about as useful as removing someone's 17th birthday. You can only do it post-hoc for one, but it's obvious that doing so doesn't jive with reality.

And have you really examined every double ever hit in MLB history? If not, then it's not the same as measuring the age of every single American.

Finally, if your answer won't hold for all future doubles, then it's still an estimate. There's no immutable law that states for every double ever hit in the future, it will account for X runs.
At Saturday, October 31, 2009 2:09:00 PM, Cyril Morong said...: In 2008, in MLB, in non bases loaded cases (NBL), the 2Bs ratio (per AB) was .0538. In bases loaded cases (BL), it was .0634. So the 2B rate was 18% higher in BL cases since .0634/.0538 = 1.18.

The HR rate in NBL cases was .0293. In BL cases it was .0289. So it was just 1.4% lower.

Is it possible that the base-out situation affects the batting event? We would have to look at lots more years, of course. But could this affect the run values of events, no matter how they get calculated?
At Saturday, October 31, 2009 2:19:00 PM, Phil Birnbaum said...: The base/out situation *definitely* affects batting performance. But that's implicitly controlled for in the run expectancy matrix.

For instance, suppose that teams always hit home runs with a runner on first and nobody out. The matrix would then have a very high value for that cell.

Or, if teams ALWAYS hit a double with a runner on second (only), the matrix would show infinity in that cell. :)
At Saturday, October 31, 2009 2:21:00 PM, Cyril Morong said...: Here are some other ideas for finding run values.

1. Why not use a good simulator? Have an average team play thousands of games, then program its doubles to go up by 20 (or whatever number you choose). You could simply have 20 singles become doubles. Then instead of having 20 singles become doubles, have 20 singles become HRs. Then you can see how many more runs the team scores in each case.

2. You could find an 8 team league. Then have 8 events: Bs, 2Bs, 3Bs, HRs, BBs, batting outs, SB and CS. You have 8 team equations for runs and 8 unknowns (the run values for each event). I think an Excel program can do determinants or something like that to solve that. Might be interesting to see what run values you get
At Friday, January 22, 2010 10:56:00 AM, upaulo said...: Hi, Phil

Good article.

It should be possible to study the memory(less) property.

Consider none out, none on base.
Following a home run: if we refine the model so that the preceding homerun defines a state different from the leadoff state, does the remainder of the inning have a different value?
Following a three-run home run: does the remainder of the inning have a value different from the leadoff state?

Consider two out, bases loaded.
Following a base on balls: if we distinguish that from two out, bases loaded (in general or no runs scored), same as above ...

If the measured run expectancy value of a leadoff home run is greater than one, or that of a bases loaded walk is greater than one, or similarly for more complicated cases, then there is a followup question whether to attribute the refined RE values to batters. In other words, should we then refine the PBP linear weights model of batter value in the population studied, by including simple memory in the event definitions? or should we use those findings only to refine a Markov model of the half-inning? It's debatable.

That's all for now. Among other things, I'm not sure that comments on this page are still current.

Paul Wendt
At Friday, January 22, 2010 11:42:00 AM, Phil Birnbaum said...: Hi, Paul,

Comments are moderated after 28 days ... I was getting too much spam.

Agreed that what you suggest would be something to study. You'd have to consider other things, like batting order ... you'd expect more runs in an inning where the number 1 leadoff hitter homers than in an inning where the number 6 leadoff hitter homers. And so forth.

Some of that is already built in, because home runs aren't evenly distributed through the batting order, but only some of it.
At Tuesday, February 02, 2010 4:00:00 PM, upaulo said...: Hi, Phil

Last week I discussed the richer state space defined by the familiar base-out states and perhaps one preceding event (some memory). For example: bases empty, no outs, following a home run.

The state definition may be refined by specifying the batting position, too. For example: runner on first, two out, batter 8.

Recall my followup to self: Should we attribute the refined run expectancy values of batting events to the batters? In other words, should we incorporate any findings in the PBP-linear-weights measure of batter value in the population studied? Or use them only for other purposes such as simulation?

The example of batting position rather than preceding event may provide more insight into the nature of PBP linear weights as a method.

Suppose that two-base hits for batter 8 with runner on first and two outs generate fewer runs than do those for batter 4 with runner on first and two outs. Furthermore that difference in NL 2009 is greater than in AL 2009.

How should we decide whether to value doubles by all batters according to an average over all batting positions? And how should we decide whether to use an average over one modern league-season, or over both leagues, or over multiple seasons?

(Why) should we value doubles by Albert Pujols partly by reference to all half-innings with doubles by number 8 batters? If so, should we draw the line around one league-season? How should we decide?

The urn model fails here, in my opinion. It underwrites your point that PBP linear weights are generated by clerical work rather than modeling, the sense in which they are not a statistical matter at all. That simply means the modeling is in the definition of the urn and I suppose that should be largely a statistical process.

Paul
At Tuesday, February 02, 2010 4:11:00 PM, upaulo said...: Phil,
You have explained to Anon,
>>
When you count linear weights by the PBP method, there is no standard deviation because there is no sample. You are using *all* the data to come up with an exact average, just as you can count 60 red balls and 30 white balls and get exactly 66.67% red, with no standard deviation.
<<

Where there is no sampling (according to your model), I think it's best to say directly that you have observed every member of the population (according to your model). First, some people don't know enough about sampling to know the equivalence here. Second, it's misleading to emphasize that you are using all the *data*.

Paul
At Tuesday, September 28, 2010 11:40:00 AM, Anonymous said...: Thank you, Phil for a great article.

It would serve the community well to further embrace your thinking.

Working with the Retrosheet data can be time consuming, but well worth the challenges.

As the 2009 data is available, it would be revealing to see how actual behavior may have shifted in some areas since your initial study period.

Would you please provide an explanation into the SQL coding process you undertook against the Retrosheet EVENT file to create the run or win expectancy table within your post which appears to be also be nearly identical to the output you provided on your web site:

http://www.philbirnbaum.com/probs2.txt
At Tuesday, September 28, 2010 12:47:00 PM, Phil Birnbaum said...: I didn't use SQL ... I have some programs I wrote in Visual Basic many years ago to read through the Retrosheet text files.

I'm sure others have put the Retrosheet data into a relational database, but not me.

Sabermetric Research

Friday, October 30, 2009

Don't use regression to calculate Linear Weights, Part II

32 Comments:

About Me

Previous Posts