Don't use regression to calculate Linear Weights, Part II
Last post, I wrote about how using regression to estimate Linear Weights values is a poor choice, and that the play-by-play (PBP) method is better. After thinking a bit more about it, I realized that I could have made a stronger case than I did. Specifically, that the regression method gives you an *estimate* of the value of events, but the play-by-play method gives you the *actual answer*. That is: a "perfect" regression, with an unlimited amount of data, will never be able to be more accurate than the results of the PBP method.
An analogy: suppose that, a while ago, someone randomly dumped some red balls and some white balls into an urn. If you were to draw one ball out of the urn, what would be the probability it's red?
Here are two different ways of trying to figure that out. First, you can observe people coming and drawing balls, and you can see what proportion of balls drawn turned out to be red. Maybe one guy draws ten balls (with replacement) and six are red. Maybe someone else comes up and draws one ball, and it's white. A third person comes along and draws five white out of 11. And so on. Over all, maybe there are 68 balls drawn total, and 40 of them are red.
So what do you do? You figure that the most likely estimate is 40/68, or 58.8%. You then use the binomial approximation to the normal distribution to figure out a confidence interval for your estimate.
That's the first way. What's the second way?
The second way is: you just empty the urn and count the balls! Maybe it turns out that the urn contains exactly 60 red balls and 40 white balls. So we now *know* the probability of drawing a red ball is 0.6.
If the second method is available, the first is completely unnecessary. It gives you no additional information about the question once you know what's in the urn. The second method has given you an exact answer.
That, I think, is the situation with Linear Weights. The regression is like observing people draw balls; you can then make inferences about the actual values of the events. But the PBP method is like looking in the urn -- you get the answer to the question you're asking. It's a deterministic calculation of what the regression values will converge to, if you eventually get the regression perfect.
-------
To make my case, let me start by (again) telling you what question the PBP method answers. It's this:
-- If you were to take a team made up of league-average players, and add one double to its batting line, how many more runs would it score?
That question is pretty much identical to the reverse:
-- If you were to take a team made up of league average players, and remove one double from its batting line, how many fewer runs would it score?
I'm going to show how to answer the second question, because the explanation is a bit easier. I'll do it for the 1992 American League.
Start with a listing of the play-by-play for every game (available from Retrosheet, of course). Now, let's randomly choose which double we're going to eliminate. There were 3,596 doubles hit that year; pick one at random, and find the inning in which it was hit.
Now, write out the inning. Write out the inning again, without the double. Then, see what the difference is in runs scored. (The process is almost exactly the same as how you figure earned runs: you replay the inning pretending the error never happened, and see if it saves you some runs.)
Suppose we randomly came up with Gene Larkin's double in the bottom of the fourth inning against the Angels on June 24, 1992. The inning went like this:
Actual: Walk / Double Play / DOUBLE / Fly Out -- 0 runs scored.
Without the double, our hypothetical inning would have been
Hypoth: Walk / Double Play / ------ / Fly Out -- 0 runs scored.
Pretty simple: in this case, taking away the double makes no difference, and costs the team 0 runs.
On the other hand, suppose we chose Brady Anderson's leadoff first-inning double on April 10. With and without the double:
Actual: DOUBLE / Double / Single / Double Play / HR / Ground Out -- 3 runs scored.
Hypoth: ------ / Double / Single / Double Play / HR / Ground Out -- 2 runs scored.
So, in this case, removing the double cost 1 run.
If we were to do this for each of the 3,596 doubles, we could just average out all the values, and we'd know how much a double was worth. The only problem is that sometimes it's hard to recreate the inning. For instance, Don Mattingly's double in the sixth inning on September 8:
Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.
Removing the double gives
Hypoth: Out / Single / ------ / Single / Fly Ball / Fly Ball
How many runs score in this reconstructed inning? We don't know. If the second single advanced the runner to third, and the subsequent fly ball was deep enough, one run would have scored. Otherwise, it would be 0 runs. So we don't know which it would have been. What do we do?
The situation arises that we were picking innings randomly, and dividing them into halves (the half before the double, and the half after the double). The problem is that the process creates an inconsistency in the hypothetical inning. The second half of the inning, in real life, started with one out and a runner on second and third. The hypothetical second half started with one out and a runner on first. That created the problem.
So, since we're picking randomly anyway, why don't we throw away the *real* second half of the inning, and instead pick the second half of some *other* inning, some inning where there actually IS one out and a runner on first? That will always give us a consistent inning. And while it will give us a different result for this run of our random test, over many random tests, it'll all even out.
We might randomly choose Cleveland's fourth inning against the Royals on July 16. In that frame, Mark Whiten struck out and Glenallen Hill singled, which gives us our required runner-on-first-with-one-out. After that, Jim Thome singled, and Sandy Alomar Jr. ground into a double play.
Grafting the end of that inning (single, double play) on to the beginning of the original inning gives us our "consistent" hypothetical inning:
Hypoth: (stuff to put a runner on first and one out) / ------ / Single / GIDP -- 0 runs scored.
Since the Indians scored two runs in the original, unadulerated inning, and zero runs in the hypothetical inning, this run of the simulation winds up with a loss of two runs.
Now, there's nothing special about that Cleveland fourth inning: we just happened to choose it randomly. There were 6,380 cases of a runner on first with one out, and we could have chosen any of them instead.
The inning could have gone:
Out / Single / ------ / result of inng 1 of 6,380
Out / Single / ------ / result of inng 2 of 6,380
Out / Single / ------ / result of inng 3 of 6,380
Out / Single / ------ / result of inng 4 of 6,380
...
Out / Single / ------ / result of inng 6,380 of 6,380
If we run the simulation long enough, we'll choose every one of the 6,380 equally. And so, we'll wind up with just the average of those 6,380 innings. So we can get rid of the randomness in the second half of the inning just by substituting the average of the 6,380. Then our "remove the double" hypothetical becomes:
Out / Single / ------ / average of all 6,380 innings
And, with the help of Retrosheet, we find that after having a runner on first and one out, those 6,380 innings resulted in an average of 0.510 runs being scored. So now we have:
Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.
Hypoth: Out / Single / ------ / plus an additional 0.510 runs.
I'll rewrite that to make it a bit easier to see what's going on:
Actual: (stuff that put a runner on first with one out) / DOUBLE / other stuff that caused 2 runs to be scored
Hypoth: (stuff that put a runner on first with one out) / other stuff that caused 0.510 runs to be scored, on average
So, for this inning, we can say that removing the double cost 1.490 runs.
Now, the "actual" inning was again random. We happened to choose the Yankees' 6th inning on September 8. But we might have chosen another, similar, inning where there was a runner on first with one out, and a double was hit, and the runner held at third. This particular September 8 inning led to two runs. Another such inning may have led to six runs, or three runs, or no runs (maybe there were two strikeouts after the double and the runners were stranded).
So, what we can do, is aggregate all these types of innings. If we look to Retrosheet, we would find that there were 796 times where there were runners on second and third with one out. In the remainder of those 796 innings, 1129 runs were scored. That's an average of 1.418 per inning.
So we can write:
Actual: stuff that put a runner on 1st with one out / DOUBLE leading to runners and 2nd and 3rd with one out / Other stuff leading to 1.418 runs scoring, on average.
Hypoth: stuff that put a runner on 1st with one out / ------ / Other stuff leading to 0.410 runs scoring, on average.
And so, we know that a double with a runner on first and one out, which ends with runners on 2nd and 3rd, is worth, on average, 1.008 runs.
Let's write this down this way, as an equation:
+1.008 runs = Runner on 1st and one out + Double, runner holds
We can repeat this analysis. What if the runner scores? Then, it turns out, the average inning led to 1.646 runs scoring instead of 0.410. So:
+1.236 runs = Runner on 1st and one out + Double, run scores
We can repeat this for every combination of bases and double results we want. For instance:
-0.117 runs = Runner on 1st and one out + Double, runner thrown out at home
+1.000 runs = Runner on 2nd and nobody out + Double
+1.212 runs = Runner on 1st and two out + Double, runner safe at home and batter goes to third on the throw
I'm not sure how many of these cases there are, but we can look to Retrosheet and list them all. At the end, we have a huge list of all possible combinations of doubles, and what they were worth in runs. We just have to average them, weighted by how often they happened, and we're done. We then have the answer.
As it turns out, the answer for the 1992 American League works out to 0.763 runs.
The answer is NOT a estimate based on a model with random errors that we have to eliminate. It's the exact answer to the question, the same way counting the balls in the urn gave us an exact answer.
Just to be absolutely clear, here's what we've shown:
Suppose we randomly remove one double from the 1992 American League. Then, we reconstruct the inning from the point of that double forward, by looking at the base/out situation before the double, finding a random inning with that same base/out situation, and substituting that new inning instead of what really happened.
If we do that, we should expect 0.763 runs fewer will be scored. If we were to run this same random test a trillion times, the runs lost will average out to .763 almost exactly.
If you try to answer this question by running a regression, to the extent that your estimate is different from 0.763, you got the wrong answer.
------
Anyway, the explanation above was a complicated way of describing the process. Here's a simpler description of the algorithm.
1. Using Retrosheet data, find every situation where there was a runner on second and no outs. It turns out there were 1,572 such situations in the 1992 AL. Count the total number of runs that were scored in the remainder of those innings. It turns out there were, on average, 1.095 runs scored each time that happened (1,722 runs scored in those 1,572 innings).
2. Repeat this process for the other 23 base-out states (two-outs-bases-loaded, one-out-runners-on-first-and-third, and so on). If you do that, and put the results in the traditional matrix, you get:
0 out 1 out 2 out
------------------------------
0.482 0.248 0.096 nobody on
0.853 0.510 0.211 first
1.095 0.646 0.293 second
1.494 0.907 0.423 first/second
1.356 0.940 0.377 third
1.804 1.151 0.470 first/third
2.169 1.418 0.598 second/third
2.429 1.549 0.745 loaded
3. Find every double hit in the 1992 AL. For each of those 3,596 doubles, figure (a) the run value from the above table *before* the double was hit; (b) the run value for the situation *after* the double; and (c) the number of runs that scored on the play.
The value of that double is (b) - (a) + (c). For instance, a 3-run double with the bases loaded and 2 outs is worth 0.293 minus 0.745 plus 3. That works out to 2.548 runs.
4. Average out each of the 3,596 run values. You'll get 0.763.
It's that simple. You can repeat the above for whatever event you like: triples, stolen bases, strikeouts, whatever. Here are the values I got:
-0.272 strikeout
-0.264 other batting out
+0.178 steal
+0.139 defensive indifference
-0.421 caught stealing
+0.276 wild pitch
+0.286 passed ball
+0.277 balk
+0.307 walk (except intentional)
+0.174 intentional walk
+0.331 HBP
+0.378 interference
+0.491 reached on error
+0.460 single
+0.763 double
+1.018 triple
+1.417 home run
-----
Anyway, my point in the original post wasn't meant to be "regression is bad." What I really meant was, why randomly pull balls from urns when Retrosheet gives you enough data to actually count the balls? This method gives you an exact number.
One objection might be that, to do it this way, there's way too much data to use, and so regression is a more practical alternative. But is it really better to use a wrong value just because it's easier to calculate?
Besides, you don't have to calculate them yourself -- they've been calculated, repeatedly, by others who can be referenced. In the worst case, they're close to the "traditional" weights, as calculated by Pete Palmer some 25 years ago. If you need a solid reference, just use Pete's numbers. They're closer than you'll get by running a regression, even a particularly comprehensive one.
Labels: linear weights, regression