Regressing Park Factors (Part II)
Note: math/stats post.
-------------
Last post, I figured the breakdown of variance for (three-year average) park effects (BPFs) from the Lahman database. It came out like this:
All Parks [chart 1]
-------------------------
4.8 = SD(3-year observed)
-------------------------
4.3 = SD(3-year true)
2.1 = SD(3-year luck)
-------------------------
Using the usual method, we would figure, theoretically, that you have to regress park factor by (2.1/4.8)^2, which is about 20 percent.
But when we used empirical data to calculate the real-life amount of regression required, it turned out to be 38 percent.
Why the difference? Because the 20 percent figure is to regress the observed three-year BPF to its true three-year average. But the 38 percent is to regress the observed three-year BPF to a single-year BPF.
My first thought was: the 3-year true value is the average of three 1-year true values. If each of those were independent, we could just break the 3-year SD into three 1-year SDs by dividing by the square root of 3.
But that wouldn't work. That's because when we split a 3-year BPF into three 1-year BPFs, those three are from the same park. So, we'd expect them to be closer to each other than if they were three random BPFs from different parks. (That fact is why we choose a three-year average instead of a single year -- we expect the three years to be very similar, which will increase our accuracy.)
Three years of the same park are similar, but not exactly the same. Parks do change a bit from year to year; more importantly, *other* parks change. (In their first season in MLB, the Rockies had a BPF of 118. All else being equal, the other 13 teams would see their BPF drop by about 1.4 points to keep the average at 100.)
So, we need to figure out the SD(true) for different seasons of the same park.
--------
From the Lahman database, I took all ballparks (1960-2016) with the same name for at least 10 seasons. For each park, I calculated the SD of its BPF for those years. Then, I took the root mean square of those numbers. That came out to 3.1.
We already calculated that the SD of luck for the average of three seasons is 2.1. That means we can fill in SD(true)=2.3.
Same Park [Chart 2]
------------------------------------
3.1 = SD(3-year observed, same park)
------------------------------------
2.3 = SD(3-year true, same park)
2.1 = SD(3-year luck, any parks)
------------------------------------
(That's the only number we will actually need from that chart.)
Now, from Chart 1, we found SD(true) was 4.3 for all park-years. That 4.3 is the combination of (a) variance of different years from the same park, and (b) variance between the different parks. We now know (a) is 2.3, so we can calculate (b) is root (4.3 squared minus 2.3 squared), which equals 3.6.
So we'll break the "4.3" from Chart 1 into those two parts:
All Parks [Chart 3]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
2.3 = SD(3-year true within park)
2.1 = SD(3-year luck)
-----------------------------------
Now, let's assume that for a given park, the annual deviations from its overall average are independent from year to year. That's not absolutely true, since some changes are more permanent, like when Coors Field joins the league. But it's probably close enough.
With the assumption of independence, we can break the 3-year SD down into three 1-year SDs. That converts the single 2.3 into three SDs of 1.3 (obtained by dividing 2.3 by the square root of 3):
All Parks [Chart 4]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)
1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------
What we're interested in is the SD of this year's value. That's the combination of the first two numbers in the breakdown: the SD of the difference between parks, and the SD of this year's true value for the current park.
The bottom three numbers are different kinds of "luck," for what we're trying to measure. The actual luck in run scoring, and the "luck" in how the park changed in the other two years we're using in the smoothing for the current year.
All Parks [Chart 4a]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)
1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------
Combining the top three and bottom three, we get
All Parks [Chart 5]
----------------------------------------------
4.8 = SD(3-year observed)
----------------------------------------------
3.8 = SD(true values impacting observed BPF)
2.8 = SD(random values impacting observed BPF)
----------------------------------------------
So we regress by (2.8/4.8) squared, which works out to 34 percent. That's pretty close to the actual figure of 38 percent.
We can do another attempt, with a slightly different assumption. Back in Chart 2, when we figured SD(three year true, same park) was 3.1 ... that estimate was based on parks with at least ten years of data. If I reduce the requirement to three years of data, SD(three year true, same park) goes up to 3.2, and the final result is ... 36 percent regression to the mean.
So there it is. I think this is method is valid, but I'm not completely sure. The 95% confidence interval for the true value seems to be wide -- regression to the mean between 28 percent and 49 percent -- so it might just be coincidence that this calculation matches.
If you see a problem, let me know.
Part III is here.
Labels: baseball, park factors, regression to the mean