Monday, March 23, 2020

Regressing Park Factors (Part II)

Note: math/stats post.

-------------

Last post, I figured the breakdown of variance for (three-year average) park effects (BPFs) from the Lahman database. It came out like this:


All Parks   [chart 1]
-------------------------
4.8 = SD(3-year observed)
-------------------------
4.3 = SD(3-year true) 
2.1 = SD(3-year luck)
-------------------------

Using the usual method, we would figure, theoretically, that you have to regress park factor by (2.1/4.8)^2, which is about 20 percent. 

But when we used empirical data to calculate the real-life amount of regression required, it turned out to be 38 percent.

Why the difference? Because the 20 percent figure is to regress the observed three-year BPF to its true three-year average. But the 38 percent is to regress the observed three-year BPF to a single-year BPF.

My first thought was: the 3-year true value is the average of three 1-year true values. If each of those were independent, we could just break the 3-year SD into three 1-year SDs by dividing by the square root of 3. 

But that wouldn't work. That's because when we split a 3-year BPF into three 1-year BPFs, those three are from the same park. So, we'd expect them to be closer to each other than if they were three random BPFs from different parks. (That fact is why we choose a three-year average instead of a single year -- we expect the three years to be very similar, which will increase our accuracy.)

Three years of the same park are similar, but not exactly the same. Parks do change a bit from year to year; more importantly, *other* parks change. (In their first season in MLB, the Rockies had a BPF of 118. All else being equal, the other 13 teams would see their BPF drop by about 1.4 points to keep the average at 100.)

So, we need to figure out the SD(true) for different seasons of the same park. 

--------

From the Lahman database, I took all ballparks (1960-2016) with the same name for at least 10 seasons. For each park, I calculated the SD of its BPF for those years. Then, I took the root mean square of those numbers. That came out to 3.1.

We already calculated that the SD of luck for the average of three seasons is 2.1. That means we can fill in SD(true)=2.3.



Same Park     [Chart 2]
------------------------------------
3.1 = SD(3-year observed, same park)
------------------------------------
2.3 = SD(3-year true, same park)
2.1 = SD(3-year luck, any parks)
------------------------------------

(That's the only number we will actually need from that chart.)

Now, from Chart 1, we found SD(true) was 4.3 for all park-years. That 4.3 is the combination of (a) variance of different years from the same park, and (b) variance between the different parks. We now know (a) is 2.3, so we can calculate (b) is root (4.3 squared minus 2.3 squared), which equals 3.6.

So we'll break the "4.3" from Chart 1 into those two parts:


All Parks     [Chart 3]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
2.3 = SD(3-year true within park)
2.1 = SD(3-year luck)
-----------------------------------

Now, let's assume that for a given park, the annual deviations from its overall average are independent from year to year. That's not absolutely true, since some changes are more permanent, like when Coors Field joins the league. But it's probably close enough.

With the assumption of independence, we can break the 3-year SD down into three 1-year SDs.  That converts the single 2.3 into three SDs of 1.3 (obtained by dividing 2.3 by the square root of 3):


All Parks     [Chart 4]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)
1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------

What we're interested in is the SD of this year's value. That's the combination of the first two numbers in the breakdown: the SD of the difference between parks, and the SD of this year's true value for the current park.

The bottom three numbers are different kinds of "luck," for what we're trying to measure. The actual luck in run scoring, and the "luck" in how the park changed in the other two years we're using in the smoothing for the current year. 


All Parks     [Chart 4a]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)

1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------

Combining the top three and bottom three, we get

All Parks      [Chart 5]
----------------------------------------------
4.8 = SD(3-year observed)
----------------------------------------------
3.8 = SD(true values impacting observed BPF)
2.8 = SD(random values impacting observed BPF)
----------------------------------------------

So we regress by (2.8/4.8) squared, which works out to 34 percent. That's pretty close to the actual figure of 38 percent.

We can do another attempt, with a slightly different assumption. Back in Chart 2, when we figured SD(three year true, same park) was 3.1 ... that estimate was based on parks with at least ten years of data. If I reduce the requirement to three years of data, SD(three year true, same park) goes up to 3.2, and the final result is ... 36 percent regression to the mean.

So there it is.  I think this is method is valid, but I'm not completely sure. The 95% confidence interval for the true value seems to be wide -- regression to the mean between 28 percent and 49 percent -- so it might just be coincidence that this calculation matches. 

If you see a problem, let me know.

Part III is here.

Labels: , ,

Thursday, March 12, 2020

Regressing Park Factors (Part I)

I think park factors* are substantial overestimates of the effect of the park. 

At their core, park effects are basically calculated based on runs scored at home divided by runs scored on the road. But that figure is heavily subject to the effects of luck. One random 10-8 game at home can skew the park effect by more than half a point.

Because of this, most sabermetric sources calculate park effects based on more than one year of data. A three-year sample is common ... I think Fangraphs, Baseball Reference, and the Lahman database all use three years. 

That helps, but not enough. It looks like park factors are still too extreme, and need to be substantially regressed to the mean.

-------

* Here's a quick explanation of how park factors work, for those not familiar with them.

Park Factor is a number that tells us how relatively easy or difficult it is for a team to score runs because of the characteristics of its home park. The value "100" represents the average park. A number bigger than 100 means it's a hitters' park, where more runs tend to score, and smaller than 100 means it's a pitchers' park, where fewer runs tend to score.


Perhaps confusingly, the park factor averages the home park with an amalgam of road parks, in equal proportion. So if Chase Field has a park factor of 105, which is 5 percent more than average, that really means it's about 10 percent more runs at home, and about average on the road.

The point of park factor is that you can use it to adjust a hitter's stats to account for his home park. So if Edouardo Escobar creates 106 runs for the Diamondbacks, you divide by 1.05 and figure he'd have been good for about 101 runs if he had played in a neutral home park.

--------

For my large sample of batters (1960-2016, minimum 300 PA), I calculated their runs created per 500 PA (RC500), and their park-adjusted RC500. Then, I binned the players by batting park factor (BPF), and took the average for each bin. If park adjustment worked perfectly, you'd expect every bin to have the same level of performance. After all, there's no reason to think batters who play for the Red Sox or Rockies should be any better or worse overall than batters who are current Mets or former Astros.

(Because of small sample sizes, I grouped all parks 119+ into a single bin. The average BPF for those parks was 123.4.)

Here's the chart:


BPF      PA     Runs   Adj'd  Regressed
------------------------------------------
 88   8554   67.07   76.22   72.67
 89   2960   56.57   63.56   60.85
 90  43213   62.53   69.48   66.78
 91  61195   62.42   68.59   66.19
 92 121203   61.04   66.35   64.29
 93 195382   62.21   66.89   65.07
 94 241681   61.93   65.89   64.35
 95 304270   64.72   68.12   66.80
 96 325511   64.13   66.81   65.77
 97 463537   63.05   65.00   64.24
 98 520621   65.62   66.96   66.44
 99 712668   64.29   64.94   64.69
100 674090   64.86   64.86   64.86
101 589401   66.53   65.87   66.12
102 514724   66.19   64.89   65.39
103 440940   65.48   63.58   64.32
104 415243   66.07   63.53   64.51
105 319334   67.35   64.15   65.39
106 177680   66.15   62.41   63.86
107 138748   65.85   61.54   63.21
108 105850   67.58   62.57   64.51
109   25751   68.50   62.84   65.04
110   48127   65.46   59.51   61.82
111   34278   69.61   62.71   65.39
112   36977   65.12   58.14   60.85
113   23001   67.94   60.13   63.16
115   13778   74.55   64.83   68.60
116    7994   72.21   62.25   66.12
117   20901   73.62   62.92   67.07
123.4  39629   79.28   64.28   70.10
---------------------------------------
row/row diff    0.60%   0.39%   0.00%


Start with the third column, which is the raw RC500. As you'd expect, the higher the park factor, the higher the unadjusted runs. That's the effect we want BPF to remove.

So, I adjusted by BPF, and that's column 4. We now expect everything to even out, and the column to look uniform. But it doesn't -- now, it goes the other way. Batters in pitchers' parks now look like they're better hitters than the batters in hitters' parks.

That shows that we overadjusted. By how much? 

Take a look at the bottom row of the chart. Unadjusted, each row is about 0.6 percent higher than the row above it. We'd expect about 1 percent, if BPF worked perfectly. Adjusted, each row is about 0.4 percent lower. 

So BPF overestimates the true park factor by around 40 percent. Which means, if we regress park factors to the mean by 40 percent, we should remove the bias.

That's what the last column is. And the numbers look pretty level.

Actually, I didn't use 40 percent ... I used 38.8 percent. That's what gave the best flat fit. (Part of the difference is due to rounding, and the rest due to the fact that I ignored nonlinearity when I calculated the percentages.)

Just to be more rigorous and get a more accurate estimate, I ran a regression. Instead of binning the players, I just used all players separately, and did a "weighted regression" that effectively adjusts for the number of PA associated with each bin. Because of the weights, I was able to drop the minimum from 300 PA to 10 PA. Also, I included a dummy variable for year, just in case there were a lot of pitchers' parks in 1987, or something.

The result came out almost exactly the same -- regress by 38.3 percent.

-------

Could we have calculated this mathematically just from raw park factors? Yes, I think so -- but not quite in the usual way. 

I'll show the usual way here, and save the rest for the next post. If you don't care about the math, you can just stop here.

-------

If we used the usual technique for figuring how much to regress, we'd use

SD^2(observed) = SD^2(true) + SD^2(luck)

We can figure luck. The SD of team runs in a game is about 3. For two teams combined, multiply by root 2. Calculating the different from a road game, multiply by root 2 again. Then, for 81 games, divide by the square root of 81, which is 9. Finally, because we're using 3 years, divide by the square root of 3. 

You get 0.385 runs. 

That figure, 0.385 runs, is 4.27 percent of the usual 9 runs per game. To convert that to a park factor, take half. That's 2.13 points. I'll round to 2.1.

The observed SD, from the Lahman database, is 4.8 points. 

We can now calculate SD(true), since 4.8^2 = SD^2(true) + 2.1^2. It works out to 4.3 points.

SD(observed)= 4.8
-----------------
    SD(true)= 4.3
    SD(luck)= 2.1

So, to regress observed to true, we regress the park factor by (2.1/4.8)^2, which is about 19 percent.

Why isn't it 38.3 percent?

Because, in this case, the "true" value is the three-year average. For that, regress 19 percent. But, that's not what we really want when it comes to a single year's performance. For that, we want SD(true) and SD(luck) for just that one year's park factor, not the average of the three years.

It makes sense you need to regress more for one year than three, because there's more randomness: the first 19 percent is for the randomness in the three-year average, and the next 19 percent is for the randomness in the fact that the other two of the three years the park might have been different, so the three-year BPF might not be representative of the year you're looking at.

There's no obvious relationship between the 19 percent and the 38.3 percent -- it's just coincidence that it comes out double. 

But I think I've worked out how we could have calculated the 38.3 figure. I'll write that up for Part II.  

(P.S.  You might have noticed that the last column of the chart was fairly level, except for the extreme hitters' parks.  I'll talk about that in part III.)


Update, 3/24/20: Part II is here.


Labels: , ,