Regressing Park Factors (Part I)
I think park factors* are substantial overestimates of the effect of the park.
At their core, park effects are basically calculated based on runs scored at home divided by runs scored on the road. But that figure is heavily subject to the effects of luck. One random 10-8 game at home can skew the park effect by more than half a point.
Because of this, most sabermetric sources calculate park effects based on more than one year of data. A three-year sample is common ... I think Fangraphs, Baseball Reference, and the Lahman database all use three years.
That helps, but not enough. It looks like park factors are still too extreme, and need to be substantially regressed to the mean.
-------
* Here's a quick explanation of how park factors work, for those not familiar with them.
Park Factor is a number that tells us how relatively easy or difficult it is for a team to score runs because of the characteristics of its home park. The value "100" represents the average park. A number bigger than 100 means it's a hitters' park, where more runs tend to score, and smaller than 100 means it's a pitchers' park, where fewer runs tend to score.
Perhaps confusingly, the park factor averages the home park with an amalgam of road parks, in equal proportion. So if Chase Field has a park factor of 105, which is 5 percent more than average, that really means it's about 10 percent more runs at home, and about average on the road.
The point of park factor is that you can use it to adjust a hitter's stats to account for his home park. So if Edouardo Escobar creates 106 runs for the Diamondbacks, you divide by 1.05 and figure he'd have been good for about 101 runs if he had played in a neutral home park.
--------
For my large sample of batters (1960-2016, minimum 300 PA), I calculated their runs created per 500 PA (RC500), and their park-adjusted RC500. Then, I binned the players by batting park factor (BPF), and took the average for each bin. If park adjustment worked perfectly, you'd expect every bin to have the same level of performance. After all, there's no reason to think batters who play for the Red Sox or Rockies should be any better or worse overall than batters who are current Mets or former Astros.
(Because of small sample sizes, I grouped all parks 119+ into a single bin. The average BPF for those parks was 123.4.)
Here's the chart:
BPF PA Runs Adj'd Regressed
------------------------------------------
88 8554 67.07 76.22 72.67
89 2960 56.57 63.56 60.85
90 43213 62.53 69.48 66.78
91 61195 62.42 68.59 66.19
92 121203 61.04 66.35 64.29
93 195382 62.21 66.89 65.07
94 241681 61.93 65.89 64.35
95 304270 64.72 68.12 66.80
96 325511 64.13 66.81 65.77
97 463537 63.05 65.00 64.24
98 520621 65.62 66.96 66.44
99 712668 64.29 64.94 64.69
100 674090 64.86 64.86 64.86
101 589401 66.53 65.87 66.12
102 514724 66.19 64.89 65.39
103 440940 65.48 63.58 64.32
104 415243 66.07 63.53 64.51
105 319334 67.35 64.15 65.39
106 177680 66.15 62.41 63.86
107 138748 65.85 61.54 63.21
108 105850 67.58 62.57 64.51
109 25751 68.50 62.84 65.04
110 48127 65.46 59.51 61.82
111 34278 69.61 62.71 65.39
112 36977 65.12 58.14 60.85
113 23001 67.94 60.13 63.16
115 13778 74.55 64.83 68.60
116 7994 72.21 62.25 66.12
117 20901 73.62 62.92 67.07
123.4 39629 79.28 64.28 70.10
---------------------------------------
row/row diff 0.60% 0.39% 0.00%
Start with the third column, which is the raw RC500. As you'd expect, the higher the park factor, the higher the unadjusted runs. That's the effect we want BPF to remove.
So, I adjusted by BPF, and that's column 4. We now expect everything to even out, and the column to look uniform. But it doesn't -- now, it goes the other way. Batters in pitchers' parks now look like they're better hitters than the batters in hitters' parks.
That shows that we overadjusted. By how much?
Take a look at the bottom row of the chart. Unadjusted, each row is about 0.6 percent higher than the row above it. We'd expect about 1 percent, if BPF worked perfectly. Adjusted, each row is about 0.4 percent lower.
So BPF overestimates the true park factor by around 40 percent. Which means, if we regress park factors to the mean by 40 percent, we should remove the bias.
That's what the last column is. And the numbers look pretty level.
Actually, I didn't use 40 percent ... I used 38.8 percent. That's what gave the best flat fit. (Part of the difference is due to rounding, and the rest due to the fact that I ignored nonlinearity when I calculated the percentages.)
Just to be more rigorous and get a more accurate estimate, I ran a regression. Instead of binning the players, I just used all players separately, and did a "weighted regression" that effectively adjusts for the number of PA associated with each bin. Because of the weights, I was able to drop the minimum from 300 PA to 10 PA. Also, I included a dummy variable for year, just in case there were a lot of pitchers' parks in 1987, or something.
The result came out almost exactly the same -- regress by 38.3 percent.
-------
Could we have calculated this mathematically just from raw park factors? Yes, I think so -- but not quite in the usual way.
I'll show the usual way here, and save the rest for the next post. If you don't care about the math, you can just stop here.
-------
If we used the usual technique for figuring how much to regress, we'd use
SD^2(observed) = SD^2(true) + SD^2(luck)
We can figure luck. The SD of team runs in a game is about 3. For two teams combined, multiply by root 2. Calculating the different from a road game, multiply by root 2 again. Then, for 81 games, divide by the square root of 81, which is 9. Finally, because we're using 3 years, divide by the square root of 3.
You get 0.385 runs.
That figure, 0.385 runs, is 4.27 percent of the usual 9 runs per game. To convert that to a park factor, take half. That's 2.13 points. I'll round to 2.1.
The observed SD, from the Lahman database, is 4.8 points.
We can now calculate SD(true), since 4.8^2 = SD^2(true) + 2.1^2. It works out to 4.3 points.
SD(observed)= 4.8
-----------------
SD(true)= 4.3
SD(luck)= 2.1
So, to regress observed to true, we regress the park factor by (2.1/4.8)^2, which is about 19 percent.
Why isn't it 38.3 percent?
Because, in this case, the "true" value is the three-year average. For that, regress 19 percent. But, that's not what we really want when it comes to a single year's performance. For that, we want SD(true) and SD(luck) for just that one year's park factor, not the average of the three years.
It makes sense you need to regress more for one year than three, because there's more randomness: the first 19 percent is for the randomness in the three-year average, and the next 19 percent is for the randomness in the fact that the other two of the three years the park might have been different, so the three-year BPF might not be representative of the year you're looking at.
There's no obvious relationship between the 19 percent and the 38.3 percent -- it's just coincidence that it comes out double.
But I think I've worked out how we could have calculated the 38.3 figure. I'll write that up for Part II.
(P.S. You might have noticed that the last column of the chart was fairly level, except for the extreme hitters' parks. I'll talk about that in part III.)
Update, 3/24/20: Part II is here.
Labels: baseball, park factors, regression to the mean
2 Comments:
Is it true that we'd expect batters to be the same (true) quality regardless of their home park? I would think there might be some selection bias, where teams with high park factors believe their hitters to be better than they are, and invest more in pitching, and vice-versa for low park factors. I wonder if this effect has changed over time, as GMs have gotten more savvy about adjusting for park.
Anonymous,
That's definitely a possibility. If that were the case, they'd consistently have below-average hitting and above-average pitching.
You could find evidence for that ... just check their road batting relative to the league, and compare it to their road pitching relative to the league.
Post a Comment
<< Home