Wednesday, April 15, 2020

Regressing Park Factors (Part III)

I previously calculated that to estimate the true park factor (BPF) for a particular season, you have to take the "standard" one and regress it to the mean by 38 percent.

That's the generic estimate, for all parks combined. If you take Coors Field out of the pool of parks, you have to regress even more.

I ran the same study as in my other post, but this time I left out all the Rockies. Now, instead of 38 percent, you have to regress 50 percent. (It was actually 49-point-something, but I'm calling it 50 percent for simplicity.)

In effect, the old 38 percent estimate comes from a combination of 

1. Coors Field, which needs to be regressed virtually zero, and
2. The other parks, which need to be regressed 50 percent.

For the 50-percent estimate, the 93% confidence interval is (41, 58), which is very wide. But the theoretical method from last post, which I also repeated without Colorado, gave 51 percent, right in line with the observed number.

--------

I tried this method for the Rockies only, and it turns out that the point estimate is that you have to regress slightly *away* from the mean of 100. But with so few team-seasons, the confidence interval is so huge that I'd just take the park factors at face value and not regress them at all. 

The proper method would probably be to regress the Rockies' park factor to the Coors Field mean, which is about 113. You could probably crunch the numbers and figure out how much to regress. 

--------

The overall non-Coors value is 50 percent, but it turns out that every decade is different. *Very* different:

1960s:   regress 15 percent
1970s:   regress 27 percent
1980s:   regress 80 percent
1990s:   regress 84 percent
2000s:   regress 28 percent
2010-16: regress 28 percent 

Why do the values jump around so much? One possibility is that it's random variation on how teams are matched to parks. The method expects batters in hitters' parks to be equal to batters in pitchers' parks, but if (for instance) the Red Sox had a bad team in the 80s, this method would make the park effect appear smaller.

As soon as I wrote that, I realized I could check it. Here are the correlations between BPF and team talent in terms of RS-RA (per 162 games) for team-seasons, by decade. I'll include the regression-to-the-mean amount to make it easier to compare:

             r    RTM
---------------------
1960s:    +0.14   15% 
1970s:    +0.06   27%
1980s:    -0.14   80%
1990s:    +0.03   84%
2000s:    +0.16   28%
2010s:    +0.23   28%
---------------------
overall:  +0.05   50%

It does seem to work out that the more positive the correlation between hitting and BPF, the more you have to regress. The two lowest correlations were the ones with the two highest levels of regression to the mean.

(The 1990s does seem a little out of whack, though. Maybe it has something to do with the fact that we're leaving out the Rockies, so the NL BPFs are deflated for 1993-99, but the RS-RA are inflated because the Rockies were mediocre that decade. With the Rockies included, the 1990s correlation would turn negative.)

The "regress 50 percent to the mean" estimate seems to be associated with an overall correlation of +.05. If we want an estimate that assumes zero correlation, we should probably bump it up a bit -- maybe to 60 percent or something.

I'd have to think about whether I wanted to do that, though. My gut seems more comfortable with the actual observed value of 50 percent. I can't justify that.



Labels: , ,