Monday, March 23, 2020

Regressing Park Factors (Part II)


Note: math/stats post.

-------------

Last post, I figured the breakdown of variance for (three-year average) park effects (BPFs) from the Lahman database. It came out like this:

All Parks   [chart 1]
-------------------------
4.8 = SD(3-year observed)
-------------------------
4.3 = SD(3-year true) 
2.1 = SD(3-year luck)
-------------------------

Using the usual method, we would figure, theoretically, that you have to regress park factor by (2.1/4.8)^2, which is about 20 percent. 

But when we used empirical data to calculate the real-life amount of regression required, it turned out to be 38 percent.

Why the difference? Because the 20 percent figure is to regress the observed three-year BPF to its true three-year average. But the 38 percent is to regress the observed three-year BPF to a single-year BPF.

My first thought was: the 3-year true value is the average of three 1-year true values. If each of those were independent, we could just break the 3-year SD into three 1-year SDs by dividing by the square root of 3. 

But that wouldn't work. That's because when we split a 3-year BPF into three 1-year BPFs, those three are from the same park. So, we'd expect them to be closer to each other than if they were three random BPFs from different parks. (That fact is why we choose a three-year average instead of a single year -- we expect the three years to be very similar, which will increase our accuracy.)

Three years of the same park are similar, but not exactly the same. Parks do change a bit from year to year; more importantly, *other* parks change. (In their first season in MLB, the Rockies had a BPF of 118. All else being equal, the other 13 teams would see their BPF drop by about 1.4 points to keep the average at 100.)

So, we need to figure out the SD(true) for different seasons of the same park. 

--------

From the Lahman database, I took all ballparks (1960-2016) with the same name for at least 10 seasons. For each park, I calculated the SD of its BPF for those years. Then, I took the root mean square of those numbers. That came out to 3.1.

We already calculated that the SD of luck for the average of three seasons is 2.1. That means we can fill in SD(true)=2.3.

Same Park     [Chart 2]
------------------------------------
3.1 = SD(3-year observed, same park)
------------------------------------
2.3 = SD(3-year true, same park)
2.1 = SD(3-year luck, any parks)
------------------------------------

(That's the only number we will actually need from that chart.)

Now, from Chart 1, we found SD(true) was 4.3 for all park-years. That 4.3 is the combination of (a) variance of different years from the same park, and (b) variance between the different parks. We now know (a) is 2.3, so we can calculate (b) is root (4.3 squared minus 2.3 squared), which equals 3.6.

So we'll break the "4.3" from Chart 1 into those two parts:

All Parks     [Chart 3]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
2.3 = SD(3-year true within park)
2.1 = SD(3-year luck)
-----------------------------------

Now, let's assume that for a given park, the annual deviations from its overall average are independent from year to year. That's not absolutely true, since some changes are more permanent, like when Coors Field joins the league. But it's probably close enough.

With the assumption of independence, we can break the 3-year SD down into three 1-year SDs.  That converts the single 2.3 into three SDs of 1.3 (obtained by dividing 2.3 by the square root of 3):

All Parks     [Chart 4]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)
1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------

What we're interested in is the SD of this year's value. That's the combination of the first two numbers in the breakdown: the SD of the difference between parks, and the SD of this year's true value for the current park.

The bottom three numbers are different kinds of "luck," for what we're trying to measure. The actual luck in run scoring, and the "luck" in how the park changed in the other two years we're using in the smoothing for the current year. 

All Parks     [Chart 4a]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)

1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------

Combining the top three and bottom three, we get

All Parks      [Chart 5]
----------------------------------------------
4.8 = SD(3-year observed)
----------------------------------------------
3.8 = SD(true values impacting observed BPF)
2.8 = SD(random values impacting observed BPF)
----------------------------------------------

So we regress by (2.8/4.8) squared, which works out to 34 percent. That's pretty close to the actual figure of 38 percent.

We can do another attempt, with a slightly different assumption. Back in Chart 2, when we figured SD(three year true, same park) was 3.1 ... that estimate was based on parks with at least ten years of data. If I reduce the requirement to three years of data, SD(three year true, same park) goes up to 3.2, and the final result is ... 36 percent regression to the mean.

So there it is.  I think this is method is valid, but I'm not completely sure. The 95% confidence interval for the true value seems to be wide -- regression to the mean between 28 percent and 49 percent -- so it might just be coincidence that this calculation matches. 

If you see a problem, let me know.



Labels: , ,

Thursday, March 12, 2020

Regressing Park Factors (Part I)

I think park factors* are substantial overestimates of the effect of the park. 

At their core, park effects are basically calculated based on runs scored at home divided by runs scored on the road. But that figure is heavily subject to the effects of luck. One random 10-8 game at home can skew the park effect by more than half a point.

Because of this, most sabermetric sources calculate park effects based on more than one year of data. A three-year sample is common ... I think Fangraphs, Baseball Reference, and the Lahman database all use three years. 

That helps, but not enough. It looks like park factors are still too extreme, and need to be substantially regressed to the mean.

-------

* Here's a quick explanation of how park factors work, for those not familiar with them.

Park Factor is a number that tells us how relatively easy or difficult it is for a team to score runs because of the characteristics of its home park. The value "100" represents the average park. A number bigger than 100 means it's a hitters' park, where more runs tend to score, and smaller than 100 means it's a pitchers' park, where fewer runs tend to score.


Perhaps confusingly, the park factor averages the home park with an amalgam of road parks, in equal proportion. So if Chase Field has a park factor of 105, which is 5 percent more than average, that really means it's about 10 percent more runs at home, and about average on the road.

The point of park factor is that you can use it to adjust a hitter's stats to account for his home park. So if Edouardo Escobar creates 106 runs for the Diamondbacks, you divide by 1.05 and figure he'd have been good for about 101 runs if he had played in a neutral home park.

--------

For my large sample of batters (1960-2016, minimum 300 PA), I calculated their runs created per 500 PA (RC500), and their park-adjusted RC500. Then, I binned the players by batting park factor (BPF), and took the average for each bin. If park adjustment worked perfectly, you'd expect every bin to have the same level of performance. After all, there's no reason to think batters who play for the Red Sox or Rockies should be any better or worse overall than batters who are current Mets or former Astros.

(Because of small sample sizes, I grouped all parks 119+ into a single bin. The average BPF for those parks was 123.4.)

Here's the chart:


BPF      PA     Runs   Adj'd  Regressed
------------------------------------------
 88   8554   67.07   76.22   72.67
 89   2960   56.57   63.56   60.85
 90  43213   62.53   69.48   66.78
 91  61195   62.42   68.59   66.19
 92 121203   61.04   66.35   64.29
 93 195382   62.21   66.89   65.07
 94 241681   61.93   65.89   64.35
 95 304270   64.72   68.12   66.80
 96 325511   64.13   66.81   65.77
 97 463537   63.05   65.00   64.24
 98 520621   65.62   66.96   66.44
 99 712668   64.29   64.94   64.69
100 674090   64.86   64.86   64.86
101 589401   66.53   65.87   66.12
102 514724   66.19   64.89   65.39
103 440940   65.48   63.58   64.32
104 415243   66.07   63.53   64.51
105 319334   67.35   64.15   65.39
106 177680   66.15   62.41   63.86
107 138748   65.85   61.54   63.21
108 105850   67.58   62.57   64.51
109   25751   68.50   62.84   65.04
110   48127   65.46   59.51   61.82
111   34278   69.61   62.71   65.39
112   36977   65.12   58.14   60.85
113   23001   67.94   60.13   63.16
115   13778   74.55   64.83   68.60
116    7994   72.21   62.25   66.12
117   20901   73.62   62.92   67.07
123.4  39629   79.28   64.28   70.10
---------------------------------------
row/row diff    0.60%   0.39%   0.00%


Start with the third column, which is the raw RC500. As you'd expect, the higher the park factor, the higher the unadjusted runs. That's the effect we want BPF to remove.

So, I adjusted by BPF, and that's column 4. We now expect everything to even out, and the column to look uniform. But it doesn't -- now, it goes the other way. Batters in pitchers' parks now look like they're better hitters than the batters in hitters' parks.

That shows that we overadjusted. By how much? 

Take a look at the bottom row of the chart. Unadjusted, each row is about 0.6 percent higher than the row above it. We'd expect about 1 percent, if BPF worked perfectly. Adjusted, each row is about 0.4 percent lower. 

So BPF overestimates the true park factor by around 40 percent. Which means, if we regress park factors to the mean by 40 percent, we should remove the bias.

That's what the last column is. And the numbers look pretty level.

Actually, I didn't use 40 percent ... I used 38.8 percent. That's what gave the best flat fit. (Part of the difference is due to rounding, and the rest due to the fact that I ignored nonlinearity when I calculated the percentages.)

Just to be more rigorous and get a more accurate estimate, I ran a regression. Instead of binning the players, I just used all players separately, and did a "weighted regression" that effectively adjusts for the number of PA associated with each bin. Because of the weights, I was able to drop the minimum from 300 PA to 10 PA. Also, I included a dummy variable for year, just in case there were a lot of pitchers' parks in 1987, or something.

The result came out almost exactly the same -- regress by 38.3 percent.

-------

Could we have calculated this mathematically just from raw park factors? Yes, I think so -- but not quite in the usual way. 

I'll show the usual way here, and save the rest for the next post. If you don't care about the math, you can just stop here.

-------

If we used the usual technique for figuring how much to regress, we'd use

SD^2(observed) = SD^2(true) + SD^2(luck)

We can figure luck. The SD of team runs in a game is about 3. For two teams combined, multiply by root 2. Calculating the different from a road game, multiply by root 2 again. Then, for 81 games, divide by the square root of 81, which is 9. Finally, because we're using 3 years, divide by the square root of 3. 

You get 0.385 runs. 

That figure, 0.385 runs, is 4.27 percent of the usual 9 runs per game. To convert that to a park factor, take half. That's 2.13 points. I'll round to 2.1.

The observed SD, from the Lahman database, is 4.8 points. 

We can now calculate SD(true), since 4.8^2 = SD^2(true) + 2.1^2. It works out to 4.3 points.

SD(observed)= 4.8
-----------------
    SD(true)= 4.3
    SD(luck)= 2.1

So, to regress observed to true, we regress the park factor by (2.1/4.8)^2, which is about 19 percent.

Why isn't it 38.3 percent?

Because, in this case, the "true" value is the three-year average. For that, regress 19 percent. But, that's not what we really want when it comes to a single year's performance. For that, we want SD(true) and SD(luck) for just that one year's park factor, not the average of the three years.

It makes sense you need to regress more for one year than three, because there's more randomness: the first 19 percent is for the randomness in the three-year average, and the next 19 percent is for the randomness in the fact that the other two of the three years the park might have been different, so the three-year BPF might not be representative of the year you're looking at.

There's no obvious relationship between the 19 percent and the 38.3 percent -- it's just coincidence that it comes out double. 

But I think I've worked out how we could have calculated the 38.3 figure. I'll write that up for Part II.  

(P.S.  You might have noticed that the last column of the chart was fairly level, except for the extreme hitters' parks.  I'll talk about that in part III.)


Update, 3/24/20: Part II is here.


Labels: , ,

Friday, February 28, 2020

Park adjusting a player's batting line has to depend on who the player is

Suppose home runs are scarce in the Astrodome, so that only half as many home runs are hit there than in any other park. One year, Astros outfielder Joe Slugger hits 15 HR at the Astrodome. How do you convert that to be park-neutral? 

It seems like you should adjust it to 30. Take the 15 HR, double it, and there you go. 

But I don't think that works. I think if you do it that way, you overestimate what Joe would have done in a normal park. I think you need to adjust Joe to something substantially less than 30.

------

One reason is that Joe might not necessarily be hurt by the park as much as other players. Maybe the park hurts weaker hitters more, the kind who hit mostly 310-foot home runs. Maybe Joe is the kind who hits most of his fly balls 430 feet, so when the indoor dead air shortens them by 15 feet, they still have enough carry to make it over the fence.

It's almost certain that some players should have different park factors than others. Many parks are asymmetrical, so lefties and righties will hit to different outfield depths. Some parks may have more impact on players more who hit more line drive HRs, and less impact on towering fly balls. And so on.

I suspect that's actually a big issue, but I'm going to assume it away for now. I'll continue as if every player is affected by the park the same way, and I'll assume that Joe hit exactly 15 HR at home and 30 HR on the road, exactly in line with expectations.

Also, to keep things simple, two more assumptions. First I'll assume that the park factor is caused by distance to the outfield fence -- that the Astrodome is, say, 10 percent deeper than the average park. Second, I'll assume that in the alternative universe where Joe played in a different home park, he would have hit every ball with exactly the same trajectory and distance that he did in the Astrodome.

My argument is that with these assumptions, the Astros overall would have hit twice as many HR at home as they actually did. But Joe Slugger would have hit *fewer* than twice as many.

------

Let's start by defining two classes of deep fly balls:

A: fly balls deep enough to be HR in any park, including the Astrodome; 
B: fly balls deep enough to be HR in any park *except* the Astrodome.

We know that, overall, class A is exactly equal in size to class B, since (A+B) is exactly twice A.

That's why, when we saw 15 HR in class A, we immediately assumed that implies 15 HR in class B. And so we assumed that Joe would have hit an extra 15 HR in any other park.

That seems like it should work, but it doesn't. Here's a series of analogies that shows why.

1. You have a pair of fair dice. You expect them to come up snake eyes (1-1) exactly as often as box cars (6-6). You roll the dice 360 times, and find that 1-1 came up 15 times. 

Since 6-6 comes up as often as 1-1, should you estimate that 6-6 also came up 15 times? You should not. Since the dice are fair, you expect 6-6 to have come up 1 time in 36, or 10 times.* The fact that 1-1 got lucky, and came up more often, doesn't mean that 6-6 must have come up more often.

(*Actually, you should expect that 6-6 came up only 9.86 times, since there are 5 fewer tosses left for 6-6 after taking out the successful 1-1s. But never mind for now.)

2. You have a pair of fair dice, and an APBA card. On that card, 1-1 is a home run, and 6-6 represents a home run anywhere except the Astrodome.

You roll the dice 360 times, and 1-1 comes up 15 times. Do you also expect that 6-6 came up 15 times? Same answer: you expect it came up only 10 times. The fact that 1-1 got lucky doesn't mean that 6-6 must also have gotten lucky.

3. You have a simulation game, with some number of fair dice, and a card for Joe Slugger. You know the probability of Joe hitting an Astrodome HR is equal to the probability of Joe hitting an "anywhere but Astrodome" HR.  But that probability -- Joe's talent level -- isn't necessarily 1 in 36.

You play a season's worth of Joe's home games, and he hit 15 HR. Can you assume that he also hit 15 "anywhere but Astrodome" HR? 

Well, in one special case, you can. If the 15 HR was Joe's actual expectation, based on his talent -- that is, his card -- then, yes, you can assume 15 near-HR. 

But, in all other cases, you can't. If Joe's 15 HR total was lucky, based on his talent, you should assume fewer than 15 near-HR. And if the 15 HR was unlucky, you should assume more than 15 near-HR.

------

So I think you can't park adjust players via the standard method of multiplying their performance by their park factor. The park adjustment has to be based on their *expected* performance, not their observed performance.

Suppose Joe Slugger, at the beginning of the season, was projected by the Marcel system to hit 10 HR at home. That means that he was expected to hit 10 HR at the Astrodome, and 10 "almost HR" at the Astrodome.

Instead, he winds up hitting 15 HR there. But we still estimate that he hit only 10 "almost HR". So, instead of bumping his 15 HR total to 30, we bump it only to 25.

-------

I was surprised by this, that there's no way to convert the Astrodome to a normal park that doesn't require you to estimate the player's talent. 

But here's what surprised me even more, when I worked it out: you only need to know the player's talent when you're adjusting from a pitchers' park. When you adjust from a hitters' park, one formula works for everyone!

Let's take it the other way, and suppose that Fenway affords twice as many home runs as any other park. And, suppose Joe Slugger, now with the Red Sox, hits 40 at Fenway and 20 on the road.

How many would he have hit if none of his games were at Fenway?

Well, on average, half of his 40 HR would have been HR on the road. So, that's 20. End of calculation. 

It doesn't matter who the batter is, or what his talent is -- as long as we stick to the assumption that every player's expectation is twice as many HR at Fenway, the expectation is that half his Fenway HR would also have been HR on the road.

(In reality, it might have been more, or it might have been less, since the breakdown of the 40 HR is random, like 40 coin tosses. But the point is, it doesn't depend on the player.)

-------

If you're not convinced, here's a coin toss analogy that might make it clearer.

We ask MLB players to do a coin toss experiment. We give them a fair coin. We tell them, take your day of birth, multiply it by 10, toss the coin that many times, and count the heads. Then, toss the coin that many times again, but this time, count the number of tails.

For the Fenway analogy: heads are "Fenway only" HR. Tails are "any park" HR.

We ask each player to come back and tell us H+T, the total number of Fenway HR. We then try to estimate the heads, the number of "Fenway only" HR.

That's easy: we just assume half the number. Mathematically, the expectation for any player, no matter who, is that H will be half of (H+T). That's because no matter how lucky or unlucky he was, there's no reason to expect he was luckier in H than T, or vice-versa.

Now, for the Astrodome analogy. Heads are "Any park including Astrodome" HR. Tails are "other park only" HR.

We ask each player to come back and tell us only the number of heads, which is the the Astrodome HR total. We'll try to estimate tails, the non-Astrodome HR total.

Rob Picciolo comes back and says he got 15 heads. Naively, we might estimate that he also tossed 15 tails, since the probabilities are equal. But that would be wrong. Because, we would check Baseball Reference, and we would see that Picciolo was born on the 4th of the month, not the 3rd. Which means he actually had 40 tosses, not 30, and was unlucky in heads.

In his 40 tosses for tails, there's no reason to expect he'd have been similarly unlucky, so we estimate that Picciolo tossed 20 tails, not 15.

On the other hand, Barry Bonds comes back and says he got 130 heads. On average, players who toss 130 heads would also have averaged about 130 tails. But Barry Bonds was born on the 24th of July. We should estimate that he tossed only 120 tails, not 130.

For Fenway, when we know the total number of heads and tails, the player's birthday doesn't factor into our estimate of tails. For the Astrodome, when we know only the total number of heads, the player's birthday *does* factor into our estimate of tails.

-------

So, when Joe Slugger plays 81 games at the Astrodome and tosses 15 home run "heads," we can't just expect him to have also tossed 15 long fly ball "tails". We have to look up his home run talent "date of birth". If he was only born on the 2nd of the month, so that we'd have only expected him to hit 10 HR "heads" and 10 near-HR "tails" in the first place, then we estimate he'd have hit only 10 neutral-park HR, not 15. 

If we don't do that -- if we don't look at his "date of birth" talent and just double his actual Astrodome HR -- our estimates will be too high for players who were lucky, and too low for players who were unlucky. 

Obviously, players who were lucky will have higher totals. That means that if we park-adjust the numbers for the Astros every year, the players who have the best seasons will tend to be the ones we overadjust the most. In other words, when a player was both good and lucky, we're going to make his good seasons look great, his great seasons look spectacular, and his spectacular seasons look like insane outliers. When a player is bad and unlucky, his bad seasons will look even worse.

But if we park-adjust the Red Sox every year ... there's no such effect, and everything should work reasonably well.

My gut still doesn't want to believe it, but my brain thinks it's correct. 

Well, my gut *didn't* want to believe it, when I wrote that sentence originally. Now, I realize that the effect is pretty small. When a player gets lucky by, say, 20 runs, with a season park factor of 95 ... well, that's only 1 run total. My gut is more comfortable with a 1-run effect.

But, suppose you're adjusting a Met superstar, trying to figure out what he'd hit in Colorado. Runs are about 60 percent more abundant in Coors Field than Citi Field, which means the park factor is around 30 percent higher. If the player was 20 runs lucky in that particular season, you'd wind up overestimating him by 6 runs, which is now worth worrying about.




-------

(Note: After writing this, but before final edit, I discovered that Tom Tango made a similar argument years ago. His analysis dealt with the specific case where the player's observed performance matches his expectation, and for that instance I have reinvented his wheel, 15 years later.)


Labels: , , ,

Monday, November 18, 2019

Why you can't calculate aging trajectories with a standard regression

I found myself in a little Twitter discussion last week about using regression to analyze player aging. I argued that regression won't give you accurate results, and that the less elegant "delta method" is the better way to go.

Although I did a small example to try to make my point, Tango suggested I do a bigger simulation and a blog post. That's this.

(Some details if you want:

For the kind of regression we're talking about, each season of a career is an input row. Suppose Damaso Griffin created 2 WAR at age 23, 2.5 WAR at age 24, and 3 WAR at age 25. And Alfredo Garcia created 1, 1.5, and 1.5 WAR at age 24, 25, and 26. The file would look like:

2    23  Damaso Griffin
2.5  24  Damaso Griffin
3    25  Damaso Griffin
1    24  Alfredo Garcia
1.5  25  Alfredo Garcia
1.5  26  Alfredo Garcia

And so on, for all the players and ages you're analyzing. (The names are there so you can have dummy variables for individual player skills.)

You take that file and run a regression, and you hope to get a curve that's "representative" or an "average" or a "consolidation" of how those players truly aged.)

------

I simulated 200 player careers. I decided to use a quadratic (parabola), symmetric around peak age. I would have used just a linear regression, but I was worried that it might seem like the conclusions were the result of the model being too simple.

Mathematically, there are three parameters that define a parabola. For this application, they represent (a) peak age, (b) peak production (WAR), and (c) how steep or gentle the curve is.* 

(*The equation is: 

y = (x - peak age)^2 / -steepness + peak production. 

"Steepness" is related to how fast the player ages: higher steepness is higher decay. Assuming a player has a job only when his WAR is positive, his career length can be computed as twice the square root of (peak WAR * steepness). So, if steepness is 2 and peak WAR is 4, that's a 5.7 year career. If steepness is 6 and peak WAR is 7, that's a 13-year career.

You can also represent a parabola as y = ax^2+bx+c, but it's harder to get your head around what the coefficients mean. They're both the same thing ... you can use basic algebra to convert one into the other.)

For each player, I randomly gave him parameters from these distributions: (a) peak age normally distributed with mean 27 and SD 2; (b) peak WAR with mean 4 and SD 2; and (c) steepness (mean 2, SD 5; but if the result was less than 1.5, I threw it out and picked a new one).

I arbitrarily decided to throw out any careers of length three years or fewer, which reduced the sample from 200 players to 187. Also, I assumed nobody plays before age 18, no matter how good he is. I don't think either of those decisions made a difference.

Here's the plot of all 187 aging curves on one graph:





The idea, now, is to consolidate the 187 curves into one representative curve. Intuitively, what are we expecting here? Probably, something like, the curve that belongs to the average player in the list.

The average random career turned out to be age 26.9, peak WAR 4.19, and steepness 5.36. Here's a curve that matches those parameters:





That seems like what we expect, when we ask a regression to find the best-fit curve. We want a "typical" aging trajectory. Eyeballing the graph, it does look pretty reasonable, although to my eye, it's just a bit small. Maybe half a year bigger left and right, and a bit higher? But close. Up to you ... feel free to draw on your monitor what you think it should look like.  

But when I ran the regression ... well, what came out wasn't close to my guess, and probably not close to your guess either:






It's much, much gentler than it should be. Even if your gut told you something different than the black curve, there's no way your gut was thinking this. The regression came up with a 19-year career. A career that long happened only once in the entire 187-player sample. we expected "representative," but the regression gave us 99.5th percentile.

What happened?

It's the same old "selective sampling"/"survivorship bias" problem.

The simulation decided that when a player's curve scores below zero, those seasons aren't included. It makes sense to code the simulation that way, to match real life. If Jerry Remy had played five years longer than he did, what would his WAR be at age 36? We have no idea.

But, with this simulation, we have a God's-eye view of how negative every player would go. So, let's include that in the plot, down to -20:





See what's happening? The black curve is based on *all* the green data, both above and below zero, and it lands in the middle. The red curve is based only on the green data above zero, so it ignores all the green negatives at the extremes.

If you like, think of the green lines as magnets, pulling the lines towards them. The green magnets bottom-left and bottom-right pull the black curve down and make it steeper. But only the green magnets above zero affect the red line, so it's much less steep.

In fact, if you scroll back up to the other graph, the one that's above zero only, you'll see that at almost every vertical age, the red line bisects the green forest -- there's about as much green magnetism above the red line it there is below it.

In other words: survivorship bias is causing the difference.

------

What's really going on is the regression is just falling for the same classic fallacy we've been warning against for the past 30 years! It's comparing players active (above zero) at age 27 to players active (above zero) at age 35. And it doesn't find much difference. But that's because the two sets of players aren't the same. 

One more thing to make the point clearer. 

Let's suppose you find every player active last year at age 27, and average their performance (per 500PA, or whatever). And then you find every player active last year at age 35, and average their performance.

And you find there's not much difference. And you conclude, hey, players age gracefully! There's hardly any dropoff from age 27 to age 35!

Well, that's the fallacy saberists have been warning against for 30 years, right? The canonical (correct) explanation goes something like this:


"The problem with that logic is that it doesn't actually measure aging, because those two sets of players aren't the same. The players who are able to still be active at 35 are the superstars. The players who were able to be active at 27 are ... almost all of them. All this shows is that superstars at 35 are almost as good as the league average at 27. It doesn't actually tell us how players age."

Well, that logic is *exactly* what the regression is doing. It's calculating the average performance at every age, and drawing a parabola to join them. 

Here's one last graph. I've included the "average at each age" line (blue) calculated from my random data. It's almost a perfect match to the (red) regression line.






------

Bottom line: all the aging regression does is commit the same classic fallacy we repeatedly warn about. It just winds up hiding it -- by complicating, formalizing, and blackboxing what's really going on. 





Labels: ,

Sunday, October 13, 2019

A study on NBA home court advantage

Economist Tyler Cowen often links to NBA studies in his "Marginal Revolution" blog ... here's a recent one, from an August post. (Follow his link to download the study ... you can also find a press release by Googling the title.)

The study used a neural network to try to figure out what factors are most important for home (court) advantage (which I'll call "HCA"). The best fit model used twelve variables: two-point shots made, three-point shots made, and free throws made -- repeated for team at home, opposition on road, team on road, and opposition at home.

The authors write, 

"Networks that include shot attempts, shooting percentage, total points scored, field goals, attendance statistics, elevation and market size as predictors added no improvement in performance. ...

"Contrary to previous work, attendance, elevation and market size were not relevant to understanding home advantage, nor were shot attempts, shooting percentage, overall W-L%, and total points scored."

On reflection, it's not surprising that those other variables don't add anything ... the ones they used, shots made, are enough to actually compute points scored and allowed. Once you have that, what does it matter what the attendance was? If attendance matters at all, it would affect wins through points scored and allowed, not something independent of scoring. And "total points scored" weren't "relevant" because they were redundant, given shots made.

------

The study then proceeds to a "sensitivity analysis," where they increase the various factors, separately, to see what happens to HCA. It turns out that when you increase two-point shots made by 10 percent, you get three to four times the impact on HCA compared to when you increase three-point shots made by the same 10 percent.

The authors write,


"[This] suggests teams can maximize their advantage -- and hence their odds of winning -- by employing different shot selection strategies when home versus away. When playing at home, teams can maximize their advantage by shooting more 2P and forcing opponents to take more 2P shots. When playing away, teams can minimize an opponent's home advantage by shooting more 3P and forcing opponents to take more 3P shots."


Well, yes, but, at the same time, no. 

The reason increasing 2P by 10 percent leads to a bigger effect than increasing 3P by 10 percent is ... that 10 percent of 2P is a lot more points! Eyeball the graph of "late era" seasons the authors used (I assume it's the sixteen seasons ending with 2015-16). Per team-season, it looks like the average is maybe 2500 two-point shots made, but only 500 three-point shots.

Adding 10 percent more 2P is 250 shots for 500 points. Adding 10 percent more 3P is 50 shots for 150 points. 500 divided by 150 gives a factor of three-and-a-third -- almost exactly what the paper shows!

I'd argue that what the study discovered is that points seem to affect HCA and winning percentage equally, regardless of how they are scored. 

------

Even so, the argument in the paper doesn't work. By the authors' own choice of variables, HCA is increased by *making* 2P shots, not my *taking* 2P shots. Rephrasing the above quote, what the study really shows is,

"When playing at home, teams can maximize their advantage by concentrating on *making* more 2P and on forcing opponents to *miss* more 2P. That's assuming that it's just as easy to impact 2P percentages by 10 percent than to impact 3P percentages by 10 percent."

But we could have figured that out easily, just by noticing that 10 percent of 2P is more points than 10 percent of 3P.

------

The authors found that you increase your HCA more with a 10 percent increase in road three-pointers than by a 10 percent increase in road two-pointers. 

Sure. But that's because, with the 3P, you actually wind up scoring fewer road points. Which means you win fewer road games. Which makes your HCA larger, since winning fewer road games increases the difference between home and road. 

It's because the worse you do on the road, the bigger your home court advantage!

Needless to say, you don't really want to increase your HCA by tanking road games. The authors didn't notice that's what they were suggesting.

I think the issue is that the paper assumes that increasing your HCA is always a good thing. It's not. It's actually neutral. The object isn't to increase or decrease your HCA. It's  to *win more games*. You can do that by winning more games at home, increasing your home court advantage, or by winning more games on the road, decreasing your home court advantage.

It's one of those word biases we all have if we don't think too hard. "Increasing your advantage" sounds like something we should strive for. The problem is, in this context, the home "advantage" is relative to *your own performance* on the road. So it really isn't an "advantage," in the sense of something that makes you more likely to beat the other team. 

In fact, if you rotate "Home Court Advantage" 360 degrees and call it "Road Court Disadvantage," now it feels like you want to *decrease* it -- even though it's exactly the same number!

But HCA isn't something you should want to increase or decrease for its own sake. It's just a description of how your wins are distributed.






Labels: , ,

Friday, September 06, 2019

Evidence confirming the DH "penalty"

In "The Book," Tango/Lichtman/Dolphin found that batters perform significantly worse when they play a game as DH than when they play a fielding position. Lichtman (MGL) later followed up with detailed results -- a difference of about 14 points of wOBA. That translates to about 6 runs per 500 PA.

A side effect of my new "luck" database is that I'm able to confirm MGL's result in a different way.

The way my luck algorithm works: it tries to "predict" a player's season by averaging the rest of his career -- before and after -- while adjusting for league, park, and age. Any difference between actual and predicted I ascribe to luck.

I calibrated the algorithm so the overall average luck, over thousands of player-seasons, works out to zero. For most breakdowns -- third basemen, say, or players whose first names start with "M" -- average luck stays close to zero. But, for seasons where the batter was exclusively a DH, the average luck worked out negative -- an average of -3.8 runs per 500 PA.  I'll round that to 4.

-6 R/500PA  MGL
-4 R/500PA  Phil

My results are smaller than what MGL found, but that's probably because we used different methods. I considered only players who never played in the field that year. MGL's study also included the DH games of players who did play fielding positions. 

(My method also included PH who never fielded that year. I made sure to cover the same set of seasons as MGL -- 1998 to 2012.)

MGL's study would have included players who were DHing temporarily because they were recovering from injury, and I'm guessing that's the reason for my missing 2 runs.

But, what about the 4 runs we have in common? What's going on there? Some possibilities:

1. Injury. Maybe when players spend a season DHing, they're more likely to be recovering from some longer-term problem, which also winds up impacting their hitting.

2. It's harder to bat as a DH than when playing a position. As "The Book" suggests, maybe "there is something about spending two hours sitting on the bench that hinders a player's ability to make good contact with a pitch."

3. Selective sampling. Most designated hitters played a fielding position at some time earlier in their careers. The fact that they are no longer doing so suggests that their fielding ability has declined. Whatever aspect of aging caused the fielding decline may have also affect their batting. In that case, looking at DHs might be selectively choosing players who show evidence of having aged worse than expected.

4. Something else I haven't thought of.

You could probably get a better answer by looking at the data a little closer. 

For the "harder to DH" hypothesis, you could isolate PA from the top of the first inning, when all hitters are on equal footing with the DH, since the road team hasn't been out on defense yet. And, for the "injury" hypothesis, you could maybe check batters who had DH seasons in the middle of their careers, rather than the end, and check if those came out especially unlucky. 

One test I was able to do is a breakdown of the full-season designated hitters by age:

Age     R/500PA   sample size
-----------------------------
28-32    -13.7     2,316 PA
33-37    - 6.4     4,305 PA
38-42    + 1.4     6,245 PA

(I've left out the age groups with too few PA to be meaningful.)

Young DHs underperform, and older DHs overperform. I think that's suggestive more of the injury and selective-sampling explanations than of the "it's hard to DH" hypothesis. 

----

UPDATE: This 2015 post by Jeff Zimmerman finds a similar result. Jeff found that designated hitters had a larger "penalty" for the season in cases where they normally played a fielding position, or when they spent some time on the DL.


Labels: , ,

Wednesday, August 14, 2019

Aggregate career year luck as evidence of PED use

Back in 2005, I came up with a method to try to estimate how lucky a player was in a given season (see my article in BRJ 34, here). I compared his performance to a weighted average of his two previous seasons and his two subsequent seasons, and attributed the difference to luck.

I'm working on improving that method, as I've been promising Chris Jaffe I would (for the last eight years or something). One thing I changed was that now, I use a player's entire career as the comparison set, instead of just four seasons. One reason I did that is that I realized that, the old way, a player's overall career luck was based almost completely on how well he did at the beginning and end of his career.

The method I used was to weight the four surrounding seasons in a ratio of 1/2/2/1. If the player didn't play all four of those years, the missing seasons just get left out.

So, suppose a batter played from 1981 to 1989. The sum of his luck wouldn't be zero:

(81 luck) = (81)                     - 2/3(82) - 1/3(83) 
(82 luck) = (82) - 2/5(81)           - 2/5(83) - 1/5(84) 
(83 luck) = (83) - 2/6(82) - 1/6(81) - 2/6(84) - 1/6(85) 
(84 luck) = (84) - 2/6(83) - 1/6(82) - 2/6(85) - 1/6(86) 
(85 luck) = (85) - 2/6(84) - 1/6(83) - 2/6(86) - 1/6(87) 
(86 luck) = (86) - 2/6(85) - 1/6(84) - 2/6(87) - 1/6(88) 
(87 luck) = (87) - 2/6(86) - 1/6(85) - 2/6(88) - 1/6(89)
(88 luck) = (88) - 2/5(87) - 1/5(86) - 2/5(89) 
(89 luck) = (89) - 2/3(88) - 1/3(87) 
---------------------------------------------------------
total luck = 13/30(81) +1/6(82) - 7/30(83) - 1/30(84) - 1/30(86) - 7/30(87) - 1/6(88) + 13/30 (89)

(*Year numbers not followed by the word "luck" refer to player performance level that year).

(Sorry about the small font.)

If a player has a good first two years and last two years, he'll score lucky. If he has a good third and fourth year, or third last and fourth last year, he'll score unlucky. The years in the middle (in this case, 1985, but, for longer careers, any seasons other than the first four and last four) cancel out and don't affect the total.

Now, by comparing each year to the player's entire career, that problem is gone. Now, every player's luck will sum close to zero (before regressing to the mean).

It's not that big a deal, but it was still worth fixing.

--------

This meant I had to adjust for age. The old way, when a player was (say) 36, his estimate was based on his performance from age 34-38 ... reasonably close to 36. Although players decline from 34 to 38, I could probably assume that the decline from 34 to 36 was roughly equal to the decline from 36 to 38, so the age biases would cancel out.

But now, I'm comparing a 36-year-old player to his entire career ... say, from age 25 to 38. Now, we can't assume the 25-35 years, when the player was in his prime, cancel out the 37-38 years, when he's nowhere near the player he was.

---------

So ... I have to adjust for age. What adjustment should I use? I don't think there's an accepted aging scale. 

But ... I think I figured out how to calculate one.

Good luck should be exactly as prevalent as bad luck, by definition. That means that when I look at all players of any given age, the total luck should add up to zero.

So, I experimented with age adjustments until all ages had overall luck close to zero. It wasn't possible to get them to exactly zero, of course, but I got them close.

From age 20 to 36, for both batting and pitching, no single age was lucky or unlucky more than half a run per 500 PA. Outside of that range, there were sample size issues, but that's OK, because if the sample is small enough, you wouldn't expect them close to zero anyway.

---------

Anyway, it occurred to me: maybe this is an empirical way to figure out how players age! Even if my "luck" method isn't perfect, as long as it's imperfect roughly the same way for various ages, the differences should cancel out. 

As I said, I'm still fine-tuning the adjustments, but, for what it's worth, here's what I have for age adjustments for batting, from 1950 to 2016, denominated in Runs Created per 500 PA:

      age(1-17) = 0.7
        age(18) = 0.74
        age(19) = 0.75
        age(20) = 0.775
        age(21) = 0.81
        age(22) = 0.84
        age(23) = 0.86
        age(24) = 0.89
        age(25) = 0.9
        age(26) = 0.925
        age(27) = 0.925
        age(28) = 0.925
        age(29) = 0.925
        age(30) = 0.91
        age(31) = 0.8975
        age(32) = 0.8775
        age(33) = 0.8625
        age(34) = 0.8425
        age(35) = 0.8325
        age(36) = 0.8225
        age(37) = 0.8025
        age(38) = 0.7925
     age(39-42) = 0.7
       age(43+) = 0.65

These numbers only make sense relative to each other. For instance, players created 11 percent more runs per PA at age 24 than they did at age 37 (.89 divided by .8025 equals 1.11).

(*Except ... there might be an issue with that. It's kind of subtle, but here goes.

The "24" number is based on players at age 24 compared to the rest of their careers. The "37" number is based on players at age 37 compared to the rest of their careers. It doesn't necessarily follow that the ratio is the same for those players who were active both at 24 and 37. 

If you don't see why: imagine that every active player had to retire at age 27, and was replaced by a 28-year-old who never played MLB before. Then, the 17-27 groups and the 28-43 groups would have no players in common, and the two sets of aging numbers would be mutually exclusive. (You could, for instance, triple all the numbers in one group, and everything would still work.)

In real life, there's definitely an overlap, but only a minority of players straddle both groups. So, you could have somewhat of the same situation here, I think.

I checked batters who were active at both 24 and 37, and had at least 1000 PA combined for those two seasons. On average, they showed lucky by +0.2 runs per 500 PA. 

That's fine ... but from 750 to 999 PA, there were 73 players, and they showed unlucky by -3.7 runs per 500 PA. 

You'd expect those players with fewer PA to have been unlucky, since if they were lucky, they'd have been given more playing time. (And players with more PA to have been lucky.)  But is 3.7 runs too big to be a natural effect? (And is the +0.2 runs too small?)

My gut says: maybe, by a run or two. Still, if this aging chart works for this selective sample within a couple of runs in 500 PA, that's still pretty good.

Anyway, I'm still thinking about this, and other issues.)

---------

In the process of experimenting with age adjustments, I found that aging patterns weren't constant over that 67-year period. 

For instance: for batters from 1960 to 1970, the peak ages from 27 to 31 all came out unlucky (by the standard of 1950-2015), while 22-26 and 32-34 were all lucky. That means the peak was lower that decade, which means more gentle aging. 

Still: the bias was around +1/-1 run of luck per 500 PA -- still pretty good, and maybe not enough to worry about.

---------

If the data lets us see different aging patterns for different eras, we should be able to use it to see the effects of PEDs, if any.

Here's luck per 500 PA by age group for hitters, 1995 to 2004 inclusive:

-1.75   age 17-22
-0.74   age 23-27
+0.61   age 28-32
+0.99   age 33-37
+0.45   age 38-42

That seems like it's in the range we'd expect given what we know, or think we know, about the prevalence of PEDs during that period. It's maybe 2/3 of a run better than normal for ages 28 to 42. If, say 20 percent of hitters in that group were using PEDs, that would be around 3 runs each. Is that plausible? 

Here's pitchers:

-1.22   age 17-22
-0.51   age 23-27
+1.36   age 28-32 
+1.42   age 33-37 
+1.07   age 38-42 

Now, that's pretty big (and statistically significant), all the way from 28 to 42: for a starter who faces 800 batters, it's about 2 runs. if 20 percent of pitchers are on PEDs, that's 10 runs each.

By checking the post-steroid era, we can check the opposing argument that it's not PEDs, it's just better conditioning, or some such. Here's pitchers again, but this time 2007-2013:

-0.06   age 17-22
+1.01   age 23-27
+0.30   age 28-32
-1.67   age 33-37
+0.59   age 38-42

Now, from 28 to 42, pitchers were *unlucky* on average, overall.

I'd say this is pretty good support for the idea that pitchers were aging better due to PEDs ... especially given actual knowledge and evidence that PED use was happening.







Labels: , ,