Selective sampling and peak age
Back a couple of years ago, I reviewed a paper by J.C. Bradbury on aging in baseball. J.C. found that players peak offensively around age 29, rather than the age 27 found in other studies.
I had critiqued the study on three points:
-- assuming symmetry;
-- selective sampling of long careers;
-- selective sampling of seasons.
In a blog post today, J.C. responds to my "assuming symmetry" critique. I had argued that if the aging curve in baseball has a long right tail, the median of the symmetrical best-fit curve would be at a higher age than the peak of the original curve. That would cause the estimate to be too high. But, today, J.C. says that he tried non-symmetrical curves, and he got roughly the same result.
So, I wondered, if the cause of the discrepancy isn't the poor fit of the quadratic, could selective sampling be a big enough factor? I ran a little experiment, and I think the answer is yes.
J.C. considered only players with long careers, spanning ages 24 to 35. It seems obvious that that would skew the observed peak higher than the actual peak. To see why, take an unrealistic extreme case. Suppose that half of players peak at exactly 16, and half peak at exactly 30. The average peak is 24. But what happens if you look only at players in the league continuously from age 24 to 35? Almost all those players are from the half who peak at 30, and almost none of those guys are the ones who peaked at 16. And so you observe a peak of 30, whereas the real average peak is 24.
As I said, that's an unrealistic case. But even in the real world, you expect early peakers to be less likely to survive until 35, and your sample is still skewed towards late peakers. So the estimate is still biased. Is the bias significant?
To test that, I did a little simulation experiment. I created a world where the average peak age is 27. I made two assumptions:
-- every player has his own personal peak age, which is normally distributed with mean 27 and variance 7.5 (for an SD of about 2.74).
-- I assumed that for every year after his peak, a player has an additional 1/15 chance (6.6 percentage points) to drop out of the league. So if a player peaks at 27, his chance of still being in the league at age thirty-five is (1 minus 8/15), since he's 8 years past his peak. That's 46.7%. If he peaks at 30, 35 is only five years past his peak, so his chance would be 66.7% (which is 1 minus 5/15).
Then, I simulated 5,000 players. Results:
27.0 -- The average peak age for all players.
28.1 -- The average observed peak age of those players who survived until age 35.
The difference between the two results is the result of selective sampling. So, with this model and these assumptions, J.C.'s algorithm overestimates the peak by 1.1 years.
We can get results even more extreme if we change some of the assumptions. Instead of longevity decaying by 1/15, suppose it decays by 1/13? Then the average observed age is 28.5. If it decays by 1/12, we get 28.9. And if it decays by 1/10, the peak age jumps to 30.9.
Of course, we can get less extreme results too: if we use a decay increment of only 1/20, we get an average of 27.6. And maybe the decay slows down as you get older, and we might have too steep a curve near the end. Still, no matter how small the increment, the estimate will still be too high. The only question is, how much too high?
I don't know. But given the results of this (admittedly oversimplified) simulation, it does seem like the bias could be as high as two years, which is the difference between J.C.'s study and others.
If we want to get an unbiased estimate for the peak for all players, not just the longest-lasting ones, I think we'll have to use a different method than tracking career curves.
UPDATE: Tango says it better than I did, here.