Monday, November 18, 2019

Why you can't calculate aging trajectories with a standard regression

I found myself in a little Twitter discussion last week about using regression to analyze player aging. I argued that regression won't give you accurate results, and that the less elegant "delta method" is the better way to go.

Although I did a small example to try to make my point, Tango suggested I do a bigger simulation and a blog post. That's this.

(Some details if you want:

For the kind of regression we're talking about, each season of a career is an input row. Suppose Damaso Griffin created 2 WAR at age 23, 2.5 WAR at age 24, and 3 WAR at age 25. And Alfredo Garcia created 1, 1.5, and 1.5 WAR at age 24, 25, and 26. The file would look like:

2    23  Damaso Griffin
2.5  24  Damaso Griffin
3    25  Damaso Griffin
1    24  Alfredo Garcia
1.5  25  Alfredo Garcia
1.5  26  Alfredo Garcia

And so on, for all the players and ages you're analyzing. (The names are there so you can have dummy variables for individual player skills.)

You take that file and run a regression, and you hope to get a curve that's "representative" or an "average" or a "consolidation" of how those players truly aged.)


I simulated 200 player careers. I decided to use a quadratic (parabola), symmetric around peak age. I would have used just a linear regression, but I was worried that it might seem like the conclusions were the result of the model being too simple.

Mathematically, there are three parameters that define a parabola. For this application, they represent (a) peak age, (b) peak production (WAR), and (c) how steep or gentle the curve is.* 

(*The equation is: 

y = (x - peak age)^2 / -steepness + peak production. 

"Steepness" is related to how fast the player ages: higher steepness is higher decay. Assuming a player has a job only when his WAR is positive, his career length can be computed as twice the square root of (peak WAR * steepness). So, if steepness is 2 and peak WAR is 4, that's a 5.7 year career. If steepness is 6 and peak WAR is 7, that's a 13-year career.

You can also represent a parabola as y = ax^2+bx+c, but it's harder to get your head around what the coefficients mean. They're both the same thing ... you can use basic algebra to convert one into the other.)

For each player, I randomly gave him parameters from these distributions: (a) peak age normally distributed with mean 27 and SD 2; (b) peak WAR with mean 4 and SD 2; and (c) steepness (mean 2, SD 5; but if the result was less than 1.5, I threw it out and picked a new one).

I arbitrarily decided to throw out any careers of length three years or fewer, which reduced the sample from 200 players to 187. Also, I assumed nobody plays before age 18, no matter how good he is. I don't think either of those decisions made a difference.

Here's the plot of all 187 aging curves on one graph:

The idea, now, is to consolidate the 187 curves into one representative curve. Intuitively, what are we expecting here? Probably, something like, the curve that belongs to the average player in the list.

The average random career turned out to be age 26.9, peak WAR 4.19, and steepness 5.36. Here's a curve that matches those parameters:

That seems like what we expect, when we ask a regression to find the best-fit curve. We want a "typical" aging trajectory. Eyeballing the graph, it does look pretty reasonable, although to my eye, it's just a bit small. Maybe half a year bigger left and right, and a bit higher? But close. Up to you ... feel free to draw on your monitor what you think it should look like.  

But when I ran the regression ... well, what came out wasn't close to my guess, and probably not close to your guess either:

It's much, much gentler than it should be. Even if your gut told you something different than the black curve, there's no way your gut was thinking this. The regression came up with a 19-year career. A career that long happened only once in the entire 187-player sample. we expected "representative," but the regression gave us 99.5th percentile.

What happened?

It's the same old "selective sampling"/"survivorship bias" problem.

The simulation decided that when a player's curve scores below zero, those seasons aren't included. It makes sense to code the simulation that way, to match real life. If Jerry Remy had played five years longer than he did, what would his WAR be at age 36? We have no idea.

But, with this simulation, we have a God's-eye view of how negative every player would go. So, let's include that in the plot, down to -20:

See what's happening? The black curve is based on *all* the green data, both above and below zero, and it lands in the middle. The red curve is based only on the green data above zero, so it ignores all the green negatives at the extremes.

If you like, think of the green lines as magnets, pulling the lines towards them. The green magnets bottom-left and bottom-right pull the black curve down and make it steeper. But only the green magnets above zero affect the red line, so it's much less steep.

In fact, if you scroll back up to the other graph, the one that's above zero only, you'll see that at almost every vertical age, the red line bisects the green forest -- there's about as much green magnetism above the red line it there is below it.

In other words: survivorship bias is causing the difference.


What's really going on is the regression is just falling for the same classic fallacy we've been warning against for the past 30 years! It's comparing players active (above zero) at age 27 to players active (above zero) at age 35. And it doesn't find much difference. But that's because the two sets of players aren't the same. 

One more thing to make the point clearer. 

Let's suppose you find every player active last year at age 27, and average their performance (per 500PA, or whatever). And then you find every player active last year at age 35, and average their performance.

And you find there's not much difference. And you conclude, hey, players age gracefully! There's hardly any dropoff from age 27 to age 35!

Well, that's the fallacy saberists have been warning against for 30 years, right? The canonical (correct) explanation goes something like this:

"The problem with that logic is that it doesn't actually measure aging, because those two sets of players aren't the same. The players who are able to still be active at 35 are the superstars. The players who were able to be active at 27 are ... almost all of them. All this shows is that superstars at 35 are almost as good as the league average at 27. It doesn't actually tell us how players age."

Well, that logic is *exactly* what the regression is doing. It's calculating the average performance at every age, and drawing a parabola to join them. 

Here's one last graph. I've included the "average at each age" line (blue) calculated from my random data. It's almost a perfect match to the (red) regression line.


Bottom line: all the aging regression does is commit the same classic fallacy we repeatedly warn about. It just winds up hiding it -- by complicating, formalizing, and blackboxing what's really going on. 

Labels: ,

Sunday, October 13, 2019

A study on NBA home court advantage

Economist Tyler Cowen often links to NBA studies in his "Marginal Revolution" blog ... here's a recent one, from an August post. (Follow his link to download the study ... you can also find a press release by Googling the title.)

The study used a neural network to try to figure out what factors are most important for home (court) advantage (which I'll call "HCA"). The best fit model used twelve variables: two-point shots made, three-point shots made, and free throws made -- repeated for team at home, opposition on road, team on road, and opposition at home.

The authors write, 

"Networks that include shot attempts, shooting percentage, total points scored, field goals, attendance statistics, elevation and market size as predictors added no improvement in performance. ...

"Contrary to previous work, attendance, elevation and market size were not relevant to understanding home advantage, nor were shot attempts, shooting percentage, overall W-L%, and total points scored."

On reflection, it's not surprising that those other variables don't add anything ... the ones they used, shots made, are enough to actually compute points scored and allowed. Once you have that, what does it matter what the attendance was? If attendance matters at all, it would affect wins through points scored and allowed, not something independent of scoring. And "total points scored" weren't "relevant" because they were redundant, given shots made.


The study then proceeds to a "sensitivity analysis," where they increase the various factors, separately, to see what happens to HCA. It turns out that when you increase two-point shots made by 10 percent, you get three to four times the impact on HCA compared to when you increase three-point shots made by the same 10 percent.

The authors write,

"[This] suggests teams can maximize their advantage -- and hence their odds of winning -- by employing different shot selection strategies when home versus away. When playing at home, teams can maximize their advantage by shooting more 2P and forcing opponents to take more 2P shots. When playing away, teams can minimize an opponent's home advantage by shooting more 3P and forcing opponents to take more 3P shots."

Well, yes, but, at the same time, no. 

The reason increasing 2P by 10 percent leads to a bigger effect than increasing 3P by 10 percent is ... that 10 percent of 2P is a lot more points! Eyeball the graph of "late era" seasons the authors used (I assume it's the sixteen seasons ending with 2015-16). Per team-season, it looks like the average is maybe 2500 two-point shots made, but only 500 three-point shots.

Adding 10 percent more 2P is 250 shots for 500 points. Adding 10 percent more 3P is 50 shots for 150 points. 500 divided by 150 gives a factor of three-and-a-third -- almost exactly what the paper shows!

I'd argue that what the study discovered is that points seem to affect HCA and winning percentage equally, regardless of how they are scored. 


Even so, the argument in the paper doesn't work. By the authors' own choice of variables, HCA is increased by *making* 2P shots, not my *taking* 2P shots. Rephrasing the above quote, what the study really shows is,

"When playing at home, teams can maximize their advantage by concentrating on *making* more 2P and on forcing opponents to *miss* more 2P. That's assuming that it's just as easy to impact 2P percentages by 10 percent than to impact 3P percentages by 10 percent."

But we could have figured that out easily, just by noticing that 10 percent of 2P is more points than 10 percent of 3P.


The authors found that you increase your HCA more with a 10 percent increase in road three-pointers than by a 10 percent increase in road two-pointers. 

Sure. But that's because, with the 3P, you actually wind up scoring fewer road points. Which means you win fewer road games. Which makes your HCA larger, since winning fewer road games increases the difference between home and road. 

It's because the worse you do on the road, the bigger your home court advantage!

Needless to say, you don't really want to increase your HCA by tanking road games. The authors didn't notice that's what they were suggesting.

I think the issue is that the paper assumes that increasing your HCA is always a good thing. It's not. It's actually neutral. The object isn't to increase or decrease your HCA. It's  to *win more games*. You can do that by winning more games at home, increasing your home court advantage, or by winning more games on the road, decreasing your home court advantage.

It's one of those word biases we all have if we don't think too hard. "Increasing your advantage" sounds like something we should strive for. The problem is, in this context, the home "advantage" is relative to *your own performance* on the road. So it really isn't an "advantage," in the sense of something that makes you more likely to beat the other team. 

In fact, if you rotate "Home Court Advantage" 360 degrees and call it "Road Court Disadvantage," now it feels like you want to *decrease* it -- even though it's exactly the same number!

But HCA isn't something you should want to increase or decrease for its own sake. It's just a description of how your wins are distributed.

Labels: , ,

Friday, September 06, 2019

Evidence confirming the DH "penalty"

In "The Book," Tango/Lichtman/Dolphin found that batters perform significantly worse when they play a game as DH than when they play a fielding position. Lichtman (MGL) later followed up with detailed results -- a difference of about 14 points of wOBA. That translates to about 6 runs per 500 PA.

A side effect of my new "luck" database is that I'm able to confirm MGL's result in a different way.

The way my luck algorithm works: it tries to "predict" a player's season by averaging the rest of his career -- before and after -- while adjusting for league, park, and age. Any difference between actual and predicted I ascribe to luck.

I calibrated the algorithm so the overall average luck, over thousands of player-seasons, works out to zero. For most breakdowns -- third basemen, say, or players whose first names start with "M" -- average luck stays close to zero. But, for seasons where the batter was exclusively a DH, the average luck worked out negative -- an average of -3.8 runs per 500 PA.  I'll round that to 4.

-6 R/500PA  MGL
-4 R/500PA  Phil

My results are smaller than what MGL found, but that's probably because we used different methods. I considered only players who never played in the field that year. MGL's study also included the DH games of players who did play fielding positions. 

(My method also included PH who never fielded that year. I made sure to cover the same set of seasons as MGL -- 1998 to 2012.)

MGL's study would have included players who were DHing temporarily because they were recovering from injury, and I'm guessing that's the reason for my missing 2 runs.

But, what about the 4 runs we have in common? What's going on there? Some possibilities:

1. Injury. Maybe when players spend a season DHing, they're more likely to be recovering from some longer-term problem, which also winds up impacting their hitting.

2. It's harder to bat as a DH than when playing a position. As "The Book" suggests, maybe "there is something about spending two hours sitting on the bench that hinders a player's ability to make good contact with a pitch."

3. Selective sampling. Most designated hitters played a fielding position at some time earlier in their careers. The fact that they are no longer doing so suggests that their fielding ability has declined. Whatever aspect of aging caused the fielding decline may have also affect their batting. In that case, looking at DHs might be selectively choosing players who show evidence of having aged worse than expected.

4. Something else I haven't thought of.

You could probably get a better answer by looking at the data a little closer. 

For the "harder to DH" hypothesis, you could isolate PA from the top of the first inning, when all hitters are on equal footing with the DH, since the road team hasn't been out on defense yet. And, for the "injury" hypothesis, you could maybe check batters who had DH seasons in the middle of their careers, rather than the end, and check if those came out especially unlucky. 

One test I was able to do is a breakdown of the full-season designated hitters by age:

Age     R/500PA   sample size
28-32    -13.7     2,316 PA
33-37    - 6.4     4,305 PA
38-42    + 1.4     6,245 PA

(I've left out the age groups with too few PA to be meaningful.)

Young DHs underperform, and older DHs overperform. I think that's suggestive more of the injury and selective-sampling explanations than of the "it's hard to DH" hypothesis. 


UPDATE: This 2015 post by Jeff Zimmerman finds a similar result. Jeff found that designated hitters had a larger "penalty" for the season in cases where they normally played a fielding position, or when they spent some time on the DL.

Labels: , ,

Wednesday, August 14, 2019

Aggregate career year luck as evidence of PED use

Back in 2005, I came up with a method to try to estimate how lucky a player was in a given season (see my article in BRJ 34, here). I compared his performance to a weighted average of his two previous seasons and his two subsequent seasons, and attributed the difference to luck.

I'm working on improving that method, as I've been promising Chris Jaffe I would (for the last eight years or something). One thing I changed was that now, I use a player's entire career as the comparison set, instead of just four seasons. One reason I did that is that I realized that, the old way, a player's overall career luck was based almost completely on how well he did at the beginning and end of his career.

The method I used was to weight the four surrounding seasons in a ratio of 1/2/2/1. If the player didn't play all four of those years, the missing seasons just get left out.

So, suppose a batter played from 1981 to 1989. The sum of his luck wouldn't be zero:

(81 luck) = (81)                     - 2/3(82) - 1/3(83) 
(82 luck) = (82) - 2/5(81)           - 2/5(83) - 1/5(84) 
(83 luck) = (83) - 2/6(82) - 1/6(81) - 2/6(84) - 1/6(85) 
(84 luck) = (84) - 2/6(83) - 1/6(82) - 2/6(85) - 1/6(86) 
(85 luck) = (85) - 2/6(84) - 1/6(83) - 2/6(86) - 1/6(87) 
(86 luck) = (86) - 2/6(85) - 1/6(84) - 2/6(87) - 1/6(88) 
(87 luck) = (87) - 2/6(86) - 1/6(85) - 2/6(88) - 1/6(89)
(88 luck) = (88) - 2/5(87) - 1/5(86) - 2/5(89) 
(89 luck) = (89) - 2/3(88) - 1/3(87) 
total luck = 13/30(81) +1/6(82) - 7/30(83) - 1/30(84) - 1/30(86) - 7/30(87) - 1/6(88) + 13/30 (89)

(*Year numbers not followed by the word "luck" refer to player performance level that year).

(Sorry about the small font.)

If a player has a good first two years and last two years, he'll score lucky. If he has a good third and fourth year, or third last and fourth last year, he'll score unlucky. The years in the middle (in this case, 1985, but, for longer careers, any seasons other than the first four and last four) cancel out and don't affect the total.

Now, by comparing each year to the player's entire career, that problem is gone. Now, every player's luck will sum close to zero (before regressing to the mean).

It's not that big a deal, but it was still worth fixing.


This meant I had to adjust for age. The old way, when a player was (say) 36, his estimate was based on his performance from age 34-38 ... reasonably close to 36. Although players decline from 34 to 38, I could probably assume that the decline from 34 to 36 was roughly equal to the decline from 36 to 38, so the age biases would cancel out.

But now, I'm comparing a 36-year-old player to his entire career ... say, from age 25 to 38. Now, we can't assume the 25-35 years, when the player was in his prime, cancel out the 37-38 years, when he's nowhere near the player he was.


So ... I have to adjust for age. What adjustment should I use? I don't think there's an accepted aging scale. 

But ... I think I figured out how to calculate one.

Good luck should be exactly as prevalent as bad luck, by definition. That means that when I look at all players of any given age, the total luck should add up to zero.

So, I experimented with age adjustments until all ages had overall luck close to zero. It wasn't possible to get them to exactly zero, of course, but I got them close.

From age 20 to 36, for both batting and pitching, no single age was lucky or unlucky more than half a run per 500 PA. Outside of that range, there were sample size issues, but that's OK, because if the sample is small enough, you wouldn't expect them close to zero anyway.


Anyway, it occurred to me: maybe this is an empirical way to figure out how players age! Even if my "luck" method isn't perfect, as long as it's imperfect roughly the same way for various ages, the differences should cancel out. 

As I said, I'm still fine-tuning the adjustments, but, for what it's worth, here's what I have for age adjustments for batting, from 1950 to 2016, denominated in Runs Created per 500 PA:

      age(1-17) = 0.7
        age(18) = 0.74
        age(19) = 0.75
        age(20) = 0.775
        age(21) = 0.81
        age(22) = 0.84
        age(23) = 0.86
        age(24) = 0.89
        age(25) = 0.9
        age(26) = 0.925
        age(27) = 0.925
        age(28) = 0.925
        age(29) = 0.925
        age(30) = 0.91
        age(31) = 0.8975
        age(32) = 0.8775
        age(33) = 0.8625
        age(34) = 0.8425
        age(35) = 0.8325
        age(36) = 0.8225
        age(37) = 0.8025
        age(38) = 0.7925
     age(39-42) = 0.7
       age(43+) = 0.65

These numbers only make sense relative to each other. For instance, players created 11 percent more runs per PA at age 24 than they did at age 37 (.89 divided by .8025 equals 1.11).

(*Except ... there might be an issue with that. It's kind of subtle, but here goes.

The "24" number is based on players at age 24 compared to the rest of their careers. The "37" number is based on players at age 37 compared to the rest of their careers. It doesn't necessarily follow that the ratio is the same for those players who were active both at 24 and 37. 

If you don't see why: imagine that every active player had to retire at age 27, and was replaced by a 28-year-old who never played MLB before. Then, the 17-27 groups and the 28-43 groups would have no players in common, and the two sets of aging numbers would be mutually exclusive. (You could, for instance, triple all the numbers in one group, and everything would still work.)

In real life, there's definitely an overlap, but only a minority of players straddle both groups. So, you could have somewhat of the same situation here, I think.

I checked batters who were active at both 24 and 37, and had at least 1000 PA combined for those two seasons. On average, they showed lucky by +0.2 runs per 500 PA. 

That's fine ... but from 750 to 999 PA, there were 73 players, and they showed unlucky by -3.7 runs per 500 PA. 

You'd expect those players with fewer PA to have been unlucky, since if they were lucky, they'd have been given more playing time. (And players with more PA to have been lucky.)  But is 3.7 runs too big to be a natural effect? (And is the +0.2 runs too small?)

My gut says: maybe, by a run or two. Still, if this aging chart works for this selective sample within a couple of runs in 500 PA, that's still pretty good.

Anyway, I'm still thinking about this, and other issues.)


In the process of experimenting with age adjustments, I found that aging patterns weren't constant over that 67-year period. 

For instance: for batters from 1960 to 1970, the peak ages from 27 to 31 all came out unlucky (by the standard of 1950-2015), while 22-26 and 32-34 were all lucky. That means the peak was lower that decade, which means more gentle aging. 

Still: the bias was around +1/-1 run of luck per 500 PA -- still pretty good, and maybe not enough to worry about.


If the data lets us see different aging patterns for different eras, we should be able to use it to see the effects of PEDs, if any.

Here's luck per 500 PA by age group for hitters, 1995 to 2004 inclusive:

-1.75   age 17-22
-0.74   age 23-27
+0.61   age 28-32
+0.99   age 33-37
+0.45   age 38-42

That seems like it's in the range we'd expect given what we know, or think we know, about the prevalence of PEDs during that period. It's maybe 2/3 of a run better than normal for ages 28 to 42. If, say 20 percent of hitters in that group were using PEDs, that would be around 3 runs each. Is that plausible? 

Here's pitchers:

-1.22   age 17-22
-0.51   age 23-27
+1.36   age 28-32 
+1.42   age 33-37 
+1.07   age 38-42 

Now, that's pretty big (and statistically significant), all the way from 28 to 42: for a starter who faces 800 batters, it's about 2 runs. if 20 percent of pitchers are on PEDs, that's 10 runs each.

By checking the post-steroid era, we can check the opposing argument that it's not PEDs, it's just better conditioning, or some such. Here's pitchers again, but this time 2007-2013:

-0.06   age 17-22
+1.01   age 23-27
+0.30   age 28-32
-1.67   age 33-37
+0.59   age 38-42

Now, from 28 to 42, pitchers were *unlucky* on average, overall.

I'd say this is pretty good support for the idea that pitchers were aging better due to PEDs ... especially given actual knowledge and evidence that PED use was happening.

Labels: , ,

Tuesday, March 26, 2019

True talent levels for individual players

(Note: Technical post about practical methods to figure MLB distribution of player talent and regression to the mean.)


For a long time, we've been using the "Palmer/Tango" method to estimating the spread of talent among MLB teams. You're probably sick of seeing it, but I'll run it again real quick for 2013:

1. Find the SD of observed team winning percentage from the standings. In 2013, SD(observed) was 0.0754.

2. Calculate the theoretical SD of luck in a team-season. Statistical theory tells us the formula is the square root of p(1-p)/162, where p is the probability of winning. Assuming teams aren't that far from .500, SD(luck) works out to around 0.039.

3. Since luck is independent of talent, we can say that SD(observed)^2 = SD(luck)^2 + SD(talent)^2 . Substituting the numbers gives our estimate that SD(talent) = 0.0643. 

That works great for teams. But what about players? What's the spread of talent, in, say, on-base percentage, for individual hitters?

It would be great to use the same method, but there's a problem. Unlike team-seasons, where every team plays 162 games, every player bats a different number of times. Sure, we can calculate SD(luck) for each hitter individually, based on his playing time, but then how do we combine them all into one aggregate "SD(luck)" for step 3? 

Can we use the average number of plate appearances? I don't think that would work, actually, because the SD isn't linear. It's inversely proportional to the square root of PA, but even if we used the average of that, I still don't think it would work.

Another possibility is to consider only batters with close to some arbitrary number of plate appearances. For instance, we could just take players in the range 480-520 PA, and treat them as if they all had 500 PA. That would give a reasonable approximation.

But, that would only help us find talent for batters who make it to 500 PA. Those batters are generally the best in baseball, so the range we find will be much too narrow. Also, batters who do make it to 500 PA are probably somewhat lucky (if they started off 15-for-100, say, they probably wouldn't have been allowed to get to 500). That means our theoretical formula for binomial luck probably wouldn't hold for this sample.

So, what do we do?

I don't think there's an easy way to figure that out. Unless Tango already has a way ... maybe I've missed something and reinvented the wheel here, because after thinking about it for a while, I came up with a more complicated method. 

The thing is, we still need to have all hitters have the same number of PA. 

We take the batter with the lowest playing time, and use that. It might be 1 PA. In that case, for all the hitters who have more than 1 PA, we reduce them down to 1 PA. Now that they're all equal, we can go ahead and run the usual method. 

Well, actually, that's a bit of an exaggeration ... 1 PA doesn't work. It's too small, for reasons I'll explain later. But 20 PA does seem to work OK. So, we reduce all batters down to 20 PA.*  

*The only problem is, we'll only be finding the talent range for the subset of batters who are good (or lucky) enough to make it to 20 plate appearances. That should be reasonable enough for most practical purposes, though.  

How do we take a player with 600 PA, and reduce his batting line to 20 PA? We can't just scale down. Proportionally, there's much less randomness in large samples than small, so if we treated a player's 20 PA as an exact replica of his performance in 600 PA, we'd wind up with the "wrong" amount of luck compared to what the formulas expect, and we'd get the wrong answer.

So, what I did was: I took a random sample of 20 PA from every batter's batting line, sampling "without replacement" (which means not using the same plate appearance twice). 

Once that's done, and every hitter is down to 20 PA, we can just go ahead and use the standard method. Here it is for 2013:

1. There were 602 non-pitchers in the sample. The SD of the 602 observed batter OBP values (based on 20 PA per player) was 0.1067.

2. Those batters had an aggregate OBP of .2944. The theoretical SD(luck) in 20 PA with a .2944 expectation is 0.1019.

3. The square root of (0.1067 squared - 0.1019 squared) equals 0.0317 squared.

So, our estimate of SD(talent) = 0.0317. 

That implies that 95% of batters range between .247 and .373. Seems pretty reasonable.


I think this method actually works quite decently. One issue, though, is that it includes a lot of randomness. All the regulars with 500 or 600 plate appearances ... we just randomly pick 20, and ignore the rest. The result is sensitive to which random numbers are pulled. 

How sensitive? To give you an idea, here are the results of 10 different random runs:


I should explain the "imaginary" one. That happens when, just by random chance, SD(observed) is smaller than the expected SD(luck). It's more frequent when the sample size is so small -- say, 20 PA -- that luck is much larger than talent. 

In our original run, SD(observed) was 0.0107 and SD(luck) was 0.0102.  Those are pretty close to each other. It doesn't take much random fluctuation to reverse their order ... in the "imaginary" run, the numbers were 0.01021 and 0.01022, respectively.

More generally, when SD(observed) and SD(luck) are so close, SD(talent) is very sensitive to small random changes in SD(observed). And so the estimates jump around a lot.

(And that's the reason I used the 20 PA minimum. With a sample size of 1 PA, there would be too much distortion from the lack of symmetry. I think. Still investigating.)

The obvious thing to do is just do a whole bunch of random runs, and take the average. That's doesn't quite work, though. One problem is that you can't average the imaginary numbers that sometimes come up. Another problem -- actually, the same problem -- is that the errors aren't symmetrical. A negative random error decreases the estimate more than a positive random error increases the estimate. 

To help get around that, I didn't average the 500 estimates in the list. Instead, I averaged the 500 values of SD(observed), and 500 estimates of SD(luck). Then, I calculated SD(talent) from those.

The result:

SD(talent) = 0.0356

Even with this method, I suspect the estimate is still a bit off. I'm thinking about ways to improve it. I still think it's decent enough, though.


So, now we have our estimate that for 2013, SD(talent)=0.0356. 

The next step: estimating a batter's true talent based on his observed OBP.

We know, from Tango, that we can estimate any player's talent by regressing to the mean -- specifically, "diluting" his batting line by adding a certain number of PA of average performance. 

How many PA do we need to add? As Tango showed, it's the number that makes SD(luck) equal to SD(talent). 

In the 500 simulations, SD(luck) averaged 0.1023 in 20 PA. To get luck down to 0.0356, where it would equal SD(talent), we'd need 166 PA. (That's 20 multiplied by the square of (0.1023 / 0.0356)). I'll just repeat that for reference:

Regress by 166 PA

A value of 166 PA seems reasonable. To check, I ran every season from 1950 to 2016, and 166 was right in line. 

The average of the 57 seasons was 183 PA. The highest was 244 PA (1981); the lowest was 108 PA (1993).  


Now we know we need to add 166 PA of average performance to a batting line to go from observed performance to estimated talent. But what, exactly, is "average performance"?

There are at least four different possibilities:

1. Regress to the observed real-life OBP. In MLB in 2013, for non-pitchers with at least 20 PA, that was .3186. 

2. Regress to the observed real-life OBP weighting every batter equally. That works out to .2984. (It's smaller than the actual MLB number because, in real life, worse hitters get fewer-than-equal PA.)

3. Regress to the average *talent*, weighted by real-life PA.

4. Regress to the average *talent*, weighting every batter equally.

Which one is correct? I had never actually thought about the question before. That's because I had only every used this method on team talent, and, for teams, all four averages are .500.  Here, they're all different. 

I won't try to explain why, but I think the correct answer is number 4. We want to regress to the average talent of the players in the sample.

Except ... now we have a Catch-22. 

To regress performance to the mean, we need to know the league's average talent. But to know the league's average talent, we need to regress performance to the mean!

What's the way out of this? It took me a while, but I think I have a solution.

The Tango method has an implicit assumption that -- while some players may have been lucky in 2013, and some unlucky -- overall, luck evened out. Which means, the observed OBP in MLB in 2013 is exactly equal to the expected OBP based on player talent.

Since the actual OBP was .3186, it must be that the expected OBP, based on player talent, is also .3186. That is: if we regress every player towards X by 166 PA, the overall league OBP has to stay .3186. 

What value of X makes that happen?

I don't think there can be an easy formula for X, because it depends on the distribution of playing time -- most importantly, how much more playing time the good hitters got that year compared to the bad hitters.

So I had to figure it out by trial and error. The answer:

Mean of player talent = .30995

(If you want to check that yourself, just regress every player's OBP while keeping PA constant, and verify that the overall average (weighted by PA) remains the same. Here's the SQL I used for that:

sum(H+bb)/sum(ab+bb) AS actual, 
sum((h+bb+.30995*166)/(ab+bb+166)*(ab+bb)) / sum(ab+bb) AS regressed 
FROM batting
WHERE yearid=2013 and ab+bb>=20 and primarypos <> "P"
The idea is that "actual" and "regressed" should come out equal.

The "primarypos" column is one I created and populated myself, but the rest should work right from the Lahman database. You can leave out the "primarypos" and just use all hitters with 20+ PA. You'll probably find that it'll be something lower than .30995 that makes it work, since including pitchers brings down the average talent.  Also, with a different population of talent, the correct number of PA to regress should be something other than 166 -- probably a little lower? -- but 166 is probably close.

While I'm here ... I should have said earlier that I used only walks, AB, and hits in my definition of OBP, all through this post.)


So, a summary of the method:

1. For each player, take a random 20 PA subset of his batting line. Figure SD(observed) and SD(luck).

2. Repeat the above enough times to get a large sample size, and average out to get a stable estimate of SD(observed) and SD(luck).

3. Use the Tango method to calculate SD(talent).

4. Use the Tango method to calculate how many PA to regress to the mean to estimate player talent.

5. Figure what mean to regress to by trial and error, to get the playing-time-weighted average talent equal to the actual league OBP.


If I did that right, it should work for any stat, not just OBP. Eventually I'll run it for wOBA, and RC27, and BABIP, and whatever else comes to mind. 

As always, let me know if I've got any of this wrong.

Labels: , , , ,

Tuesday, January 15, 2019

Fun with splits

This was Frank Thomas in 1993, a year in which he was American League MVP with an OPS of 1.033.

                 PA   H 2B 3B HR  BB  K   BA   OPS 
'93 F. Thomas   676 174 36  0 41 112 54 .317 1.033  

Most of Thomas's hitting splits were fairly normal:

Home/Road:              1.113/0.950
First vs. Second Half:  0.970/1.114
Vs. RHP/LHP:            1.019/1.068
Outs in inning:         1.023/1.134/0.948
Team ahead/behind/tied: 1.016/0.988/1.096
Early/mid/late innings: 1.166/0.950/0.946
Night/day:              1.071/0.939

But I found one split that was surprisingly large:

              PA   H 2B 3B HR BB  K   BA   OPS  RC/G 
Thomas 1     352 108 22  0 33 58 34 .367 1.251 14.81 
Thomas 2     309  66 14  0  8 54 20 .259 0.796  5.45 

"Thomas 1" was an order of magnitude better than "Thomas 2," to the extent that you wouldn't recognize them as the same player. 

This is a real split ... it's not a selective-sampling trick, like "team wins vs. losses," where "team wins" were retroactively more likely to have been games in which Thomas hit better. (For the record, that particular split was 1.172/.828 -- this one is wider.)

So what is this split? The answer is ... 


The first line is games on odd-numbered days of the month. The second line is even-numbered days.

In other words, this split is random.

In terms of OPS difference -- 455 points -- it's the biggest odd/even split I found for any player in any season from 1950 to 2016 with at least 251 AB PA each half. 

If we go down to a 150 AB minimum, the biggest is Ken Phelps in 1987:

1987 Phelps   PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd          204  31  3  0  8 39 33 .188 0.695  3.79 
even         208  55 10  1 19 41 42 .329 1.204 13.03 

And if we go down to 100 AB, it's Mike Stanley, again in 1987, but on the opposite days to Phelps:

1987 Stanley  PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd          134  42  6  1  6 18 23 .362 1.034 10.49 
even         113  17  2  0  0 13 25 .170 0.455  1.55 

But, from here on, I'll stick to the 251 AB standard.

That 1993 Frank Thomas split was also the biggest gap in home runs, with a 25 HR difference between odd and even (33 vs. 8). Here's another I found interesting -- Dmitri Young in 2001:

2001 D Young  PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
Odd          285  68 12  2  2 18 40 .255 0.639  3.48 
Even         292  95 16  1 19 19 37 .348 1.013  9.51 

Only two of Young's 21 home runs came on odd-numbered days. The binomial probability of that happening randomly (19-2/2-19 or better) is about 1 in 4520.*  And, coincidentally, there were exactly 4516 players in the sample!

(* Actually, it must be more likely than 1 in 4520. The binomial probability assumes each opportunity is independent, and equally likely to occur on an even day as an odd day. But, PA tend to happen in daily clusters of 3 to 5. Since PAs are more likely to cluster, so are HR. 

To see that more easily, imagine extreme clustering, where there are only two games a year (instead of 162), with 250 PA each game. Half of all players would have either all odd PA or all even PA, and you'd see lots of extreme splits.)

For K/BB ratio, check out Derek Jeter's 2004:  

2004 Jeter   PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd         362 113 27  1 15 14 63 .325 0.888  7.12 
even        327  75 17  0  8 32 36 .254 0.720  4.40 

There were bigger differences, but I found Jeter's the most interesting. 

In 1978, all 10 of Rod Carew's triples came on even-numbered days:

1978 Carew   PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd         333  92 10  0  0 45 34 .319 0.766  5.46 
even        309  96 16 10  5 33 28 .348 0.950  8.69 

A 10-0 split is a 1-in-512 shot. I'd say again that it's actually a bit more likely than that because of PA clustering, but ... Carew actually had *fewer* PA in that situation! 

Oh, and Carew also hit all five of his HR on even days. Combining them into 15-0 is binomial odds of 16383 to 1, if you want to do that.

Strikeouts and walks aren't quite as impressive. It's Justin Upton 2013 for strikeouts:

2003 Upton     PA   H 2B 3B HR BB   K   BA  OPS  RC/G 
odd           330  71 14  1 16 31 102 .237 0.761 4.67 
even          303  76 13  1 11 44  59 .293 0.875 6.84 

And Mike Greenwell 1988 for walks:

88 Greenwell   PA   H 2B 3B HR BB   K  BA   OPS  RC/G 
odd           357  91 15  3 10 62  18 .308 0.910 7.61 
even          320 101 24  5 12 25  20 .342 0.973 8.85 

Interestingly, Greenwell was actually more productive on the even-numbered days where he took less than half as many walks.

Finally, here's batting average, Grady Sizemore in 2005:

2005 Sizemore  PA   H 2B 3B HR BB   K  BA   OPS  RC/G 
odd           344  69  9  4 12 26  79 .217 0.660 3.45 
even          348 116 28  7 10 26  53 .360 0.992 9.50 

Another anomaly -- Sizemore hit more home runs on his .217 days than on his .360 days.


Anyway, what's the point of all this? Fun, mostly. But, for me, it did give me a better idea of what kinds of splits can happen just by chance. If it's possible to have a split of 33 odd homers and 8 even homers, just by luck, then it's possible to have 33 first-half homers and 8 second-half homers, just by luck. 

Of course, you should just expect that size of effect once every 40 years or so. It might more intuitive to go from a 40-year standard to a single-season standard, to get a better idea of what we can expect each year. 

To do that, I looked at 1977 to 2016 -- 39 seasons plus 1994. Averaging the top 39 should roughly give us the average for the year. Instead of the average, I figured I'd just (unscientifically) take the 25th biggest ... that's probably going to be close to the median MLB-leading split for the year, taking into account that some years have more than one of the top 39.

For HR, the 25th ranked is Fred McGriff's 2002. It's an impressive 22/8 split:

02 McGriff   PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         297  70 11  1 22 42  47 .275 0.961  7.74 
even        289  73 16  1  8 21  52 .272 0.754  4.89 

For OPS, it's Scott Hatteberg in 2004:

04 Hatteberg PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         312  92 19  0 10 37  23 .335 0.926  8.12 
even        310  64 11  0  5 35  25 .233 0.647  3.47

For strikeouts, it's Felipe Lopez, 2005. Not that huge a deal ... only 27 K difference.

05 F. Lopez  PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         316  78 15  2 12 19  69 .263 0.755  4.75 
even        321  91 19  3 11 38  42 .322 0.928  7.95 

For walks, it's Darryl Strawberry's 1987. The difference is only 23 BB, but to me it looks more impressive than the 27 strikeouts:

87 Strwb'ry  PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         315  77 15  2 19 37  55 .277 0.912  7.02 
even        314  74 17  3 20 60  67 .291 1.045  9.49 

For batting average, number 25 is Orestes Infante, 2011, but I'll show you the 24th ranked, which is Rickey Henderson in his rookie card year. (Both players round to a .103 difference.)

1980 Rickey  PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         340 100 13  1  2 60  21 .357 0.903  8.07 
even        368  79  9  3  7 57  33 .254 0.739  4.67 


I'm going to think of this as, every year, the league-leading random split is going to look like those. Some years it'll be higher, some lower, but these will be fairly typical.

That's the league-leading split for *each category*. There'll be a random home/road split of this magnitude (in addition to actual home/road effect). There'll be a random early/late split of this magnitude (in addition to any fatigue/weather effects). There'll be a random lefty/righty split of this magnitude (in addition to actual platoon effects). And so on.

Another way I might use this is to get an intuitive grip on how much I should trust a potentially meaningful split. For instance, if a certain player hits substantially worse in the second half of the season than in the first half, how much should you worry? To figure that out, I'd list a season's biggest even/odd splits alongside the season's biggest early/late splits. If the 20th biggest real split is as big as the 10th biggest random split, then, knowing nothing else, you can start with a guess that there's a 50 percent chance the decline is real.

Sure, you could do it mathematically, by figuring out the SD of the various stats. But that's harder to appreciate. And it's not nearly as much fun as being able to say that, in 1987, Rod Carew hit every one of his 10 triples and 5 homers on even-numbered days. Especially when anyone can go to Baseball Reference and verify it.

Labels: , , ,