Sabermetric Research: Fun with splits

This was Frank Thomas in 1993, a year in which he was American League MVP with an OPS of 1.033.

PA H 2B 3B HR BB K BA OPS
--------------------------------------------------
'93 F. Thomas 676 174 36 0 41 112 54 .317 1.033

Most of Thomas's hitting splits were fairly normal:

Home/Road: 1.113/0.950
First vs. Second Half: 0.970/1.114
Vs. RHP/LHP: 1.019/1.068
Outs in inning: 1.023/1.134/0.948
Team ahead/behind/tied: 1.016/0.988/1.096
Early/mid/late innings: 1.166/0.950/0.946
Night/day: 1.071/0.939

But I found one split that was surprisingly large:

PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
Thomas 1 352 108 22 0 33 58 34 .367 1.251 14.81
Thomas 2 309 66 14 0 8 54 20 .259 0.796 5.45

"Thomas 1" was an order of magnitude better than "Thomas 2," to the extent that you wouldn't recognize them as the same player.

This is a real split ... it's not a selective-sampling trick, like "team wins vs. losses," where "team wins" were retroactively more likely to have been games in which Thomas hit better. (For the record, that particular split was 1.172/.828 -- this one is wider.)

So what is this split? The answer is ...

.
.
.

The first line is games on odd-numbered days of the month. The second line is even-numbered days.

In other words, this split is random.

In terms of OPS difference -- 455 points -- it's the biggest odd/even split I found for any player in any season from 1950 to 2016 with at least 251 AB PA each half.

If we go down to a 150 AB minimum, the biggest is Ken Phelps in 1987:

1987 Phelps PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 204 31 3 0 8 39 33 .188 0.695 3.79
even 208 55 10 1 19 41 42 .329 1.204 13.03

And if we go down to 100 AB, it's Mike Stanley, again in 1987, but on the opposite days to Phelps:

1987 Stanley PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 134 42 6 1 6 18 23 .362 1.034 10.49
even 113 17 2 0 0 13 25 .170 0.455 1.55

But, from here on, I'll stick to the 251 AB standard.

That 1993 Frank Thomas split was also the biggest gap in home runs, with a 25 HR difference between odd and even (33 vs. 8). Here's another I found interesting -- Dmitri Young in 2001:

2001 D Young PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
Odd 285 68 12 2 2 18 40 .255 0.639 3.48
Even 292 95 16 1 19 19 37 .348 1.013 9.51

Only two of Young's 21 home runs came on odd-numbered days. The binomial probability of that happening randomly (19-2/2-19 or better) is about 1 in 4520.* And, coincidentally, there were exactly 4516 players in the sample!

(* Actually, it must be more likely than 1 in 4520. The binomial probability assumes each opportunity is independent, and equally likely to occur on an even day as an odd day. But, PA tend to happen in daily clusters of 3 to 5. Since PAs are more likely to cluster, so are HR.

To see that more easily, imagine extreme clustering, where there are only two games a year (instead of 162), with 250 PA each game. Half of all players would have either all odd PA or all even PA, and you'd see lots of extreme splits.)

For K/BB ratio, check out Derek Jeter's 2004:

2004 Jeter PA H 2B 3B HR BB K BA OPS RC/G
---------------------------------------------------
odd 362 113 27 1 15 14 63 .325 0.888 7.12
even 327 75 17 0 8 32 36 .254 0.720 4.40

There were bigger differences, but I found Jeter's the most interesting.

In 1978, all 10 of Rod Carew's triples came on even-numbered days:

1978 Carew PA H 2B 3B HR BB K BA OPS RC/G
---------------------------------------------------
odd 333 92 10 0 0 45 34 .319 0.766 5.46
even 309 96 16 10 5 33 28 .348 0.950 8.69

A 10-0 split is a 1-in-512 shot. I'd say again that it's actually a bit more likely than that because of PA clustering, but ... Carew actually had *fewer* PA in that situation!

Oh, and Carew also hit all five of his HR on even days. Combining them into 15-0 is binomial odds of 16383 to 1, if you want to do that.

Strikeouts and walks aren't quite as impressive. It's Justin Upton 2013 for strikeouts:

2003 Upton PA H 2B 3B HR BB K BA OPS RC/G
-----------------------------------------------------
odd 330 71 14 1 16 31 102 .237 0.761 4.67
even 303 76 13 1 11 44 59 .293 0.875 6.84

And Mike Greenwell 1988 for walks:

88 Greenwell PA H 2B 3B HR BB K BA OPS RC/G
-----------------------------------------------------
odd 357 91 15 3 10 62 18 .308 0.910 7.61
even 320 101 24 5 12 25 20 .342 0.973 8.85

Interestingly, Greenwell was actually more productive on the even-numbered days where he took less than half as many walks.

Finally, here's batting average, Grady Sizemore in 2005:

2005 Sizemore PA H 2B 3B HR BB K BA OPS RC/G
-----------------------------------------------------
odd 344 69 9 4 12 26 79 .217 0.660 3.45
even 348 116 28 7 10 26 53 .360 0.992 9.50

Another anomaly -- Sizemore hit more home runs on his .217 days than on his .360 days.

-------

Anyway, what's the point of all this? Fun, mostly. But, for me, it did give me a better idea of what kinds of splits can happen just by chance. If it's possible to have a split of 33 odd homers and 8 even homers, just by luck, then it's possible to have 33 first-half homers and 8 second-half homers, just by luck.

Of course, you should just expect that size of effect once every 40 years or so. It might more intuitive to go from a 40-year standard to a single-season standard, to get a better idea of what we can expect each year.

To do that, I looked at 1977 to 2016 -- 39 seasons plus 1994. Averaging the top 39 should roughly give us the average for the year. Instead of the average, I figured I'd just (unscientifically) take the 25th biggest ... that's probably going to be close to the median MLB-leading split for the year, taking into account that some years have more than one of the top 39.

For HR, the 25th ranked is Fred McGriff's 2002. It's an impressive 22/8 split:

02 McGriff PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 297 70 11 1 22 42 47 .275 0.961 7.74
even 289 73 16 1 8 21 52 .272 0.754 4.89

For OPS, it's Scott Hatteberg in 2004:

04 Hatteberg PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 312 92 19 0 10 37 23 .335 0.926 8.12
even 310 64 11 0 5 35 25 .233 0.647 3.47

For strikeouts, it's Felipe Lopez, 2005. Not that huge a deal ... only 27 K difference.

05 F. Lopez PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 316 78 15 2 12 19 69 .263 0.755 4.75
even 321 91 19 3 11 38 42 .322 0.928 7.95

For walks, it's Darryl Strawberry's 1987. The difference is only 23 BB, but to me it looks more impressive than the 27 strikeouts:

87 Strwb'ry PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 315 77 15 2 19 37 55 .277 0.912 7.02
even 314 74 17 3 20 60 67 .291 1.045 9.49

For batting average, number 25 is Orestes Infante, 2011, but I'll show you the 24th ranked, which is Rickey Henderson in his rookie card year. (Both players round to a .103 difference.)

1980 Rickey PA H 2B 3B HR BB K BA OPS RC/G
----------------------------------------------------
odd 340 100 13 1 2 60 21 .357 0.903 8.07
even 368 79 9 3 7 57 33 .254 0.739 4.67

-------

I'm going to think of this as, every year, the league-leading random split is going to look like those. Some years it'll be higher, some lower, but these will be fairly typical.

That's the league-leading split for *each category*. There'll be a random home/road split of this magnitude (in addition to actual home/road effect). There'll be a random early/late split of this magnitude (in addition to any fatigue/weather effects). There'll be a random lefty/righty split of this magnitude (in addition to actual platoon effects). And so on.

Another way I might use this is to get an intuitive grip on how much I should trust a potentially meaningful split. For instance, if a certain player hits substantially worse in the second half of the season than in the first half, how much should you worry? To figure that out, I'd list a season's biggest even/odd splits alongside the season's biggest early/late splits. If the 20th biggest real split is as big as the 10th biggest random split, then, knowing nothing else, you can start with a guess that there's a 50 percent chance the decline is real.

Sure, you could do it mathematically, by figuring out the SD of the various stats. But that's harder to appreciate. And it's not nearly as much fun as being able to say that, in 1987, Rod Carew hit every one of his 10 triples and 5 homers on even-numbered days. Especially when anyone can go to Baseball Reference and verify it.

Labels: baseball, luck, randomness, splits

2 Comments:

At Saturday, January 19, 2019 4:04:00 PM, Daniel said...: This is a wonderful post Phil.
At Wednesday, February 27, 2019 10:50:00 AM, Unknown said...: There is speculation on what causes home/road difference in performance. Some speculations are that the bias is due to travel fatigue, getting a better night's rest in your own home, greater familiarity with your own park, etc.

The authors of the book, "Scorecasting", Moskowitz and Wertheim, think it's mostly due to subconscious umpire bias- the crowds influence their decisions on a subconscious level.

I see that robot umpires are going to be tested in the Atlantic League.

http://www.realclearlife.com/sports/mlb-will-test-robotic-umpires-in-independent-league-baseball/

IF Moskowitz and Wertheim are correct, home/road splits should drop significantly when the robot umpires start operating, since they won't be subconsciously influenced by the crowds. Of course OTHER factors, like hitters/pitchers parks, would still result in some bias in homers, doubles, etc.

<< Home

Sabermetric Research

Tuesday, January 15, 2019

Fun with splits

2 Comments:

About Me

Previous Posts