Do pitchers perform worse after a high-pitch start?
      Last week, J.C. Bradbury and Sean Forman released a study to check whether throwing a lot of pitches affects a pitcher's next start.  The paper, along with a series of PowerPoint slides, can be found at JC's blog, here.
There were several things that the study checked, but I'm going to concentrate on one part of it, which is fairly representative of the whole.
The authors tried to predict a starting pitcher's ERA in the following game, based on how many pitches he threw this game, and a bunch of other variables.  Specifically:
-- number of pitches
-- number of days rest
-- the pitcher's ERA in this same season 
-- the pitcher's age
It turned out that, controlling for the other three factors, every additional pitch thrown this game led to a .007 increase in ERA the next game.
Except that, I think there's a problem.
The authors included season ERA in their list of controls.  That's because they needed a way to control for the quality of the pitcher.  Otherwise, they'd probably find that throwing a lot of pitches today means you'll pitch well next time -- since the pitcher who throws 130 pitches today is more likely to be Nolan Ryan than Josh Towers.
So, effectively, they're comparing every pitcher to himself that season.
But if you compare a pitcher to himself that season, then it's guaranteed that an above-average game (for that pitcher) will be more likely to be followed by a below-average game (for that pitcher).  After all, the entire set of games has to add up to "average" for that pitcher.
This is easiest to see if you consider the case where the pitcher only starts two games.  If the first game is below his average, the second game absolutely must be above his average.  And if the first game is above his average, the second game must be below.
The same thing holds for pitchers with more than two starts.  Suppose a starter throws 150 innings, and gives up 75 runs, for an ERA of 4.50.  And suppose that, today, he throws a 125-pitch complete game shutout.  
For all games other than this one, his record will be 141 innings and 75 earned runs, for a 4.79 ERA.  So, in his next start, you'd expect him, in retrospect, to be significantly worse than his season average of 4.50.  That difference isn't caused by the 125 pitches.  It's just the logical consequence that if this game was above the season average, the other games combined must be below the season average.
Now, high pitch counts are associated with above-average games, and low pitch counts are associated with bad starts.  So, since a player should be below average after a good start, and a high pitch start was probably a very good start, then it follows that a player should be below his average after a high pitch start.  Similarly, he should be above his average after a low-pitch start.  That's just an artifact of the way the study was designed, and has nothing to do with the player's arm being tired or not.
How big is the effect over the entire study?  I checked.  For every starter from 2000 to 2009 starting on less than 15 days rest, I computed how much his ERA would have been higher or lower had that start been eliminated completely.  Then I grouped the starts in groups, by number of pitches.  The results:
05-14: -0.09
15-24: -0.41
25-34: -0.50
35-44: -0.64
45-54: -0.50
55-64: -0.38
65-74: -0.19
75-84: -0.08
85-94: +0.01
95-104: +0.05
105-114: +0.06
115-124: +0.07
125-134: +0.07
135-144: +0.08
145-154: +0.06
(Note: even though I'm talking about ERA, I included unearned runs too.  I really should say "RA", but I'll occasionally keep on saying "ERA" anyway just to keep the discussion easier to follow.  Just remember: JC/Sean's data is really ERA, and mine is really RA.)
To read one line off the chart: if you randomly found a game in which a starter threw only 50 pitches, and eliminated that game from his record, his season ERA would drop by half a run, 0.50.  That's because a 50-pitch start is probably a bad outing, so eliminating it is a big improvement.
That's pretty big. A pitcher with an ERA of 4.00 *including* that bad outing might be 3.50 in all other games.  And so, if he actually pitches to an ERA of around 3.50 in his next start, that would be just as expected by the logic of the calculations.  
It's also interesting to note that the effect is very steep up to about 90 pitches, and then it levels off.  That's probably because, after 90, any subsequent pitches are more a consequence of the pitcher's perceived ability to handle the workload, and less the number of runs he's giving up on this particular day.
Finally, if you take the "if this game were omitted" ERA difference in every game, and regress it against the number of pitches, what do you get?  You'll get that every extra pitch causes an .006 increase in ERA next game -- very close to the .007 that JC and Sean found in their study.
-----
So, that's an argument that suggests the result might be just due to the methodology, and not to arm fatigue at all.  To be more certain, I decided to try to reproduce the result.  I ran a regression to predict next game's ERA from this game's pitches, and the pitcher's season ERA (the same variables JC and Sean used, but without age and year, which weren't found to be significant).  I used roughly the same database they did -- 1988 to 2009.  
My result: every extra pitch was worth .005 of ERA next game.  That's a bit smaller than the .007 the authors found (more so when you consider that theirs really is ERA, and mine includes unearned runs), but still consistent.  (I should mention that the original study didn't do a straight-line linear regression like I did -- the authors investigated transformations that might have wound up with a curved line as best fit.  However, their graph shows a line that's almost straight -- I had to hold a ruler to it to notice a slight curve -- so it seems to me that the results are indeed similar.)
Then, I ran the same regression, but, this time, to remove the flaw, I used the pitcher's ERA for that season but adjusted *to not include that particular game*.  So, for instance, in the 50-pitch example above, I used 3.50 instead of 4.00.
Now, the results went the other way!  In this regression, every additional pitch this game led to a .003 *decrease* in runs allowed next game.  Moreover, the result was only barely statistically significant (p=.07).
So, there appears to be a much weaker relationship between pitch count and future performance when you choose a better version of ERA, one that's independent of the other variables in the regression.
However, there's still some bias there, and there's one more correction we can make.  Let me explain.
-----
In 2002, Mike Mussina allowed 103 runs in 215.2 innings of work, for an RA of 4.30.
Suppose you took one of Mussina's 2005 starts, at random.  On average, what should his RA that game be?
The answer is NOT 4.30.  It's much higher.  It's 4.89.  That is, if you take Mussina's RA for every one of his 33 starts, and you average all those numbers out, you get 4.89.
Why?  Because the ERA calculation, the 4.30, is when you weight all Mussina's innings equally.  But, when we wonder about his average ERA in a game, we're wanting to treat all *games* equally, not innings.  The July 31 game, where he pitched only 3 innings and had an RA of 21.00, gets the same weight in the per-game-average as his 9-inning shutout of August 28, with an RA of 0.00.  
In ERA, the 0.00 gets three times the weight of the 9.00, because it covered three times as many innings.  But when we ask about ERA in a given game, we're ignoring innings, and just looking at games.  So the 0.00 gets only equal weight to the 9.00, not three times.
Since pitchers tend to pitch more innings in games where they pitch better, ERA gives a greater weight to those games.  And that's why overall ERA is lower than averaging individual games' ERAs.
The point is: The study is trying to predict ERA for the next game.  The best estimate for ERA next game is *not* the ERA for the season.  That's because, as we just saw, the overall season ERA is too low to be a proper estimate of a single game's ERA.  Rather, the best estimate of a game's ERA is the overall average of the individual game ERAs.
So, in the regression, instead of using plain ERA as one of the dependent variables, why not use the player's average game ERA that season?  That would be more consistent with what we're trying to predict.  In our Mussina example, instead of using 4.30, we'll use 4.89.
With the exception, of course, that we'll subtract out the current game from the average game ERA.  So, if we're working on predicting the game after Mussina's shutout, we'll use the average game ERA from Mussina's other 32 starts, not including the shutout.  Instead of 4.89, that works out to 5.04.  
That is, I again ran a regression, trying to predict the next game's RA based on:
-- pitches thrown this game
-- pitcher's average game ERA this season for all games excluding this one.
When I did that, what happened?
The effect of pitches thrown disappeared, almost entirely.  It went down to -.0004 in ERA, and wasn't even close to significant (p=.79).  Basically, the number of pitches thrown had no effect at all on the next start.
-----
So I think what JC and Sean found is not at all related to arm fatigue.  It's just a consequence of the fact that their model retroactively required all the starts to add up to zero, relative to that pitcher's season average.  And so, when one start is positive, the other starts simply have to work out to be negative, to cancel out.  That makes it look like a good start causes a bad start, which makes it look like a high-pitch start causes a bad start.
But that's not true.  And, as it turns out, when we correct for the zero-sum situation, the entire effect disappears.  And so it doesn't look to me like pitches thrown has any connection to subsequent performance.
UPDATE: I took JC/Sean's regression and added one additional predictor variable -- ERA in the first game, the game corresponding to the number of pitches.
Once you control for ERA that game, the number of pitches became completely non-significant (p=.94), and its effect on ERA was pretty much zero (-0.00014).
That is: if you give up the same number of runs in two complete games, but one game takes you 90 pitches, and the other takes you 130 pitches ... well, there's effectively no difference in how well you'll pitch the following game.
That is strongly supportive of the theory that number of pitches is significant in the study's regression only because it acts as a proxy for runs allowed.
Labels: baseball, pitching, streakiness



16 Comments:
As I was preparing this post, JC posted that he adjusted his ERA variable, and his results didn't change much.
However, for my part, as I said above, when I did that, I got the result to flip sign.
Will have to investigate. Anybody have any ideas?
No ideas, just wanted to say I really like the way you laid out this post. Very easy to follow, and I learned quite a bit too.
Thank you!
Good post Phil. You explained the problem really clearly. Bad news: a few weeks ago when I was reading an older study by JC (on lineup protection), it had the same problem. For that one, it's surprising no one caught it in peer review. In any case, the adjustment you did, to use the mean without that observation, is sometimes used for these problems. Another simple option is to put in a pitcher-year fixed effect (dummy variable for each pitcher-year). Should give roughly similar results.
Butler Blue,
Wouldn't a pitcher-year dummy variable still have the same problem? So long as the coefficient is the same for every game that season, you're still comparing the player to his average. Aren't you?
Right, nevermind about the fixed effects. Doesn't work for the same reasons; regressing on a variable from last start, not this one, with a `small' number of starts.
But I would still be looking for a way to avoid regressing on the average ERA from this year (even taking out the one observation). You gave us one really plausible story for how last start's pitch count relates to today's ERA, via last start's ERA. I'm sure we could come up with (less severe but still relevant) stories about how ERA from two, three, four starts ago is also related... The pitcher had a bad start two games ago, so he was rested and went deeper in his last start. Etc.
It would be good to get as far from all of these as possible. Maybe looking at performance realtive to for average ERA from the previous season, etc. might be another step in the right direction.
Shouldn't you ignore both the ERA (or RA) in both the first game and the second game?
Imagine a pitcher with a two start season. The first one he throws 150 pitches and has a 2 ERA. The second one he throws 50 pitches and has a 6 ERA. You would think that sequence would support the hypothesis, but your test would throw out the first start, say his ERA for the rest of the season is 6 and there's no effect from the high pitch start. Really, you need to say you can't use this guy until he has a 3rd start to compare with these.
Interesting thought ... you may have a point.
Except that if a pitcher has only two starts, you'd be ignoring the first one anyway, since the study ignores any start with more than 15 days rest. So either way, the pitcher with only two starts doesn't make the study.
Any other cases that could cause a problem? And are there cases were *removing* the subsequent start from the average, as Anonymous suggests, would cause a problem?
(I'm the Anonymous from earlier) I've heard another theory you could test. It's not the number of pitches, but the number of pitches while tired. It could be that the first N pitches are "free" and only after that does fatigue start to take a toll.
Mike Roca
An idea on why you end up getting a negative coefficient:
I would imagine that pitchers with higher ERA last start are expected to have high ERA this start, independent of pitch count. This could be pure streakiness, park effects of long homestands/roadtrips, Coors hangover effect, or really anything.
A pitcher gave up runs last game and gets pulled early. And this start he gives up a lot of runs, because of streakiness, park effects, etc. Low pitches last start, high ERA this start. It could hide a real effect of pitch count.
Saw the update. Is that taking JC/Sean's original setup and just adding in last start's ERA or are you doing all of your other adjustments?
Thinking beyond pitch counts, it just seems like numbers that are driven by managers' decisions are going to be difficult to use in a standard regression, right? There's just so many things going into that decision that it's hard to control for all the right things.
Yup, that was taking two of JC/Sean's original variables -- unadjusted season ERA, and pitches -- and adding the one variable for single-game ERA, the same game as the one where the pitches were counted.
Wonderful stuff, Phil. Though JC and Sean certainly deserve props for tackling such an important issue.
Butler Blue's commentary is just terrific as well, imo.
The essence of James' recent article on streakiness is relevant here, I think. It seems obvious that he asked the right question there, and it's twin here is "What would have happened in a parallel universe where the manager decided to pull the pitchers earlier in the previous game?". That's easier asked than answered.
Obviously there is considerable overlap in these two studies, and every discovered factor that weakens one argument will strengthen the other. This by mathematical definition, for the most part.
Butler Blue's points about managerial decisions is well taken, and wholly commonsensical. But if managers were perfect, then the effect would be completely invisible to this math. And it isn't, there is still something showing. So this should be within our reach.
I'm at a dead end myself, though I feel that James' parallel universe mindset of recent years ... it's the right way forward on this issue, and others, or so I suspect. Time will tell.
Terrific discussion in any case, guys. Hopefully it keeps moving forward.
Thanks for this post. I learned a lot. I have other questions, and maybe someone can point me to where can find answers.
If single-game high pitch counts are meaningless, what about high pitch counts for the season? For one game, throwing 120 pitches means nothing. But what about the long-term effects of several--say, 15--of those in a season?
Also, what about the common practice now of limiting the innings per season of young pitchers? By which I mean, the stair-stepping of, oh, 80 innings one year, then 110 innings this year, and maybe 150 innings next year. Is there data to support that?
@Vic: I haven't read a lot of Bill's recent stuff, but how you're describing his mindset is exactly where I'm coming from. I tend to think of all of this in terms of counterfactual outcomes (the alternate universe where the manager pulled the pitcher).
Just curious: have you looked into the incremental impact of higher pitch counts on pitcher performance intragame? That is, pitcher effectiveness in the 0-50 pitch count bucket, vs. 51-100, vs. 100-120, 120-140, and so on. That is, what is the additional performance risk assumed by the manager by leaving a pitcher in for additional pitches beyond 100, if any?
Post a Comment
<< Home