Friday, June 23, 2017

Juiced baseballs, part II

Last post, I showed how MGL found the variation (SD) of MLB baseballs to be in the range of about 7 feet difference for a typical fly ball. I wondered if that were truly the case, or if some of it wasn't real, just imprecision due to measurement error.

After some Twitter conversations that led me to other sources, I'm leaning to the conclusion that the variance is real.

------

Two of the three measurements in MGL's study (co-authored with Ben Lindbergh) were the circumference of the baseball and its average seam height. For both of those factors, the higher the measure, the more air resistance, and therefore shorter distance travelled.

It occurred to me -- why not measure distance directly, if that's what you're interested in? MGL told me, on Twitter, that that's been done. I found one study via a Google search (a study that Kevin later linked to in a comment).

That study took a box of one dozen MLB balls, fired them from a cannon one by one, and observed how far each travelled. Crucially, the authors adjusted that distance for the original speed and angle, because the cannon itself produces variations in intial conditions. So, what remains is mostly about the ball.

For one of the two boxes, the balls varied (SD) by 8 feet. For the second box, the SD was only 3 feet.

It's still possible that some of that variation is due to initial conditions that weren't controlled for, like small fluctuations in temperature, or air movement within the flight path, or whatever. Fortunately, the authors repeated the procedure, but for a single ball fired multiple times. 

The SD for the single ball was 3 feet.

Using the usual method, we know

SD for different balls ^ 2 = SD for a single ball ^ 2 + SD caused by ball differences ^ 2

That means for the first box, we estimate that the balls vary by 7 feet. For the second box, it's 0 feet. That's a big difference. Fortunately again, the authors repeated the procedure for different types of balls.

NCAA balls have higher seams and therefore less carry. The study found an overall SD of 11 feet, and single ball variation of 2 feet. That means different balls vary by an expected 10.8 feet, which I'll round to 11. 

For minor league balls, the study found an SD of 8 feet overall, but didn't test single balls. Taking 3 feet as a representative estimate for single-ball variation, we get that MiLB balls vary by 7 feet. (8 squared minus 3 squared equals 7 squared, roughly.)

So we have:

-- MLB  balls vary  0 feet in air resistance
-- MLB  balls vary  7 feet in air resistance
-- MiLB balls vary  7 feet in air resistance
-- NCAA balls vary 11 feet in air resistance

In that light, the 7 feet found in MGL's study doesn't seem out of line. Actually, that 7 feet is a bit of an overestimate. It includes variation in COR (bounciness), which doesn't factor into air resistance, as far as I can tell. Limiting only to air resistance, MGL's study found an SD of only 6 feet.

-----

One thing I noticed in the MGL data is that even for balls within the same era, the COR "bouciness" measure correlates highly to both circumference (-.46 overall) and seam height (-.35 overall). (For the 10 balls after the 2016 All-Star break, it's -.36 and -.56, respectively.)

I don't know if those measures are related on some kind of physics basis, or if it's just coincidence that they varied together that way. 

-----

One thing I wonder: are balls within the same batch (whether the definition of "batch" is a box, a case, or a day's production) more uniform than balls from different batches? I haven't found a study that tells us that. From MGL's data, and treating day of use as a "batch," my eyeballs say batches are slightly more uniform than expected, but not much. My eyeballs could be wrong.

If batches *are* more uniform, teams could get valuable information by grabbing a few balls from today's batch, and getting them tested in advance. They'd be more likely to know, then, if they were dealing with livelier or deader balls that night.

Even if there's no difference within batches compared to between batches, it's still worth the testing. I don't know if any teams actually did this, but if any of them were testing balls in 2016, they'd have had advance knowledge that the balls were getting livelier. 

I have no idea what a team would do with that information, that home runs were about to jump significantly over last year ... but you'd think it would be valuable some way.

-----

MGL tweeted, and I agreed, that it doesn't take much variation in a ball to make a huge difference to home run rates. He also thinks that any change in liveliness is likely to have been inadvertent on the part of the manufacturer, since it takes so little to make balls fly farther. I agree with that too.

But, why are MLB standards so lenient? As Lindbergh quotes from an earlier report,


" ... two baseballs could meet MLB specifications for construction but one ball could be theoretically hit 49.1 feet further."

Why doesn't MLB just put tighter control on the baseballs it uses? If the manufacturers can't make baseballs that precise, just put out a net at a standard distance, fire all the balls, and discard (or save for batting practice) all the balls that land outside the net. (That can't be so hard, can it? It can't be that the cannon would damage the balls too much, since MLB reuses balls that have been hit for line drives, which is a much more violent impact.)

You could even assign the balls to different liveliness groups, and require that different batches be stored at different humidor settings to equalize their bounciness.

Even if that's not practical, couldn't MLB, at least, test the balls regularly, so as to notice the variation before it shows up so obviously in the HR totals?

-----

Finally, one last thought I had. If a ball is hit for a deep fly ball, doesn't that suggest that, at least as a matter or probability, it's juicier than average? If I were the pitching team, I might not want to pitch that ball again. It might be an expected difference of only a foot or two, but every little bit helps.





Labels: , , ,

Monday, June 19, 2017

Are some of today's baseballs twice as lively as others?

Over at The Ringer, Ben Lindbergh and Mitchel Lichtman (MGL) claim to have evidence of a juiced ball in MLB.

They got the evidence in the most direct way possible -- by obtaining actual balls, and having them tested. MLB sells some of their game-used balls directly to the public, with certificates of authenticity that include the date and play in which the ball was used. MGL bought 36 of those balls, and sent them to a lab for testing.

It never once occurred to me that you could do that ... so simple an idea, and so ingenious! Kudos to MGL. I wonder why mainstream sports journalists didn't think of it? It would be trivial for Sports Illustrated or ESPN to arrange for that.

Anyway ... it turned out that the 13 more recent balls -- the ones used in 2016 -- were indeed "juicier" than the 10 older balls used before the 2015 All-Star break. Differences in COR (Coefficient of Restitution, a measure of "bounciness"), seam height, and circumference were all in the expected "juicy" direction in favor of the newer baseballs. (The difference was statistically significant at 2.6 SD.)

The article says,


"While none of these attributes in isolation could explain the increase in home runs that we saw in the summer of 2015, in combination, they can."

If I read that right, it means the magnitude of the difference in the balls matches the magnitude of the increase in home runs. The sum of the three differences translated to the equivalent of 7.1 feet in fly ball distance.

The authors posted the results of the lab tests, for each of the 36 balls in the study; you can find their spreadsheet here.

-------

One thing I noticed: there sure is a lot of variation between balls, even within the same era, even used on the same day. Consider, for instance, the balls marked "MSCC0041" and "MSCC0043," both used on June 15, 2016.

The "43" ball had a COR of .497, compared to .486 for the "41" ball. That's a difference of 8 feet (I extrapolated from the chart in the article).

The "43" ball had a seam height of .032 inches, versus .046 for the other ball. That's a difference of *17 feet*.

The "43" ball had a circumference of 9.06 inches, compared to 9.08. That's another 0.5 feet.

Add those up, and you get that one ball, used the same day as another, was twenty-five feet livelier

If 7.1 feet (what MGL observed between seasons) is worth, say, 30 percent more home runs, then the 25 foot difference means the "43" ball is worth DOUBLE the home runs of the "41" ball. And that's for two balls that look identical, feel identical, and were used in MLB game play on exactly the same day.

-----

That 25-foot difference is bigger than typical, because I chose a relative outlier for the example. But the average difference is still pretty significant. Even within eras, the SD of difference between balls (adding up the three factors) is 7 or 8 feet.

Which means, if you take two random balls used on the same day in MLB, on average, one of them is *40 percent likelier* to be hit for a home run.

Of course, you don't know which one. If it were possible to somehow figure it out in real time during a game, what would that mean for strategy?


-----

UPDATE: thinking further ... could it just be that the lab tests aren't that precise, and the observed differences between same-era balls are mostly random error? 

That would explain the unintuitive result that balls vary so hugely, and it would still preserve the observation that the eras are different.
















Labels: , , ,

Thursday, May 25, 2017

Pete Palmer on luck vs. skill

Pete Palmer has a new article on skill and luck in baseball, in which he crams a whole lot of results into five pages. 

It's called "Calculating Skill and Luck in Major League Baseball," and appears in the new issue of SABR's "Baseball Research Journal."  It's downloadable only by SABR members at the moment, but will be made publicly available when the next issue comes out this fall.

For most of the results, Pete uses what I used to call the "Tango method," which I should call the "Palmer method," because I think Pete was actually the first to use it in the context of sabermetrics, in the 2005 book "Baseball Hacks."  (The mathematical method is very old; Wikipedia says it's the "Bienaymé formula," discovered in 1853. But its use in sabermetrics is recent, as far as I can tell.)

Anyway, to go through the method yet one more time ... 

Pete found that the standard deviation (SD) of MLB season team wins, from 1981 to 1990, was 9.98. Mathematically, you can calculate that the expected SD of luck is 6.25 wins. Since a team's wins is the total of (a) its expected wins due to talent, and (b) deviation due to luck, the 1853 formula says

SD(actual)^2 = SD(talent)^2 + SD(luck)^2

Subbing in the numbers, we get

9.98 ^ 2 = SD(talent)^2 + 6.25^2 

Which means SD(talent) = 7.78.

In terms of the variation in team wins for single seasons from 1981 to 1990, we can estimate that differences in skill were only slightly more important than differences in luck -- 7.8 games to 6.3 games.

------

That 7.8 is actually the narrowest range of team talent for any decade. Team skill has been narrowing since the beginning of baseball, but seems to have widened a bit since 1990. Here's part of Pete's table:

decade 
ending   SD(talent)
-------------------
 1880     9.93
 1890    14.44
 1900    14.72
 1910    15.33
 1920    13.06
 1930    12.51
 1940    13.66
 1950    12.99
 1960    11.95
 1970    11.17
 1980     9.75
 1990     7.78
 2000     8.46
 2010     9.87
 2016     8.91

Anyway, we've seen that many times, in various forms (although perhaps not by decade). But that's just the beginning of what Pete provides. I don't want to give away his entire article, but here some of the findings I hadn't seen before, at least not in this form:

1. For players who had at least 300 PA in a season, the spread in their batting average is roughly evenly caused by luck and skill.

2. Switching from BA to NOPS (normalized on-base plus slugging), skill now surpasses luck, by an SD of 20 points to 15.

3. For pitchers with 150 IP or more, luck and skill are again roughly even.

In the article, these are broken down by decade. There's other stuff too, including comparisons with the NBA and NFL (OK, that's not new, but still). Check it out if you can.

-------

OK, one thing that surprised me. Pete used simulations to estimate the true talent of teams, based on their W-L record. For instance, teams who win 95-97 games are, on average, 5.6 games lucky -- they're probably 90 or 91-win talents rather than 96.

That makes sense, and is consistent with other studies that tried to figure out the same thing. But Pete went one step further: he found actual teams that won 95-97 games, and checked how they did next year.

For the year in question, you'd expect them to have been 91 win teams. For the following year, you'd expect them to be *worse* than 91 wins, though. Because, team talent tends to revert to .500 over the medium term, unless you're a Yankee dynasty or something.

But ... for those teams, the difference was only six-tenths of a win. Instead of being 91 wins (90.8), they finished with an average of 90.2.

I would have thought the difference would have been more than 0.6 wins. And it's not just this group. For teams who finished between 58 and 103 wins, no group regressed more than 1.8 wins beyond their luck estimate. 

I guess that makes sense, when you think about it. A 90-win team is really an 87-win talent. If they regress to 81-81 over the next five seasons, that's only about one win per year. It's my intuition that was off, and it took Pete's chart to make me see that.






Labels: , ,

Wednesday, May 17, 2017

The hot hand debate vs. the clutch hitting debate

In the "hot hand" debate between Guy Molyneux and Joshua Miller I posted about last time, I continue to accept Guy's position, that "the hot hand has a negligible impact on competitive sports outcomes."

Josh's counterargument is that some evidence for a hot hand has emerged, and it's big. That's true: after correcting for the error in the Gilovich study, Josh did find evidence for a hot hand in the shooting data of Gilovich's experiment. He also found a significant hot hand in the NBA's three-point shooting contest

I still don't believe that those necessarily suggest a similar hot hand "in the wild" (as Guy puts it), especially considering that to my knowledge, none has been found in actual games. 

As Guy says,


"Personally, I find it easy to believe that humans may get into (and out of) a rhythm for some extremely repetitive tasks – like shooting a large number of 3-point baskets. Perhaps this kind of “muscle memory” momentum exists, and is revealed in controlled experiments."

-------

Of course, I keep an open mind: maybe players *do* get "hot" in real game situations, and maybe we'll eventually see evidence for it. 

But ... that evidence will be hard to find. As I have written before, and as Josh acknowledges himself, it's hard to pinpoint when a "hot hand" actually occurs, because streaks happen randomly without the player actually being "hot."

I think I've used this example in the past: suppose you have a 50 percent shooter when he's normal, but he turns in to a 60 percent shooter when he's "hot," which is one-tenth of the time. His overall rate is 51 percent.

Suppose that player makes three consecutive shots. Does that mean he's in his "hot" state? Not necessarily. Even when he's "normal," he's going to have times where he makes three consecutive shots just by random luck. And since he's "normal" nine times as often as he's "hot," the normal streaks will outweigh the hot streaks.

Specifically, only 19 percent of three-hit streaks will come when the player is hot. In other words, four out of five streaks are false positives.

(Normally, he makes three consecutive shots one time in 8. Hot, he makes three consecutive shots one time in 4.63. In 100 sequences, he'll be "normal" 90 times, for an average 11.25 streaks. In his 10 "hot" times, he'll make 2.16 streaks. That's about a 4:1 ratio.)

Averaging the real hotness with the fake hotness, the player will shoot 51.9 percent after a streak. But his overall rate is 51.0 percent. It takes a huge sample size to notice the difference between 51 percent and 51.9 percent.

Even if you do notice a difference, does it really make an impact on game decisions? Are you really going to give the player the ball more because his expectation is 0.9 percent higher, for an indeterminate amout of time?

-------

And that's my main disagreement with Josh's argument. I do acknowledge his finding that there's evidence of a "muscle memory" hot hand, and it does seem reasonable to think that if there's a hot hand in one circumstance, there's probably one in real games. After all, *part* of basketball is muscle memory ... maybe it fades when you don't take shots in quick succession, but it still seems plausible that maybe, some days you're more calibrated than others. If your muscles and brain are slightly different each day, or even each quarter, it's easy to imagine that some days, the mean of your instinctive shooting motion is right on the money, but, other days, it's a bit short.

But the argument isn't really about the *existence* of a hot hand -- it's about the *size* of the hot hand, whether it makes a real difference in games. And I think Guy is right that the effect has to be negligible. Because, even if you have a very large change in talent,  from 50 percent to 60 percent -- and a significant frequency of "hotness", 10 percent of the time -- you still only wind up with a 0.9 percent increased expectation after a streak of three hits. 

You could argue that, well, maybe 50 to 60 percent understates the true effect ... and you could get a stronger signal by looking at longer streaks.

That's true. But, to me, that argument actually *hurts* the case for the hot hand. Because, with so much data available, and so many examples of long streaks, a signal of high-enough strength should have been found by now, no? 


-------

This debate, it seems to me, echoes the clutch hitting debate almost perfectly.

For years, we framed the state of the evidence as "clutch hitting doesn't exist," because we couldn't find any evidence of signal in the noise. Then, a decade ago, Bill James published his famous "Underestimating the Fog" essay, in which he argued (and I agree) that you can't prove a negative, and the "fog" is so thick that there could, in fact, be a true clutch hitting talent, that we have been unable to notice.

That's true -- clutch hitting talent may, in fact, exist. But ... while we can't prove it doesn't exist, we CAN prove that if it does exist, it's very small. My study (.pdf) showed the spread (SD) among hitters would have to be less than 10 points of batting average (.010). "The Book" found it to be even smaller, .008 of wOBA (a metric that includes all offensive components, but is scaled to look like on-base percentage). 

To my experience, a sizable part of the fan community seizes on the "clutch hitting could be real" finding, but ignores the "clutch hitting can't be any more than tiny" finding. 

The implicit logic goes something like, 

1. Bill James thinks clutch hitting exists!
2. My favorite player came through in the clutch a lot more than normal!
3. Therefore, my favorite player is a clutch hitter who's much better than normal when it counts!

But that doesn't follow. Most strong clutch hitting performances will happen because of luck. Your great clutch hitting performance is probably a false positive. Sure, a strong clutch performance is more likely to happen given that a player is truly clutch, but, even then, with an SD of 10 points, there's no way your .250 hitter who hit .320 in the clutch is anything near a .320 clutch hitter. If you did the math, maybe you'd find that you should expect him to be .253, or something. 

Well, it's the same here, with the hot hand:

1. Miller and Sanjurjo found a real hot hand!
2. Therefore, hot hand is not a myth!
3. My favorite player just hit his last five three-point attempts!
4. Therefore, my player is hot and they should give him the ball more!

Same bad logic. Most streaks happen because of luck. The streak you just saw is probably a false positive. Sure, streaks will happen given that a player truly has a hot hand, but, even then, given how small the effect must be, there's no way your usual 50-percent-guy is anything near superstar level when hot. If you had the evidence and did the math, maybe you'd find that you should expect him to be 52 percent, or something.

-------

For some reason, fans do care about whether clutch hitting and the hot hand actually happen, but *don't* care how big the effect is. I bet psychologists have a cognitive fallacy for this, the "Zero Shades of Grey" fallacy or the "Give Them an Inch" fallacy or the "God Exists Therefore My Religion is Correct" fallacy or something, where people are unwilling to believe something into existence -- but, once given license to believe, are willing to assign it whatever properties their intuition comes up with.

So until someone shows us evidence of an observable, strong hot hand in real games, I would have to agree with Guy:

"... fans’ belief in the hot hand (in real games) is a cognitive error."  

The error is not in believing the hot hand exists, but in believing the hot hand is big enough to matter. 

Science may say there's a strong likelihood that intelligent life exists on other planets -- but it's still a cognitive error to believe every unexplained light in the sky is an alien flying saucer.




Labels: , ,

Wednesday, April 26, 2017

Guy Molyneux and Joshua Miller debate the hot hand

Here's a good "hot hand" debate between Guy Molyneux and Joshua Miller, over at Andrew Gelman's blog.

A bit of background, if you like, before you go there.

-----

In 1985, Thomas Gilovich, Robert Vallone, and Amos Tversky published a study refuting the "hot hand" hypothesis, which is the assumption that after a player has recently performed exceptionally well, he is likely to be "hot" and continue to perform exceptionally well.

The Gilovich [et al] study showed three results:

1. NBA players were actually *worse* after recent field goal successes than after recent failures;

2. NBA players showed no significant correlation between their first free throw and second free throw; and

3. In an experiment set up by Gilovich, which involved long series of repeated shots by college basketball players, there was no significant improvement after a series of hits.

-----

Then, in 2015-2016, Joshua Miller and Adam Sanjurjo found a flaw in Gilovich's reasoning. 

The most intuitive way to describe the flaw is this:

Gilovich assumed that if a player shot (say) 50 percent over the full sequence of 100 shots, you'd expect him to shoot 50 percent after a hit, and 50 percent after a miss.

But this is clearly incorrect. If a player hit 50 out of 100, then, if he made his (or her) first shot, what's left is 49 out of 99. You wouldn't expect 50%, then, but only about 49.5%. And, similarly, you'd expect 50.5% after a miss.

By assuming 50%, the Gilovich study set the benchmark too high, and would call a player cold or neutral when he was actually neutral or hot.

(That's a special case of the flaw Miller and Sanjurjo found, which applies only to the "after one hit" case. For what happens after a streak of two or more consecutive hits, it's more complicated. Coincidentally, the flaw is actually identical to one that Steven Landsburg posted for a similar problem, which I wrote about back in 2010. See my post here, or check out the Miller paper linked to above.)

------

The Miller [and Sanjurjo] paper corrected the flaw, and found that in Gilovich's experiment, there was indeed a hot hand, and a large one. In the Gilovich paper, shooters and observers were allowed to bet on whether the next shot would be made. The hit rate was actually seven percentage points higher when they decided to bet high, compared to when they decided to bet low (for example, 60 percent compared to 53 percent).

That suggests that the true hot hand effect must be higher than that -- because, if seven percentage points was what the participants observed in advance, who knows what they didn't observe? Maybe they only started betting when a streak got long, so they missed out on the part of the "hot hand" effect at the beginning of the streak.

However, there was no evidence of a hot hand in the other two parts of the Gilovich paper. In one part, players seem to hit field goals *worse* after a hit than after a miss -- but, corrected for the flaw, it seems (to my eye) that the effect is around zero. And, the "second free throw after the first" doesn't feature the flaw, so the results stand.

------

In addition, in a separate paper, Miller and Sanjurjo analyzed the results of the NBA's three-point contest, and found a hot hand there, too. I wrote about that in two posts in 2015. 

-------

From that, Miller argues that the hot hand *does* exist, and we now have evidence for it, and we need to take it seriously, and it's not a cognitive error to believe the hot hand represents something real, rather than just random occurrences in random sequences. 

Moreover, he argues that teams and players might actually benefit from taking a "hot hand" into account when formulating strategy -- not in any specific way, but, rather, that, in theory, there could be a benefit to be found somewhere.

He also uses an "absence of evidence is not evidence of absence"-type argument, pointing out that if all you have is binary data, of hits and misses, there could be a substantial hot hand effect in real life, but one that you'd be unable to find in the data unless you had a very large sample. I consider that argument a parallel to Bill James' "Underestimating the Fog" argument for clutch hitting -- that the methods we're using are too weak to find it even if it were there.

------

And that's where Guy comes in. 

Here's that link again. Be sure to check the comments ... most of the real debate resides there, where Miller and Guy engage each other's arguments directly.






Labels: , , ,

Friday, March 24, 2017

Career run support for starting pitchers

For the little study I did last post, I used Retrosheet data to compile run support stats for every starting pitcher in recent history (specifically, pitchers whose starts all came in 1950 or later).

Comparing every pitcher to his teammates, and totalling up everything for a career ...the biggest "hard luck" starter, in terms of total runs, is Greg Maddux. In Maddux's 740 starts, his offense scored 238 fewer runs than they did for his teammates those same seasons. That's a shortfall of 0.32 runs per game.

Here's the top six:

Runs   GS   R/GS  
--------------------------------
-238  740  -0.32  Greg Maddux
-199  773  -0.26  Nolan Ryan
-192  707  -0.27  Roger Clemens
-168  430  -0.39  A.J. Burnett
-167  690  -0.24  Gaylord Perry
-164  393  -0.42  Steve Rogers

Three four of the top five are in the Hall of Fame. You might expect that to be the case, since, to accumulate a big deficiency in run support, you have to pitch a lot of games ... and guys who pitch a lot of games tend to be good. But, on the flip side, the "good luck" starters, whose teams scored more for them than for their teammates, aren't nearly as good:

Runs   GS   R/GS  
--------------------------------
+238  364  +0.65  Vern Law
+188  458  +0.41  Mike Torrez
+170  254  +0.67  Bryn Smith
+151  297  +0.51  Ramon Martinez
+147  355  +0.41  Mike Krukow
+143  682  +0.21  Tom Glavine

The only explanation for the difference, that I can think of, is that to have a long career despite bad run support, you have to be a REALLY good pitcher. To have the same length career, with good run support, you can just be PRETTY good.

But, that assumes that teams pay a lot of attention to W-L record, which would be the biggest statistical reflection of run support. And, we're only talking about a difference of around half a run per game. 

Another possibility: pitchers who are the ace of the staff usually start on opening day, where they face the other team's ace. So, that game, against a star pitcher, they get below-average support. Maybe, because of the way rotations work, they face better pitchers more often, and that's what accounts for the difference. Did Bill James study this once?

In any event, just taking the opening day game .. if those games are one run below average for the team, and Nolan Ryan got 20 of those starts, there's 20 of his 199 runs right there.

--------

UPDATE: see the comments for suggestions from Tango and GuyM.  The biggest one: GuyM points out that good pitchers lead to more leads, which means fewer bottom-of-the-ninth runs when they pitch at home.  Back of the envelope estimate: suppose a great pitcher means the team goes 24-8 in his starts, instead of 16-16.  That's 8 extra wins, which is 4 extra wins at home, which is 2 runs over a season, which is 30 runs over 15 good seasons like that.
--------

Here are the career highs and lows on a per-game basis, minimum 100 starts:

Runs   GS   R/GS  
--------------------------------
- 85  106  -0.80  Ryan Franklin
- 94  134  -0.70  Shawn Chacon
-135  203  -0.66  Ron Kline
- 72  116  -0.62  Shelby Miller
-154  249  -0.62  Denny Lemaster
- 68  115  -0.59  Trevor Wilson

Runs   GS   R/GS  
--------------------------------
+127  164  +0.77  Bill Krueger
+ 82  108  +0.76  Rob Bell
+ 89  118  +0.76  Jeff Ballard
+ 81  110  +0.73  Mike Minor
+170  254  +0.67  Bryn Smith
+106  161  +0.66  Jake Arrieta
+238  364  +0.65  Vern Law

These look fairly random to me.

-------

Here's what happens if we go down to a minimum of 10 starts:

Runs   GS   R/GS  
---------------------------------
- 29   12  -2.40  Angel Moreno
- 30   13  -2.29  Jim Converse
- 23   11  -2.25  Mike Walker
- 20   11  -1.86  Tony Mounce
- 25   14  -1.81  John Gabler

Runs   GS   R/GS  
---------------------------------
+ 32   11  +2.91  J.D. Durbin
+ 43   17  +2.56  John Strohmayer
+ 58   25  +2.30  Colin Rea
+ 61   28  +2.16  Bob Wickman
+ 23   11  +2.33  John Rauch

-------

It seems weird that, for instance, Bob Wickman would get such good run support in as many as 28 starts, his team scoring more than two extra runs a game for him. But, with 2,169 pitchers in the list, you're going to get these kinds of things happening just randomly.

The SD of team runs in a game is around 3. Over 36 starts, the SD of average support is 3 divided by the square root of 36, which works out to 0.5. Over Wickman's 28 starts, it's 0.57. So, Wickman was about 3.8 SDs from zero.

But that's not quite right ... the support his teammates got is a random variable, too. Accounting for that, I get that Wickman was 3.7 SDs from zero. Not that big a deal, but still worth correcting for.

I'll call that "3.7" figure the "Z-score."  Here are the top and bottom career Z-scores, minimum 72 starts:


    Z   GS   R/GS  
--------------------------------
-3.06   72  -1.16  Kevin Gausman
-2.94  203  -0.66  Ron Kline
-2.89  249  -0.62  Denny Lemaster
-2.57  134  -0.70  Shawn Chacon
-2.57  740  -0.32  Greg Maddux

    Z   GS   R/GS  
--------------------------------
+3.79  364  +0.65  Vern Law
+3.24  254  +0.67  Bryn Smith
+3.16  164  +0.77  Bill Krueger
+3.12   93  +1.02  Roy Smith
+2.73  247  +0.56  Tony Cloninger

The SD of the overall Z-scores is 1.045, pretty close to the 1.000 we'd expect if everything were just random. But, that still leaves enough room that something else could be going on.

-------

I chose a cutoff 72 starts to include Kevin Gausman, who is still active. Last year, the Orioles starter went 9-12 despite an ERA of only 3.61. 

Not only is Gausman the highest Z-score of pitchers with 72 starts, he's also the highest Z-score of pitchers with as few as 10 starts!

Of the forty-two starters more extreme than Gausman's support shortfall of 1.16 runs per game, none of them have more than 41 starts. 

Gausman is a historical outlier, in terms of poor run support -- the hardluckest starting pitcher ever.

------

I've posted the full spreadsheet at my website, here.


UPDATE, 3/31: New spreadsheet (Excel format), updated to account for innings of run support, to correct any the bottom-of-the-ninth issues (as per GuyM's suggestion).  Actually, both methods are in separate tabs.


Labels: ,

Thursday, March 02, 2017

How much to park-adjust a performance depends on the performance itself

In 2016, the Detroit Tigers' fielders were below average -- by about 50 runs, according to Baseball Reference. Still, Justin Verlander had an excellent season, going 16-9 with a 3.04 ERA. Should we rate Verlander's season even higher than his stat line, since he had to overcome his team's poor fielding behind him?

Not necessarily, I have argued. A team's defense is better some games than others (in results, not necessarily in talent). The fact that Verlander had a good season suggests that his starts probably got the benefit of the better games. 

I used this analogy:

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. 

Except that ... it WAS run support. 

It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

------

Just for fun, I decided to run a little study to see how big the effect actually is, for pitcher run support.

I found all starters from 1950 to 2015, who:

-- played for teams with below-league-average runs scored;

-- had at least 15 starts and 15 decisions, pitching no more than 5 games in relief; and

-- had a W-L record at least 10 games above .500 (e.g. 16-6).

There were 102 qualifying pitchers, mostly from the modern era. Their average record was 20-8 (19.8-7.7). 

They played in leagues where an average 4.41 runs were scored per game, but their below-average teams scored only 4.22. 

A first instinct might be to say, "hey, these pitchers should have had a W-L record even better than they did, because their teams gave them worse run support than the league average, by 0.19 runs per start!"

But, I'm arguing, you can't say that. Run support varies from game to game. Since we're doing selective sampling, concentrating on pitchers with excellent W-L records, we're more likely to have selected pitchers who got better run support than the rest of their team.

And the results show that. 

As mentioned, the pitchers' teams scored only 4.22 runs per game that season, compared with the league average 4.41. But, in the specific games those pitchers started, their teams gave them 4.54 runs of support. 

That's not just more than the team normally scored -- it's actually even more than the league average.

4.22 team
4.41 league
4.54 these pitchers

That's a pretty large effect. The size is due in part to the fact that we took pitchers with exceptionally good records.

Suppose a pitcher goes 22-8. Because run support varies, it could be that:

-- he pitched to (say) a 20-10 level, but got better run support;
-- he pitched to (say) a 24-6 level, but got worse run support.

But it's much less common to pitch at a 24-6 level than it is to pitch at a 20-10 level. So, the 22-8 guy was much more likely to be a 20-10 guy who got good run support than a 24-6 guy who got poor run support.

The same is true for lesser pitchers, to a lesser extent. It's not as much rarer to (say) pitch at a 14-10 level than at a 12-12 level. So, the effect should be there, for those pitchers, too, but it should be smaller.

I reran the study, but this time, pitchers were included if they were even one game over .500. That increased the sample size to 1024 team-seasons. The average pitcher in the sample was 14-10 (14.4 and 9.7).

Here are the run support numbers:

4.15 team
4.40 league
4.32 these pitchers

This time, the effect wasn't so big that the pitchers actually got more support than the league average. But it did move them two-thirds of the way there. 

And, of course, not *every* pitcher in the study got better run support than his teammates. That figure was only 62.1 percent. The point is, we should expect it to be more than half.

-------

Suppose a player has an exceptionally good result -- an extremely good W-L record, or a lot of home runs, or a high batting average, or whatever. 

Then, in any way that it's possible for him to have been lucky or unlucky -- that is, influenced by external factors that you might want to correct for -- he's more likely to have been lucky than unlucky.

If a player hits 40 home runs in an extreme pitcher's park, he probably wasn't hurt by the park as much as other players. If a player steals 80 bases and is only caught 6 times, he probably faced weaker-throwing catchers than the league average. If a shortstop rates very high for his fielding runs one year, he was probably lucky in that he got easier balls to field than normal (relative to the standards of the metric you're using).

"Probably" doesn't mean "always," of course. It just means more than a 50 percent chance. It could be anywhere from 50.0001 percent to 99.9999 percent. (As I mentioned, it was 62.1 percent for the run support study.)

The actual probability, and the size of the effect, depends on a lot of things. It depends on how you define "extreme" performances. It depends on the variances of the performances and the factor you're correcting for. It depends on how many external factors actually affect the extreme performance you're seeing.

So: for any given case, is the effect big, or is it small? You have to think about it and make an argument. Here's an argument you could make for run support, without actually having to do the study.

In most seasons, the SD of a single team's runs per game is about 3. That means that in a season of 36 starts, the SD of average run support is 0.5 runs (which is 3 divided by the square root of 36). 

In the 2015 AL, the SD of season runs scored between teams was only 0.4 runs per game.

0.5 runs of variation between pitchers on a team
0.4 runs of variation between teams

That means, that, for a given starting pitcher's W-L record, randomness in what games he starts matters *more* than his team's overall level of run support. 

That's why we should expect the effect to be large.

There are other sources of luck that might affect a pitcher's W-L record. Home/road starts, for instance. If you find a pitcher with a good record, there's better than a 50-50 shot that he started more games at home than on the road. But, the amount of overall randomness in that stat is so small -- especially since there's usually a regular rotation -- that the expectation is probably closer to, say, 50.1 percent, than to the 62.1 percent that we found for run support.

But, in theory, the effect must exist, at some magnitude. Whether it's big enough that you have to worry about, is something that you have to figure out.

I've always wanted to try to study this for park effects. I've always suspected that when a player hits 40 home runs in a pitcher's park, and he gets adjusted up to 47 or something ... that that's way too high. But I haven't figured out how to figure it out.







Labels: , , , ,

Monday, February 06, 2017

Are women chess players intimidated by male opponents? Part III

Over at Tango's blog, commenter TomC found an error in my last post. I had meant to create a sample of male chess players with mean 2400, but I added wrong and created a mean of 2500 instead. (The distribution of females was correct, with mean 2300.)

The effect produced my original distribution had come close to the one in the paper, but, with the correction, it drops to about half.

The effect is the win probability difference between a woman facing a man, as compared to facing a woman with the same rating:

-0.021 paper
-0.020 my error
-0.010 corrected

It makes sense that the observed effect drops when the distribution of men gets closer to the distribution of women. That's because (by my hypothesis) it's caused by the fact that women and men have to be regressed to different means. The more different the means, the larger the effect. 

Suppose the distribution matches my error, 2300 for the women and 2500 for the men. When a 2400 woman plays a 2400 man, her rating of 2400 needs to be regressed down, towards 2300. But the man's rating needs to be regressed *up*, towards 2500. That means the woman was probably overrated, and the man underrated.

But, when the men's mean is only 2400, the man no longer needs to be regressed at all, because he's right at the mean. So, only the woman needs to be regressed, and the effect is smaller.

-------

The effect becomes larger when the players are closer matched in rating. That's when it's most likely that the women is above average, and the man is below average. The original study found a larger effect in close matches, and so did my (corrected) simulation:

.0130 -- ratings within 50 points
.0144 -- ratings within 100 points
.0129 -- ratings within 200 points
.0057 -- ratings more than 200 points apart

Why is that important?  Because, in my simulation, I chose the participants at random from the distributions. In real life, tournaments try to match players to opponents with similar ratings.

In the study, the men's ratings were higher than the women's, by an average 116 points. But when a man faced a woman, the average advantage wasn't 116 -- it was much lower. As the study says, on page 18,

"However, when a female player in our sample plays a male opponent, she faces an average disadvantage of 27 Elo points ..."

Twenty-seven points is a very small disadvantage, about one quarter of the 116 points that you'd see if tournament matchups were random. The matching of players makes the effecst look larger.

So, I rejigged my simulation to make all matches closer. I chose a random matchup, and then discarded it a certain percentage of the time, varying with the ratings difference. 

If the players had the same rating, I always kept that match. If the difference was more than 400 points, I always discarded that match. In between, I kept it on a sliding scale that randomly decided which matches to keep.


(Technical details: I decided each game by a probability corresponding to the 1.33 power of the difference. So 200-point games, which are halfway between 0 and 400, got kept only 1/2.52 of the time (2.52 being 2 to the power of 1.33). 
Why 1.333?  I tried a few other exponents, and that one happened to get the resulting distributions of men and women close to what was in the study. But other ways would have worked too.  For what it's worth, I tried other exponents, and the results were very similar.)

Now, my results were back close to what the study had come up with, in its Table 2:

.021 study
.019 simulation

To verify that I didn't screw up again, I compared my summary statistics to the study.  They were reasonably close.  All numbers are Elo ratings:

Mean 2410, SD 175: men, study
Mean 2413, SD 148: men, simulation

Mean 2294, SD 141: women, study
Mean 2298, SD 131: women, simulation

Mean 27 points: M vs. W opp. difference, study
Mean 46 points: M vs. W opp. difference, simulation

The biggest difference was in the opponents faced:

Mean 2348: men's opponents, study
Mean 2408: men's opponents, simulation

Mean 2283: women's opponents, study
Mean 2321: women's opponents, simulation

The differences here are that in real life, the players chosen by the study played opponents worse than themselves, on average. (Part of the reason is that, in the study, only the better players (rating of 2000+) were included as "players", but all their opponents were included, regardless of skill.)  In the simulation, the players were chosen from the same distribution. 

I don't think that affects the results, but I should definitely mention it.

-------

Another thing I should mention, in defense of the study: last post, I questioned what happens when you include actual ratings in the regression, instead of just win probability (which is based on the ratings). I checked, and that actually *lowers* the observed effect, even if only a little bit. From my simulation:

.0188 not included
.0187 included

And, one more: as I mentioned last post, I chose an SD of 52 points for the difference between a player's rating and his or her actual talent. I have no idea if 52 points is a reasonable estimate; my gut suggests it's too high. Reducing the SD would also reduce the size of the observed effect.

I still suspect that the study's effect is almost entirely caused by this regression-to-the-mean effect.  But, without access to the study's data, I don't know the exact distributions of the matchups, to simulate closer to real life. 

But, as a proof of concept, I think the simulation shows that the effect they found in Table 2 is of the same order of magnitude as what you'd expect for purely statistical reasons. 

So I don't think the study found any evidence at all for their hypothesis of male-to-female intimidation.




P.S.  Thanks again to TomC for finding my error, and for continued discussion at Tango's site.



Labels: , , ,