Wednesday, April 26, 2017

Guy Molyneux and Joshua Miller debate the hot hand

Here's a good "hot hand" debate between Guy Molyneux and Joshua Miller, over at Andrew Gelman's blog.

A bit of background, if you like, before you go there.

-----

In 1985, Thomas Gilovich, Robert Vallone, and Amos Tversky published a study refuting the "hot hand" hypothesis, which is the assumption that after a player has recently performed exceptionally well, he is likely to be "hot" and continue to perform exceptionally well.

The Gilovich [et al] study showed three results:

1. NBA players were actually *worse* after recent field goal successes than after recent failures;

2. NBA players showed no significant correlation between their first free throw and second free throw; and

3. In an experiment set up by Gilovich, which involved long series of repeated shots by college basketball players, there was no significant improvement after a series of hits.

-----

Then, in 2015-2016, Joshua Miller and Adam Sanjurjo found a flaw in Gilovich's reasoning. 

The most intuitive way to describe the flaw is this:

Gilovich assumed that if a player shot (say) 50 percent over the full sequence of 100 shots, you'd expect him to shoot 50 percent after a hit, and 50 percent after a miss.

But this is clearly incorrect. If a player hit 50 out of 100, then, if he made his (or her) first shot, what's left is 49 out of 99. You wouldn't expect 50%, then, but only about 49.5%. And, similarly, you'd expect 50.5% after a miss.

By assuming 50%, the Gilovich study set the benchmark too high, and would call a player cold or neutral when he was actually neutral or hot.

(That's a special case of the flaw Miller and Sanjurjo found, which applies only to the "after one hit" case. For what happens after a streak of two or more consecutive hits, it's more complicated. Coincidentally, the flaw is actually identical to one that Steven Landsburg posted for a similar problem, which I wrote about back in 2010. See my post here, or check out the Miller paper linked to above.)

------

The Miller [and Sanjurjo] paper corrected the flaw, and found that in Gilovich's experiment, there was indeed a hot hand, and a large one. In the Gilovich paper, shooters and observers were allowed to bet on whether the next shot would be made. The hit rate was actually seven percentage points higher when they decided to bet high, compared to when they decided to bet low (for example, 60 percent compared to 53 percent).

That suggests that the true hot hand effect must be higher than that -- because, if seven percentage points was what the participants observed in advance, who knows what they didn't observe? Maybe they only started betting when a streak got long, so they missed out on the part of the "hot hand" effect at the beginning of the streak.

However, there was no evidence of a hot hand in the other two parts of the Gilovich paper. In one part, players seem to hit field goals *worse* after a hit than after a miss -- but, corrected for the flaw, it seems (to my eye) that the effect is around zero. And, the "second free throw after the first" doesn't feature the flaw, so the results stand.

------

In addition, in a separate paper, Miller and Sanjurjo analyzed the results of the NBA's three-point contest, and found a hot hand there, too. I wrote about that in two posts in 2015. 

-------

From that, Miller argues that the hot hand *does* exist, and we now have evidence for it, and we need to take it seriously, and it's not a cognitive error to believe the hot hand represents something real, rather than just random occurrences in random sequences. 

Moreover, he argues that teams and players might actually benefit from taking a "hot hand" into account when formulating strategy -- not in any specific way, but, rather, that, in theory, there could be a benefit to be found somewhere.

He also uses an "absence of evidence is not evidence of absence"-type argument, pointing out that if all you have is binary data, of hits and misses, there could be a substantial hot hand effect in real life, but one that you'd be unable to find in the data unless you had a very large sample. I consider that argument a parallel to Bill James' "Underestimating the Fog" argument for clutch hitting -- that the methods we're using are too weak to find it even if it were there.

------

And that's where Guy comes in. 

Here's that link again. Be sure to check the comments ... most of the real debate resides there, where Miller and Guy engage each other's arguments directly.






Labels: , , ,

Friday, March 24, 2017

Career run support for starting pitchers

For the little study I did last post, I used Retrosheet data to compile run support stats for every starting pitcher in recent history (specifically, pitchers whose starts all came in 1950 or later).

Comparing every pitcher to his teammates, and totalling up everything for a career ...the biggest "hard luck" starter, in terms of total runs, is Greg Maddux. In Maddux's 740 starts, his offense scored 238 fewer runs than they did for his teammates those same seasons. That's a shortfall of 0.32 runs per game.

Here's the top six:

Runs   GS   R/GS  
--------------------------------
-238  740  -0.32  Greg Maddux
-199  773  -0.26  Nolan Ryan
-192  707  -0.27  Roger Clemens
-168  430  -0.39  A.J. Burnett
-167  690  -0.24  Gaylord Perry
-164  393  -0.42  Steve Rogers

Three four of the top five are in the Hall of Fame. You might expect that to be the case, since, to accumulate a big deficiency in run support, you have to pitch a lot of games ... and guys who pitch a lot of games tend to be good. But, on the flip side, the "good luck" starters, whose teams scored more for them than for their teammates, aren't nearly as good:

Runs   GS   R/GS  
--------------------------------
+238  364  +0.65  Vern Law
+188  458  +0.41  Mike Torrez
+170  254  +0.67  Bryn Smith
+151  297  +0.51  Ramon Martinez
+147  355  +0.41  Mike Krukow
+143  682  +0.21  Tom Glavine

The only explanation for the difference, that I can think of, is that to have a long career despite bad run support, you have to be a REALLY good pitcher. To have the same length career, with good run support, you can just be PRETTY good.

But, that assumes that teams pay a lot of attention to W-L record, which would be the biggest statistical reflection of run support. And, we're only talking about a difference of around half a run per game. 

Another possibility: pitchers who are the ace of the staff usually start on opening day, where they face the other team's ace. So, that game, against a star pitcher, they get below-average support. Maybe, because of the way rotations work, they face better pitchers more often, and that's what accounts for the difference. Did Bill James study this once?

In any event, just taking the opening day game .. if those games are one run below average for the team, and Nolan Ryan got 20 of those starts, there's 20 of his 199 runs right there.

--------

UPDATE: see the comments for suggestions from Tango and GuyM.  The biggest one: GuyM points out that good pitchers lead to more leads, which means fewer bottom-of-the-ninth runs when they pitch at home.  Back of the envelope estimate: suppose a great pitcher means the team goes 24-8 in his starts, instead of 16-16.  That's 8 extra wins, which is 4 extra wins at home, which is 2 runs over a season, which is 30 runs over 15 good seasons like that.
--------

Here are the career highs and lows on a per-game basis, minimum 100 starts:

Runs   GS   R/GS  
--------------------------------
- 85  106  -0.80  Ryan Franklin
- 94  134  -0.70  Shawn Chacon
-135  203  -0.66  Ron Kline
- 72  116  -0.62  Shelby Miller
-154  249  -0.62  Denny Lemaster
- 68  115  -0.59  Trevor Wilson

Runs   GS   R/GS  
--------------------------------
+127  164  +0.77  Bill Krueger
+ 82  108  +0.76  Rob Bell
+ 89  118  +0.76  Jeff Ballard
+ 81  110  +0.73  Mike Minor
+170  254  +0.67  Bryn Smith
+106  161  +0.66  Jake Arrieta
+238  364  +0.65  Vern Law

These look fairly random to me.

-------

Here's what happens if we go down to a minimum of 10 starts:

Runs   GS   R/GS  
---------------------------------
- 29   12  -2.40  Angel Moreno
- 30   13  -2.29  Jim Converse
- 23   11  -2.25  Mike Walker
- 20   11  -1.86  Tony Mounce
- 25   14  -1.81  John Gabler

Runs   GS   R/GS  
---------------------------------
+ 32   11  +2.91  J.D. Durbin
+ 43   17  +2.56  John Strohmayer
+ 58   25  +2.30  Colin Rea
+ 61   28  +2.16  Bob Wickman
+ 23   11  +2.33  John Rauch

-------

It seems weird that, for instance, Bob Wickman would get such good run support in as many as 28 starts, his team scoring more than two extra runs a game for him. But, with 2,169 pitchers in the list, you're going to get these kinds of things happening just randomly.

The SD of team runs in a game is around 3. Over 36 starts, the SD of average support is 3 divided by the square root of 36, which works out to 0.5. Over Wickman's 28 starts, it's 0.57. So, Wickman was about 3.8 SDs from zero.

But that's not quite right ... the support his teammates got is a random variable, too. Accounting for that, I get that Wickman was 3.7 SDs from zero. Not that big a deal, but still worth correcting for.

I'll call that "3.7" figure the "Z-score."  Here are the top and bottom career Z-scores, minimum 72 starts:


    Z   GS   R/GS  
--------------------------------
-3.06   72  -1.16  Kevin Gausman
-2.94  203  -0.66  Ron Kline
-2.89  249  -0.62  Denny Lemaster
-2.57  134  -0.70  Shawn Chacon
-2.57  740  -0.32  Greg Maddux

    Z   GS   R/GS  
--------------------------------
+3.79  364  +0.65  Vern Law
+3.24  254  +0.67  Bryn Smith
+3.16  164  +0.77  Bill Krueger
+3.12   93  +1.02  Roy Smith
+2.73  247  +0.56  Tony Cloninger

The SD of the overall Z-scores is 1.045, pretty close to the 1.000 we'd expect if everything were just random. But, that still leaves enough room that something else could be going on.

-------

I chose a cutoff 72 starts to include Kevin Gausman, who is still active. Last year, the Orioles starter went 9-12 despite an ERA of only 3.61. 

Not only is Gausman the highest Z-score of pitchers with 72 starts, he's also the highest Z-score of pitchers with as few as 10 starts!

Of the forty-two starters more extreme than Gausman's support shortfall of 1.16 runs per game, none of them have more than 41 starts. 

Gausman is a historical outlier, in terms of poor run support -- the hardluckest starting pitcher ever.

------

I've posted the full spreadsheet at my website, here.


UPDATE, 3/31: New spreadsheet (Excel format), updated to account for innings of run support, to correct any the bottom-of-the-ninth issues (as per GuyM's suggestion).  Actually, both methods are in separate tabs.


Labels: ,

Thursday, March 02, 2017

How much to park-adjust a performance depends on the performance itself

In 2016, the Detroit Tigers' fielders were below average -- by about 50 runs, according to Baseball Reference. Still, Justin Verlander had an excellent season, going 16-9 with a 3.04 ERA. Should we rate Verlander's season even higher than his stat line, since he had to overcome his team's poor fielding behind him?

Not necessarily, I have argued. A team's defense is better some games than others (in results, not necessarily in talent). The fact that Verlander had a good season suggests that his starts probably got the benefit of the better games. 

I used this analogy:

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. 

Except that ... it WAS run support. 

It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

------

Just for fun, I decided to run a little study to see how big the effect actually is, for pitcher run support.

I found all starters from 1950 to 2015, who:

-- played for teams with below-league-average runs scored;

-- had at least 15 starts and 15 decisions, pitching no more than 5 games in relief; and

-- had a W-L record at least 10 games above .500 (e.g. 16-6).

There were 102 qualifying pitchers, mostly from the modern era. Their average record was 20-8 (19.8-7.7). 

They played in leagues where an average 4.41 runs were scored per game, but their below-average teams scored only 4.22. 

A first instinct might be to say, "hey, these pitchers should have had a W-L record even better than they did, because their teams gave them worse run support than the league average, by 0.19 runs per start!"

But, I'm arguing, you can't say that. Run support varies from game to game. Since we're doing selective sampling, concentrating on pitchers with excellent W-L records, we're more likely to have selected pitchers who got better run support than the rest of their team.

And the results show that. 

As mentioned, the pitchers' teams scored only 4.22 runs per game that season, compared with the league average 4.41. But, in the specific games those pitchers started, their teams gave them 4.54 runs of support. 

That's not just more than the team normally scored -- it's actually even more than the league average.

4.22 team
4.41 league
4.54 these pitchers

That's a pretty large effect. The size is due in part to the fact that we took pitchers with exceptionally good records.

Suppose a pitcher goes 22-8. Because run support varies, it could be that:

-- he pitched to (say) a 20-10 level, but got better run support;
-- he pitched to (say) a 24-6 level, but got worse run support.

But it's much less common to pitch at a 24-6 level than it is to pitch at a 20-10 level. So, the 22-8 guy was much more likely to be a 20-10 guy who got good run support than a 24-6 guy who got poor run support.

The same is true for lesser pitchers, to a lesser extent. It's not as much rarer to (say) pitch at a 14-10 level than at a 12-12 level. So, the effect should be there, for those pitchers, too, but it should be smaller.

I reran the study, but this time, pitchers were included if they were even one game over .500. That increased the sample size to 1024 team-seasons. The average pitcher in the sample was 14-10 (14.4 and 9.7).

Here are the run support numbers:

4.15 team
4.40 league
4.32 these pitchers

This time, the effect wasn't so big that the pitchers actually got more support than the league average. But it did move them two-thirds of the way there. 

And, of course, not *every* pitcher in the study got better run support than his teammates. That figure was only 62.1 percent. The point is, we should expect it to be more than half.

-------

Suppose a player has an exceptionally good result -- an extremely good W-L record, or a lot of home runs, or a high batting average, or whatever. 

Then, in any way that it's possible for him to have been lucky or unlucky -- that is, influenced by external factors that you might want to correct for -- he's more likely to have been lucky than unlucky.

If a player hits 40 home runs in an extreme pitcher's park, he probably wasn't hurt by the park as much as other players. If a player steals 80 bases and is only caught 6 times, he probably faced weaker-throwing catchers than the league average. If a shortstop rates very high for his fielding runs one year, he was probably lucky in that he got easier balls to field than normal (relative to the standards of the metric you're using).

"Probably" doesn't mean "always," of course. It just means more than a 50 percent chance. It could be anywhere from 50.0001 percent to 99.9999 percent. (As I mentioned, it was 62.1 percent for the run support study.)

The actual probability, and the size of the effect, depends on a lot of things. It depends on how you define "extreme" performances. It depends on the variances of the performances and the factor you're correcting for. It depends on how many external factors actually affect the extreme performance you're seeing.

So: for any given case, is the effect big, or is it small? You have to think about it and make an argument. Here's an argument you could make for run support, without actually having to do the study.

In most seasons, the SD of a single team's runs per game is about 3. That means that in a season of 36 starts, the SD of average run support is 0.5 runs (which is 3 divided by the square root of 36). 

In the 2015 AL, the SD of season runs scored between teams was only 0.4 runs per game.

0.5 runs of variation between pitchers on a team
0.4 runs of variation between teams

That means, that, for a given starting pitcher's W-L record, randomness in what games he starts matters *more* than his team's overall level of run support. 

That's why we should expect the effect to be large.

There are other sources of luck that might affect a pitcher's W-L record. Home/road starts, for instance. If you find a pitcher with a good record, there's better than a 50-50 shot that he started more games at home than on the road. But, the amount of overall randomness in that stat is so small -- especially since there's usually a regular rotation -- that the expectation is probably closer to, say, 50.1 percent, than to the 62.1 percent that we found for run support.

But, in theory, the effect must exist, at some magnitude. Whether it's big enough that you have to worry about, is something that you have to figure out.

I've always wanted to try to study this for park effects. I've always suspected that when a player hits 40 home runs in a pitcher's park, and he gets adjusted up to 47 or something ... that that's way too high. But I haven't figured out how to figure it out.







Labels: , , , ,

Monday, February 06, 2017

Are women chess players intimidated by male opponents? Part III

Over at Tango's blog, commenter TomC found an error in my last post. I had meant to create a sample of male chess players with mean 2400, but I added wrong and created a mean of 2500 instead. (The distribution of females was correct, with mean 2300.)

The effect produced my original distribution had come close to the one in the paper, but, with the correction, it drops to about half.

The effect is the win probability difference between a woman facing a man, as compared to facing a woman with the same rating:

-0.021 paper
-0.020 my error
-0.010 corrected

It makes sense that the observed effect drops when the distribution of men gets closer to the distribution of women. That's because (by my hypothesis) it's caused by the fact that women and men have to be regressed to different means. The more different the means, the larger the effect. 

Suppose the distribution matches my error, 2300 for the women and 2500 for the men. When a 2400 woman plays a 2400 man, her rating of 2400 needs to be regressed down, towards 2300. But the man's rating needs to be regressed *up*, towards 2500. That means the woman was probably overrated, and the man underrated.

But, when the men's mean is only 2400, the man no longer needs to be regressed at all, because he's right at the mean. So, only the woman needs to be regressed, and the effect is smaller.

-------

The effect becomes larger when the players are closer matched in rating. That's when it's most likely that the women is above average, and the man is below average. The original study found a larger effect in close matches, and so did my (corrected) simulation:

.0130 -- ratings within 50 points
.0144 -- ratings within 100 points
.0129 -- ratings within 200 points
.0057 -- ratings more than 200 points apart

Why is that important?  Because, in my simulation, I chose the participants at random from the distributions. In real life, tournaments try to match players to opponents with similar ratings.

In the study, the men's ratings were higher than the women's, by an average 116 points. But when a man faced a woman, the average advantage wasn't 116 -- it was much lower. As the study says, on page 18,

"However, when a female player in our sample plays a male opponent, she faces an average disadvantage of 27 Elo points ..."

Twenty-seven points is a very small disadvantage, about one quarter of the 116 points that you'd see if tournament matchups were random. The matching of players makes the effecst look larger.

So, I rejigged my simulation to make all matches closer. I chose a random matchup, and then discarded it a certain percentage of the time, varying with the ratings difference. 

If the players had the same rating, I always kept that match. If the difference was more than 400 points, I always discarded that match. In between, I kept it on a sliding scale that randomly decided which matches to keep.


(Technical details: I decided each game by a probability corresponding to the 1.33 power of the difference. So 200-point games, which are halfway between 0 and 400, got kept only 1/2.52 of the time (2.52 being 2 to the power of 1.33). 
Why 1.333?  I tried a few other exponents, and that one happened to get the resulting distributions of men and women close to what was in the study. But other ways would have worked too.  For what it's worth, I tried other exponents, and the results were very similar.)

Now, my results were back close to what the study had come up with, in its Table 2:

.021 study
.019 simulation

To verify that I didn't screw up again, I compared my summary statistics to the study.  They were reasonably close.  All numbers are Elo ratings:

Mean 2410, SD 175: men, study
Mean 2413, SD 148: men, simulation

Mean 2294, SD 141: women, study
Mean 2298, SD 131: women, simulation

Mean 27 points: M vs. W opp. difference, study
Mean 46 points: M vs. W opp. difference, simulation

The biggest difference was in the opponents faced:

Mean 2348: men's opponents, study
Mean 2408: men's opponents, simulation

Mean 2283: women's opponents, study
Mean 2321: women's opponents, simulation

The differences here are that in real life, the players chosen by the study played opponents worse than themselves, on average. (Part of the reason is that, in the study, only the better players (rating of 2000+) were included as "players", but all their opponents were included, regardless of skill.)  In the simulation, the players were chosen from the same distribution. 

I don't think that affects the results, but I should definitely mention it.

-------

Another thing I should mention, in defense of the study: last post, I questioned what happens when you include actual ratings in the regression, instead of just win probability (which is based on the ratings). I checked, and that actually *lowers* the observed effect, even if only a little bit. From my simulation:

.0188 not included
.0187 included

And, one more: as I mentioned last post, I chose an SD of 52 points for the difference between a player's rating and his or her actual talent. I have no idea if 52 points is a reasonable estimate; my gut suggests it's too high. Reducing the SD would also reduce the size of the observed effect.

I still suspect that the study's effect is almost entirely caused by this regression-to-the-mean effect.  But, without access to the study's data, I don't know the exact distributions of the matchups, to simulate closer to real life. 

But, as a proof of concept, I think the simulation shows that the effect they found in Table 2 is of the same order of magnitude as what you'd expect for purely statistical reasons. 

So I don't think the study found any evidence at all for their hypothesis of male-to-female intimidation.




P.S.  Thanks again to TomC for finding my error, and for continued discussion at Tango's site.



Labels: , , ,

Thursday, January 26, 2017

Are women chess players intimidated by male opponents? Part II

UPDATE, 1/29/2017: Oops!  Due to an arithmetic error, I had the pools of men and women apart by 200 instead of 116, in my simulation.  Will rerun everything and post again.  

UPDATE, 2/6/2017: New post is here after rerunning stuff.  Read this first if you haven't already..

-----

Do women chess players perform worse against male opponents because they find the men intimidating and threatening? Last post, I talked about a paper that makes that claim. I disagreed with their conclusions, thinking the effect was actually the result of women and men being compared to different averages. It turns out I was wrong -- there are enough inter-sex matches that the ratings would clean themselves up over time.

So, I'll have a second argument -- this time, with a simulation I ran to make sure it actually holds up. But, before I do that, I should talk about the regressions in the paper itself. 

------

The first regression comes in Table 2. It tries to estimate the number of points the player earns (equivalent to the probability of a player winning, where a tie counts as half a win). To get that, it regresses on:

-- the expected result, based on the Elo formula applied to the two players' ratings;
-- whether the player plays the white or black pieces (in chess, white moves first, and so has an advantage); 
-- the ages of the two competitors; and, of course,
-- the sexes of the two competitors.

I wonder about why the authors chose to include age in this regression. Is Elo biased by age? If so, could whatever biased it by age also have biased it by sex?

Furthermore, why would the bias be linear on age, such that the difference between a 42- and 22-year old would be ten times as large as the difference between a 24- and 22-year old? That doesn't seem plausible to me at all.

Anyway, maybe I'm just nitpicky here. This might not actually matter much.

(Well, OK, if you want to get technical -- the effect of playing white can't be linear either, can it? Suppose playing white lifts your chances of winning from .47 to .53, if you're playing someone of equal talent. But if you're a much better player, does it really lift your chances from .90 to .96? Probably not.

In fairness, the authors did run a version of one regression that included the square of the expected win probability -- but they didn't interact it with the white/black variable, or any of the others, I think.)

In any case, the coefficient of interest comes out to .021. That means that when a player faces a woman as opposed to a man -- after controlling for age, ratings differential, and color -- his or her chances of winning are 2.1 percentage points higher. For a .700 favorite moving to .721, that's the equivalent of about 18 Elo points.

-------

The second regression (Table 4) still tries to predict winning percentage, but based on a larger set of variables:

-- the sex of each competitor (again);
-- the expected winning percentage, based on the Elo differential (again);
-- the ages of both players (again);
-- the Elo rating of both players;
-- the country (national chess federation) to which the player belongs; and
-- the proportion of other players at the event who are female.

The first thing that strikes me is that the study uses two separate measures of talent -- expected win probability, and Elo ratings. This seems like duplication. I guess, though, that the two are non-linear in different ways, so maybe the second corrects for the errors of the first. But, if neither alone is unbiased across different types of players, who's to say that both together are unbiased? 

Also, the expected winning percentage will be highly correlated with the two Elo ratings ... wouldn't that cause weird effects? We can't tell, because the authors don't show ANY of the coefficient estimates except the male/female one. (See Table 4.)

The authors also include player fixed effects -- in other words, they assume every player has his or her own idiosyncratic arithmetic "bump" in expected winning percentage. This would make sense in other contexts, but seems weird in a situation where every player has a specific rating that's supposed to cover all that. But, Ted Turocy assured me that shouldn't affect the estimates of the other coefficients, so I'll just go with it for now.

In any case, there's so much going on in this regression that I have no idea how to interpret what comes out of it, especially without coefficient estimates.

I don't even have a gut feel. The estimate of the "intimidation" effect *doubles* when these new variables are introduced. How does that happen? Is the formula to produce winning percentage from ratings so badly wrong that the ratings themselves double the effect? Are ratings that biased by country? Are women twice as intimidated by male opponents in the absence of fellow female players?

It doesn't make sense to my gut. So I'm not going to try to figure out this second regression at all. I'll just go with the first one, the one that found an effect of 18 Elo points.


-------

OK, now my argument: I think there's a specific reason for the effect the authors found, one intrinsic to the structure of chess ratings, and having nothing to do with what happens when women compete against men. It has to do with regression to the mean.

Chess (Elo) ratings aren't perfectly accurate. They change after every game, depending on the results. So, a player who has been lucky lately will have a rating higher than his or her talent, and vice-versa. This is the same idea as in baseball, or any other sport. (In MLB, after 162 games, you have to regress a team's record to the mean by about 40 percent to get a true estimate of its talent, so a team that went 100-62 was, in expectation, a 92-70 talent that got lucky.)

Suppose a man faces another male opponent, but has a 50-point advantage in rating. Both ratings have to be regressed to the mean, so the talent gap is smaller than 50 points. Maybe, then, the better player should only a 45-point favorite, or something.

Now, suppose a man faces a *female* opponent, with the same 50-point advantage. In this case, I would argue, after regressing to the mean, the man actually has MORE than a 50-point advantage.

Why? Because, in general, the women have lower ratings than the men, by 116 points. So when there's only a 50-point gap between them, we're probably looking at a woman who's above average for a woman, and a man who's below average for a man.

That means the woman has to be regressed DOWN towards the women's mean, and the man has to be regressed UP towards the men's mean.

In other words: when a man faces a woman, the true talent gap is probably larger than when a man faces a man, even when the Elo ratings are exactly the same.

Here's an easier example:

Suppose an NBA team beats another NBA team by 20 points. They were probably substantially lucky, right? If they faced each other again, you'd expect the next game to be a lot closer than that.

But, suppose an NBA team beats *an NCAA team* by 20 points. In that case, the NCAA team must have played great, to get to within 20 points of the pros. In that case, you'd expect the next game to be a much bigger blowout than 20 points.

-----

Well, this time, I figured I'd better test out the logic before posting. So I ran a simulation. I created random men and women with a talent distribution similar to what was in the original study. My distributions were bell-shaped -- I don't know what the real-life distributions were.

(UPDATE: As I mentioned, the men were about 200 points more talented than the women, instead of 100.  Oops.  Spoiler alert: divide the effect in half, roughly, for now.)

Then, I ran a random simulation of games, with proportions of MM, MW, WM, and WW similar to those in the study. 

For the first simulation, I assumed that the ratings were perfect, representing the players' talents exactly. As expected, the regression found no real difference between the women and men. The coefficients were close to zero and not statistically significant. 

So far, so good.

Then, I added the random noise. For every player, I randomly adjusted his or her rating to vary from talent, with mean zero and SD of about 52 points.

I ran the simulation again. 

The data matched my hypothesis this time: the man/woman games have to be regressed differently than the woman/woman games.

When a man played another man, and the difference between their ratings was less than 50 points ... on average, both men matched their ratings. (In other words, each man was as likely to be too high as too low.)  But when a man played a woman, and the difference between their ratings was less than 50 points ... big difference. In that case, the woman, on average, was 10 points worse than her rating, and the man was 10 points better than his rating. 

In other words, when a man faced a similarly-rated woman, his true advantage was 20 points more than the ratings suggested.

-------

For a direct comparison, I recreated the authors' first regression (without the age factor). Here's what I got, compared to the study:

                         Me   Study
-----------------------------------
Player is female      -.024   -.021
Opponent is female    +.020   +.021
Female/Female         +.002   +.001

Almost exact! One thing that's a little different is that, in real life, the player and opponent effects were equal. In the simulation, they weren't. That's just random: in my case, the females designated "opponent" were a little luckier than the ones designated "player."  In the real study, they happened to be very close to equal. In neither case is the difference statistically significant.

(The only differences between the two categories: the "player" must have a rating of at least 2000, and have had at least one game against each sex. The study didn't require the "opponent" to have either criterion.)

In order to make things easier to read, in subsequent runs of the simulation, I included both sides of every game -- I just switched the player and opponent, and treated it as a second observation.

This equalizes the two coefficients while keeping the expected value of the effects the same. (It does invalidate the standard errors, but I don't care about significance here, so that doesn't matter.)  It also eliminates the need for the "female/female" interaction term.

Here's that next run of the simulation:

All                      Me    Study
------------------------------------
Player is female      -.020    -.021
Opponent is female    +.020    +.021

Exactly the same! Well, that's not coincidence. By trial and error, I figured out that an SD of 52 points between talent and rating is what made it work, so that's what I used. (I should really have raised it to 53 or 54 to get it to come out exactly, but for some reason I thought the target was .020 instead of .021, and by the time I realized it was .021, the sim was done and I was too lazy to go back and rerun it.)

In any case, I think this proves that (a) there MUST be a "women appear to play worse against men" effect appearing as a consequence of the way the data is structured, and (b) the effect size the authors found is associated with a reasonable ratings accuracy of 54 points or so.

If the authors want to search for an effect in addition to this regression artifact, they have to figure out what the real ratings accuracy is, and adjust for it in their calculations. I'm not sure how easy it would be to do that. 

-------

I could stop here, but I'll try reproducing one more of the study's checks, just in case you're not yet convinced.

In Table 6, the authors showed the results of regressions separating the data by how well-matched the players were. They did this only for the regression with all the extra variables that doubled their coefficients, so their numbers are higher. Here's what they got:

.040 Overall
.057 Players within 50 points
.052 Players within 100 points
.033 Players within 200 points

I did the same, and got

.020 Overall
.031 Players within 50 points
.025 Players within 100 points
.027 Players within 200 points
.013 Players 200+ points apart

The general pattern is similar: a larger effect for closely-matched players, and a lower effect for mismatches. 

The biggest difference is that in my simulation, the dropoff comes after 200 points, whereas in the original study, it seems to come somewhere before 200 points.

Part of the difference might be that I chose completely random matchups between players, but, in real life, the tournaments paired up players with similar rankings. Overall, the women were 116 points lower than the men, but in actual tournaments, they were only 37 points worse.

I'm guessing that if my simulation chose similar matchups to real life, the numbers would come out much closer.

-------

I could try to reproduce more of the study's regression results, but I'll stop here. I think this is enough for a strong basis for the conclusion that what the authors found is just caused by the fact that (a) Elo ratings aren't perfect, and (b) as a group, the women have much lower ratings than the men.

In other words, it's just a statistical artifact.




Labels: , , ,

Thursday, January 19, 2017

Are women chess players intimidated by male opponents?

The "Elo" rating system is a method most famous for ranking chess players, but which has now spread to many other sports and games.

How Elo works is like this: when you start out in competitive chess, the federation assigns you an arbitrary rating -- either a standard starting rating (which I think is 1200), or one based on an estimate of your skill. Your rating then changes as you play.

What I gather from Wikipedia is that "master" starts at a rating of about 2300, and "grandmaster" around 2500. To get from the original 1200 up to the 2300 level, you just start winning games. Every game you play, your rating is adjusted up or down, depending on whether you win, lose, or draw. The amount of the adjustment depends on the difference in skill between you and your opponent. Elo calculates an estimate of the odds of winning, calculated from your rating and your opponent's rating, and the loser "pays" points to the winner. So, the better your opponents, the more points you get for defeating them.

The rating is an estimate of your skill, a "true talent level" for chess. It's calibrated so that every 400-point difference between players is an odds ratio of 10. So, when a 1900-rated player, "Ann", faces a 1500-rated player, "Bob," her odds of winning are 10:1 (.909). That means that if the underdog, Bob, wins, he'll get 10 times as many points as Ann will get if she wins.

How many points, exactly? That's set by the chess federation in an attempt to get the ratings to converge on talent, and the "400-point rule," as quickly and accurately as possible. The idea is that the less information you have about the players, the more points you adjust by, because the result carries more weight towards your best estimate of talent. 

For players below "expert," the adjustment is 32 times the difference from expectation. For expert players, the adjustment is only 24 points per win, and, at the master level and above, it's 16 points per win.

If Bob happens to beat Ann, he won 1.00 games when the expectation was that that he'd win only 0.09. So, Bob exceeded expectations by 0.91 wins. Multiply by 32, and you get 29 points. That means Bob's rating jumps from 1500 to 1529, while Ann drops from 1900 to 1871.

If Ann had won, she'd claim 3 points from Bob, so she'd be at 1903 and Bob would wind up at 1497.

FiveThirtyEight recently started using Elo for their NFL and NBA ratings. It's also used by my Scrabble app, and the world pinball rankings, and other such things. I haven't looked it up, but I'd be surprised if it weren't used for other games, too, like Backgammon and Go.

-------

For the record, I'm not an expert on Elo, by any means ... I got most of my understanding from Wikipedia, and other internet sources. And, a couple of days ago, Tango posted a link to an excellent article by Adam Dorhauer that explains it very well.

Despite my lack of expertise, it seems to me that these properties of Elo are clearly the case:

1. Elo ratings are only applicable to the particular game that they're calculated from. If you're a 1800 at Chess, and I'm a 1600 at Scrabble, we have no idea which one of us would win at either game. 

2. The range of ELO ratings varies between games, depending on the range of talent of the competitors, but also on the amount of luck inherent to the sport. If the best team in the NBA is (say) an 8:1 favorite against the worst team in the league, it must be rated 361 Elo points better. (That's because 10 to the power of (361/400) equals 8.)  But if the best team in MLB is only a 2:1 favorite, it has to be rated only 120 points better.

Elo is an estimate of odds of winning. It doesn't follow, then, that a 1800 rating in one sport is comparable to a 1800 rating in another sport. I'm a better pinball player than I am a Scrabble player, but my Scrabble rating is higher than my pinball rating. That's because underdogs are more likely to win at pinball. I have a chance of beating the best pinball player in the world, in a single game, but I'd have no chance at all against a world-class Scrabble player.

In other words: the more luck inherent in the game, the tighter the range (smaller the standard deviation) of Elo ratings. 

3. Elo ratings are only applicable within the particular group that they're applied to. 

Last March, before the NCAA basketball tournament, FiveThirtyEight had Villanova with an Elo rating of 2045. Right now, they have the NBA's Golden State Warriors with a rating of 1761.

Does that mean that Villanova was actually a better basketball team than Golden State? No, of course not. Villanova's rating is relative to its NCAA competition, and Golden State's rating is relative to its NBA competition.

If you took the ratings at face value, without realizing that, you'd be projecting Villanova as 5:1 favorites over Golden State. In reality, of course, if they faced each other, Villanova would get annihilated.

--------

OK, this brings me to a study I found on the web (hat tip here). It claims that women do worse in chess games that they play against men rather than against women of equal skill. The hypothesis is, women's play suffers because they find men intimidating and threatening. 

(For instance: "Girls just don’t have the brains to play chess," (male) grandmaster Nigel Short said in 2015.)

In an article about the paper, co-author Maria Cubel writes:


"These results are thus compatible with the theory of stereotype threat, which argues that when a group suffers from a negative stereotype, the anxiety experienced trying to avoid that stereotype, or just being aware of it, increases the probability of confirming the stereotype.

"As indicated above, expert chess is a strongly male-stereotyped environment. "... expert women chess players are highly professional. They have reached a high level of mastery and they have selected themselves into a clearly male-dominated field. If we find gender interaction effects in this very selective sample, it seems reasonable to expect larger gender differences in the whole population."

Well, "stereotype threat" might be real, but I would argue that you don't actually have evidence of it in this chess data. I don't think the results actually mean what the authors claim they mean. 

-------

The authors examined a large database of chess results, and selected all players with a rating of at least 2000 (expert level) who played at least one game against an opponent of each of the two sexes.

After their regressions, the authors report,
"These results indicate that players earn, on average and ceteris paribus, about 0.04 fewer points [4 percentage points of win probability] when playing against a man as compared to when their opponent is a woman. Or conversely, men earn 0.04 points more when facing a female opponent than when facing a male opponent. This is a sizable effect, comparable to women playing with a 30 Elo point handicap when facing male opponents."

The authors did control for Elo rating, of course. That was especially important because the women were, on average, less skilled than the men. The average male player in the study was rated at 2410, while the average female was only 2294. That's a huge difference: if the average man played the average woman, the 116-point spread suggests the man would have a .661 winning percentage -- roughly, 2:1 odds in favor of the man.

Also, there were many more same-sex matches in the database than intersex matches. There are two reasons for that. First, many tournaments are organized by ranking; since there are many more men, proportionally, in the higher ranks, they wind up playing each other more often. Second, and probably more important, there are many women's tournaments and women's-only competitions.

-------

UPDATE: my hypothesis, described in the remainder of this post, turns out to be wrong.  Keep that in mind while reading, and then turn to Part II after. Eventually, I'll merge the two posts to avoid confusion.

-------

So, now we see the obvious problem with the study, why it doesn't show what the authors think it shows. 

It's the Villanova/Golden State situation, just better hidden.

The men and women have different levels of ability -- and, for the most part, their ratings are based on play within their own group. 

That means the men's and women's Elo ratings aren't comparable, for exactly the same reason an NCAA Elo rating isn't comparable to an NBA Elo rating. The women's ratings are based more on their performance relative to the [less strong] women, and the men's ratings more on their performance relative to the [stronger] men.

Of course, the bias isn't as severe in the chess case as the basketball case, because the women do play matches against men (while Villanova, of course, never plays against NBA teams). Still, both groups played predominanly within their sex -- the women 61 percent against other women, and the men 87 percent against other men.

So, clearly, there's still substantial bias. The Elo ratings are only perfectly commensurable if the entire pool can be assumed to have faced a roughly equal caliber of competition. A smattering of intersex play isn't enough.

Villanova and Golden State would still have incompatible Elos even if they played, say, one out of every five games against each other. Because, then, for the rest of their games, Villanova would go play teams that are 1500 against NCAA competition, and Golden State would go play teams that are 1500 against NBA competition, and Villanova would have a much easier time of it.

------

Having said that ... if you have enough inter-sex games, the ratings should still work themselves out. 

Because, the way Elo works, points can neither be created nor destroyed.  If women play only women, and men play only men, on average, they'll keep all the ratings points they started with, as a group. But if the men play even occasional games against the women, they'll slowly scoop up ratings points from the women's side to the men's side. All that matters is *how many* of those games are played, not *what proportion*.  The male-male and female-female games don't make a huge difference, no matter how many there are.

The way Elo works, overrated players "leak" points to underrated players. No matter how wrong the ratings are to start, play enough games, and you'll have enough "leaks" for the ratings all converge on accuracy.

Even if 99% of women's games are against other women, eventually, with enough games played, that 1% can add up to as many points as necessary, transferred from the women to the men, to make things work out.

------

So, do we have enough games, enough "leaks", to get rid of the bias?

Suppose both groups, the men and the women, started out at 1200. But the men were better. They should have been 1250, and the women should have been 1150.  The woman/woman games and man/man games will keep both averages at 1200, so we can ignore those.  But the woman/man games will start "leaking" ratings points to the men's side.

Are there enough woman/man games in the database that the men could unbias the women's ratings by capturing enough of their ratings points?

In the sample, there were 5,695 games by those woman experts (rating 2000+) who played at least one man.  Of those games, 61 percent were woman against women.  That leaves 2,221 games where expert women played (expert or inexpert) men. 

By a similar calculation, there were 2,800 games where expert men played (expert or inexpert) women.  

There's probably lots of overlap in those two sets of games, where an expert man played an expert woman. Let's assume the overlap is 1,500 games, so we'll reduce the total to 3,500.

How much leakage do we get in 3,500 games?  

Suppose the men really are exactly 116 points better in talent than the women, like their ratings indicate -- which would be the case if the leakage did, in fact, take care of all the bias. 

Now, consider what would have happened if there were no leakage. If the sexes played only each other, the women would be overrated by 116 points (since they'd have equal average ratings, but the men would be 116 points more talented).

Now, introduce intersex games. The first time a woman played a man, she'd be the true underdog by 116 points. Her male opponent would have a .661 true win probability, but treated by Elo as if he only had .500. So, the male group would gain .161 wins in expectation on that game.  At 24 points per win, that's 3.9 points.

After that game, the sum of ratings on the woman's side drops by 3.9 points, so now, the women won't be quite as overrated, and the advantage to the men will drop.  But, to be conservative, let's just keep it at 3.9 points all the way through the set of 3,500 games.  Let's even round it to 4 points.

Four points of leakage, multiplied by 3,500 games, is 14,000 ratings points moving from the women's side to the men's side.

There were about 2,000 male players in the study, and 500 female players. Let's ignore their non-expert opponents, and assume all the leakage came from these 2,500 players.

That means the average female player would have (legitimately) lost 28 points due to leakage (14,000 divided by 500).  The average male player would gain 7 points (14,000 divided by 2000).

So, that much leakage would have cut the male/female ratings bias by 35 points.

But, since we started the process with a known 116 points of bias, we're left with 81 points still remaining! Even with such a large database of games, there aren't enough male/female games to get rid of more than 30 percent of the Elo bias caused by unbalanced opponents.

If the true bias should be 81 points, why did the study find only 30?

Because the sample of games in the study isn't a complete set of all games that went into every player's rating.  For one thing, it's just the results of major tournaments, the ones that were significant enough to appear in "The Week in Chess," the publication from which the authors compiled their data.  For another thing, the authors used only 18 months worth of data, but most of these expert players have been in playing chess for years.

If we included all the games that all the players ever played, would that be enough to get rid of the bias?  We can't tell, because we don't know the number of intersex games in the players' full careers.  

We can say hypothetically, though.  If the average expert played three times as many games as logged in this 18-month sample, that still wouldn't be enough -- it would only cover be 105 of the 116 points.  Actually, it would be a lot less, because once the ratings start to become accurate, the rate of correction decelerates.  By the time half the bias is covered, the remaining bias corrects at only 2 points per between-sex game, rather than 4.  

Maybe we can do this with a geometric argument.  The data in the sample reduced the bias from 116 to 81, which is 70 percent of the original.  So, a second set of data would reduce the bias to 57 points.  A third set would reduce it to 40 points.  And a fourth set would reduce it to 28 points, which is about what the study found.

So, if every player in this study actually had four times as many man vs. woman games as were in this database, that would *still* not be enough to reduce the bias below what was found in the study.

And, again, that's conservative.  It assumes the same players in all four samples.  In real life, new players come in all the time, and if the new males tend to be better than the new females, that would start the bias all over again.

-------

So, I can't prove, mathematically, that the 30-point discrepancy the authors found is an expected artifact of the way the rating system works.  I can only show why it should be strongly suspected.

It's a combination of the fact that, for whatever reason, the men are stronger players than the women, and, again for whatever reason, there are many fewer male-female games than you need for the kind of balanced schedule that would make the ratings comparable.

And while we can't say for sure that this is the cause, we CAN say -- almost prove -- that this is exactly the kind of bias that happens, mathematically, unless you have enough male-female games to wash it out.  

I think the burden is on the authors of the study to show that there's enough data outside their sample to wash out that inherent bias, before introducing alternative hypotheses. Because, we *know* this specific effect exists, has positive sign, depends on data that's not given in the study, and could plausibly exactly the size of the observed effect!  

(Assuming I got all this right. As always, I might have screwed up.)

-------

So I think there's a very strong case that what this study found is just a form of this "NCAA vs. NBA" bias. It's an effect that must exist -- it's just the size that we don't know. But intuitive arguments suggest the size is plausibly pretty close to what the study found.

So it's probably not that women perform worse against men of equal talent. It's that women perform worse against men of equal ratings


UPDATE: In a discussion at Tango's site, commenter TomC convinced me that there is enough "leakage" to unbias the ratings. I found an alternative explanation that I think works -- this time, I verified it with a simulation.  Will post that as soon as I can.


UPDATE: Here it is, Part II.


Peter Backus, Maria Cubel, Matej Guid, Santiago Sanches-Pages, Enrique López Mañas: Gender, Competition and Performance: Evidence from Real Tournaments.  


Labels: , , ,