UPDATE, 1/29/2017: Oops! Due to an arithmetic error, I had the pools of men and women apart by 200 instead of 116, in my simulation. Will rerun everything and post again.
UPDATE, 2/6/2017: New post is here after rerunning stuff. Read this first if you haven't already..
Do women chess players perform worse against male opponents because they find the men intimidating and threatening? Last post, I talked about a paper that makes that claim. I disagreed with their conclusions, thinking the effect was actually the result of women and men being compared to different averages. It turns out I was wrong -- there are enough inter-sex matches that the ratings would clean themselves up over time.
So, I'll have a second argument -- this time, with a simulation I ran to make sure it actually holds up. But, before I do that, I should talk about the regressions in the paper itself.
The first regression comes in Table 2. It tries to estimate the number of points the player earns (equivalent to the probability of a player winning, where a tie counts as half a win). To get that, it regresses on:
-- the expected result, based on the Elo formula applied to the two players' ratings;
-- whether the player plays the white or black pieces (in chess, white moves first, and so has an advantage);
-- the ages of the two competitors; and, of course,
-- the sexes of the two competitors.
I wonder about why the authors chose to include age in this regression. Is Elo biased by age? If so, could whatever biased it by age also have biased it by sex?
Furthermore, why would the bias be linear on age, such that the difference between a 42- and 22-year old would be ten times as large as the difference between a 24- and 22-year old? That doesn't seem plausible to me at all.
Anyway, maybe I'm just nitpicky here. This might not actually matter much.
(Well, OK, if you want to get technical -- the effect of playing white can't be linear either, can it? Suppose playing white lifts your chances of winning from .47 to .53, if you're playing someone of equal talent. But if you're a much better player, does it really lift your chances from .90 to .96? Probably not.
In fairness, the authors did run a version of one regression that included the square of the expected win probability -- but they didn't interact it with the white/black variable, or any of the others, I think.)
In any case, the coefficient of interest comes out to .021. That means that when a player faces a woman as opposed to a man -- after controlling for age, ratings differential, and color -- his or her chances of winning are 2.1 percentage points higher. For a .700 favorite moving to .721, that's the equivalent of about 18 Elo points.
The second regression (Table 4) still tries to predict winning percentage, but based on a larger set of variables:
-- the sex of each competitor (again);
-- the expected winning percentage, based on the Elo differential (again);
-- the ages of both players (again);
-- the Elo rating of both players;
-- the country (national chess federation) to which the player belongs; and
-- the proportion of other players at the event who are female.
The first thing that strikes me is that the study uses two separate measures of talent -- expected win probability, and Elo ratings. This seems like duplication. I guess, though, that the two are non-linear in different ways, so maybe the second corrects for the errors of the first. But, if neither alone is unbiased across different types of players, who's to say that both together are unbiased?
Also, the expected winning percentage will be highly correlated with the two Elo ratings ... wouldn't that cause weird effects? We can't tell, because the authors don't show ANY of the coefficient estimates except the male/female one. (See Table 4.)
The authors also include player fixed effects -- in other words, they assume every player has his or her own idiosyncratic arithmetic "bump" in expected winning percentage. This would make sense in other contexts, but seems weird in a situation where every player has a specific rating that's supposed to cover all that. But, Ted Turocy assured me that shouldn't affect the estimates of the other coefficients, so I'll just go with it for now.
In any case, there's so much going on in this regression that I have no idea how to interpret what comes out of it, especially without coefficient estimates.
I don't even have a gut feel. The estimate of the "intimidation" effect *doubles* when these new variables are introduced. How does that happen? Is the formula to produce winning percentage from ratings so badly wrong that the ratings themselves double the effect? Are ratings that biased by country? Are women twice as intimidated by male opponents in the absence of fellow female players?
It doesn't make sense to my gut. So I'm not going to try to figure out this second regression at all. I'll just go with the first one, the one that found an effect of 18 Elo points.
OK, now my argument: I think there's a specific reason for the effect the authors found, one intrinsic to the structure of chess ratings, and having nothing to do with what happens when women compete against men. It has to do with regression to the mean.
Chess (Elo) ratings aren't perfectly accurate. They change after every game, depending on the results. So, a player who has been lucky lately will have a rating higher than his or her talent, and vice-versa. This is the same idea as in baseball, or any other sport. (In MLB, after 162 games, you have to regress a team's record to the mean by about 40 percent to get a true estimate of its talent, so a team that went 100-62 was, in expectation, a 92-70 talent that got lucky.)
Suppose a man faces another male opponent, but has a 50-point advantage in rating. Both ratings have to be regressed to the mean, so the talent gap is smaller than 50 points. Maybe, then, the better player should only a 45-point favorite, or something.
Now, suppose a man faces a *female* opponent, with the same 50-point advantage. In this case, I would argue, after regressing to the mean, the man actually has MORE than a 50-point advantage.
Why? Because, in general, the women have lower ratings than the men, by 116 points. So when there's only a 50-point gap between them, we're probably looking at a woman who's above average for a woman, and a man who's below average for a man.
That means the woman has to be regressed DOWN towards the women's mean, and the man has to be regressed UP towards the men's mean.
In other words: when a man faces a woman, the true talent gap is probably larger than when a man faces a man, even when the Elo ratings are exactly the same.
Here's an easier example:
Suppose an NBA team beats another NBA team by 20 points. They were probably substantially lucky, right? If they faced each other again, you'd expect the next game to be a lot closer than that.
But, suppose an NBA team beats *an NCAA team* by 20 points. In that case, the NCAA team must have played great, to get to within 20 points of the pros. In that case, you'd expect the next game to be a much bigger blowout than 20 points.
Well, this time, I figured I'd better test out the logic before posting. So I ran a simulation. I created random men and women with a talent distribution similar to what was in the original study. My distributions were bell-shaped -- I don't know what the real-life distributions were.
(UPDATE: As I mentioned, the men were about 200 points more talented than the women, instead of 100. Oops. Spoiler alert: divide the effect in half, roughly, for now.)
Then, I ran a random simulation of games, with proportions of MM, MW, WM, and WW similar to those in the study.
For the first simulation, I assumed that the ratings were perfect, representing the players' talents exactly. As expected, the regression found no real difference between the women and men. The coefficients were close to zero and not statistically significant.
So far, so good.
Then, I added the random noise. For every player, I randomly adjusted his or her rating to vary from talent, with mean zero and SD of about 52 points.
I ran the simulation again.
The data matched my hypothesis this time: the man/woman games have to be regressed differently than the woman/woman games.
When a man played another man, and the difference between their ratings was less than 50 points ... on average, both men matched their ratings. (In other words, each man was as likely to be too high as too low.) But when a man played a woman, and the difference between their ratings was less than 50 points ... big difference. In that case, the woman, on average, was 10 points worse than her rating, and the man was 10 points better than his rating.
In other words, when a man faced a similarly-rated woman, his true advantage was 20 points more than the ratings suggested.
For a direct comparison, I recreated the authors' first regression (without the age factor). Here's what I got, compared to the study:
Player is female -.024 -.021
Opponent is female +.020 +.021
Female/Female +.002 +.001
Almost exact! One thing that's a little different is that, in real life, the player and opponent effects were equal. In the simulation, they weren't. That's just random: in my case, the females designated "opponent" were a little luckier than the ones designated "player." In the real study, they happened to be very close to equal. In neither case is the difference statistically significant.
(The only differences between the two categories: the "player" must have a rating of at least 2000, and have had at least one game against each sex. The study didn't require the "opponent" to have either criterion.)
In order to make things easier to read, in subsequent runs of the simulation, I included both sides of every game -- I just switched the player and opponent, and treated it as a second observation.
This equalizes the two coefficients while keeping the expected value of the effects the same. (It does invalidate the standard errors, but I don't care about significance here, so that doesn't matter.) It also eliminates the need for the "female/female" interaction term.
Here's that next run of the simulation:
All Me Study
Player is female -.020 -.021
Opponent is female +.020 +.021
Exactly the same! Well, that's not coincidence. By trial and error, I figured out that an SD of 52 points between talent and rating is what made it work, so that's what I used. (I should really have raised it to 53 or 54 to get it to come out exactly, but for some reason I thought the target was .020 instead of .021, and by the time I realized it was .021, the sim was done and I was too lazy to go back and rerun it.)
In any case, I think this proves that (a) there MUST be a "women appear to play worse against men" effect appearing as a consequence of the way the data is structured, and (b) the effect size the authors found is associated with a reasonable ratings accuracy of 54 points or so.
If the authors want to search for an effect in addition to this regression artifact, they have to figure out what the real ratings accuracy is, and adjust for it in their calculations. I'm not sure how easy it would be to do that.
I could stop here, but I'll try reproducing one more of the study's checks, just in case you're not yet convinced.
In Table 6, the authors showed the results of regressions separating the data by how well-matched the players were. They did this only for the regression with all the extra variables that doubled their coefficients, so their numbers are higher. Here's what they got:
.057 Players within 50 points
.052 Players within 100 points
.033 Players within 200 points
I did the same, and got
.031 Players within 50 points
.025 Players within 100 points
.027 Players within 200 points
.013 Players 200+ points apart
The general pattern is similar: a larger effect for closely-matched players, and a lower effect for mismatches.
The biggest difference is that in my simulation, the dropoff comes after 200 points, whereas in the original study, it seems to come somewhere before 200 points.
Part of the difference might be that I chose completely random matchups between players, but, in real life, the tournaments paired up players with similar rankings. Overall, the women were 116 points lower than the men, but in actual tournaments, they were only 37 points worse.
I'm guessing that if my simulation chose similar matchups to real life, the numbers would come out much closer.
I could try to reproduce more of the study's regression results, but I'll stop here. I think this is enough for a strong basis for the conclusion that what the authors found is just caused by the fact that (a) Elo ratings aren't perfect, and (b) as a group, the women have much lower ratings than the men.
In other words, it's just a statistical artifact.
Labels: chess, Elo, gender, ratings