Monday, February 06, 2017

Are women chess players intimidated by male opponents? Part III

Over at Tango's blog, commenter TomC found an error in my last post. I had meant to create a sample of male chess players with mean 2400, but I added wrong and created a mean of 2500 instead. (The distribution of females was correct, with mean 2300.)

The effect produced my original distribution had come close to the one in the paper, but, with the correction, it drops to about half.

The effect is the win probability difference between a woman facing a man, as compared to facing a woman with the same rating:

-0.021 paper
-0.020 my error
-0.010 corrected

It makes sense that the observed effect drops when the distribution of men gets closer to the distribution of women. That's because (by my hypothesis) it's caused by the fact that women and men have to be regressed to different means. The more different the means, the larger the effect. 

Suppose the distribution matches my error, 2300 for the women and 2500 for the men. When a 2400 woman plays a 2400 man, her rating of 2400 needs to be regressed down, towards 2300. But the man's rating needs to be regressed *up*, towards 2500. That means the woman was probably overrated, and the man underrated.

But, when the men's mean is only 2400, the man no longer needs to be regressed at all, because he's right at the mean. So, only the woman needs to be regressed, and the effect is smaller.

-------

The effect becomes larger when the players are closer matched in rating. That's when it's most likely that the women is above average, and the man is below average. The original study found a larger effect in close matches, and so did my (corrected) simulation:

.0130 -- ratings within 50 points
.0144 -- ratings within 100 points
.0129 -- ratings within 200 points
.0057 -- ratings more than 200 points apart

Why is that important?  Because, in my simulation, I chose the participants at random from the distributions. In real life, tournaments try to match players to opponents with similar ratings.

In the study, the men's ratings were higher than the women's, by an average 116 points. But when a man faced a woman, the average advantage wasn't 116 -- it was much lower. As the study says, on page 18,

"However, when a female player in our sample plays a male opponent, she faces an average disadvantage of 27 Elo points ..."

Twenty-seven points is a very small disadvantage, about one quarter of the 116 points that you'd see if tournament matchups were random. The matching of players makes the effecst look larger.

So, I rejigged my simulation to make all matches closer. I chose a random matchup, and then discarded it a certain percentage of the time, varying with the ratings difference. 

If the players had the same rating, I always kept that match. If the difference was more than 400 points, I always discarded that match. In between, I kept it on a sliding scale that randomly decided which matches to keep.


(Technical details: I decided each game by a probability corresponding to the 1.33 power of the difference. So 200-point games, which are halfway between 0 and 400, got kept only 1/2.52 of the time (2.52 being 2 to the power of 1.33). 
Why 1.333?  I tried a few other exponents, and that one happened to get the resulting distributions of men and women close to what was in the study. But other ways would have worked too.  For what it's worth, I tried other exponents, and the results were very similar.)

Now, my results were back close to what the study had come up with, in its Table 2:

.021 study
.019 simulation

To verify that I didn't screw up again, I compared my summary statistics to the study.  They were reasonably close.  All numbers are Elo ratings:

Mean 2410, SD 175: men, study
Mean 2413, SD 148: men, simulation

Mean 2294, SD 141: women, study
Mean 2298, SD 131: women, simulation

Mean 27 points: M vs. W opp. difference, study
Mean 46 points: M vs. W opp. difference, simulation

The biggest difference was in the opponents faced:

Mean 2348: men's opponents, study
Mean 2408: men's opponents, simulation

Mean 2283: women's opponents, study
Mean 2321: women's opponents, simulation

The differences here are that in real life, the players chosen by the study played opponents worse than themselves, on average. (Part of the reason is that, in the study, only the better players (rating of 2000+) were included as "players", but all their opponents were included, regardless of skill.)  In the simulation, the players were chosen from the same distribution. 

I don't think that affects the results, but I should definitely mention it.

-------

Another thing I should mention, in defense of the study: last post, I questioned what happens when you include actual ratings in the regression, instead of just win probability (which is based on the ratings). I checked, and that actually *lowers* the observed effect, even if only a little bit. From my simulation:

.0188 not included
.0187 included

And, one more: as I mentioned last post, I chose an SD of 52 points for the difference between a player's rating and his or her actual talent. I have no idea if 52 points is a reasonable estimate; my gut suggests it's too high. Reducing the SD would also reduce the size of the observed effect.

I still suspect that the study's effect is almost entirely caused by this regression-to-the-mean effect.  But, without access to the study's data, I don't know the exact distributions of the matchups, to simulate closer to real life. 

But, as a proof of concept, I think the simulation shows that the effect they found in Table 2 is of the same order of magnitude as what you'd expect for purely statistical reasons. 

So I don't think the study found any evidence at all for their hypothesis of male-to-female intimidation.




P.S.  Thanks again to TomC for finding my error, and for continued discussion at Tango's site.



Labels: , , ,