Monday, February 06, 2017

Are women chess players intimidated by male opponents? Part III

Over at Tango's blog, commenter TomC found an error in my last post. I had meant to create a sample of male chess players with mean 2400, but I added wrong and created a mean of 2500 instead. (The distribution of females was correct, with mean 2300.)

The effect produced my original distribution had come close to the one in the paper, but, with the correction, it drops to about half.

The effect is the win probability difference between a woman facing a man, as compared to facing a woman with the same rating:

-0.021 paper
-0.020 my error
-0.010 corrected

It makes sense that the observed effect drops when the distribution of men gets closer to the distribution of women. That's because (by my hypothesis) it's caused by the fact that women and men have to be regressed to different means. The more different the means, the larger the effect. 

Suppose the distribution matches my error, 2300 for the women and 2500 for the men. When a 2400 woman plays a 2400 man, her rating of 2400 needs to be regressed down, towards 2300. But the man's rating needs to be regressed *up*, towards 2500. That means the woman was probably overrated, and the man underrated.

But, when the men's mean is only 2400, the man no longer needs to be regressed at all, because he's right at the mean. So, only the woman needs to be regressed, and the effect is smaller.

-------

The effect becomes larger when the players are closer matched in rating. That's when it's most likely that the women is above average, and the man is below average. The original study found a larger effect in close matches, and so did my (corrected) simulation:

.0130 -- ratings within 50 points
.0144 -- ratings within 100 points
.0129 -- ratings within 200 points
.0057 -- ratings more than 200 points apart

Why is that important?  Because, in my simulation, I chose the participants at random from the distributions. In real life, tournaments try to match players to opponents with similar ratings.

In the study, the men's ratings were higher than the women's, by an average 116 points. But when a man faced a woman, the average advantage wasn't 116 -- it was much lower. As the study says, on page 18,

"However, when a female player in our sample plays a male opponent, she faces an average disadvantage of 27 Elo points ..."

Twenty-seven points is a very small disadvantage, about one quarter of the 116 points that you'd see if tournament matchups were random. The matching of players makes the effecst look larger.

So, I rejigged my simulation to make all matches closer. I chose a random matchup, and then discarded it a certain percentage of the time, varying with the ratings difference. 

If the players had the same rating, I always kept that match. If the difference was more than 400 points, I always discarded that match. In between, I kept it on a sliding scale that randomly decided which matches to keep.


(Technical details: I decided each game by a probability corresponding to the 1.33 power of the difference. So 200-point games, which are halfway between 0 and 400, got kept only 1/2.52 of the time (2.52 being 2 to the power of 1.33). 
Why 1.333?  I tried a few other exponents, and that one happened to get the resulting distributions of men and women close to what was in the study. But other ways would have worked too.  For what it's worth, I tried other exponents, and the results were very similar.)

Now, my results were back close to what the study had come up with, in its Table 2:

.021 study
.019 simulation

To verify that I didn't screw up again, I compared my summary statistics to the study.  They were reasonably close.  All numbers are Elo ratings:

Mean 2410, SD 175: men, study
Mean 2413, SD 148: men, simulation

Mean 2294, SD 141: women, study
Mean 2298, SD 131: women, simulation

Mean 27 points: M vs. W opp. difference, study
Mean 46 points: M vs. W opp. difference, simulation

The biggest difference was in the opponents faced:

Mean 2348: men's opponents, study
Mean 2408: men's opponents, simulation

Mean 2283: women's opponents, study
Mean 2321: women's opponents, simulation

The differences here are that in real life, the players chosen by the study played opponents worse than themselves, on average. (Part of the reason is that, in the study, only the better players (rating of 2000+) were included as "players", but all their opponents were included, regardless of skill.)  In the simulation, the players were chosen from the same distribution. 

I don't think that affects the results, but I should definitely mention it.

-------

Another thing I should mention, in defense of the study: last post, I questioned what happens when you include actual ratings in the regression, instead of just win probability (which is based on the ratings). I checked, and that actually *lowers* the observed effect, even if only a little bit. From my simulation:

.0188 not included
.0187 included

And, one more: as I mentioned last post, I chose an SD of 52 points for the difference between a player's rating and his or her actual talent. I have no idea if 52 points is a reasonable estimate; my gut suggests it's too high. Reducing the SD would also reduce the size of the observed effect.

I still suspect that the study's effect is almost entirely caused by this regression-to-the-mean effect.  But, without access to the study's data, I don't know the exact distributions of the matchups, to simulate closer to real life. 

But, as a proof of concept, I think the simulation shows that the effect they found in Table 2 is of the same order of magnitude as what you'd expect for purely statistical reasons. 

So I don't think the study found any evidence at all for their hypothesis of male-to-female intimidation.




P.S.  Thanks again to TomC for finding my error, and for continued discussion at Tango's site.



Labels: , , ,

Thursday, January 26, 2017

Are women chess players intimidated by male opponents? Part II

UPDATE, 1/29/2017: Oops!  Due to an arithmetic error, I had the pools of men and women apart by 200 instead of 116, in my simulation.  Will rerun everything and post again.  

UPDATE, 2/6/2017: New post is here after rerunning stuff.  Read this first if you haven't already..

-----

Do women chess players perform worse against male opponents because they find the men intimidating and threatening? Last post, I talked about a paper that makes that claim. I disagreed with their conclusions, thinking the effect was actually the result of women and men being compared to different averages. It turns out I was wrong -- there are enough inter-sex matches that the ratings would clean themselves up over time.

So, I'll have a second argument -- this time, with a simulation I ran to make sure it actually holds up. But, before I do that, I should talk about the regressions in the paper itself. 

------

The first regression comes in Table 2. It tries to estimate the number of points the player earns (equivalent to the probability of a player winning, where a tie counts as half a win). To get that, it regresses on:

-- the expected result, based on the Elo formula applied to the two players' ratings;
-- whether the player plays the white or black pieces (in chess, white moves first, and so has an advantage); 
-- the ages of the two competitors; and, of course,
-- the sexes of the two competitors.

I wonder about why the authors chose to include age in this regression. Is Elo biased by age? If so, could whatever biased it by age also have biased it by sex?

Furthermore, why would the bias be linear on age, such that the difference between a 42- and 22-year old would be ten times as large as the difference between a 24- and 22-year old? That doesn't seem plausible to me at all.

Anyway, maybe I'm just nitpicky here. This might not actually matter much.

(Well, OK, if you want to get technical -- the effect of playing white can't be linear either, can it? Suppose playing white lifts your chances of winning from .47 to .53, if you're playing someone of equal talent. But if you're a much better player, does it really lift your chances from .90 to .96? Probably not.

In fairness, the authors did run a version of one regression that included the square of the expected win probability -- but they didn't interact it with the white/black variable, or any of the others, I think.)

In any case, the coefficient of interest comes out to .021. That means that when a player faces a woman as opposed to a man -- after controlling for age, ratings differential, and color -- his or her chances of winning are 2.1 percentage points higher. For a .700 favorite moving to .721, that's the equivalent of about 18 Elo points.

-------

The second regression (Table 4) still tries to predict winning percentage, but based on a larger set of variables:

-- the sex of each competitor (again);
-- the expected winning percentage, based on the Elo differential (again);
-- the ages of both players (again);
-- the Elo rating of both players;
-- the country (national chess federation) to which the player belongs; and
-- the proportion of other players at the event who are female.

The first thing that strikes me is that the study uses two separate measures of talent -- expected win probability, and Elo ratings. This seems like duplication. I guess, though, that the two are non-linear in different ways, so maybe the second corrects for the errors of the first. But, if neither alone is unbiased across different types of players, who's to say that both together are unbiased? 

Also, the expected winning percentage will be highly correlated with the two Elo ratings ... wouldn't that cause weird effects? We can't tell, because the authors don't show ANY of the coefficient estimates except the male/female one. (See Table 4.)

The authors also include player fixed effects -- in other words, they assume every player has his or her own idiosyncratic arithmetic "bump" in expected winning percentage. This would make sense in other contexts, but seems weird in a situation where every player has a specific rating that's supposed to cover all that. But, Ted Turocy assured me that shouldn't affect the estimates of the other coefficients, so I'll just go with it for now.

In any case, there's so much going on in this regression that I have no idea how to interpret what comes out of it, especially without coefficient estimates.

I don't even have a gut feel. The estimate of the "intimidation" effect *doubles* when these new variables are introduced. How does that happen? Is the formula to produce winning percentage from ratings so badly wrong that the ratings themselves double the effect? Are ratings that biased by country? Are women twice as intimidated by male opponents in the absence of fellow female players?

It doesn't make sense to my gut. So I'm not going to try to figure out this second regression at all. I'll just go with the first one, the one that found an effect of 18 Elo points.


-------

OK, now my argument: I think there's a specific reason for the effect the authors found, one intrinsic to the structure of chess ratings, and having nothing to do with what happens when women compete against men. It has to do with regression to the mean.

Chess (Elo) ratings aren't perfectly accurate. They change after every game, depending on the results. So, a player who has been lucky lately will have a rating higher than his or her talent, and vice-versa. This is the same idea as in baseball, or any other sport. (In MLB, after 162 games, you have to regress a team's record to the mean by about 40 percent to get a true estimate of its talent, so a team that went 100-62 was, in expectation, a 92-70 talent that got lucky.)

Suppose a man faces another male opponent, but has a 50-point advantage in rating. Both ratings have to be regressed to the mean, so the talent gap is smaller than 50 points. Maybe, then, the better player should only a 45-point favorite, or something.

Now, suppose a man faces a *female* opponent, with the same 50-point advantage. In this case, I would argue, after regressing to the mean, the man actually has MORE than a 50-point advantage.

Why? Because, in general, the women have lower ratings than the men, by 116 points. So when there's only a 50-point gap between them, we're probably looking at a woman who's above average for a woman, and a man who's below average for a man.

That means the woman has to be regressed DOWN towards the women's mean, and the man has to be regressed UP towards the men's mean.

In other words: when a man faces a woman, the true talent gap is probably larger than when a man faces a man, even when the Elo ratings are exactly the same.

Here's an easier example:

Suppose an NBA team beats another NBA team by 20 points. They were probably substantially lucky, right? If they faced each other again, you'd expect the next game to be a lot closer than that.

But, suppose an NBA team beats *an NCAA team* by 20 points. In that case, the NCAA team must have played great, to get to within 20 points of the pros. In that case, you'd expect the next game to be a much bigger blowout than 20 points.

-----

Well, this time, I figured I'd better test out the logic before posting. So I ran a simulation. I created random men and women with a talent distribution similar to what was in the original study. My distributions were bell-shaped -- I don't know what the real-life distributions were.

(UPDATE: As I mentioned, the men were about 200 points more talented than the women, instead of 100.  Oops.  Spoiler alert: divide the effect in half, roughly, for now.)

Then, I ran a random simulation of games, with proportions of MM, MW, WM, and WW similar to those in the study. 

For the first simulation, I assumed that the ratings were perfect, representing the players' talents exactly. As expected, the regression found no real difference between the women and men. The coefficients were close to zero and not statistically significant. 

So far, so good.

Then, I added the random noise. For every player, I randomly adjusted his or her rating to vary from talent, with mean zero and SD of about 52 points.

I ran the simulation again. 

The data matched my hypothesis this time: the man/woman games have to be regressed differently than the woman/woman games.

When a man played another man, and the difference between their ratings was less than 50 points ... on average, both men matched their ratings. (In other words, each man was as likely to be too high as too low.)  But when a man played a woman, and the difference between their ratings was less than 50 points ... big difference. In that case, the woman, on average, was 10 points worse than her rating, and the man was 10 points better than his rating. 

In other words, when a man faced a similarly-rated woman, his true advantage was 20 points more than the ratings suggested.

-------

For a direct comparison, I recreated the authors' first regression (without the age factor). Here's what I got, compared to the study:

                         Me   Study
-----------------------------------
Player is female      -.024   -.021
Opponent is female    +.020   +.021
Female/Female         +.002   +.001

Almost exact! One thing that's a little different is that, in real life, the player and opponent effects were equal. In the simulation, they weren't. That's just random: in my case, the females designated "opponent" were a little luckier than the ones designated "player."  In the real study, they happened to be very close to equal. In neither case is the difference statistically significant.

(The only differences between the two categories: the "player" must have a rating of at least 2000, and have had at least one game against each sex. The study didn't require the "opponent" to have either criterion.)

In order to make things easier to read, in subsequent runs of the simulation, I included both sides of every game -- I just switched the player and opponent, and treated it as a second observation.

This equalizes the two coefficients while keeping the expected value of the effects the same. (It does invalidate the standard errors, but I don't care about significance here, so that doesn't matter.)  It also eliminates the need for the "female/female" interaction term.

Here's that next run of the simulation:

All                      Me    Study
------------------------------------
Player is female      -.020    -.021
Opponent is female    +.020    +.021

Exactly the same! Well, that's not coincidence. By trial and error, I figured out that an SD of 52 points between talent and rating is what made it work, so that's what I used. (I should really have raised it to 53 or 54 to get it to come out exactly, but for some reason I thought the target was .020 instead of .021, and by the time I realized it was .021, the sim was done and I was too lazy to go back and rerun it.)

In any case, I think this proves that (a) there MUST be a "women appear to play worse against men" effect appearing as a consequence of the way the data is structured, and (b) the effect size the authors found is associated with a reasonable ratings accuracy of 54 points or so.

If the authors want to search for an effect in addition to this regression artifact, they have to figure out what the real ratings accuracy is, and adjust for it in their calculations. I'm not sure how easy it would be to do that. 

-------

I could stop here, but I'll try reproducing one more of the study's checks, just in case you're not yet convinced.

In Table 6, the authors showed the results of regressions separating the data by how well-matched the players were. They did this only for the regression with all the extra variables that doubled their coefficients, so their numbers are higher. Here's what they got:

.040 Overall
.057 Players within 50 points
.052 Players within 100 points
.033 Players within 200 points

I did the same, and got

.020 Overall
.031 Players within 50 points
.025 Players within 100 points
.027 Players within 200 points
.013 Players 200+ points apart

The general pattern is similar: a larger effect for closely-matched players, and a lower effect for mismatches. 

The biggest difference is that in my simulation, the dropoff comes after 200 points, whereas in the original study, it seems to come somewhere before 200 points.

Part of the difference might be that I chose completely random matchups between players, but, in real life, the tournaments paired up players with similar rankings. Overall, the women were 116 points lower than the men, but in actual tournaments, they were only 37 points worse.

I'm guessing that if my simulation chose similar matchups to real life, the numbers would come out much closer.

-------

I could try to reproduce more of the study's regression results, but I'll stop here. I think this is enough for a strong basis for the conclusion that what the authors found is just caused by the fact that (a) Elo ratings aren't perfect, and (b) as a group, the women have much lower ratings than the men.

In other words, it's just a statistical artifact.




Labels: , , ,

Thursday, January 19, 2017

Are women chess players intimidated by male opponents?

The "Elo" rating system is a method most famous for ranking chess players, but which has now spread to many other sports and games.

How Elo works is like this: when you start out in competitive chess, the federation assigns you an arbitrary rating -- either a standard starting rating (which I think is 1200), or one based on an estimate of your skill. Your rating then changes as you play.

What I gather from Wikipedia is that "master" starts at a rating of about 2300, and "grandmaster" around 2500. To get from the original 1200 up to the 2300 level, you just start winning games. Every game you play, your rating is adjusted up or down, depending on whether you win, lose, or draw. The amount of the adjustment depends on the difference in skill between you and your opponent. Elo calculates an estimate of the odds of winning, calculated from your rating and your opponent's rating, and the loser "pays" points to the winner. So, the better your opponents, the more points you get for defeating them.

The rating is an estimate of your skill, a "true talent level" for chess. It's calibrated so that every 400-point difference between players is an odds ratio of 10. So, when a 1900-rated player, "Ann", faces a 1500-rated player, "Bob," her odds of winning are 10:1 (.909). That means that if the underdog, Bob, wins, he'll get 10 times as many points as Ann will get if she wins.

How many points, exactly? That's set by the chess federation in an attempt to get the ratings to converge on talent, and the "400-point rule," as quickly and accurately as possible. The idea is that the less information you have about the players, the more points you adjust by, because the result carries more weight towards your best estimate of talent. 

For players below "expert," the adjustment is 32 times the difference from expectation. For expert players, the adjustment is only 24 points per win, and, at the master level and above, it's 16 points per win.

If Bob happens to beat Ann, he won 1.00 games when the expectation was that that he'd win only 0.09. So, Bob exceeded expectations by 0.91 wins. Multiply by 32, and you get 29 points. That means Bob's rating jumps from 1500 to 1529, while Ann drops from 1900 to 1871.

If Ann had won, she'd claim 3 points from Bob, so she'd be at 1903 and Bob would wind up at 1497.

FiveThirtyEight recently started using Elo for their NFL and NBA ratings. It's also used by my Scrabble app, and the world pinball rankings, and other such things. I haven't looked it up, but I'd be surprised if it weren't used for other games, too, like Backgammon and Go.

-------

For the record, I'm not an expert on Elo, by any means ... I got most of my understanding from Wikipedia, and other internet sources. And, a couple of days ago, Tango posted a link to an excellent article by Adam Dorhauer that explains it very well.

Despite my lack of expertise, it seems to me that these properties of Elo are clearly the case:

1. Elo ratings are only applicable to the particular game that they're calculated from. If you're a 1800 at Chess, and I'm a 1600 at Scrabble, we have no idea which one of us would win at either game. 

2. The range of ELO ratings varies between games, depending on the range of talent of the competitors, but also on the amount of luck inherent to the sport. If the best team in the NBA is (say) an 8:1 favorite against the worst team in the league, it must be rated 361 Elo points better. (That's because 10 to the power of (361/400) equals 8.)  But if the best team in MLB is only a 2:1 favorite, it has to be rated only 120 points better.

Elo is an estimate of odds of winning. It doesn't follow, then, that a 1800 rating in one sport is comparable to a 1800 rating in another sport. I'm a better pinball player than I am a Scrabble player, but my Scrabble rating is higher than my pinball rating. That's because underdogs are more likely to win at pinball. I have a chance of beating the best pinball player in the world, in a single game, but I'd have no chance at all against a world-class Scrabble player.

In other words: the more luck inherent in the game, the tighter the range (smaller the standard deviation) of Elo ratings. 

3. Elo ratings are only applicable within the particular group that they're applied to. 

Last March, before the NCAA basketball tournament, FiveThirtyEight had Villanova with an Elo rating of 2045. Right now, they have the NBA's Golden State Warriors with a rating of 1761.

Does that mean that Villanova was actually a better basketball team than Golden State? No, of course not. Villanova's rating is relative to its NCAA competition, and Golden State's rating is relative to its NBA competition.

If you took the ratings at face value, without realizing that, you'd be projecting Villanova as 5:1 favorites over Golden State. In reality, of course, if they faced each other, Villanova would get annihilated.

--------

OK, this brings me to a study I found on the web (hat tip here). It claims that women do worse in chess games that they play against men rather than against women of equal skill. The hypothesis is, women's play suffers because they find men intimidating and threatening. 

(For instance: "Girls just don’t have the brains to play chess," (male) grandmaster Nigel Short said in 2015.)

In an article about the paper, co-author Maria Cubel writes:


"These results are thus compatible with the theory of stereotype threat, which argues that when a group suffers from a negative stereotype, the anxiety experienced trying to avoid that stereotype, or just being aware of it, increases the probability of confirming the stereotype.

"As indicated above, expert chess is a strongly male-stereotyped environment. "... expert women chess players are highly professional. They have reached a high level of mastery and they have selected themselves into a clearly male-dominated field. If we find gender interaction effects in this very selective sample, it seems reasonable to expect larger gender differences in the whole population."

Well, "stereotype threat" might be real, but I would argue that you don't actually have evidence of it in this chess data. I don't think the results actually mean what the authors claim they mean. 

-------

The authors examined a large database of chess results, and selected all players with a rating of at least 2000 (expert level) who played at least one game against an opponent of each of the two sexes.

After their regressions, the authors report,
"These results indicate that players earn, on average and ceteris paribus, about 0.04 fewer points [4 percentage points of win probability] when playing against a man as compared to when their opponent is a woman. Or conversely, men earn 0.04 points more when facing a female opponent than when facing a male opponent. This is a sizable effect, comparable to women playing with a 30 Elo point handicap when facing male opponents."

The authors did control for Elo rating, of course. That was especially important because the women were, on average, less skilled than the men. The average male player in the study was rated at 2410, while the average female was only 2294. That's a huge difference: if the average man played the average woman, the 116-point spread suggests the man would have a .661 winning percentage -- roughly, 2:1 odds in favor of the man.

Also, there were many more same-sex matches in the database than intersex matches. There are two reasons for that. First, many tournaments are organized by ranking; since there are many more men, proportionally, in the higher ranks, they wind up playing each other more often. Second, and probably more important, there are many women's tournaments and women's-only competitions.

-------

UPDATE: my hypothesis, described in the remainder of this post, turns out to be wrong.  Keep that in mind while reading, and then turn to Part II after. Eventually, I'll merge the two posts to avoid confusion.

-------

So, now we see the obvious problem with the study, why it doesn't show what the authors think it shows. 

It's the Villanova/Golden State situation, just better hidden.

The men and women have different levels of ability -- and, for the most part, their ratings are based on play within their own group. 

That means the men's and women's Elo ratings aren't comparable, for exactly the same reason an NCAA Elo rating isn't comparable to an NBA Elo rating. The women's ratings are based more on their performance relative to the [less strong] women, and the men's ratings more on their performance relative to the [stronger] men.

Of course, the bias isn't as severe in the chess case as the basketball case, because the women do play matches against men (while Villanova, of course, never plays against NBA teams). Still, both groups played predominanly within their sex -- the women 61 percent against other women, and the men 87 percent against other men.

So, clearly, there's still substantial bias. The Elo ratings are only perfectly commensurable if the entire pool can be assumed to have faced a roughly equal caliber of competition. A smattering of intersex play isn't enough.

Villanova and Golden State would still have incompatible Elos even if they played, say, one out of every five games against each other. Because, then, for the rest of their games, Villanova would go play teams that are 1500 against NCAA competition, and Golden State would go play teams that are 1500 against NBA competition, and Villanova would have a much easier time of it.

------

Having said that ... if you have enough inter-sex games, the ratings should still work themselves out. 

Because, the way Elo works, points can neither be created nor destroyed.  If women play only women, and men play only men, on average, they'll keep all the ratings points they started with, as a group. But if the men play even occasional games against the women, they'll slowly scoop up ratings points from the women's side to the men's side. All that matters is *how many* of those games are played, not *what proportion*.  The male-male and female-female games don't make a huge difference, no matter how many there are.

The way Elo works, overrated players "leak" points to underrated players. No matter how wrong the ratings are to start, play enough games, and you'll have enough "leaks" for the ratings all converge on accuracy.

Even if 99% of women's games are against other women, eventually, with enough games played, that 1% can add up to as many points as necessary, transferred from the women to the men, to make things work out.

------

So, do we have enough games, enough "leaks", to get rid of the bias?

Suppose both groups, the men and the women, started out at 1200. But the men were better. They should have been 1250, and the women should have been 1150.  The woman/woman games and man/man games will keep both averages at 1200, so we can ignore those.  But the woman/man games will start "leaking" ratings points to the men's side.

Are there enough woman/man games in the database that the men could unbias the women's ratings by capturing enough of their ratings points?

In the sample, there were 5,695 games by those woman experts (rating 2000+) who played at least one man.  Of those games, 61 percent were woman against women.  That leaves 2,221 games where expert women played (expert or inexpert) men. 

By a similar calculation, there were 2,800 games where expert men played (expert or inexpert) women.  

There's probably lots of overlap in those two sets of games, where an expert man played an expert woman. Let's assume the overlap is 1,500 games, so we'll reduce the total to 3,500.

How much leakage do we get in 3,500 games?  

Suppose the men really are exactly 116 points better in talent than the women, like their ratings indicate -- which would be the case if the leakage did, in fact, take care of all the bias. 

Now, consider what would have happened if there were no leakage. If the sexes played only each other, the women would be overrated by 116 points (since they'd have equal average ratings, but the men would be 116 points more talented).

Now, introduce intersex games. The first time a woman played a man, she'd be the true underdog by 116 points. Her male opponent would have a .661 true win probability, but treated by Elo as if he only had .500. So, the male group would gain .161 wins in expectation on that game.  At 24 points per win, that's 3.9 points.

After that game, the sum of ratings on the woman's side drops by 3.9 points, so now, the women won't be quite as overrated, and the advantage to the men will drop.  But, to be conservative, let's just keep it at 3.9 points all the way through the set of 3,500 games.  Let's even round it to 4 points.

Four points of leakage, multiplied by 3,500 games, is 14,000 ratings points moving from the women's side to the men's side.

There were about 2,000 male players in the study, and 500 female players. Let's ignore their non-expert opponents, and assume all the leakage came from these 2,500 players.

That means the average female player would have (legitimately) lost 28 points due to leakage (14,000 divided by 500).  The average male player would gain 7 points (14,000 divided by 2000).

So, that much leakage would have cut the male/female ratings bias by 35 points.

But, since we started the process with a known 116 points of bias, we're left with 81 points still remaining! Even with such a large database of games, there aren't enough male/female games to get rid of more than 30 percent of the Elo bias caused by unbalanced opponents.

If the true bias should be 81 points, why did the study find only 30?

Because the sample of games in the study isn't a complete set of all games that went into every player's rating.  For one thing, it's just the results of major tournaments, the ones that were significant enough to appear in "The Week in Chess," the publication from which the authors compiled their data.  For another thing, the authors used only 18 months worth of data, but most of these expert players have been in playing chess for years.

If we included all the games that all the players ever played, would that be enough to get rid of the bias?  We can't tell, because we don't know the number of intersex games in the players' full careers.  

We can say hypothetically, though.  If the average expert played three times as many games as logged in this 18-month sample, that still wouldn't be enough -- it would only cover be 105 of the 116 points.  Actually, it would be a lot less, because once the ratings start to become accurate, the rate of correction decelerates.  By the time half the bias is covered, the remaining bias corrects at only 2 points per between-sex game, rather than 4.  

Maybe we can do this with a geometric argument.  The data in the sample reduced the bias from 116 to 81, which is 70 percent of the original.  So, a second set of data would reduce the bias to 57 points.  A third set would reduce it to 40 points.  And a fourth set would reduce it to 28 points, which is about what the study found.

So, if every player in this study actually had four times as many man vs. woman games as were in this database, that would *still* not be enough to reduce the bias below what was found in the study.

And, again, that's conservative.  It assumes the same players in all four samples.  In real life, new players come in all the time, and if the new males tend to be better than the new females, that would start the bias all over again.

-------

So, I can't prove, mathematically, that the 30-point discrepancy the authors found is an expected artifact of the way the rating system works.  I can only show why it should be strongly suspected.

It's a combination of the fact that, for whatever reason, the men are stronger players than the women, and, again for whatever reason, there are many fewer male-female games than you need for the kind of balanced schedule that would make the ratings comparable.

And while we can't say for sure that this is the cause, we CAN say -- almost prove -- that this is exactly the kind of bias that happens, mathematically, unless you have enough male-female games to wash it out.  

I think the burden is on the authors of the study to show that there's enough data outside their sample to wash out that inherent bias, before introducing alternative hypotheses. Because, we *know* this specific effect exists, has positive sign, depends on data that's not given in the study, and could plausibly exactly the size of the observed effect!  

(Assuming I got all this right. As always, I might have screwed up.)

-------

So I think there's a very strong case that what this study found is just a form of this "NCAA vs. NBA" bias. It's an effect that must exist -- it's just the size that we don't know. But intuitive arguments suggest the size is plausibly pretty close to what the study found.

So it's probably not that women perform worse against men of equal talent. It's that women perform worse against men of equal ratings


UPDATE: In a discussion at Tango's site, commenter TomC convinced me that there is enough "leakage" to unbias the ratings. I found an alternative explanation that I think works -- this time, I verified it with a simulation.  Will post that as soon as I can.


UPDATE: Here it is, Part II.


Peter Backus, Maria Cubel, Matej Guid, Santiago Sanches-Pages, Enrique López Mañas: Gender, Competition and Performance: Evidence from Real Tournaments.  


Labels: , , ,

Monday, February 26, 2007

Do women choke in pressure situations?

Of top corporate executives, only one in 40 is a woman. That's perhaps because women have a higher tendency to choke under pressure.

That's according to
a study by Israeli researcher M. Daniele Paserman. Paserman analyzed the results of men's and women's tennis tournaments. He found that men scored roughly the same percentage of "unforced errors" regardless of the clutchness of the situation, but women's unforced error rate went up signficiantly when the chips were down.

(Hat tip to Slate and Steven E. Landsburg, whose summary of the study is
here.)

(Tennis scoring is described
here. Basically, four points make a game and six games make a set. A match is best of three sets for women, or best of five sets for men. There are other rules, like you have to win by two – see the link for details.)

Generally, points are divided, after the fact, into one of three types. "Winners" are volleys that the other player can't handle. "Unforced errors" occur when an opponent "has time to prepare and position himself" but makes a point-ending mistake. And "forced errors" are mistakes that were in part caused by a skilled return from the opponent. These are just to describe what happened – they don't affect scoring or ranking or anything.

Men and women have different percentages of the error types. According to the study, the numbers look like this:

Men .... 31% winners, 30% unforced errors, 35% forced errors
Women .. 30% winners, 37% unforced errors, 30% forced errors


The study doesn’t go into detail about why the overall numbers are different, but my guess is that it has to do with the relative strength of the sexes. Men, being stronger, have faster serves and returns than women; faster volleys would lead to opponents having less time to react to a shot, which would increase the incidence of forced errors. The more forced errors, the fewer unforced errors (since the three types sum roughly to 100%).

That may not be the correct explanation, but, in any case, the point is that womens' higher percentage of unforced errors doesn't necessarily mean they're chokers, or more careless players in general. (And the study makes no such claim.)

Paserman starts by comparing unforced error rates in two situations – one more important (the last set of the match – 5th for men, 3rd for women), and the other less important (all other sets). After adjusting for a whole bunch of things – player quality, location, etc. -- he finds that men make about 1.4 percentage points more unforced errors in the important situations, but women make 2.9 percentage points more:

Men .......... +1.4 percentage points (1.9 standard deviations)
Women ........ +2.9 percentage points (3.5 standard deviations)
Difference: .. +1.5 percentage points (1.4 standard deviations)


The results moderate slightly when, instead of using a binary "yes/no" variable corresponding to the last set, he uses a sliding scale for how "important" the set is. In that situation, the women's number falls to +2.4 (1.9 standard deviations).

Paserman then switches from set-level data to play-by-play data, which is where it gets more interesting. First, he figures out the "importance" of any given point, in terms of the magnitude in which it affects the outcome of the set. In a tiebreaker, importance would be high, but when one player already leads 5 games to zero, it would be very small. (In baseball, Tangotiger calls this "leverage".)

Then, he divides all the situations into quartiles, and calculates unforced error percentage in each.

Lowest importance ... Men 31% ... Women 34%
Quartile 2 .......... Men 30% ... Women 37%
Quartile 3 .......... Men 31% ... Women 39%
Highest importance .. Men 31% ... Women 40%


As you can see, women make almost 15% more unforced errors in high-pressure situations than in low-pressure situations. And there is no such effect for men.

These are controversial results, and Paserman ran a few checks to see if other factors might account for the effect.

-- he adds variables for ability, which set in the match it is, which round of the tournament it is, and others. The effect remained.
-- he checked whether changing the definition of "importance" would change the results. The effect remained.
-- he replaced the "imporance" variable with a series of dummy variables representing the actual game score. The effect remained. In fact, that "explains more than 40 percent of the variation in ... unforced errors."

Paserman then hypothesized that the errors might be the result of risk aversion. If women play safer with the match on the line, that will make things easier for opponents, thus reduce the number of winning shots and forced errors. (As Steven Landsburg writes, "If both players just keep lobbing the ball back and forth, there can't be any forced errors, so all errors are recorded as unforced.")

To check for risk aversion, Paserman looked at first serves. (In tennis, a player is given two chances to serve: if the first one fails, the server gets one more opportunity.) If women are risk averse in the clutch, this should lead to a higher percentage of successful first serves.

And it does:

Lowest importance .......... baseline
Quartile 2 .......... Men +10% ... Women +11%
Quartile 3 .......... Men + 7% ... Women +25%
Highest importance .. Men + 7% ... Women +28%


If I've interpreted the results of the study properly, women are a full 28% more likely to hit a legal first serve in clutch situations than in meaningless situations. That can be considered a "conservative" strategy because there's a tradeoff between effectiveness and legalness – a risky shot is more likely to be out, but also more likely to win the point if it's in. And, in this case, the strategy of more legal serves was costly – women servers were 11% less likely to win the clutch points, even with (or perhaps "because of") the greater proportion of legal first serves.

Also, men hit harder first serves in more important situations (+1.5 mph difference between the two extremes of importance), while women hit softer ones (-1.5 mph). In second serves, men's speeds dropped by 1.7 mph, but women's dropped by a much larger 3.5 mph. Again, this is evidence that women are more conservative.

Finally, conservative play is evident in the number of strokes per rally. Men's strokes per point (both players combined)increased by 1.4, and women's by 1.2. But in this measure, it looks like it's men who are more conservative here.

In his conclusions, Paserman does argue that it would be dangerous to extrapolate from poor judgement in tennis results to poor judgement in corporate management. However, he does say that

"... this sample is more representative of the extreme right tail of the talent distribution that is of interest for understanding the large under-representation of women in top corporate jobs ..."


The impression you get is that Paserman does think the results have some value as evidence on the question of women's achievement in other fields.

----------------------

Well, I'm not so sure.

First, here are the percentages for the three types of women's shots for all for quintiles:

Lowest importance ... 29% winners, 34% unforced, 31% forced
Quartile 2 .......... 30% winners, 37% unforced, 31% forced
Quartile 3 .......... 29% winners, 39% unforced, 30% forced
Highest importance .. 30% winners, 40% unforced, 28% forced


Notice that the percentage of winners is constant, and the percentage of forced errors is also pretty steady. It's the percentage of unforced errors that varies a bit. But even then, and as the commenter at the end of Landsburg's article argues, it's mostly the 34% figure that's responsible for the trend. Change the top-middle cell to, say, 38%, and the effect pretty much disappears.

And that's not so farfetched. The first line adds up to only 94%, but the other three lines add up to 98%. Where did the other 4% go? Paserman doesn't tell us. If you add it on to unforced errors, you get to 38%, and there's barely any clutch effect left at all! Is it possible that some unforced errors are being misclassified for the low importance case?

So that's an important unanswered question.

Here's another: is it possible that the differences between men and women's tennis might explain some of the differences?

For instance, there's the difference in rules -- five sets for men against three sets for women. And there are strategic differences. Just a few days ago, an article in the Globe and Mail talked about " ... the significance of strategy in the women's game -- as opposed to the men's game which is often dictated by power serves."

Could these be part of the explanation? Here's one attempt:

Suppose that on any given shot, you can try a strategy varying between two extremes. You can hit the ball very hard, hoping to catch your opponent off-guard and win the point right there. Alternatively, you can lob the ball over, hoping your opponent makes an unforced error and gives you the point.

Now, consider the "least important" quartile. Those are situations when it's obvious who's going to win the set. Maybe it's 5-0 in games. In that case, it would take a miracle for the other player to come back. Therefore, winning a point in this situation isn't that important. What's more important is wearing the other player out. If one player can make the opponent run around and get tired out, s/he gains an advantage that way even if the results of that set aren't affected.

What's the best way to tire the opponent out while not tiring yourself out? Maybe, instead of lobbing the ball, hit it hard. That way, the unwary opponent has to start and stop, change direction quickly, and perhaps run all over the place. And there's a good chance the power shot will end the point, so maybe you yourself won't have to run any more at all.

Of course, the opponent is playing the same strategy, and will exert less effort to return those hard shots (since it doesn't matter much who wins that point). This leads to shorter rallies (as observed).

Now, suppose the harder you hit the ball, the less control you have. Men hit the ball harder than women, and so more of their hard shots fail. That's an unforced error. And so, in the least important situations, men show up as having (relatively) more unforced errors than women.

Implausible, perhaps. But the idea that women intrinsically have trouble under pressure, isn't that implausible too?

Or another possibility: Suppose that judges tend to call an error "forced" if it came from a volley above X mph, but "unforced" if it came from a volley below X mph. If X is chosen such that it's easy for a man to achieve, but harder for a woman to achieve, that would lead to more unforced errors called against women overall (which is what the data show). In the lowest quartile, both men and women might try to hit hard shots to wear their opponents out. Even so, men would have roughly the same number of Xs as in important situations, since they hit Xs as a matter of course. But, if women only hit Xs by putting out extra effort, those shots will disproportionately appear in the first quartile, which will move that category's errors from unforced to forced. And again, that's what we see in the data.

Finally, a third theory: when the set is 5-0 and the outcome is all but certain, men have more incentive to slack off than women do. That's because the men have five sets to play, rather than three, and have a greater need to conserve energy. This leads, somehow, to a different pattern of play, one that leads to more "unimportant" unforced errors than for the women. I don't know what that pattern might be, but there is good reason to think it should be different.

Anyway, as unlikely as these hypotheses may seem, they do seem more reasonable than the "successful women choke under pressure" theory. Are there any other plausible theories that I didn't think of?

Labels: , ,