Monday, February 06, 2017

Are women chess players intimidated by male opponents? Part III

Over at Tango's blog, commenter TomC found an error in my last post. I had meant to create a sample of male chess players with mean 2400, but I added wrong and created a mean of 2500 instead. (The distribution of females was correct, with mean 2300.)

The effect produced my original distribution had come close to the one in the paper, but, with the correction, it drops to about half.

The effect is the win probability difference between a woman facing a man, as compared to facing a woman with the same rating:

-0.021 paper
-0.020 my error
-0.010 corrected

It makes sense that the observed effect drops when the distribution of men gets closer to the distribution of women. That's because (by my hypothesis) it's caused by the fact that women and men have to be regressed to different means. The more different the means, the larger the effect. 

Suppose the distribution matches my error, 2300 for the women and 2500 for the men. When a 2400 woman plays a 2400 man, her rating of 2400 needs to be regressed down, towards 2300. But the man's rating needs to be regressed *up*, towards 2500. That means the woman was probably overrated, and the man underrated.

But, when the men's mean is only 2400, the man no longer needs to be regressed at all, because he's right at the mean. So, only the woman needs to be regressed, and the effect is smaller.


The effect becomes larger when the players are closer matched in rating. That's when it's most likely that the women is above average, and the man is below average. The original study found a larger effect in close matches, and so did my (corrected) simulation:

.0130 -- ratings within 50 points
.0144 -- ratings within 100 points
.0129 -- ratings within 200 points
.0057 -- ratings more than 200 points apart

Why is that important?  Because, in my simulation, I chose the participants at random from the distributions. In real life, tournaments try to match players to opponents with similar ratings.

In the study, the men's ratings were higher than the women's, by an average 116 points. But when a man faced a woman, the average advantage wasn't 116 -- it was much lower. As the study says, on page 18,

"However, when a female player in our sample plays a male opponent, she faces an average disadvantage of 27 Elo points ..."

Twenty-seven points is a very small disadvantage, about one quarter of the 116 points that you'd see if tournament matchups were random. The matching of players makes the effecst look larger.

So, I rejigged my simulation to make all matches closer. I chose a random matchup, and then discarded it a certain percentage of the time, varying with the ratings difference. 

If the players had the same rating, I always kept that match. If the difference was more than 400 points, I always discarded that match. In between, I kept it on a sliding scale that randomly decided which matches to keep.

(Technical details: I decided each game by a probability corresponding to the 1.33 power of the difference. So 200-point games, which are halfway between 0 and 400, got kept only 1/2.52 of the time (2.52 being 2 to the power of 1.33). 
Why 1.333?  I tried a few other exponents, and that one happened to get the resulting distributions of men and women close to what was in the study. But other ways would have worked too.  For what it's worth, I tried other exponents, and the results were very similar.)

Now, my results were back close to what the study had come up with, in its Table 2:

.021 study
.019 simulation

To verify that I didn't screw up again, I compared my summary statistics to the study.  They were reasonably close.  All numbers are Elo ratings:

Mean 2410, SD 175: men, study
Mean 2413, SD 148: men, simulation

Mean 2294, SD 141: women, study
Mean 2298, SD 131: women, simulation

Mean 27 points: M vs. W opp. difference, study
Mean 46 points: M vs. W opp. difference, simulation

The biggest difference was in the opponents faced:

Mean 2348: men's opponents, study
Mean 2408: men's opponents, simulation

Mean 2283: women's opponents, study
Mean 2321: women's opponents, simulation

The differences here are that in real life, the players chosen by the study played opponents worse than themselves, on average. (Part of the reason is that, in the study, only the better players (rating of 2000+) were included as "players", but all their opponents were included, regardless of skill.)  In the simulation, the players were chosen from the same distribution. 

I don't think that affects the results, but I should definitely mention it.


Another thing I should mention, in defense of the study: last post, I questioned what happens when you include actual ratings in the regression, instead of just win probability (which is based on the ratings). I checked, and that actually *lowers* the observed effect, even if only a little bit. From my simulation:

.0188 not included
.0187 included

And, one more: as I mentioned last post, I chose an SD of 52 points for the difference between a player's rating and his or her actual talent. I have no idea if 52 points is a reasonable estimate; my gut suggests it's too high. Reducing the SD would also reduce the size of the observed effect.

I still suspect that the study's effect is almost entirely caused by this regression-to-the-mean effect.  But, without access to the study's data, I don't know the exact distributions of the matchups, to simulate closer to real life. 

But, as a proof of concept, I think the simulation shows that the effect they found in Table 2 is of the same order of magnitude as what you'd expect for purely statistical reasons. 

So I don't think the study found any evidence at all for their hypothesis of male-to-female intimidation.

P.S.  Thanks again to TomC for finding my error, and for continued discussion at Tango's site.

Labels: , , ,

Thursday, January 26, 2017

Are women chess players intimidated by male opponents? Part II

UPDATE, 1/29/2017: Oops!  Due to an arithmetic error, I had the pools of men and women apart by 200 instead of 116, in my simulation.  Will rerun everything and post again.  

UPDATE, 2/6/2017: New post is here after rerunning stuff.  Read this first if you haven't already..


Do women chess players perform worse against male opponents because they find the men intimidating and threatening? Last post, I talked about a paper that makes that claim. I disagreed with their conclusions, thinking the effect was actually the result of women and men being compared to different averages. It turns out I was wrong -- there are enough inter-sex matches that the ratings would clean themselves up over time.

So, I'll have a second argument -- this time, with a simulation I ran to make sure it actually holds up. But, before I do that, I should talk about the regressions in the paper itself. 


The first regression comes in Table 2. It tries to estimate the number of points the player earns (equivalent to the probability of a player winning, where a tie counts as half a win). To get that, it regresses on:

-- the expected result, based on the Elo formula applied to the two players' ratings;
-- whether the player plays the white or black pieces (in chess, white moves first, and so has an advantage); 
-- the ages of the two competitors; and, of course,
-- the sexes of the two competitors.

I wonder about why the authors chose to include age in this regression. Is Elo biased by age? If so, could whatever biased it by age also have biased it by sex?

Furthermore, why would the bias be linear on age, such that the difference between a 42- and 22-year old would be ten times as large as the difference between a 24- and 22-year old? That doesn't seem plausible to me at all.

Anyway, maybe I'm just nitpicky here. This might not actually matter much.

(Well, OK, if you want to get technical -- the effect of playing white can't be linear either, can it? Suppose playing white lifts your chances of winning from .47 to .53, if you're playing someone of equal talent. But if you're a much better player, does it really lift your chances from .90 to .96? Probably not.

In fairness, the authors did run a version of one regression that included the square of the expected win probability -- but they didn't interact it with the white/black variable, or any of the others, I think.)

In any case, the coefficient of interest comes out to .021. That means that when a player faces a woman as opposed to a man -- after controlling for age, ratings differential, and color -- his or her chances of winning are 2.1 percentage points higher. For a .700 favorite moving to .721, that's the equivalent of about 18 Elo points.


The second regression (Table 4) still tries to predict winning percentage, but based on a larger set of variables:

-- the sex of each competitor (again);
-- the expected winning percentage, based on the Elo differential (again);
-- the ages of both players (again);
-- the Elo rating of both players;
-- the country (national chess federation) to which the player belongs; and
-- the proportion of other players at the event who are female.

The first thing that strikes me is that the study uses two separate measures of talent -- expected win probability, and Elo ratings. This seems like duplication. I guess, though, that the two are non-linear in different ways, so maybe the second corrects for the errors of the first. But, if neither alone is unbiased across different types of players, who's to say that both together are unbiased? 

Also, the expected winning percentage will be highly correlated with the two Elo ratings ... wouldn't that cause weird effects? We can't tell, because the authors don't show ANY of the coefficient estimates except the male/female one. (See Table 4.)

The authors also include player fixed effects -- in other words, they assume every player has his or her own idiosyncratic arithmetic "bump" in expected winning percentage. This would make sense in other contexts, but seems weird in a situation where every player has a specific rating that's supposed to cover all that. But, Ted Turocy assured me that shouldn't affect the estimates of the other coefficients, so I'll just go with it for now.

In any case, there's so much going on in this regression that I have no idea how to interpret what comes out of it, especially without coefficient estimates.

I don't even have a gut feel. The estimate of the "intimidation" effect *doubles* when these new variables are introduced. How does that happen? Is the formula to produce winning percentage from ratings so badly wrong that the ratings themselves double the effect? Are ratings that biased by country? Are women twice as intimidated by male opponents in the absence of fellow female players?

It doesn't make sense to my gut. So I'm not going to try to figure out this second regression at all. I'll just go with the first one, the one that found an effect of 18 Elo points.


OK, now my argument: I think there's a specific reason for the effect the authors found, one intrinsic to the structure of chess ratings, and having nothing to do with what happens when women compete against men. It has to do with regression to the mean.

Chess (Elo) ratings aren't perfectly accurate. They change after every game, depending on the results. So, a player who has been lucky lately will have a rating higher than his or her talent, and vice-versa. This is the same idea as in baseball, or any other sport. (In MLB, after 162 games, you have to regress a team's record to the mean by about 40 percent to get a true estimate of its talent, so a team that went 100-62 was, in expectation, a 92-70 talent that got lucky.)

Suppose a man faces another male opponent, but has a 50-point advantage in rating. Both ratings have to be regressed to the mean, so the talent gap is smaller than 50 points. Maybe, then, the better player should only a 45-point favorite, or something.

Now, suppose a man faces a *female* opponent, with the same 50-point advantage. In this case, I would argue, after regressing to the mean, the man actually has MORE than a 50-point advantage.

Why? Because, in general, the women have lower ratings than the men, by 116 points. So when there's only a 50-point gap between them, we're probably looking at a woman who's above average for a woman, and a man who's below average for a man.

That means the woman has to be regressed DOWN towards the women's mean, and the man has to be regressed UP towards the men's mean.

In other words: when a man faces a woman, the true talent gap is probably larger than when a man faces a man, even when the Elo ratings are exactly the same.

Here's an easier example:

Suppose an NBA team beats another NBA team by 20 points. They were probably substantially lucky, right? If they faced each other again, you'd expect the next game to be a lot closer than that.

But, suppose an NBA team beats *an NCAA team* by 20 points. In that case, the NCAA team must have played great, to get to within 20 points of the pros. In that case, you'd expect the next game to be a much bigger blowout than 20 points.


Well, this time, I figured I'd better test out the logic before posting. So I ran a simulation. I created random men and women with a talent distribution similar to what was in the original study. My distributions were bell-shaped -- I don't know what the real-life distributions were.

(UPDATE: As I mentioned, the men were about 200 points more talented than the women, instead of 100.  Oops.  Spoiler alert: divide the effect in half, roughly, for now.)

Then, I ran a random simulation of games, with proportions of MM, MW, WM, and WW similar to those in the study. 

For the first simulation, I assumed that the ratings were perfect, representing the players' talents exactly. As expected, the regression found no real difference between the women and men. The coefficients were close to zero and not statistically significant. 

So far, so good.

Then, I added the random noise. For every player, I randomly adjusted his or her rating to vary from talent, with mean zero and SD of about 52 points.

I ran the simulation again. 

The data matched my hypothesis this time: the man/woman games have to be regressed differently than the woman/woman games.

When a man played another man, and the difference between their ratings was less than 50 points ... on average, both men matched their ratings. (In other words, each man was as likely to be too high as too low.)  But when a man played a woman, and the difference between their ratings was less than 50 points ... big difference. In that case, the woman, on average, was 10 points worse than her rating, and the man was 10 points better than his rating. 

In other words, when a man faced a similarly-rated woman, his true advantage was 20 points more than the ratings suggested.


For a direct comparison, I recreated the authors' first regression (without the age factor). Here's what I got, compared to the study:

                         Me   Study
Player is female      -.024   -.021
Opponent is female    +.020   +.021
Female/Female         +.002   +.001

Almost exact! One thing that's a little different is that, in real life, the player and opponent effects were equal. In the simulation, they weren't. That's just random: in my case, the females designated "opponent" were a little luckier than the ones designated "player."  In the real study, they happened to be very close to equal. In neither case is the difference statistically significant.

(The only differences between the two categories: the "player" must have a rating of at least 2000, and have had at least one game against each sex. The study didn't require the "opponent" to have either criterion.)

In order to make things easier to read, in subsequent runs of the simulation, I included both sides of every game -- I just switched the player and opponent, and treated it as a second observation.

This equalizes the two coefficients while keeping the expected value of the effects the same. (It does invalidate the standard errors, but I don't care about significance here, so that doesn't matter.)  It also eliminates the need for the "female/female" interaction term.

Here's that next run of the simulation:

All                      Me    Study
Player is female      -.020    -.021
Opponent is female    +.020    +.021

Exactly the same! Well, that's not coincidence. By trial and error, I figured out that an SD of 52 points between talent and rating is what made it work, so that's what I used. (I should really have raised it to 53 or 54 to get it to come out exactly, but for some reason I thought the target was .020 instead of .021, and by the time I realized it was .021, the sim was done and I was too lazy to go back and rerun it.)

In any case, I think this proves that (a) there MUST be a "women appear to play worse against men" effect appearing as a consequence of the way the data is structured, and (b) the effect size the authors found is associated with a reasonable ratings accuracy of 54 points or so.

If the authors want to search for an effect in addition to this regression artifact, they have to figure out what the real ratings accuracy is, and adjust for it in their calculations. I'm not sure how easy it would be to do that. 


I could stop here, but I'll try reproducing one more of the study's checks, just in case you're not yet convinced.

In Table 6, the authors showed the results of regressions separating the data by how well-matched the players were. They did this only for the regression with all the extra variables that doubled their coefficients, so their numbers are higher. Here's what they got:

.040 Overall
.057 Players within 50 points
.052 Players within 100 points
.033 Players within 200 points

I did the same, and got

.020 Overall
.031 Players within 50 points
.025 Players within 100 points
.027 Players within 200 points
.013 Players 200+ points apart

The general pattern is similar: a larger effect for closely-matched players, and a lower effect for mismatches. 

The biggest difference is that in my simulation, the dropoff comes after 200 points, whereas in the original study, it seems to come somewhere before 200 points.

Part of the difference might be that I chose completely random matchups between players, but, in real life, the tournaments paired up players with similar rankings. Overall, the women were 116 points lower than the men, but in actual tournaments, they were only 37 points worse.

I'm guessing that if my simulation chose similar matchups to real life, the numbers would come out much closer.


I could try to reproduce more of the study's regression results, but I'll stop here. I think this is enough for a strong basis for the conclusion that what the authors found is just caused by the fact that (a) Elo ratings aren't perfect, and (b) as a group, the women have much lower ratings than the men.

In other words, it's just a statistical artifact.

Labels: , , ,

Thursday, January 19, 2017

Are women chess players intimidated by male opponents?

The "Elo" rating system is a method most famous for ranking chess players, but which has now spread to many other sports and games.

How Elo works is like this: when you start out in competitive chess, the federation assigns you an arbitrary rating -- either a standard starting rating (which I think is 1200), or one based on an estimate of your skill. Your rating then changes as you play.

What I gather from Wikipedia is that "master" starts at a rating of about 2300, and "grandmaster" around 2500. To get from the original 1200 up to the 2300 level, you just start winning games. Every game you play, your rating is adjusted up or down, depending on whether you win, lose, or draw. The amount of the adjustment depends on the difference in skill between you and your opponent. Elo calculates an estimate of the odds of winning, calculated from your rating and your opponent's rating, and the loser "pays" points to the winner. So, the better your opponents, the more points you get for defeating them.

The rating is an estimate of your skill, a "true talent level" for chess. It's calibrated so that every 400-point difference between players is an odds ratio of 10. So, when a 1900-rated player, "Ann", faces a 1500-rated player, "Bob," her odds of winning are 10:1 (.909). That means that if the underdog, Bob, wins, he'll get 10 times as many points as Ann will get if she wins.

How many points, exactly? That's set by the chess federation in an attempt to get the ratings to converge on talent, and the "400-point rule," as quickly and accurately as possible. The idea is that the less information you have about the players, the more points you adjust by, because the result carries more weight towards your best estimate of talent. 

For players below "expert," the adjustment is 32 times the difference from expectation. For expert players, the adjustment is only 24 points per win, and, at the master level and above, it's 16 points per win.

If Bob happens to beat Ann, he won 1.00 games when the expectation was that that he'd win only 0.09. So, Bob exceeded expectations by 0.91 wins. Multiply by 32, and you get 29 points. That means Bob's rating jumps from 1500 to 1529, while Ann drops from 1900 to 1871.

If Ann had won, she'd claim 3 points from Bob, so she'd be at 1903 and Bob would wind up at 1497.

FiveThirtyEight recently started using Elo for their NFL and NBA ratings. It's also used by my Scrabble app, and the world pinball rankings, and other such things. I haven't looked it up, but I'd be surprised if it weren't used for other games, too, like Backgammon and Go.


For the record, I'm not an expert on Elo, by any means ... I got most of my understanding from Wikipedia, and other internet sources. And, a couple of days ago, Tango posted a link to an excellent article by Adam Dorhauer that explains it very well.

Despite my lack of expertise, it seems to me that these properties of Elo are clearly the case:

1. Elo ratings are only applicable to the particular game that they're calculated from. If you're a 1800 at Chess, and I'm a 1600 at Scrabble, we have no idea which one of us would win at either game. 

2. The range of ELO ratings varies between games, depending on the range of talent of the competitors, but also on the amount of luck inherent to the sport. If the best team in the NBA is (say) an 8:1 favorite against the worst team in the league, it must be rated 361 Elo points better. (That's because 10 to the power of (361/400) equals 8.)  But if the best team in MLB is only a 2:1 favorite, it has to be rated only 120 points better.

Elo is an estimate of odds of winning. It doesn't follow, then, that a 1800 rating in one sport is comparable to a 1800 rating in another sport. I'm a better pinball player than I am a Scrabble player, but my Scrabble rating is higher than my pinball rating. That's because underdogs are more likely to win at pinball. I have a chance of beating the best pinball player in the world, in a single game, but I'd have no chance at all against a world-class Scrabble player.

In other words: the more luck inherent in the game, the tighter the range (smaller the standard deviation) of Elo ratings. 

3. Elo ratings are only applicable within the particular group that they're applied to. 

Last March, before the NCAA basketball tournament, FiveThirtyEight had Villanova with an Elo rating of 2045. Right now, they have the NBA's Golden State Warriors with a rating of 1761.

Does that mean that Villanova was actually a better basketball team than Golden State? No, of course not. Villanova's rating is relative to its NCAA competition, and Golden State's rating is relative to its NBA competition.

If you took the ratings at face value, without realizing that, you'd be projecting Villanova as 5:1 favorites over Golden State. In reality, of course, if they faced each other, Villanova would get annihilated.


OK, this brings me to a study I found on the web (hat tip here). It claims that women do worse in chess games that they play against men rather than against women of equal skill. The hypothesis is, women's play suffers because they find men intimidating and threatening. 

(For instance: "Girls just don’t have the brains to play chess," (male) grandmaster Nigel Short said in 2015.)

In an article about the paper, co-author Maria Cubel writes:

"These results are thus compatible with the theory of stereotype threat, which argues that when a group suffers from a negative stereotype, the anxiety experienced trying to avoid that stereotype, or just being aware of it, increases the probability of confirming the stereotype.

"As indicated above, expert chess is a strongly male-stereotyped environment. "... expert women chess players are highly professional. They have reached a high level of mastery and they have selected themselves into a clearly male-dominated field. If we find gender interaction effects in this very selective sample, it seems reasonable to expect larger gender differences in the whole population."

Well, "stereotype threat" might be real, but I would argue that you don't actually have evidence of it in this chess data. I don't think the results actually mean what the authors claim they mean. 


The authors examined a large database of chess results, and selected all players with a rating of at least 2000 (expert level) who played at least one game against an opponent of each of the two sexes.

After their regressions, the authors report,
"These results indicate that players earn, on average and ceteris paribus, about 0.04 fewer points [4 percentage points of win probability] when playing against a man as compared to when their opponent is a woman. Or conversely, men earn 0.04 points more when facing a female opponent than when facing a male opponent. This is a sizable effect, comparable to women playing with a 30 Elo point handicap when facing male opponents."

The authors did control for Elo rating, of course. That was especially important because the women were, on average, less skilled than the men. The average male player in the study was rated at 2410, while the average female was only 2294. That's a huge difference: if the average man played the average woman, the 116-point spread suggests the man would have a .661 winning percentage -- roughly, 2:1 odds in favor of the man.

Also, there were many more same-sex matches in the database than intersex matches. There are two reasons for that. First, many tournaments are organized by ranking; since there are many more men, proportionally, in the higher ranks, they wind up playing each other more often. Second, and probably more important, there are many women's tournaments and women's-only competitions.


UPDATE: my hypothesis, described in the remainder of this post, turns out to be wrong.  Keep that in mind while reading, and then turn to Part II after. Eventually, I'll merge the two posts to avoid confusion.


So, now we see the obvious problem with the study, why it doesn't show what the authors think it shows. 

It's the Villanova/Golden State situation, just better hidden.

The men and women have different levels of ability -- and, for the most part, their ratings are based on play within their own group. 

That means the men's and women's Elo ratings aren't comparable, for exactly the same reason an NCAA Elo rating isn't comparable to an NBA Elo rating. The women's ratings are based more on their performance relative to the [less strong] women, and the men's ratings more on their performance relative to the [stronger] men.

Of course, the bias isn't as severe in the chess case as the basketball case, because the women do play matches against men (while Villanova, of course, never plays against NBA teams). Still, both groups played predominanly within their sex -- the women 61 percent against other women, and the men 87 percent against other men.

So, clearly, there's still substantial bias. The Elo ratings are only perfectly commensurable if the entire pool can be assumed to have faced a roughly equal caliber of competition. A smattering of intersex play isn't enough.

Villanova and Golden State would still have incompatible Elos even if they played, say, one out of every five games against each other. Because, then, for the rest of their games, Villanova would go play teams that are 1500 against NCAA competition, and Golden State would go play teams that are 1500 against NBA competition, and Villanova would have a much easier time of it.


Having said that ... if you have enough inter-sex games, the ratings should still work themselves out. 

Because, the way Elo works, points can neither be created nor destroyed.  If women play only women, and men play only men, on average, they'll keep all the ratings points they started with, as a group. But if the men play even occasional games against the women, they'll slowly scoop up ratings points from the women's side to the men's side. All that matters is *how many* of those games are played, not *what proportion*.  The male-male and female-female games don't make a huge difference, no matter how many there are.

The way Elo works, overrated players "leak" points to underrated players. No matter how wrong the ratings are to start, play enough games, and you'll have enough "leaks" for the ratings all converge on accuracy.

Even if 99% of women's games are against other women, eventually, with enough games played, that 1% can add up to as many points as necessary, transferred from the women to the men, to make things work out.


So, do we have enough games, enough "leaks", to get rid of the bias?

Suppose both groups, the men and the women, started out at 1200. But the men were better. They should have been 1250, and the women should have been 1150.  The woman/woman games and man/man games will keep both averages at 1200, so we can ignore those.  But the woman/man games will start "leaking" ratings points to the men's side.

Are there enough woman/man games in the database that the men could unbias the women's ratings by capturing enough of their ratings points?

In the sample, there were 5,695 games by those woman experts (rating 2000+) who played at least one man.  Of those games, 61 percent were woman against women.  That leaves 2,221 games where expert women played (expert or inexpert) men. 

By a similar calculation, there were 2,800 games where expert men played (expert or inexpert) women.  

There's probably lots of overlap in those two sets of games, where an expert man played an expert woman. Let's assume the overlap is 1,500 games, so we'll reduce the total to 3,500.

How much leakage do we get in 3,500 games?  

Suppose the men really are exactly 116 points better in talent than the women, like their ratings indicate -- which would be the case if the leakage did, in fact, take care of all the bias. 

Now, consider what would have happened if there were no leakage. If the sexes played only each other, the women would be overrated by 116 points (since they'd have equal average ratings, but the men would be 116 points more talented).

Now, introduce intersex games. The first time a woman played a man, she'd be the true underdog by 116 points. Her male opponent would have a .661 true win probability, but treated by Elo as if he only had .500. So, the male group would gain .161 wins in expectation on that game.  At 24 points per win, that's 3.9 points.

After that game, the sum of ratings on the woman's side drops by 3.9 points, so now, the women won't be quite as overrated, and the advantage to the men will drop.  But, to be conservative, let's just keep it at 3.9 points all the way through the set of 3,500 games.  Let's even round it to 4 points.

Four points of leakage, multiplied by 3,500 games, is 14,000 ratings points moving from the women's side to the men's side.

There were about 2,000 male players in the study, and 500 female players. Let's ignore their non-expert opponents, and assume all the leakage came from these 2,500 players.

That means the average female player would have (legitimately) lost 28 points due to leakage (14,000 divided by 500).  The average male player would gain 7 points (14,000 divided by 2000).

So, that much leakage would have cut the male/female ratings bias by 35 points.

But, since we started the process with a known 116 points of bias, we're left with 81 points still remaining! Even with such a large database of games, there aren't enough male/female games to get rid of more than 30 percent of the Elo bias caused by unbalanced opponents.

If the true bias should be 81 points, why did the study find only 30?

Because the sample of games in the study isn't a complete set of all games that went into every player's rating.  For one thing, it's just the results of major tournaments, the ones that were significant enough to appear in "The Week in Chess," the publication from which the authors compiled their data.  For another thing, the authors used only 18 months worth of data, but most of these expert players have been in playing chess for years.

If we included all the games that all the players ever played, would that be enough to get rid of the bias?  We can't tell, because we don't know the number of intersex games in the players' full careers.  

We can say hypothetically, though.  If the average expert played three times as many games as logged in this 18-month sample, that still wouldn't be enough -- it would only cover be 105 of the 116 points.  Actually, it would be a lot less, because once the ratings start to become accurate, the rate of correction decelerates.  By the time half the bias is covered, the remaining bias corrects at only 2 points per between-sex game, rather than 4.  

Maybe we can do this with a geometric argument.  The data in the sample reduced the bias from 116 to 81, which is 70 percent of the original.  So, a second set of data would reduce the bias to 57 points.  A third set would reduce it to 40 points.  And a fourth set would reduce it to 28 points, which is about what the study found.

So, if every player in this study actually had four times as many man vs. woman games as were in this database, that would *still* not be enough to reduce the bias below what was found in the study.

And, again, that's conservative.  It assumes the same players in all four samples.  In real life, new players come in all the time, and if the new males tend to be better than the new females, that would start the bias all over again.


So, I can't prove, mathematically, that the 30-point discrepancy the authors found is an expected artifact of the way the rating system works.  I can only show why it should be strongly suspected.

It's a combination of the fact that, for whatever reason, the men are stronger players than the women, and, again for whatever reason, there are many fewer male-female games than you need for the kind of balanced schedule that would make the ratings comparable.

And while we can't say for sure that this is the cause, we CAN say -- almost prove -- that this is exactly the kind of bias that happens, mathematically, unless you have enough male-female games to wash it out.  

I think the burden is on the authors of the study to show that there's enough data outside their sample to wash out that inherent bias, before introducing alternative hypotheses. Because, we *know* this specific effect exists, has positive sign, depends on data that's not given in the study, and could plausibly exactly the size of the observed effect!  

(Assuming I got all this right. As always, I might have screwed up.)


So I think there's a very strong case that what this study found is just a form of this "NCAA vs. NBA" bias. It's an effect that must exist -- it's just the size that we don't know. But intuitive arguments suggest the size is plausibly pretty close to what the study found.

So it's probably not that women perform worse against men of equal talent. It's that women perform worse against men of equal ratings

UPDATE: In a discussion at Tango's site, commenter TomC convinced me that there is enough "leakage" to unbias the ratings. I found an alternative explanation that I think works -- this time, I verified it with a simulation.  Will post that as soon as I can.

UPDATE: Here it is, Part II.

Peter Backus, Maria Cubel, Matej Guid, Santiago Sanches-Pages, Enrique López Mañas: Gender, Competition and Performance: Evidence from Real Tournaments.  

Labels: , , ,

Monday, January 09, 2017

Apportioning team wins among individual players

In one of my favorite posts of 2016, Tango talks about Win Shares, and about forcing individual player totals are forced to add up to the team's actual wins, and what that kind of accounting actually implies.

I was going to add my agreement to what Tango said, but then I got sidetracked about the idea of assigning team wins to individual players, even without the "forcing" aspect. 

I thought, even if the player wins exactly add up to the team wins without forcing them to do that ... well, even then, does the concept make sense?


The idea goes like this: you know the Blue Jays won 89 games in 2016. How many of those 89 wins is each individual Blue Jay responsible for, in the sense that if you add them all up, you get a total of 89?

One problem is that the criterion "responsible for" is too vague -- like the "most valuable" in "most valuable player."  It can mean whatever you want it to mean. But even if you're flexible and accept any reasonable definition, I'm not sure it still necessarily makes sense.

You drive your car 89 miles on three gallons of gas. How many of those 89 miles was the engine responsible for? How many of those 89 miles was the steering wheel responsible for? Or the tires, or the radiator? 

Well, you can't go anywhere without an engine, and you can't go anywhere without tires -- but they can't both get full credit for the 89 miles, can they? 

Well, it's the same thing for baseball. You can't win without a pitcher, and you can't win without a catcher -- you'd forfeit every game either way.

If you say that Troy Tulowitzki was responsible for (say) 4.0 of the team's 89 wins, what does that actually mean? It sounds like it means something like, "the team won 4.0 more games with Tulo at short than if the rules let them leave the shortstop position open."  

But that can't be it. Even if the rules allowed it ... well, with only eight players on the field, and an automatic out every time Tulo's spot came up to bad ... well, that would have cost the Blue Jays a lot more than four additional games, wouldn't it?


So, my initial thought was, assigning team wins to players makes no sense. But, then, I saw what I think is a way to make it work. 

When you say Troy Tulowitzki was responsible for 4.0 wins, you're implying that he's four wins better than nothing. But "nothing," taken literally, defaults every game. What if you say, instead, that he's four wins better than a zero-level player?

Taking a page from "Wins Above Replacement," let's redefine "nothing" to mean, a player from the best possible team that would still win zero games against MLB opponents. Or, maybe, to make it clearer, a team that would go 1-161. (You could probably use 0.1 - 161.9, or something, but I'll stick to 1-161.)

(I'm curious what kind of team that would be in real life. For what it's worth, I think I once ran a simulation of nine Danny Ainges (1981) versus nine Babe Ruths, and the Ainge team did go exactly 1-161.)

If Pythagoras works at those extremes ... the square root of 161 is about 13, so we're talking a team that would be outscored by MLB teams by a factor of 13 or more. I have no idea what that is. A good high school team?

Anyway ... if you define it that way, then, I think, it works. Win Shares is just Wins Above Replacement, with a team replacement level at an .005 winning percentage instead of the usual .333 or .250 or whatever. 

Maybe you could call it WAZ, for "Wins Above Zero."


But I'm still uneasy, even though it kind of works. I'm uneasy because I still don't buy the concept. I don't accept the idea that you can start with the 89 wins, and break it down by player, and it has to add up, and the job is just to figure out how. 

Because, that's not how it works. If you didn't like the car analogy, try this:

You have three players on your team. Each takes ten free throws. You get the team score by multiplying the individual scores together. If the players get 5, 6, and 8, respectively, the team gets 240.

Of those 240 points, which players are responsible for how many points? If you replaced player A by a guy who can't shoot at all, the team would score zero -- the product of 0, 6, and 8. So, A's "with or without you" contribution is worth 240. But, so is B's and C's! In this non-baseball sport, the sum of the individual players adds up to *triple* the team total.

In this specific case, because the score is straight multiplication, you might be able to make this work by taking the logarithm of everything, and switching to addition. But baseball isn't that easy. It's somewhere between addition and multiplication; the value of your single depends on the chance your teammates will reach base ahead of you and behind you. A home run is still dependent on context, but less so, since you always at least score yourself.

As I argued, baseball is "almost" linear, so you can get all this to work, kind of. But the fact that it works, kind of, doesn't mean the question makes sense. It just happens that the roundish peg fits into the squarish slot, because the peg is kind of squarish too, and the slot is kind of roundish.


Even before Win Shares or other win statistics, we used to do team breakdowns all the time, but for runs rather than wins. 

For instance, the 1986 Chicago White Sox scored 644 runs. We've always been willing to split those up by figuring Runs Created. For instance, I'm OK saying that of those 644 runs, Harold Baines was responsible for 87 of them, Daryl Boston another 28, Carlton Fisk 39, and so on. 

So why do I have a problem doing the same for wins?

Well, this is just me, and your gut will differ. But, personally, it's that when you throw pitching into the mix, it makes it obvious that the splitting exercise is contrived. 

With runs, you have an actual, visible pile of runs, that actually scored, and you can see the players involved, and it seems reasonable to divide the spoils.

But what about pitchers? What do you have a pile of to split? Maybe runs prevented, rather than runs created. But what's in the pile? How many runs did the 1986 White Sox prevent? Infinity, is my first reaction.

With Win Shares, Bill James got around this problem by defining a "zero line" to be twice the league average -- the "pile" is the runs between that and the actual number. For 1986, the zero line is 1492, so the White Sox wind up with a pile of 793 prevented runs to split among them. That's fine and reasonable, but it's still arbitrary, and, for me, it shatters the illusion that when you split team wins, you're doing something real. 

Here's another weird analogy.

You earn $52,000 a year, and at the end of the year, after all your expenses, you have $1,040 saved. How do you split the credit for that $1,040 among your 52 paycheques (batters)? Easily: just divide. Each pay is responsible for $20 of that $1,040.

But ... it's not just your deposits that are involved. It's your withdrawals, too. You could easily have spent all your money, and even gone into debt, if not for your spending prevention skills (pitchers). How much of the $1,040 is due to your dollars deposited being high, and how much is due to your dollars withdrawn being low?

To model that, you have to figure out "spending prevented."  Maybe, under zero willpower, you would have spent double what you earned -- you would have borrowed another $52,000 and blew it on crap. So, it turns out, your willpower prevented $53,000 in spending.

Your paycheques are responsible for $52,000 deposited, and your willpower for $53,000 not withdrawn. Maybe we'll divide that proportionally. So, each cheque, maybe your job skills were worth $9.90 per paycheque, and your thrift skills were worth $10.10.

Does that sound like a real thing? It doesn't to me.


This is not to say that I don't like Win Shares ... I do, actually. But I like them in the same way I like Bill's other "intuitive" stats, like Approximate Value, and Speed Score, and Trade Value. I like those as rough evaluations, not measurements. In fact, Win Shares is almost like Approximate Value, except that because they're roughly denominated in team wins, I find Win Shares easier to process intuitivel7. 

It's not the stat that bugs me, or the process. It's just the idea that it's a real, legitimate thing, demanding that team wins be broken up and credited arithmetically to the individual players. Because, I don't think it is. It just comes close in the baseball case.

Maybe I'm just old and cranky.

Anyway, in Tango's post, which I haven't actually talked about yet, there are better reasons to resist the idea of splitting team wins, based on the idea that they don't actually add up, and that when you force them to, you have to do things that you really can't justify. That's a much better argument, and it was what I was going to talk about before I started getting sidetracked with this conceptual stuff.

Next post.

Labels: ,

Wednesday, December 07, 2016

Charlie Pavitt: Steroids and the Hall of Fame

This guest post is from occasional contributor Charlie Pavitt. Here's a link to some of Charlie's previous posts.


I am writing today about a much-discussed topic, performance enhancing drugs and Baseball Hall of Fame enshrinement. My goal is not to defend a particular opinion about it, but rather to attempt to lay out five possible positions and some strengths and weaknesses each has. In fact, one reason why I will not defend a particular opinion is that, given these strengths and weaknesses, I am torn among several of the options.

But before I start, a few preliminaries. First, research of which I am aware provides strong evidence that steroid use significantly increases offensive performance, whereas there is little if any evidence that human growth hormone has any impact.

Second, none of this is new. Ancient Greek athletes took then-known stimulants before competitions, and nobody back then batted an eye.
Third, one must be careful throwing rocks when one’s own house could potentially, in a different context, be made of glass. When I was in graduate school, if someone had come to me and whispered, "Hey man, I have this pill you can take every day that will make you read, write, and think more quickly and efficiently," I would have been sorely tempted to partake.  In fact, one of my grad school cohort-mates imagined a situation in which you took a pill that provided you with the information you are supposed to learn from assigned reading, with lighter doses for undergraduate students and heavier doses for us grad students. Mighty tempting fantasy.
Fourth, and this is critical: Before throwing rocks, one needs to defend the claim that there is something wrong with taking performance enhancing drugs.  The fact that it may be illegal is, in my view, irrelevant, as many illegal items are not only harmless but helpful. For example, without getting into the marijuana debate, it is the case that any use of hemp has been illegal in some places, despite its many many positive applications. And taking something into the body to improve athletic performance is often a good thing. After all, a person can improve athletic performance by eating better, and perhaps taking supplements of necessary vitamins and minerals in moderation. So what’s the difference?  Here’s an argument for that difference; eating better and taking supplements in moderation promotes overall health whereas taking steroids (and overdosing on vitamins/minerals) does quite the opposite. The early deaths of many professional rasslers (I reserve the word "wrestlers" for the real sport), perhaps some football players (Lyle Alzado?), and two well-known baseball players (more on this later) has been linked with steroid use. 

One could then make the claim that it is the use of a substance that causes bodily harm that warrants rejection from the HOF. After all, the criteria for entry include "Integrity, sportsmanship, and character" along with "record, playing ability," and "contributions to the team(s) on which the player played."  

So, the argument continues, PED use is contrary to the former three criteria.  I think the best angle for this argument is that it sets the wrong example for others, particularly young people, whereas eating well and getting one’s vitamins/minerals sets the right example. Fair enough. But: Lots of HOF players were smokers or used chewing tobacco, and Babe Ruth certainly did not set a good dietary example by reportedly eating multiple hot dogs just before games.  And speaking of setting bad examples, if there is anybody enshrined who does not deserve it for absence of integrity etc., it is Adrian "Cap" Anson, who was proactive in the successful attempt to get Moses Fleetwood Walker, the first African-American major league baseball player, banned for the color of his skin.

So this argument leads to a slippery slope. But let us assume that we accept it.  Here are five possible responses, ordered from most lenient to most strict.

Position Number One: Let everyone in. The argument here is that great performers deserve entry no matter why they performed greatly. Buttressing this position is the seeming fact that until the public response to Jose Canseco’s confession among other events forced action, the powers-that-be in MLB’s establishment knew what was going on and intentionally turned a blind eye to it. After all, fans like offense, particularly home runs, and attendance was swelling, so all seemed right with the world. So if that was what baseball was in those days, goes the argument for this position, one must accept it and its great performers no matter what.  ne strength of this argument is that one must not make always-problematic non-performance-related judgments about players, as we will must for the other positions on this issue to be discussed in turn below.  The problem with this argument is that it is contrary to the "integrity, sportsmanship, and character" criteria, condones unhealthy behavior, and as such if anything encourages youthful copy cats.
Position Number Two: Let everyone in who deserves enshrinement independently of PED use. This implies that one likely accepts Alex Rodriguez, Roger Clemens, and Barry Bonds, because if one mentally subtracts the PED-fed "value-added" part of their performance, they are still HOF material. But one rejects those who would not have reached performance criteria otherwise: Mark McGwire and Rafael Palmeiro among others come to mind.  Perhaps this makes some sense, but one is still condoning bad behavior by allowing in known users while making questionable judgments about whose performance would have been "good enough" without PEDs.
Position Number Three: Ban known users. So Bonds, Clemens, McGwire, Palmeiro, Sammy Sosa, Manny Ramirez, and some others who reached supposed HOF performance levels are out. Also some who approached HOF levels and might otherwise deserve consideration (Miguel Tejada, Jason Giambi) get none. In so doing, we clear the deck of those guilty of poor integrity etc.  Also, it might allow us to consider those whose performance would have reached criteria in another era; think Fred McGriff, who hit as many homers as Lou Gehrig.  But what about those suspected of use? Take Jeff Bagwell for an example. Although there is no clear evidence of his use and he has steadfastly denied it, he did get a lot bigger fairly quickly, hit way more homers than anyone originally expected, and associated closely with the first known user alluded to earlier, Ken Caminiti, whose early death has been partly linked with use. If we lower our performance criteria to allow for McGriff, it also allows for Bagwell.  So now we’ve admitted someone who may have been as guilty as Bonds et al. but whose usage (if any) has not been publicly verified. Setting the in versus out boundary is a pretty questionable judgment call.
Position Number Four: Ban everyone either known or rumored to be users.  So Bagwell and Gary Sheffield and perhaps Juan Gonzalez if you think he reached HOF performance levels and maybe even David Ortiz are out, and Mike Piazza should not have been recently admitted. Now we know we’ve kept the HOF free of those with poor integrity etc. But at least in the U.S. court of law one is considered innocent until proven guilty. Take Jeff Bagwell.  Although he got a lot bigger fairly quickly, hit way more homers than anyone originally expected, and associated with a known user, there is no clear evidence of his use and he has steadfastly denied it. Again, setting the in versus out boundary is a pretty questionable judgment call.
Position Number Five: Not only ban everyone suspected, but kick out anyone currently in who is suspected. Now we are sure everyone in baseball had the proper integrity etc.  Out goes Mike Piazza. Further, and this is the second player I alluded to at the beginning of this essay, out goes Kirby Puckett.  Jose Canseco fingered him, plus the physical problems that ended his career along with those that ended his life along with his violent post-baseball behavior sure seem to be signals of steroid use. In addition, in a 2002 article, statistician Scott Berry calculated that Puckett’s jump from no home runs his rookie year (1985) to four his sophomore year to 31 his junior year was the most unlikely performance increase in the history of MLB, with an odds of one in 100 million, much greater than similar jumps made by other known or suspected users. But this is all indirect evidence, there are other explanations for all of it (maybe Kirby started taking his vitamins or radically changed his swing between the 1986 and 1987 seasons). And if we kick out Kirby, should we kick out Adrian Anson? (Actually, I think we should, but that’s a side issue here.)  How about Ty Cobb for his racism (to be expected, natural attitudes for a Georgian in his time)?  Babe Ruth for eating all those hot dogs? I do not believe I have heard anyone support this position, but I suppose someone could.
So – as I noted at the top, I am frankly torn among several of these options. If I had a vote, my heart would point me toward Position Three, but my head would tell me that it would be hard to rationally defend relative to some of the others (particularly Four). Anyway, I hope that I’ve laid out at least some of the arguments on either side well enough that readers can have an intelligent discussion about it and maybe even add some arguments to my list, and that those who are SURE that their position, whichever it is, is obviously correct think twice about its weaknesses along with its strengths.

-- Charlie Pavitt

Labels: , , ,