Thursday, January 19, 2017

Are women chess players intimidated by male opponents?

The "Elo" rating system is a method most famous for ranking chess players, but which has now spread to many other sports and games.

How Elo works is like this: when you start out in competitive chess, the federation assigns you an arbitrary rating -- either a standard starting rating (which I think is 1200), or one based on an estimate of your skill. Your rating then changes as you play.

What I gather from Wikipedia is that "master" starts at a rating of about 2300, and "grandmaster" around 2500. To get from the original 1200 up to the 2300 level, you just start winning games. Every game you play, your rating is adjusted up or down, depending on whether you win, lose, or draw. The amount of the adjustment depends on the difference in skill between you and your opponent. Elo calculates an estimate of the odds of winning, calculated from your rating and your opponent's rating, and the loser "pays" points to the winner. So, the better your opponents, the more points you get for defeating them.

The rating is an estimate of your skill, a "true talent level" for chess. It's calibrated so that every 400-point difference between players is an odds ratio of 10. So, when a 1900-rated player, "Ann", faces a 1500-rated player, "Bob," her odds of winning are 10:1 (.909). That means that if the underdog, Bob, wins, he'll get 10 times as many points as Ann will get if she wins.

How many points, exactly? That's set by the chess federation in an attempt to get the ratings to converge on talent, and the "400-point rule," as quickly and accurately as possible. The idea is that the less information you have about the players, the more points you adjust by, because the result carries more weight towards your best estimate of talent. 

For players below "expert," the adjustment is 32 times the difference from expectation. For expert players, the adjustment is only 24 points per win, and, at the master level and above, it's 16 points per win.

If Bob happens to beat Ann, he won 1.00 games when the expectation was that that he'd win only 0.09. So, Bob exceeded expectations by 0.91 wins. Multiply by 32, and you get 29 points. That means Bob's rating jumps from 1500 to 1529, while Ann drops from 1900 to 1871.

If Ann had won, she'd claim 3 points from Bob, so she'd be at 1903 and Bob would wind up at 1497.

FiveThirtyEight recently started using Elo for their NFL and NBA ratings. It's also used by my Scrabble app, and the world pinball rankings, and other such things. I haven't looked it up, but I'd be surprised if it weren't used for other games, too, like Backgammon and Go.


For the record, I'm not an expert on Elo, by any means ... I got most of my understanding from Wikipedia, and other internet sources. And, a couple of days ago, Tango posted a link to an excellent article by Adam Dorhauer that explains it very well.

Despite my lack of expertise, it seems to me that these properties of Elo are clearly the case:

1. Elo ratings are only applicable to the particular game that they're calculated from. If you're a 1800 at Chess, and I'm a 1600 at Scrabble, we have no idea which one of us would win at either game. 

2. The range of ELO ratings varies between games, depending on the range of talent of the competitors, but also on the amount of luck inherent to the sport. If the best team in the NBA is (say) an 8:1 favorite against the worst team in the league, it must be rated 361 Elo points better. (That's because 10 to the power of (361/400) equals 8.)  But if the best team in MLB is only a 2:1 favorite, it has to be rated only 120 points better.

Elo is an estimate of odds of winning. It doesn't follow, then, that a 1800 rating in one sport is comparable to a 1800 rating in another sport. I'm a better pinball player than I am a Scrabble player, but my Scrabble rating is higher than my pinball rating. That's because underdogs are more likely to win at pinball. I have a chance of beating the best pinball player in the world, in a single game, but I'd have no chance at all against a world-class Scrabble player.

In other words: the more luck inherent in the game, the tighter the range (smaller the standard deviation) of Elo ratings. 

3. Elo ratings are only applicable within the particular group that they're applied to. 

Last March, before the NCAA basketball tournament, FiveThirtyEight had Villanova with an Elo rating of 2045. Right now, they have the NBA's Golden State Warriors with a rating of 1761.

Does that mean that Villanova was actually a better basketball team than Golden State? No, of course not. Villanova's rating is relative to its NCAA competition, and Golden State's rating is relative to its NBA competition.

If you took the ratings at face value, without realizing that, you'd be projecting Villanova as 5:1 favorites over Golden State. In reality, of course, if they faced each other, Villanova would get annihilated.


OK, this brings me to a study I found on the web (hat tip here). It claims that women do worse in chess games that they play against men rather than against women of equal skill. The hypothesis is, women's play suffers because they find men intimidating and threatening. 

(For instance: "Girls just don’t have the brains to play chess," (male) grandmaster Nigel Short said in 2015.)

In an article about the paper, co-author Maria Cubel writes:

"These results are thus compatible with the theory of stereotype threat, which argues that when a group suffers from a negative stereotype, the anxiety experienced trying to avoid that stereotype, or just being aware of it, increases the probability of confirming the stereotype.

"As indicated above, expert chess is a strongly male-stereotyped environment. "... expert women chess players are highly professional. They have reached a high level of mastery and they have selected themselves into a clearly male-dominated field. If we find gender interaction effects in this very selective sample, it seems reasonable to expect larger gender differences in the whole population."

Well, "stereotype threat" might be real, but I would argue that you don't actually have evidence of it in this chess data. I don't think the results actually mean what the authors claim they mean. 


The authors examined a large database of chess results, and selected all players with a rating of at least 2000 (expert level) who played at least one game against an opponent of each of the two sexes.

After their regressions, the authors report,
"These results indicate that players earn, on average and ceteris paribus, about 0.04 fewer points [4 percentage points of win probability] when playing against a man as compared to when their opponent is a woman. Or conversely, men earn 0.04 points more when facing a female opponent than when facing a male opponent. This is a sizable effect, comparable to women playing with a 30 Elo point handicap when facing male opponents."

The authors did control for Elo rating, of course. That was especially important because the women were, on average, less skilled than the men. The average male player in the study was rated at 2410, while the average female was only 2294. That's a huge difference: if the average man played the average woman, the 116-point spread suggests the man would have a .661 winning percentage -- roughly, 2:1 odds in favor of the man.

Also, there were many more same-sex matches in the database than intersex matches. There are two reasons for that. First, many tournaments are organized by ranking; since there are many more men, proportionally, in the higher ranks, they wind up playing each other more often. Second, and probably more important, there are many women's tournaments and women's-only competitions.


So, now we see the obvious problem with the study, why it doesn't show what the authors think it shows. 

It's the Villanova/Golden State situation, just better hidden.

The men and women have different levels of ability -- and, for the most part, their ratings are based on play within their own group. 

That means the men's and women's Elo ratings aren't comparable, for exactly the same reason an NCAA Elo rating isn't comparable to an NBA Elo rating. The women's ratings are based more on their performance relative to the [less strong] women, and the men's ratings more on their performance relative to the [stronger] men.

Of course, the bias isn't as severe in the chess case as the basketball case, because the women do play matches against men (while Villanova, of course, never plays against NBA teams). Still, both groups played predominanly within their sex -- the women 61 percent against other women, and the men 87 percent against other men.

So, clearly, there's still substantial bias. The Elo ratings are only perfectly commensurable if the entire pool can be assumed to have faced a roughly equal caliber of competition. A smattering of intersex play isn't enough.

Villanova and Golden State would still have incompatible Elos even if they played, say, one out of every five games against each other. Because, then, for the rest of their games, Villanova would go play teams that are 1500 against NCAA competition, and Golden State would go play teams that are 1500 against NBA competition, and Villanova would have a much easier time of it.


Having said that ... if you have enough inter-sex games, the ratings should still work themselves out. 

Because, the way Elo works, points can neither be created nor destroyed.  If women play only women, and men play only men, on average, they'll keep all the ratings points they started with, as a group. But if the men play even occasional games against the women, they'll slowly scoop up ratings points from the women's side to the men's side. All that matters is *how many* of those games are played, not *what proportion*.  The male-male and female-female games don't make a huge difference, no matter how many there are.

The way Elo works, overrated players "leak" points to underrated players. No matter how wrong the ratings are to start, play enough games, and you'll have enough "leaks" for the ratings all converge on accuracy.

Even if 99% of women's games are against other women, eventually, with enough games played, that 1% can add up to as many points as necessary, transferred from the women to the men, to make things work out.


So, do we have enough games, enough "leaks", to get rid of the bias?

Suppose both groups, the men and the women, started out at 1200. But the men were better. They should have been 1250, and the women should have been 1150.  The woman/woman games and man/man games will keep both averages at 1200, so we can ignore those.  But the woman/man games will start "leaking" ratings points to the men's side.

Are there enough woman/man games in the database that the men could unbias the women's ratings by capturing enough of their ratings points?

In the sample, there were 5,695 games by those woman experts (rating 2000+) who played at least one man.  Of those games, 61 percent were woman against women.  That leaves 2,221 games where expert women played (expert or inexpert) men. 

By a similar calculation, there were 2,800 games where expert men played (expert or inexpert) women.  

There's probably lots of overlap in those two sets of games, where an expert man played an expert woman. Let's assume the overlap is 1,500 games, so we'll reduce the total to 3,500.

How much leakage do we get in 3,500 games?  

Suppose the men really are exactly 116 points better in talent than the women, like their ratings indicate -- which would be the case if the leakage did, in fact, take care of all the bias. 

Now, consider what would have happened if there were no leakage. If the sexes played only each other, the women would be overrated by 116 points (since they'd have equal average ratings, but the men would be 116 points more talented).

Now, introduce intersex games. The first time a woman played a man, she'd be the true underdog by 116 points. Her male opponent would have a .661 true win probability, but treated by Elo as if he only had .500. So, the male group would gain .161 wins in expectation on that game.  At 24 points per win, that's 3.9 points.

After that game, the sum of ratings on the woman's side drops by 3.9 points, so now, the women won't be quite as overrated, and the advantage to the men will drop.  But, to be conservative, let's just keep it at 3.9 points all the way through the set of 3,500 games.  Let's even round it to 4 points.

Four points of leakage, multiplied by 3,500 games, is 14,000 ratings points moving from the women's side to the men's side.

There were about 2,000 male players in the study, and 500 female players. Let's ignore their non-expert opponents, and assume all the leakage came from these 2,500 players.

That means the average female player would have (legitimately) lost 28 points due to leakage (14,000 divided by 500).  The average male player would gain 7 points (14,000 divided by 2000).

So, that much leakage would have cut the male/female ratings bias by 35 points.

But, since we started the process with a known 116 points of bias, we're left with 81 points still remaining! Even with such a large database of games, there aren't enough male/female games to get rid of more than 30 percent of the Elo bias caused by unbalanced opponents.

If the true bias should be 81 points, why did the study find only 30?

Because the sample of games in the study isn't a complete set of all games that went into every player's rating.  For one thing, it's just the results of major tournaments, the ones that were significant enough to appear in "The Week in Chess," the publication from which the authors compiled their data.  For another thing, the authors used only 18 months worth of data, but most of these expert players have been in playing chess for years.

If we included all the games that all the players ever played, would that be enough to get rid of the bias?  We can't tell, because we don't know the number of intersex games in the players' full careers.  

We can say hypothetically, though.  If the average expert played three times as many games as logged in this 18-month sample, that still wouldn't be enough -- it would only cover be 105 of the 116 points.  Actually, it would be a lot less, because once the ratings start to become accurate, the rate of correction decelerates.  By the time half the bias is covered, the remaining bias corrects at only 2 points per between-sex game, rather than 4.  

Maybe we can do this with a geometric argument.  The data in the sample reduced the bias from 116 to 81, which is 70 percent of the original.  So, a second set of data would reduce the bias to 57 points.  A third set would reduce it to 40 points.  And a fourth set would reduce it to 28 points, which is about what the study found.

So, if every player in this study actually had four times as many man vs. woman games as were in this database, that would *still* not be enough to reduce the bias below what was found in the study.

And, again, that's conservative.  It assumes the same players in all four samples.  In real life, new players come in all the time, and if the new males tend to be better than the new females, that would start the bias all over again.


So, I can't prove, mathematically, that the 30-point discrepancy the authors found is an expected artifact of the way the rating system works.  I can only show why it should be strongly suspected.

It's a combination of the fact that, for whatever reason, the men are stronger players than the women, and, again for whatever reason, there are many fewer male-female games than you need for the kind of balanced schedule that would make the ratings comparable.

And while we can't say for sure that this is the cause, we CAN say -- almost prove -- that this is exactly the kind of bias that happens, mathematically, unless you have enough male-female games to wash it out.  

I think the burden is on the authors of the study to show that there's enough data outside their sample to wash out that inherent bias, before introducing alternative hypotheses. Because, we *know* this specific effect exists, has positive sign, depends on data that's not given in the study, and could plausibly exactly the size of the observed effect!  

(Assuming I got all this right. As always, I might have screwed up.)


So I think there's a very strong case that what this study found is just a form of this "NCAA vs. NBA" bias. It's an effect that must exist -- it's just the size that we don't know. But intuitive arguments suggest the size is plausibly pretty close to what the study found.

So it's probably not that women perform worse against men of equal talent. It's that women perform worse against men of equal ratings

UPDATE: In a discussion at Tango's site, commenter TomC convinced me that there is enough "leakage" to unbias the ratings. I found an alternative explanation that I think works -- this time, I verified it with a simulation.  Will post that as soon as I can.

Peter Backus, Maria Cubel, Matej Guid, Santiago Sanches-Pages, Enrique López Mañas: Gender, Competition and Performance: Evidence from Real Tournaments.  

Labels: , , ,

Monday, January 09, 2017

Apportioning team wins among individual players

In one of my favorite posts of 2016, Tango talks about Win Shares, and about forcing individual player totals are forced to add up to the team's actual wins, and what that kind of accounting actually implies.

I was going to add my agreement to what Tango said, but then I got sidetracked about the idea of assigning team wins to individual players, even without the "forcing" aspect. 

I thought, even if the player wins exactly add up to the team wins without forcing them to do that ... well, even then, does the concept make sense?


The idea goes like this: you know the Blue Jays won 89 games in 2016. How many of those 89 wins is each individual Blue Jay responsible for, in the sense that if you add them all up, you get a total of 89?

One problem is that the criterion "responsible for" is too vague -- like the "most valuable" in "most valuable player."  It can mean whatever you want it to mean. But even if you're flexible and accept any reasonable definition, I'm not sure it still necessarily makes sense.

You drive your car 89 miles on three gallons of gas. How many of those 89 miles was the engine responsible for? How many of those 89 miles was the steering wheel responsible for? Or the tires, or the radiator? 

Well, you can't go anywhere without an engine, and you can't go anywhere without tires -- but they can't both get full credit for the 89 miles, can they? 

Well, it's the same thing for baseball. You can't win without a pitcher, and you can't win without a catcher -- you'd forfeit every game either way.

If you say that Troy Tulowitzki was responsible for (say) 4.0 of the team's 89 wins, what does that actually mean? It sounds like it means something like, "the team won 4.0 more games with Tulo at short than if the rules let them leave the shortstop position open."  

But that can't be it. Even if the rules allowed it ... well, with only eight players on the field, and an automatic out every time Tulo's spot came up to bad ... well, that would have cost the Blue Jays a lot more than four additional games, wouldn't it?


So, my initial thought was, assigning team wins to players makes no sense. But, then, I saw what I think is a way to make it work. 

When you say Troy Tulowitzki was responsible for 4.0 wins, you're implying that he's four wins better than nothing. But "nothing," taken literally, defaults every game. What if you say, instead, that he's four wins better than a zero-level player?

Taking a page from "Wins Above Replacement," let's redefine "nothing" to mean, a player from the best possible team that would still win zero games against MLB opponents. Or, maybe, to make it clearer, a team that would go 1-161. (You could probably use 0.1 - 161.9, or something, but I'll stick to 1-161.)

(I'm curious what kind of team that would be in real life. For what it's worth, I think I once ran a simulation of nine Danny Ainges (1981) versus nine Babe Ruths, and the Ainge team did go exactly 1-161.)

If Pythagoras works at those extremes ... the square root of 161 is about 13, so we're talking a team that would be outscored by MLB teams by a factor of 13 or more. I have no idea what that is. A good high school team?

Anyway ... if you define it that way, then, I think, it works. Win Shares is just Wins Above Replacement, with a team replacement level at an .005 winning percentage instead of the usual .333 or .250 or whatever. 

Maybe you could call it WAZ, for "Wins Above Zero."


But I'm still uneasy, even though it kind of works. I'm uneasy because I still don't buy the concept. I don't accept the idea that you can start with the 89 wins, and break it down by player, and it has to add up, and the job is just to figure out how. 

Because, that's not how it works. If you didn't like the car analogy, try this:

You have three players on your team. Each takes ten free throws. You get the team score by multiplying the individual scores together. If the players get 5, 6, and 8, respectively, the team gets 240.

Of those 240 points, which players are responsible for how many points? If you replaced player A by a guy who can't shoot at all, the team would score zero -- the product of 0, 6, and 8. So, A's "with or without you" contribution is worth 240. But, so is B's and C's! In this non-baseball sport, the sum of the individual players adds up to *triple* the team total.

In this specific case, because the score is straight multiplication, you might be able to make this work by taking the logarithm of everything, and switching to addition. But baseball isn't that easy. It's somewhere between addition and multiplication; the value of your single depends on the chance your teammates will reach base ahead of you and behind you. A home run is still dependent on context, but less so, since you always at least score yourself.

As I argued, baseball is "almost" linear, so you can get all this to work, kind of. But the fact that it works, kind of, doesn't mean the question makes sense. It just happens that the roundish peg fits into the squarish slot, because the peg is kind of squarish too, and the slot is kind of roundish.


Even before Win Shares or other win statistics, we used to do team breakdowns all the time, but for runs rather than wins. 

For instance, the 1986 Chicago White Sox scored 644 runs. We've always been willing to split those up by figuring Runs Created. For instance, I'm OK saying that of those 644 runs, Harold Baines was responsible for 87 of them, Daryl Boston another 28, Carlton Fisk 39, and so on. 

So why do I have a problem doing the same for wins?

Well, this is just me, and your gut will differ. But, personally, it's that when you throw pitching into the mix, it makes it obvious that the splitting exercise is contrived. 

With runs, you have an actual, visible pile of runs, that actually scored, and you can see the players involved, and it seems reasonable to divide the spoils.

But what about pitchers? What do you have a pile of to split? Maybe runs prevented, rather than runs created. But what's in the pile? How many runs did the 1986 White Sox prevent? Infinity, is my first reaction.

With Win Shares, Bill James got around this problem by defining a "zero line" to be twice the league average -- the "pile" is the runs between that and the actual number. For 1986, the zero line is 1492, so the White Sox wind up with a pile of 793 prevented runs to split among them. That's fine and reasonable, but it's still arbitrary, and, for me, it shatters the illusion that when you split team wins, you're doing something real. 

Here's another weird analogy.

You earn $52,000 a year, and at the end of the year, after all your expenses, you have $1,040 saved. How do you split the credit for that $1,040 among your 52 paycheques (batters)? Easily: just divide. Each pay is responsible for $20 of that $1,040.

But ... it's not just your deposits that are involved. It's your withdrawals, too. You could easily have spent all your money, and even gone into debt, if not for your spending prevention skills (pitchers). How much of the $1,040 is due to your dollars deposited being high, and how much is due to your dollars withdrawn being low?

To model that, you have to figure out "spending prevented."  Maybe, under zero willpower, you would have spent double what you earned -- you would have borrowed another $52,000 and blew it on crap. So, it turns out, your willpower prevented $53,000 in spending.

Your paycheques are responsible for $52,000 deposited, and your willpower for $53,000 not withdrawn. Maybe we'll divide that proportionally. So, each cheque, maybe your job skills were worth $9.90 per paycheque, and your thrift skills were worth $10.10.

Does that sound like a real thing? It doesn't to me.


This is not to say that I don't like Win Shares ... I do, actually. But I like them in the same way I like Bill's other "intuitive" stats, like Approximate Value, and Speed Score, and Trade Value. I like those as rough evaluations, not measurements. In fact, Win Shares is almost like Approximate Value, except that because they're roughly denominated in team wins, I find Win Shares easier to process intuitivel7. 

It's not the stat that bugs me, or the process. It's just the idea that it's a real, legitimate thing, demanding that team wins be broken up and credited arithmetically to the individual players. Because, I don't think it is. It just comes close in the baseball case.

Maybe I'm just old and cranky.

Anyway, in Tango's post, which I haven't actually talked about yet, there are better reasons to resist the idea of splitting team wins, based on the idea that they don't actually add up, and that when you force them to, you have to do things that you really can't justify. That's a much better argument, and it was what I was going to talk about before I started getting sidetracked with this conceptual stuff.

Next post.

Labels: ,

Wednesday, December 07, 2016

Charlie Pavitt: Steroids and the Hall of Fame

This guest post is from occasional contributor Charlie Pavitt. Here's a link to some of Charlie's previous posts.


I am writing today about a much-discussed topic, performance enhancing drugs and Baseball Hall of Fame enshrinement. My goal is not to defend a particular opinion about it, but rather to attempt to lay out five possible positions and some strengths and weaknesses each has. In fact, one reason why I will not defend a particular opinion is that, given these strengths and weaknesses, I am torn among several of the options.

But before I start, a few preliminaries. First, research of which I am aware provides strong evidence that steroid use significantly increases offensive performance, whereas there is little if any evidence that human growth hormone has any impact.

Second, none of this is new. Ancient Greek athletes took then-known stimulants before competitions, and nobody back then batted an eye.
Third, one must be careful throwing rocks when one’s own house could potentially, in a different context, be made of glass. When I was in graduate school, if someone had come to me and whispered, "Hey man, I have this pill you can take every day that will make you read, write, and think more quickly and efficiently," I would have been sorely tempted to partake.  In fact, one of my grad school cohort-mates imagined a situation in which you took a pill that provided you with the information you are supposed to learn from assigned reading, with lighter doses for undergraduate students and heavier doses for us grad students. Mighty tempting fantasy.
Fourth, and this is critical: Before throwing rocks, one needs to defend the claim that there is something wrong with taking performance enhancing drugs.  The fact that it may be illegal is, in my view, irrelevant, as many illegal items are not only harmless but helpful. For example, without getting into the marijuana debate, it is the case that any use of hemp has been illegal in some places, despite its many many positive applications. And taking something into the body to improve athletic performance is often a good thing. After all, a person can improve athletic performance by eating better, and perhaps taking supplements of necessary vitamins and minerals in moderation. So what’s the difference?  Here’s an argument for that difference; eating better and taking supplements in moderation promotes overall health whereas taking steroids (and overdosing on vitamins/minerals) does quite the opposite. The early deaths of many professional rasslers (I reserve the word "wrestlers" for the real sport), perhaps some football players (Lyle Alzado?), and two well-known baseball players (more on this later) has been linked with steroid use. 

One could then make the claim that it is the use of a substance that causes bodily harm that warrants rejection from the HOF. After all, the criteria for entry include "Integrity, sportsmanship, and character" along with "record, playing ability," and "contributions to the team(s) on which the player played."  

So, the argument continues, PED use is contrary to the former three criteria.  I think the best angle for this argument is that it sets the wrong example for others, particularly young people, whereas eating well and getting one’s vitamins/minerals sets the right example. Fair enough. But: Lots of HOF players were smokers or used chewing tobacco, and Babe Ruth certainly did not set a good dietary example by reportedly eating multiple hot dogs just before games.  And speaking of setting bad examples, if there is anybody enshrined who does not deserve it for absence of integrity etc., it is Adrian "Cap" Anson, who was proactive in the successful attempt to get Moses Fleetwood Walker, the first African-American major league baseball player, banned for the color of his skin.

So this argument leads to a slippery slope. But let us assume that we accept it.  Here are five possible responses, ordered from most lenient to most strict.

Position Number One: Let everyone in. The argument here is that great performers deserve entry no matter why they performed greatly. Buttressing this position is the seeming fact that until the public response to Jose Canseco’s confession among other events forced action, the powers-that-be in MLB’s establishment knew what was going on and intentionally turned a blind eye to it. After all, fans like offense, particularly home runs, and attendance was swelling, so all seemed right with the world. So if that was what baseball was in those days, goes the argument for this position, one must accept it and its great performers no matter what.  ne strength of this argument is that one must not make always-problematic non-performance-related judgments about players, as we will must for the other positions on this issue to be discussed in turn below.  The problem with this argument is that it is contrary to the "integrity, sportsmanship, and character" criteria, condones unhealthy behavior, and as such if anything encourages youthful copy cats.
Position Number Two: Let everyone in who deserves enshrinement independently of PED use. This implies that one likely accepts Alex Rodriguez, Roger Clemens, and Barry Bonds, because if one mentally subtracts the PED-fed "value-added" part of their performance, they are still HOF material. But one rejects those who would not have reached performance criteria otherwise: Mark McGwire and Rafael Palmeiro among others come to mind.  Perhaps this makes some sense, but one is still condoning bad behavior by allowing in known users while making questionable judgments about whose performance would have been "good enough" without PEDs.
Position Number Three: Ban known users. So Bonds, Clemens, McGwire, Palmeiro, Sammy Sosa, Manny Ramirez, and some others who reached supposed HOF performance levels are out. Also some who approached HOF levels and might otherwise deserve consideration (Miguel Tejada, Jason Giambi) get none. In so doing, we clear the deck of those guilty of poor integrity etc.  Also, it might allow us to consider those whose performance would have reached criteria in another era; think Fred McGriff, who hit as many homers as Lou Gehrig.  But what about those suspected of use? Take Jeff Bagwell for an example. Although there is no clear evidence of his use and he has steadfastly denied it, he did get a lot bigger fairly quickly, hit way more homers than anyone originally expected, and associated closely with the first known user alluded to earlier, Ken Caminiti, whose early death has been partly linked with use. If we lower our performance criteria to allow for McGriff, it also allows for Bagwell.  So now we’ve admitted someone who may have been as guilty as Bonds et al. but whose usage (if any) has not been publicly verified. Setting the in versus out boundary is a pretty questionable judgment call.
Position Number Four: Ban everyone either known or rumored to be users.  So Bagwell and Gary Sheffield and perhaps Juan Gonzalez if you think he reached HOF performance levels and maybe even David Ortiz are out, and Mike Piazza should not have been recently admitted. Now we know we’ve kept the HOF free of those with poor integrity etc. But at least in the U.S. court of law one is considered innocent until proven guilty. Take Jeff Bagwell.  Although he got a lot bigger fairly quickly, hit way more homers than anyone originally expected, and associated with a known user, there is no clear evidence of his use and he has steadfastly denied it. Again, setting the in versus out boundary is a pretty questionable judgment call.
Position Number Five: Not only ban everyone suspected, but kick out anyone currently in who is suspected. Now we are sure everyone in baseball had the proper integrity etc.  Out goes Mike Piazza. Further, and this is the second player I alluded to at the beginning of this essay, out goes Kirby Puckett.  Jose Canseco fingered him, plus the physical problems that ended his career along with those that ended his life along with his violent post-baseball behavior sure seem to be signals of steroid use. In addition, in a 2002 article, statistician Scott Berry calculated that Puckett’s jump from no home runs his rookie year (1985) to four his sophomore year to 31 his junior year was the most unlikely performance increase in the history of MLB, with an odds of one in 100 million, much greater than similar jumps made by other known or suspected users. But this is all indirect evidence, there are other explanations for all of it (maybe Kirby started taking his vitamins or radically changed his swing between the 1986 and 1987 seasons). And if we kick out Kirby, should we kick out Adrian Anson? (Actually, I think we should, but that’s a side issue here.)  How about Ty Cobb for his racism (to be expected, natural attitudes for a Georgian in his time)?  Babe Ruth for eating all those hot dogs? I do not believe I have heard anyone support this position, but I suppose someone could.
So – as I noted at the top, I am frankly torn among several of these options. If I had a vote, my heart would point me toward Position Three, but my head would tell me that it would be hard to rationally defend relative to some of the others (particularly Four). Anyway, I hope that I’ve laid out at least some of the arguments on either side well enough that readers can have an intelligent discussion about it and maybe even add some arguments to my list, and that those who are SURE that their position, whichever it is, is obviously correct think twice about its weaknesses along with its strengths.

-- Charlie Pavitt

Labels: , , ,

Monday, November 28, 2016

How should we evaluate Detroit's defense behind Verlander?

Privately and publicly, Bill James, Tom Tango, and Joe Posnanski have been arguing about Baseball Reference's version of Wins Above Replacement. Specifically, they're questioning the 2016 WAR totals for Justin Verlander and Rick Porcello:

Verlander +6.6
Porcello  +5.0

Verlander is evaluated to have created 1.6 more wins than Porcello. But their stat lines aren't that much different:

            W-L   IP   K   BB   ERA
Verlander  16-9  227  254  57  3.04
Porcello   22-4  223  189  32  3.15

So why does Verlander finish so far ahead of Porcello?


Baseball Reference credits Verlander with an extra 13 runs, compared to Porcello, to adjust for team fielding. 13 runs corresponds to 1.3 WAR -- roughly, a half-run per nine innings pitched. 

Why so big an adjustment? Because the Red Sox fielders were much better than the Tigers'. Baseball Info Solutions (who evaluate fielding performance from ball trajectory data), had Boston at 108 runs better than Detroit for the season. The 13-run difference between Porcello and Verlander is their share of that difference.

It all seems to make sense, except ... it doesn't. Posnanski, backed by the stats, thinks that even though Detroit's defense was worse than Boston's, the difference didn't affect those two particular pitchers that much. Posnanski argues, plausibly, that even though Detroit's fielders didn't play well over the season as a whole, they DID play well when Verlander was on the mound:

"For one thing, I think it’s quite likely that Detroit played EXCELLENT defense behind Verlander, even if they were shaky behind everyone else. I’m not sure how you can expect a defense to allow less than a .256 batting average on balls in play (the second-lowest of Verlander’s career and second lowest in the American League in 2016) or allow just three runners to reach on error all year (the lowest total of Verlander’s career).

"For another, the biggest difference in the two defenses was in right and centerfield. The Red Sox centerfielder and rightfielder saved 44 runs, because Jackie Bradley and Mookie Betts are awesome. The Tigers centerfield and rightfielder cost 49 runs because Cameron Maybin, J.D. Martinez and a cast of thousands are not awesome.

"But the Tigers outfield certainly didn’t cost Verlander. He allowed 216 fly balls in play, and only 16 were hits. Heck, the .568 average he allowed on line drives was the lowest in the American League. I find it almost impossible to believe that the Boston outfield would have done better than that."


So, that's the debate. Accepting that the Tigers' fielding, overall, was 49 runs worse than average for the season, can we simultaneously accept that the reverse was true on those days when Verlander was pitching? Could the crappy Detroit fielders have turned good -- or at least average -- one day out of every five?

Here's an analogy that says yes.

In 2015, Mark Buehrle and R.A. Dickey had very similar seasons for the Blue Jays. They had comparable workloads and ERAs (3.91 for Dickey, 3.81 for Buehrle). 

But in terms of W-L records ... Buehrle was 15-8, while Dickey went 11-11.

How could Dickey win only 11 games with an ERA below four? One conclusion is that he must have pitched worse when it mattered most. Because, it would be hard to argue that it was run support. In 2015, the Blue Jays were by far the best-hitting team in baseball, scoring 5.5 runs per game. They were farther ahead of the second-place Yankees than the Yankees were above the 26th place Reds. 

Unless, of course, Toronto's powerhouse offense just happened to sputter on those 29 days when Dickey was on the mound. Is that possible?


It turns out that Dickey got only 4.6 runs of support in his starts, almost a full run less than the Jays' 5.5-run average. Buehrle, on the other hand, got 6.9 runs from the offense, a benefit of a full 1.4 runs per game.

Of course, it's not really that the Blue Jays turned into a bad-hitting team, that their skill level actually changed. It's just randomness. Some days, even great-hitting teams have trouble scoring, and, by dumb luck, there happened to be more of those days when Dickey pitched than when Buehrle pitched.

Generally, runs per game has a standard deviation of about 3, so the SD over 29 games is around 0.56. Dickey's bad luck was only around 1.6 SDs from zero, not even statistically significant.

(* Note: As I was writing this post, Posnanski posted a followup using a similar run support analogy.)


Just as we only have season fielding stats for evaluating Verlander's defense, imagine that we only had season batting stats for evaluating Dickey's run support.

In that case, when we evaluated Dickey's record, we'd say, "Dickey looks like an average pitcher, at 11-11. But his team scored a lot more runs than average. If he could only finish with a .500 record with such a good offense, he's worse than his 11-11 record shows. So, we have to adjust him down, maybe to 9-13 or something, to accurately compare him to pitchers on average-hitting teams."

And that wouldn't be right, in light of the further information we have: that the Jays did NOT score that many runs on days that Dickey pitched. 

Well, the same is true for the Verlander/Porcello case, right? It's quite possible that even though the Tigers were a bad defensive team, they happened to play good defense during Verlander's starts, just because the sample size is small enough that that kind of thing can happen. In that light, Posnanski's analysis is crucial -- it's evidence that, yes, the Tigers fielders DID play well (or at least, appear to play well) behind Verlander, even if they didn't play well behind the Tigers' other pitchers.

Because, fielding is subject to variation just like hitting is. Some games, an outfielder makes a great diving catch, and, other days, the same outfielder just misses the catch on an identical ball. More importantly, some days the balls in play are just easier to field than others, and even the BIS data doesn't fully capture that fact, and the fielders look better than they actually played. 

(In fact, I suspect that the errors from misclassifying the difficulty of balls in play are much bigger than the effect of actual randomness in how balls are fielded. But that's not important for this argument.)


What if we don't have evidence, either way, on whether Detroit's fielders were better or worse with Verlander on the mound? In that case, it's OK to use the season numbers, right?

No, I don't think so. If the pitcher had better results than expected, you have to assume that the defense played better as well. Otherwise, you'll consistently overrate the pitchers who performed well on bad-fielding teams, and underrate the pitchers who performed poorly on good-fielding teams.

The argument is pretty simple -- it's the usual "regression to the mean" argument to adjust for luck.

When a pitcher does well, he was probably lucky. Not just lucky in how well he himself played, but in EVERY possible area where he could be lucky -- parks, defense, umpire calls, weather ... everything. (See MGL's comment here.)  If a pitcher pitched well, he was probably lucky in how he pitched, and he was probably lucky in how his team fielded.

You roll ten dice, and wind up with a total of 45. You were lucky to get such a high sum, because the expected total was only 35.

Since the overall total was lucky, each individual roll, taken independently, is more lucky to have been lucky than unlucky. Because, obviously, you can't be lucky with the total without being lucky with the numbers that sum to the total. We don't know which of the ten were lucky and which were not, but, for each die, we should retrospectively be willing to bet that it was higher than 3.5.

It would be wrong to say something like: "Overall for each of these dice, the expectation was 3.5. That means the first six tosses probably averaged 21. That means that the last four tosses probably scored 24. Wow! Your last four tosses were 6-6-6-6! They were REALLY lucky!"

It's wrong, of course, because you can't arbitrarily attribute all the luck to the last four tosses. All ten are equally likely to have been lucky ones.

And the same is true for Verlander. His excellent ERA is the sum of pitching and fielding. You can't arbitrarily assume all the good luck came in the rolls his pitching dice, and he had exactly zero luck in the rolls of his team's fielding dice.


If that isn't obvious, try this. 

The WAR method works like this: it's taking a single game Verlander started, assigning the results to Verlander, and adjusting for what the average of what the fielders' did in ALL the games they played, not just this one.

Imagine that we reverse it: we take a single game Verlander started, assign the results to the FIELDERS, and adjust for the average of what Verlander did in ALL the games he pitched, not just this one.

One day, Verlander and the Tigers give up 7 runs, and the argument goes something like this:

"Wow! Normally, the Tigers fielders give up only 5 runs, so today they were -2. But wait!  Justin Velander was on the mound, and he's a great pitcher, and saves an average of two runs a game! If they gave up 7 runs despite Verlander's stellar pitching, the fielders must have been exceptionally bad, and we need to give them a -4 instead of a -2!"

Verlander's stats aren't just a measure of Verlander's own performance. As Tango would say, they're a measure of *what happened when Verlander was on the mound*. That encompasses Verlander's pitching AND his teammates' fielding. 

So, if the results with Verlander on the mound are better than expected, chances are that BOTH of those things were better than expected. 


I should probably leave it there, but if you're still not convinced, here's an explicit model.

There's a wheel of fortune with ten slots. You spin the wheel to represent a ball in play. Normally, slots 1, 2, and 3 represent a hit, and 4 through 10 represent an out. But because the Tigers fielders are so bad, number 4 is changed to a hit instead of an out.

In the long term, you expect that the Tigers' defense, compared to average, will cost a Verlander one hit for every 10 balls in play. 

But: your expectation of how many hits it actually cost depends on the specific pitcher's results.

(1) Suppose Verlander's results were better than expected. Out of 10 balls in play, he gets 8 outs. How many hits did the defense cost him?

Eight of Verlander's spins must have landed somewhere in slots 5 through 10. Out of those spins, the defense didn't cost him anything, since the defense is only at fault when the wheel stops at slot 4. 

For hits, we expect that one in four came from slot 4. For the two spins that wound up a hit, that works out to half a hit.

So, with the Tigers having given up few hits, we estimate his defense cost Verlander only 0.5 hits, not 1.0 hits.

(2) Suppose Verlander's results were below average -- he gave up 6 hits. Slot 4 hits, which are the fielders' fault, are a quarter of the 6 hits allowed. So, the defense here cost him 1.5 hits, not 1.0 hits.

(3) Suppose Verlander's results were exactly as predicted -- he gave up four hits. On average, one out of those four hits is from slot 4. So, in this case, yes, the defense would have cost him one hit per ten balls in play, exactly the average rate for the team. 

Which means, the better Verlander's stat line, the more likely the fielders played better than their overall average.

Labels: , , , , ,

Monday, October 24, 2016

Why the 2016 AL was harder to predict than the 2016 NL

In 2016, team forecasts for the National League turned out more accurate than they had any right to be, with FiveThirtyEight's predictions coming in with a standard error (SD) of only 4.5 wins. The forecasts for the American League, however, weren't nearly as accurate ... FiveThirtyEight came in at 8.9, and Bovada at 8.8. 

That isn't all that great. You could have hit 11.1 just by predicting each team to duplicate their 2015 record. And, 11 wins is about what you'd get most years if you just forecasted every team at 81-81.

Which is kind of what the forecasters did! Well, not every team at 81-81 exactly, but every team *close* to 81-81. If you look at FiveThirtyEight's actual predictions, you'll see that they had a standard deviation of only 3.4 wins. No team was predicted to win or lose more than 87 games.

Generally, team talent has an SD of around 9 wins. If you were a perfect evaluator of talent, your forecasts would also have an SD of 9. If, however, you acknowledge that there are things that you don't know (and many that can't be known, like injuries and suspensions), you'll forecast with an SD somewhat less than 9 -- maybe 6 or 7.

But, 3.4? That seems way too narrow. 

Why so narrow? I think it was because, last year, the AL standings were themselves exceptionally narrow. In 2015, no American League team won or lost more than 95 games. Only three teams were at 89 or more. 

The SD of team wins in the 2015 AL was 7.2. That's much lower than the usual figure of around 11. In fact, 7.2 is the lowest for either league since 1961. In fact, I checked, and it's the lowest for any league in baseball history! (Second narrowest: the 1974 American League, at 7.3.)

Why were the standings so compressed? There are three possibilities:

1. The talent was compressed;

2. There was less luck than normal;

3. The bad teams had good luck and the good teams had bad luck, moving both sets closer to .500.

I don't think it was #1. In 2016, the SD of standings wins was back near normal, at 10.2. The year before, 2014, it was 9.6. It doesn't really make sense that team talent regressed so far to the mean between 2014 and 2015, and then suddenly jumped back to normal in 2016. (I could be wrong -- if you can find trades and signings those years that showed good teams got significantly worse in 2015 and then significantly better in 2016, that would change my mind.)

And I don't think it was #2, based on Pythagorean luck. The SD of the discrepancy in "first-order wins" was 4.3, which larger than the usual 4.0. 

So, that leaves #3 -- and I think that's what it was. In the 2015 AL, the correlation between first-order-wins and Pythagorean luck was -0.54 instead of the expected 0.00. So, yes, the good teams had bad luck and the bad teams had good luck. (The NL figure was -0.16.)


When that happens, that luck compresses the standings, it definitely makes forecasting harder. Because, there's not as much information on how teams differ. To see that, consider the extreme case. If, by some weird fluke, every team wound up 81-81, how would you know which teams were talented but unlucky, and which were less skilled but lucky? You wouldn't, and so you wouldn't know what to expect next season.

Of course, that's only a problem if there *is* a wide spread of talent, one that got overcompressed by luck. If the spread of talent actually *is* narrow, then forecasting works OK. 

That's what many forecasting methods assume, that if the standings are narrow, the talent must be narrow. If you do the usual "just take the standings and regress to the mean" operation, you'll wind up implicitly assuming that the spread of talent shrank at the same time as the spread in the standings shrank.

Which is fine, if that's what you think happened ... but, do you really think that's plausible? The AL talent distribution was pretty close to average in 2014. It makes more sense to me to guess that the difference between 2014 and 2015 was luck, not wholesale changes in personnel that made the bad teams better and the good teams worse.

Of course, I have the benefit of hindsight, knowing that the AL standings returned to near-normal in 2016 (with an SD of 10.2). But it's happened before -- the record-low 7.3 figure for the 1974 AL jumped back to an above-average 11.9 in 1975.

I'd think when I was forecasting the 2016 standings, I might want to make an effort to figure out which teams were lucky and which ones weren't, in order to be able to forecast a more realistic talent SD than 3.5 wins.

Besides, you have more than the raw standings. If you adjust for Pythagoras, the SD jumps from 7.2 to 8.6. And, according to Baseball Prospectus, when you additionally adjust for cluster luck, the SD rises to 9.4. (As I wrote in the P.S. to the last post, I'm not confident in that number, but never mind for now.)

An SD of 9.4 is still smaller than 11, but it should be workable.

Anyway, my gut says that you should be able to differentiate the good teams from the bad with a spread higher than 3.4 games ... but I could be wrong. Especially since Bovada's spread was even smaller, at 3.3.


It's a bad idea to second-guess the bookies, but let's proceed anyway.

Suppose you thought that the standings compression of 2015 was a luck anomaly, and the distribution of talent for 2016 should still be as wide as ever. So, you took FiveThirtyEight's projections, and you expanded them, by regressing them away from the mean, by a factor of 1.5. Since FiveThirtyEight predicted the Red Sox at four games above .500 (85-77), you bump that up to six games (87-75).

If you did that, the SD of your actual predictions is now a more reasonable 5.1. And those predictions, it turns out, would have been better. The accuracy of your new predictions would have been an SD of 8.4. You would have beat FiveThirtyEight and Bovada.

If that's too complicated, try this. If you had tried to take advantage of Bovada's compressed projections by betting the "over" on their top seven teams, and the "under" on their bottom seven teams, you would have gone 9-5 on those bets.

Now, I'm not going to so far as to say this is a workable strategy ... bookmakers are very, very good at what they do. Maybe that strategy just turned out to be lucky. But it's something I noticed, and something to think about.


If compressed standings make predicting more difficult, then a larger spread in the standings should make it easier.

Remember how the 2016 NL predictions were much more accurate than expected, with an SD of 4.5 (FiveThirtyEight) and 5.5 (Bovada)? As it turns out, last year, the SD of the 2015 NL standings was higher than normal, at 12.65 wins. That's the highest of the past three years:

2014  AL= 9.59, NL= 9.20
2015  AL= 6.98, NL=12.65
2016  AL=10.15, NL=10.71

It's not historically high, though. I looked at 1961 to 2011 ... if the 2015 NL were included, it would be well above average, but only 70th percentile.*

(* If you care: of the 10 most extreme of the 102 league-seasons in that timespan, most were expansion years, or years following expansion. But the 2001, 2002, and 2003 AL made the list, with SDs of 15.9, 17.1, and 15.8, respectively. The 1962 National League was the most extreme, at 20.1, and the 2002 AL was second.)

A high SD won't necessarily make your predictions beat the speed of light, and a low SD won't necessarily make them awful. But both contribute. As an analogy: just because you're at home doesn't mean you're going to pitch a no-hitter. But if you *do* pitch a no-hitter, odds are, you had the help of home-field advantage.

So, given how accurate the 2016 NL forecasts were, I'm not surprised that the SD of the 2015 NL standings was higher than normal.


Can we quantify how much compressed standings hurt next year's forecasts? I was curious, so I ran a little simulation. 

First, I gave every team a random 2015 talent, so that the SD of team talent came out between 8.9 and 9.1 games. Then, I ran a simulated 2015 season. (I ran each team with 162 independent games, instead of having them play each other, so the results aren't perfect.)

Then, I regressed each team's 2015 record to the mean, to get an estimate of their talent. I assumed that I "knew" that the SD of talent was around 9, so I "unregressed" each regressed estimate away from the mean by the exact amount that gets the SD of talent to exactly 9.00. That became the official forecast for 2016. 

Finally, I ran a simulation of 2016 (with team talent being the same as 2015). I compared the actual to the forecast, and calculated the SD of the forecast errors.

The results came out, I think, very reasonable.

Over 4,000 simulated seasons, the average accuracy was an SD of 7.9. But, the higher the SD of last year's standings, the better the accuracy:

SD Standings    SD next year's forecast
7.0             8.48 (2015 AL)
8.0             8.31
9.0             8.14
10.0            7.98
11.0            7.81
12.0            7.64
12.6            7.54 (2015 NL)
13.0            7.47
14.0            7.31
20.1            6.29 (1962 NL)

So, by this reckoning, you'd expect the 2016 NL predictions to have been one win more accurate than than the AL predictions. 

They were "much more accurater" than that, of course, by 3.4 or 4.5. The main reason, of course, is that there's a lot of luck involved. Less importantly, this simulation is very rough. The model is oversimplified, and there's no assurance that the relationship is actually linear. (In fact, the relationship *can't* be linear, since the "speed of light" limit is 6.4, and the model says the 1974 AL would beat that, at 6.3). 

It's just a very rough regression to get a very rough estimate. 

But the results seem reasonable to me. In 2016, we had (a) the narrowest standings in baseball history in the 2015 AL, and (b) a wider-than-average, 70th percentile spread in the 2016 NL. In that light, an expected difference of 1 win, in terms of forecasting accuracy, seems very plausible. 


So that's my explanation of why this year's NL forecasts were so accurate, while this year's AL forecasts were mediocre. A large dose of luck -- assisted by a small (but significant) dose of extra information content in the standings.

Labels: , , , , ,