Wednesday, March 25, 2015

Does WAR undervalue injured superstars?

In an article last month, "Ain't Gonna Study WAR No More" (subscription required), Bill James points out a flaw in WAR (Wins Above Replacement), when used as a one-dimensional measure of player value. 

Bill gives an example of two hypothetical players with equal WAR, but not equal value to their teams. One team has Player A, an everyday starter who created 2.0 WAR over 162 games. Another team has Player B, a star who normally produces at the rate of 4.0 WAR, but one ear created only 2.0 WAR because he was injured for half the season.

Which player's team will do better? It's B's team. He creates 2.0 WAR, but leaves half the season for someone from the bench to add more. And, since bench players create wins at a rate higher than 0.0 -- by definition, since 0.0 is the level of player that can be had from AAA for free -- you'd rather have the half-time player than the full-time player.

This seems right to me, that playing time matters when comparing players of equal WAR. I think we can tweak WAR to come up with something better. And, even if we don't, I think the inaccuracy that Bill identified is small enough that we can ignore it in most cases.


First: you have to keep in mind what "replacement" actually means in the context of WAR. It's the level of a player just barely not good enough to make a Major League Roster. It is NOT the level of performance you can get off the bench.

Yes, when your superstar is injured, you often do find his replacement on the bench. That causes confusion, because that kind of replacement isn't what we really mean when we talk about WAR.

You might think -- *shouldn't* it be what we mean? After all, part of the reason teams keep reasonable bench players is specifically in case one of the regulars gets injured. There is probably no team in baseball, when their 4.0 WAR player goes down the first day of the season, can't replace at least a portion of those wins from an available player. So if your centerfielder normally creates 4.0 WAR, but you have a guy on the bench who can create 1.0 WAR, isn't the regular really only worth 3.0 wins in a real-life sense?

Perhaps. But then you wind up with some weird paradoxes.

You lease a blue Honda Accord for a year. It has a "VAP" (Value Above taking Public Transit) of, say $10,000. But, just in case the Accord won't start one morning, you have a ten-year-old Sentra in the garage, which you like about half as much.

Does that mean the Accord is only worth $5,000? If it disappeared, you'd lose its $10,000 contribution, but you'd gain back $5,000 of that from the Sentra. If you *do* think it's only worth $5,000 ... what happens if your neighbor has an identical Accord, but no Sentra? Do you really want to decide that his car is twice as valuable as yours?

It's true that your Accord is worth $5,000 more than what you would replace it with, and your neighbor's is worth $10,000 more than what would he would replace it with. But that doesn't seem reasonable as a general way to value the cars. Do you really want to say that Willie McCovey has almost no value just because Hank Aaron is available on the bench?


There's also another accounting problem, one that commenter "Guy123" pointed out on Bill's site. I'll use cars again to illustrate it.

Your Accord breaks down halfway through the year, for a VAP of $5,000. Your mother has only an old Sentra, which she drives all year, for an identical VAP of $5,000.

Bill James' thought experiment says, your Accord, at $5,000, is actually worth more than your mother's Sentra, at $5,000 -- because your Accord leaves room for your own Sentra to add value later. In fact, you get $7,500 in VAP -- $5,000 from half a year of the Accord, and $5,000 from half a year of the Sentra.

Except that ... how do you credit the Accord for the value added by the Sentra? You earned a total of $7,500 in VAP for the year. Normal accounting says $5,000 for the Accord, and $2,500 for the Sentra. But if you want to give the Accord "extra credit," you have to take that credit away from the Sentra! Because, the two still have to add up to $7,500.

So what do you do?


I think what you do, first, is not base the calculation on the specific alternatives for a particular team. You want to base the calculation on the *average* alternative, for a generic team. That way, your Accord winds up worth the same as your neighbor's.

You can call that, "Wins Above Average Bench." If only 1 in 10 households has a backup Sentra, then the average alternative is one tenth of $5,000, or $500. So the Accord has a WAAB of $9,500.

All this needs to happen because of a specific property of the bench -- it has better-than-replacement resources sitting idle.

When Jesse Barfield has the flu, you can substitute Hosken Powell for "free" -- he would just be sitting on the bench anyway. (It's not like using the same starting pitcher two days in a row, which has a heavy cost in injury risk.)

That wouldn't be the case if teams didn't keep extra players on the bench, like if the roster size for batters were fixed at nine. Suppose that when Jesse Barfield has the flu, you have to call Hosken Powell up from AAA. In that case, you DO want Wins Above Replacement. It's the same Hosken Powell, but, now, Powell *is* replacement, because replacement is AAA by definition.

Still, you won't go too wrong if you just stick to WAR. In terms of just the raw numbers, "Wins Above Replacement" is very close to "Wins Above Average Bench," because the bottom of the roster, the players that don't get used much, is close to 0.0 WAR anyway.

For player-seasons between 1982 and 1991, inclusive, I calculated the average offensive expectation (based on a weighted average of surrounding seasons) for regulars vs. bench players. Here are the results, in Runs Created per 405 outs (roughly a full-time player-season), broken down by "benchiness" as measured by actual AB that year:

500+ AB: 75
401-500: 69
301-400: 65
201-300: 62
151-200: 60
101-150: 59
 76-100: 45
 51- 75: 33

A non-superstar everyday player, by this chart, would probably come in at around 70 runs. A rule of thumb is that everyday players are worth about 2.0 WAR. So, 0.0 WAR -- replacement level -- would about 50 runs.

The marginal bench AB, the ones that replace the injured guy, would probably come from the bottom four rows of the chart -- maybe around 55. That's 5 runs above replacement, or 0.5 wins. 

So, the bench guys are 0.5 WAR. That means when the 4.0 guy plays half a season, and gets replaced by the 0.5 guy for the other half, the combination is worth 2.25 WAR, rather than 2.0 WAR. As Bill pointed out, the WAR accounting credits the injured star with only 2.0, and he still comes out looking only equally as good as the full-time guy.

But if we switch to WAAB ... now, the full-time guy is 1.5 WAAB (2.0 minus 0.5). The half-time star is 1.75 WAAB (4.0 minus 0.5, all divided by 2). That's what we expected: the star shows more value.

But: not by much. 0.25 wins is 2.5 runs, which is a small discrepancy compared to the randomness of performance in general. And even that discrepancy is random, since something as large as a quarter of a win only shows up when a superstar loses half the season to injury. The only time when it's large and not random is probably star platoon players -- but there aren't too many of those.

(The biggest benefit to accounting for the bench might be when evaluating pitchers, who, unlike hitters, vary quite a bit in how much they're physically capable of playing.)

I don't see it as that a big deal at all. I'd say, if you want, when you're comparing two batters, give the less-used player a bonus of 0.1 WAR for each 100 AB of playing time. 

Of course, that estimate is very rough ... the 0.1 wins could easily be 0.05, or 0.2, or something. Still, it's still going to be fairly small -- small enough that I'd be it wouldn't change too many conclusions that you'd reach if you just stuck to WAR.


Labels: ,

Friday, March 06, 2015

Why is the bell curve so pervasive?

Why do so many things seem to be normally distributed (bell curved)? That's something that bothered me for a long time. Human heights are (roughly) normally distributed. So are are weights of (presumably identical) bags of potato chips, basketball scores, blood pressure, and a million other things, seemingly unrelated.

Well, I was finally able to "figure it out," in the sense of, finding a good intuitive explanation that satisfied my inner "why". Here's the explanation I gave myself. It might not work for you -- but you might already have your own.


Imagine a million people each flipping a coin 100 times, and reporting the number of heads they got. The distribution of those million results will have a bell-shaped curve with a mean around 50. (Yes, the number of heads is discrete and the bell curve is continuous, but never mind.)  

In fact, you can prove, mathematically, that you should get something very close to the normal distribution. But is there an intuitive explanation that doesn't need that much math?

My explanation is: the curve HAS to be bell-shaped. There's no alternative based on what we already know about it.

-- First: you probably want the distribution to be curved, and not straight lines. I guess you could expect something kind of triangular, but that would be weird.

-- Second: the curve can never go below the horizontal axis, since probabilities can't be negative.

-- Third: the curve has to be highest at 50, and always go lower when you move farther from the center -- which means, at the extremes, it gets very, very close to zero without ever touching it.

That means we're stuck with this:

How do you fill that in without making something that looks like a bell? You can't. 

This line of thinking -- call it the "fill in the graph" argument -- doesn't prove it's the normal distribution specifically. It just explains why it's a bell curve. But, I didn't have a mental image of the normal distribution as different from other bell-shaped curves, so it's close enough for my gut. In fact, I'm just going to take it as a given that it's the normal distribution, and carry on.

(By the way, if you want to see the normal distribution arise magically from the equivalent of coin flips, see the video here.) 


That's fine for coin flips. But what about all those other things? Say, human height? We still know it's a bell-shaped curve from the same "fill in the graph" argument, but how do we know it's the same one as coins? After all, a single human's height isn't the same thing as flipping a quarter 100 times. 

My gut explanation is ... it probably *is* something like coin flips. Imagine that the average adult male is 5' 9". But there may be (say) a million genes that move that up or down. Suppose that for each of those genes, if it shows "heads," you get to be 1/180 of an inch taller. If the gene shows "tails," you're 1/180 of an inch shorter.

If that's how it works, and each gene is independent and random, the population winds up following a normal distribution with a standard deviation of around 2.8 inches, which is roughly the real-world number.

It seems reasonable to me, intuitively, to think that the genetics of height probably do work something like this oversimplified example. 


How does this apply to weights of bags of chips? Same idea. The chips are processed on complicated machinery, with a million moving parts. They aren't precise down to the last decimal place. If there are 1,000 places on the production line where the bag might get a fraction heavier or lighter, the coin-flip model works fine.


But for test scores, the coin-flip model doesn't seem to work very well. People have different levels of skill with which they pick up the material, and different study habits, and different reactions to the pressure of an exam, and different speeds at which they write. There's no obvious "coin flipping" involved in the fact that some students work hard, and some don't bother too much.

But there can be coin flipping involved in some of those other things. Different levels of skill could be somewhat genetic, and therefore normally distributed. 

And, most of those other things have to be *roughly* bell-shaped, too, by the "fill in the graph" argument: the extremes can't go below zero, and the curve needs to drop consistently on both sides of the peak. 

So to get the final test result, you're adding the bell-shaped curve for ability, plus the bell-shaped curve for speed, plus the bell-shaped curve for industriousness, and so on.

When you add variables that are normally distributed, the sum is also normally distributed. Why? Well, suppose ability is the equivalent of the sum of 1000 coin flips. And industriousness is the equivalent of the sum of 350 coin flips. Then, "ability plus industriousness" is just the sum of 1350 coin flips -- which is still a bell curve.

My guess is that there are a lot of things in the universe that work this way, and that's why they come out normally distributed. 

If you want to go beyond genetics ... well, there are probably a million environmental factors, too. Going back to height ... maybe, the more you exercise, the taller you get, by some tiny fraction. (Maybe exercise burns more calories, which makes you hungrier, and it's the nutrition that helps you get taller. Whatever.)  

Exercise could be normally distributed, too, or at least many of its factors might. For instance, how much exercise you get might partly depend on, say, how far you had to walk to school. That, itself, has to roughly be a bell curve, by the same old "fill in the graph" argument.


What makes bell curves even more ubiquitous is that you get bell curves even if you start with something other than coin flips.

Take, for instance, the length of a winning streak in sports. That isn't a coin flip, and it isn't, itself, normally distributed. The most frequent streak is 0 wins, then 1, then 2, and so on. The graph would look something like this (stolen randomly from the web):

But, the thing is: the distribution of one winning streak doesn't look normal at all. But if you add up, say, a million winning streaks -- the result WILL be normally distributed. That's the most famous result in statistics, the "Central Limit Theorem," which says that if you add up enough identical, independent random variables, you always get a normal curve.

My intuitive explanation is: the winning streak totals reflect, roughly, the same underlying logic as the coin flips.

Suppose you're figuring out how to get 50 heads out of 100 coins. You say, "well, all the odd flips might be heads. All the even flips might be heads. The first 50 might be heads, and the last 50 might be tails ... " and so on.

For winning streaks: Suppose you're trying to figure out how to get a total of (say) 67 wins out of 100 streaks. You say, "well, maybe all the odd streaks are 0, and all the low even streaks are 1, and streak number 100 is a 9-gamer, and streak number 98 is a 7-gamer, and so on. Or, maybe the EVEN streaks are zero, and the high ODD streaks are the big ones. Or maybe it's the LOW odd streaks that are the big ones ... " and so on.

In both cases, you calculate the probabilities by choosing combinations that add up. It's the fact that the probabilities are based on combinations that makes things come out normal. 


Why is that? Why does the involvement of combinations lead to a normal distribution? For that, the intuitive argument involves some formulas (but no complicated math). 

This is the actual equation for the normal distribution:

f(x, \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi} } e^{ -\frac{(x-\mu)^2}{2\sigma^2} }

It looks complicated. It's got pi in it, and e (the transcendental number 2.71828...), and exponents. How does all that show up, when we're just flipping coins and counting heads?

It comes from the combinations -- specifically, the factorials they contain. 

The binomial probability of getting exactly 50 heads in 100 coin tosses is:

It turns out that there is a formula, "Stirling's Approximation," which lets you substitute for the factorials. It turns out that you can rewrite n! this way:

n! \sim \sqrt{2 \pi n}\left(\frac{n}{e}\right)^n

It's only strictly equal as n approaches infinity, but it's very, very close for any value of n. 

Stick that in where the factorial would go, and do some algebra manipulation, and the "e" winds up flipping from the denominator to the numerator, and the "square root of 2 pi" flips from the numerator to the denominator ... and you get something that looks really close to the normal distribution. Well, I'm pretty sure you do; I haven't tried it myself. 

I don't have to ... at this point, my gut is happy. My sense of "I still don't understand why" is satisfied by seeing the Stirling formula, and seeing how the pi and e come out of the factorials in roughly the right place. 

(UPDATE, 3/8/2015: I had originally said, in the first paragraph, that test scores are normally distributed.  In a tweet, Rodney Fort pointed out that they're actually *engineered* to be normally distributed. So, not the best example, and I've removed it.)

Labels: , ,

Sunday, March 01, 2015

Two nice statistical analogies

I'm always trying to find good analogies to help explain statistical topics. Here's a couple of good ones I've come across lately, that I'll add to the working list I keep in my brain.


Here's Paul Bruno, explaining why r-squared is not necessarily a good indicator of whether or not something is actually important in real life:

"Consider 'access to breathable oxygen'. If you crunched the numbers, you would likely find that access to breathable oxygen accounts for very little – if any – of the variation in students' tests scores. This is because all students have roughly similar access to breathable oxygen. If all students have the same access to breathable oxygen, then access to breathable oxygen cannot 'explain' or 'account for' the differences in their test scores.

"Does this mean that access to breathable oxygen is unimportant for test scores? Obviously not. On the contrary: access to breathable oxygen is very important for kids’ test scores, and this is true even though access to breathable oxygen explains ≈ 0% of their variation."

Great way to explain it, and an easy way to understand why, if you want to see if a factor is important in a "breathable oxygen" kind of way, you need to look at the regression coefficient, not the r-squared.


This sentence comes from Jordan Ellenberg's mathematics book, "How Not To Be Wrong," which I talked a bit about last post:

"The significance test is the detective, not the judge."

I like that analogy so much I wanted to start by putting it by itself ... but I should add the previous sentence for context:

"A statistically significant finding [only] gives you a clue, suggesting a promising place to focus your research energy. The significance test is the detective, not the judge."         [page 161, emphasis in original]

(By the way, Ellenberg doesn't put a hyphen in the phrase "statistically-significant finding," but I normally do. Is the non-hyphen a standard one-off convention, like "Major League Baseball?")

(UPDATE: this question now answered in the comments.)

The point is: one in twenty experiments would produce a statistically-significant result just by random chance. So, statistical significance doesn't mean you can just leap to the conclusion that your hypothesis is true. You might be one of that "lucky" five percent. To be really confident, you need to wait for replications, or find other ways to explore further.

I had previously written an analogy that's kind of similar:

"Peer review is like the police deciding there's enough evidence to lay charges. Post-publication debate is like two lawyers arguing the case before a jury."

Well, it's really the district attorney who has the final say on whether to lay charges, right?  In that light, I like Ellenberg's description of the police better than mine. Adding that in, here's the new version:  

"Statistical significance is the detective confirming a connection between the suspect and the crime. Peer review is the district attorney deciding there's enough evidence to lay charges.  Post-publication debate is the two lawyers arguing the case before a jury."

Much better, I think, with Ellenberg's formulation in there too.

Labels: , , , , ,