Saturday, April 18, 2015

Making the NHL draft lottery more fun

The NHL draft lottery is to be held today in Toronto. Indeed, it might have already happened; I can't find any reference to what time the lottery actually takes place. The NHL will be announcing the results just before 8:00 pm tonight.

(Post-lottery update: someone on Twitter said the lottery took place immediately before the announcement.  Which makes sense; the shorter wait time minimizes the chance of a leak.  Also: the Oilers won, but you probably know that by now.)

The way it works is actually kind of boring, and I was thinking about ways to make it more interesting ... so you could televise it, and it would hold viewers' interest.

(Post-lottery update: OK, they did televise a medium-sized big deal for the reveal.  But, since the winner was already known, it was more frustrating than suspenseful. This post is about a way to make it legitimately exciting while it's still in progress, before the final result is known.)

Let me start by describing how the lottery works now. You can skip this part if you're already familiar with it.

------

The lottery is for the first draft choice only. The 14 teams that missed the playoffs are eligible. Whoever wins jumps to the number one pick, and the remaining 13 teams keep their relative ordering. 

The lower a team's position in the standings, the higher the probability it wins the lottery. The NHL set the probabilities like this:

 1. Buffalo Sabres       20.0%
 2. Arizona Coyotes      13.5%
 3. Edmonton Oilers      11.5%
 4. Toronto Maple Leafs   9.5%
 5. Carolina Hurricanes   8.5%
 6. New Jersey Devils     7.5%
 7. Philadelphia Flyers   6.5%
 8. Columbus Blue Jackets 6.0%
 9. San Jose Sharks       5.0%
10. Colorado Avalanche  3.5%
11. Florida Panthers      3.0%
12. Dallas Stars          2.5%
13. Los Angeles Kings     2.0%
14. Boston Bruins         1.0%

It's kind of interesting how they manage to get those probabilities in practice.

The NHL will put fourteen balls in a hopper, numbered 1 to 14. It will then randomly draw four of those balls.

There are exactly 1,001 combinations of 4 balls out of 14 -- that is, 1,001 distinct "lottery tickets". The 1001st ticket -- the combination "11, 12, 13, 14"  -- is designated a "draw again." The other 1,000 tickets are assigned to the teams in numbers corresponding to their probabilities. So, the Sabres get 200 tickets, the Hurricanes get 85 tickets, and so on. (The tickets are assigned randomly -- here's the NHL's .pdf file listing all 1,001 combinations, and which go to which team.)

It's just coincidence that the number of teams is the same as the number of balls. The choice of 14 balls is to get a number of tickets that's really close to a round number, to make the tickets divide easily.

------

So this works like a standard lottery, like Powerball or whatever: there's just one drawing, and you have the winner. That's kind of boring ... but it works for the NHL, which isn't interested in televising the lottery live.

But it occurred to me ... if you DID want to make it interesting, how would you do it?

Well, I figured ... you could structure it like a race. Put a bunch of balls in the machine, each with a single team's logo. Draw one ball, and that team gets a point. Put the ball back, and repeat. The first team to get to 10 points, wins.

You can't have the same number of balls for every team, because you want them all to have different odds of winning. So you need fourteen different quantities. The smallest sum of fourteen different positive integers is 105 (1 + 2 + 3 ... + 14). That particular case won't work: you want the Bruins to still have a 1 percent chance of winning, but, with only 1 ball to Buffalo's 14, it'll be much, much less than that.

What combinations work? I experimented a bit, and wrote a simulation, and I came up with a set of 746 balls that seems to give the desired result. The fourteen quantities go from 70 balls (Buffalo), down to 39 (Boston). 

In 500,000 runs of my simulation, Buffalo won 20.4 percent of the time, and Boston 1.1 percent. Here are the full numbers. First, the number of balls; second, the percentage of lotteries won; and, third, the percentage the NHL wants.

--------------------------------------------------
                                    result  target
-------------------------------------------------- 
 1. Buffalo Sabres        70 balls   20.4%   20.0%
 2. Arizona Coyotes       63 balls   13.5%   13.5%
 3. Edmonton Oilers       61 balls   11.5%   11.5%
 4. Toronto Maple Leafs   58 balls    9.3%    9.5%
 5. Carolina Hurricanes   57 balls    8.5%    8.5%
 6. New Jersey Devils     56 balls    7.3%    7.5%
 7. Philadelphia Flyers   54 balls    6.5%    6.5%
 8. Columbus Blue Jackets 53 balls    5.9%    6.0%
 9. San Jose Sharks       51 balls    4.7%    5.0%
10. Colorado Avalanche    49 balls    3.7%    3.5%
11. Florida Panthers      47 balls    3.2%    3.0%
12. Dallas Stars          45 balls    2.5%    2.5%
13. Los Angeles Kings     43 balls    1.9%    2.0%
14. Boston Bruins         39 balls    1.1%    1.0%
--------------------------------------------------

The probabilities are all pretty close. They're not perfect, but they're probably good enough. In other words, the NHL could probably live with awarding the Bruins a 1.1% chance instead of a 1% chance.

If you did the lottery this way, would it be more fun? I think it would. You'd be watching a race to 10 points. It would have a plot, and you could see who's winning, and the odds would change every ball. 

Every team would have something to cheer about, because they'd probably all have at least a few balls drawn. The ball ratio between first and last is only around 1.8 (70/39), so for every 9 points the Sabres got, the Bruins would get 5. 

The average number of simulated balls it took to find a winner was 72.4. If you draw one ball every 30 seconds ... along with filler and commercials, that's a one-hour show. Of course, it could go long, or short. The minimum is 10; the maximum is 127 (after a fourteen-way tie for 9). But I suspect the distribution is tight enough around 72 that it would be reasonably manageable.

Another thing too, is ... every team would have a reasonable chance of being close, and an underdog will almost always challenge. Here's how often each team would finish with 7 or more points (including times they won):

 1. Buffalo Sabres         53.3 percent
 2. Arizona Coyotes        43.4
 3. Edmonton Oilers        40.2
 4. Toronto Maple Leafs    35.7
 5. Carolina Hurricanes    33.9
 6. New Jersey Devils      32.4
 7. Philadelphia Flyers    29.5 
 8. Columbus Blue Jackets  27.7  
 9. San Jose Sharks        25.2 
10. Colorado Avalanche     22.1 
11. Florida Panthers       19.5 
12. Dallas Stars           16.7 
13. Los Angeles Kings      14.5 
14. Boston Bruins          10.6 


And here's the average score, by final position after it's over:

10.0  Winner
 8.0  Second
 7.3  Third
 6.6   ...
 6.1
 5.6
 5.2
 4.8
 4.3
 3.9
 3.5
 3.0
 2.4
 1.6

That's actually closer than it looks, because you don't know which team the bottom one will be. Also, just before the winning ball is drawn the first-place team would have been at 9.0, which means, at that time, the second-place team would, on average, have been only one point back. 

------

The problem is ... it still takes 746 balls to make the odds work out that closely. That's a lot of balls to have to put in the machine. Of course, that's just what I found by trial and error; you might be able to do better. Or, you could use a smaller number of balls, and accept a probability distribution that's different from the NHL's, but still reasonable.

Or, you could add a twist. You could give every ball a different number of points. Maybe the Sabres' balls are worth from 5 points down to 1, and the Bruins' balls are only 3 down to 1, and the first team to 20 points wins. I don't think it would be that hard to find some combination that works. 

That's kind of a math nerd thing. I'd bet you can find a system that comes as close as I got with less than 100 balls, and I bet you'd be able to get to it pretty quick by trial and error. 

At least, the NHL could, if it wanted to.







Labels: ,

Saturday, April 11, 2015

MLB forecasters are more conservative this year

Every April, sabermetricians, bookies and sportswriters issue their won-lost predictions for each of the 30 teams in MLB. And, every year, some of them are overconfident, and essentially wind up trying to predict coin flips.

As I've written before, there's a mathematical "speed of light" limit to how much of a team's record can be predicted. That's the part that's determined by player talent. Any spread that's wider than the spread of talent must be just random luck, and, by definition, unpredictable.

Based on the historical record, we can estimate that the spread of team talent in MLB is somewhere around an SD of 9 games. Not all of that talent can be predicted beforehand, because some of it hasn't happened yet -- trades, injuries, players unexpectedly blossoming or declining, and so on. My estimate is that if you were the most omnicient baseball insider in the universe, maybe you could predict an SD of 8 games.

Last year, many pundits exceeded that "speed of light" limit anyway. I thought that there would be fewer this year, that the 2015 forecasts would project a narrower range of team outcomes. That's because last year's standings were particularly tight, and there's been talk about how we may be entering a new era of parity.

And that did happen, to some extent.

I'll show you the 2015s and the 2014s together for easy comparison. A blank space is a forecast I don't have for that year. (For 2015, I think Jonah Keri and the ESPN website predicted only the order of finish, and not the actual W-L record.) 

Like last year, I've included the "speed of light" limits, the naive "last year regressed to the mean" forecast, and the "every team will finish .500" forecast. Links are for 2015 ... for 2014, see last year's post.


 2015  2014
--------------------------------------------------
 9.32 11.50  Sports Illustrated
 9.07  8.76  Jeff Passan (Yahoo!)
 9.00  9.00  Speed of Light (theoretical est.)
 8.79        Bruce Bukiet
       8.53  Jonah Keri (Grantland)
       8.51  Sports Illustrated (runs/10)
 8.00  8.00  Speed of Light (practical est.)
       7.79  ESPN website
 7.92  7.78  Mike Oz (Yahoo!)
 6.99        Chris Cwik (Yahoo!)
 6.38  6.38  Naive previous year method (est.)
 6.34  9.23  Mark Townsend (Yahoo!)
 6.10  6.90  Tim Brown (Yahoo!)
 6.03  7.16  Vegas Betting Line (Bovada)
 5.46  5.55  Fangraphs 
 4.93  8.72  ESPN The Magazine 
 0.00  0.00  Predict 81-81 for all teams
--------------------------------------------------

Of those who predicted both seasons, only two out of eight increased the spread of their forecasts from last year. And those two, Jeff Passan and Mike Oz, increased only a little bit. 

On the other hand, some of the other forecasters see *dramatically* more equality in team talent. Yahoo's Mark Townsend dropped from 9.23 to a very reasonable 6.34. And ESPN dropped from one of the highest spreads, to the lowest -- by far -- at 4.93. 

Which is strange, because ESPN's words are so much more optimistic than their numbers. about the Washington Nationals, they write,


"It's the Nationals' time."
"They're here to stay."
"Anything less than an NL East crown will qualify as a big disappointment."

But their W-L prediction for the Nationals, whom they projected higher than any other team?  A modest 91-71, only ten games above .500.

------

In any case ... I wonder how much of the differences between 2014 and 2015 are due to (a) new methodologies, (b) the same methodologies reflecting a real change in team parity, and (c) just a gut reaction to the 2014 standings having been tighter than normal.

My guess is that it's mostly (b). I'd bet on Bovada's forecasts being the most accurate of the group. If that's true, then maybe teams really *are* tighter in talent than last year, by around 1 win of SD. Which is, roughly, in line with the rest of the forecasts.

I guess we'll know more next year. 



Labels: , , ,

Wednesday, April 08, 2015

How much has parity increased in MLB?

The MLB standings were tighter than usual in 2014. No team won, or lost, more than 98 games. That's only happened a couple of times in baseball history.

You can measure the spread in the standings by calculating the standard deviation (SD) of team wins. Normally, it's around 11. Two years ago, it was 12.0. Last year, it was down substantially, to 9.4.

Historically, 9.4 is not an unprecedented low. In 1984, the SD was 9.0; that's the most parity of any season since the sixties. More recently, the 2007 season came in at 9.3, with a team range of 96 wins to 96 losses.

But this time, people are noticing. A couple of weeks ago, Ben Lindbergh showed that this year's preseason forecasts have been more conservative than usual, suggesting that the pundits think last year's compressed standings reflect increased parity of talent. They've also noted another anomaly: in 2014, payroll seemed to be less important than usual in predicting team success. These days, the rich teams don't seem to be spending as much, and the others seem to be spending a little more.

So, have we actually entered into an era of higher parity, where we should learn to expect tighter playoff races, more competitive balance, and fewer 100-win and 100-loss teams? 

My best guess is ... maybe just a little bit. I don't think the instant, single-season drop from 12 games to 9.4 games could possibly be the result of real changes. I think it was mostly luck. 

-----

Here's the usual statistical theory. You can break down the observed spread in the standings into talent and luck, like this: 

SD(observed)^2 = SD(talent)^2 + SD(luck)^2

Statistical theory tells us that SD(luck) equals 6.4 games, for a 162-game season. With SD(observed) equal to 12.0 for 2013, and 9.4 for 2014, we can solve the equation twice, and get

2013: 10.2 games estimated SD of talent 
2014:  7.0 games estimated SD of talent

That's a huge one-season drop, from 10.2 to 7.0 ... too big, I think, to really happen in a single offseason. 

Being generous, suppose that between the 2013 and 2014 seasons, teams changed a third of their personnel. That's a very large amount turnover. Would even that be enough to cause the drop?

Nope. At least, not if that one-third of redistributed "talent wealth" was spread equally among teams. In that case, the SD of the "new" one-third of talent would be zero. But the SD of the remaining two-thirds of team talent would be 8.3 (the 2013 figure of 10.2, multiplied by the square root of 2/3).

That 8.3 is still higher than our 7.0 estimate for 2014! So, for the SD of talent to drop that much, we'd need the one-third of talent to be redistributed, not equally, but preferentially to the bad teams. 

Is that plausible? To how large an extent would that need to happen?

We have a situation like this: 

2014 talent = original two thirds of 2013 talent 
            + new one third of 2013 talent 
            + redistribution favoring the worse teams

Statistical theory says the relationship between the SDs is this:

SD(2014 talent) squared = 

SD(2013 talent teams kept)^2 +
SD(2013 talent teams gained)^2 + 
SD(2013 talent teams kept) * SD(2013 talent teams gained) * correlation between kept and gained * 2

It's the same equation as before, but with an extra term (shown in bold). That term shows up because we're assuming a non-zero correlation between talent kept and talent gained -- that the more "talent kept," the less your "talent gained". When we just did "talent" and "luck", we were assuming there was no correlation, so we didn't need that extra part. (We could have left it in, but it would have worked out to zero anyway.)

The equation is easy to fill in. We saw that SD(2014 talent) was estimated at 9.4. We saw that SD(talent teams kept) was 8.3. And we can estimate that SD(talent teams gained) is 12.0 times the square root of 1/3, which is 5.9.

If you solve, you get 

Correlation between kept and gained = -0.57

That's a very strong correlation we need, in order for this to work out. The -0.57 means that, on average, if a team's "kept" players were, say, 5th best in MLB (that is, 10 teams above average), its "gained" players must have been 9th worst in MLB (5.7 teams below average). 

That's not just the good teams getting worse by signing players that aren't as good as the above-average players they lost -- it's the good teams signing players that are legitimately mediocre. And, vice-versa. At -0.56, the worst teams in baseball would have had to have replaced one-third of their lineup, and those new players would have to have been collectively as good as those typically found on a 90-win team.

Did that actually happen? It's possible ... but it's something that you'd easily be able to have seen at the time. I think we can say that nobody noticed -- going into last season, it didn't seem like any of the mainstream projections were more conservative than normal. (Well, with the exception of FanGraphs. Maybe they saw some of this actually happen? Or maybe they just used a conservative methodology.)

One thing that WAS noticed before 2014 is that the 51-111 Houston Astros had improved substantially. So that's at least something.

And, for what it's worth: the probability of randomly getting a correlation coefficient as extreme as 0.57, in either direction, is 0.001 -- that is, one in a thousand. On that basis, I think we can reject the hypothesis that team talent grew tighter just randomly.

(Technical note: all these calculations have assumed that every team lost *exactly* one-third of its talent, and that those one-thirds were kept whole and distributed to other teams. If you were to use more realistic assumptions, the chances would improve a little bit. I'm not going to bother, because, as we'll see, there are other possibilities that are more likely anyway.)

------

So, if it isn't the case that the spread in talent narrowed ... what else could it be? 

Here's one possibility: instead of SD(talent) dropping in 2014, SD(luck) dropped. We were holding binomial luck constant at 6.4 games, but that's just the average. It varies randomly from season to season, perhaps substantially. 

It's even possible -- though only infinitesimally likely -- that, in 2014, every team played exactly to its talent, and SD(luck) was actually zero!

Except that ... again, that wouldn't have been enough. Even with zero luck, and talent 10.3, we would have observed an SD of 10.3. But we didn't. We observed only 9.4. 

So, maybe we have another "poor get richer" story, where, in 2014, the bad teams happened to have good luck, and the good teams happened to have bad luck.

We can check that, in part, by looking at the 2014 Pythagorean projections. Did the bad teams beat Pythagoras more than the good teams did?

Not really. Well, there is one obvious case -- Oakland. The 2014 A's were the best team in MLB in run differential, but won only 88 games instead of 99 because of 11 games of bad "cluster luck". 

Eleven games is unusually large. But, the rest of the top half of MLB had a combined eighteen games of *good* luck, which seems like it would roughly cancel things out.

Still ... Pythagoras is only a part of overall luck, so there's still lots of opportunity for the "good teams having bad luck" to have manifested itself. 

-----

Let's do what we did before, and see what the correlation would have to be between talent and luck, in order to get the SD down to 9.4. The relationship, again, is:

SD(2014 standings)^2 = 

SD(2014 talent)^2 + 
SD(2014 luck)^2   + 
SD(2014 talent) * SD(2014 luck) * correlation between 2014 talent and 2014 luck * 2 

Leaving SD(2014 talent) at the 2013 estimate of 10.2, and leaving SD(2014 luck) at 6.4, we get

Correlation between 2014 talent and luck = -0.39

The chance of a correlation that big (either direction) happening just by random luck is 0.033 -- about 1 in 30. That seems like a big enough chance that it's plausible that's what actually happened. 

Sure, 1 in 30 seems low, and is statistically significantly unlikely in the classical "1 in 20" sense. But that doesn't matter. We're being Bayesian here. We know something unlikely happened, and so the reason it happened is probably also something unlikely. And the 1/30 estimate for "bad teams randomly got lucky" is a lot more plausible than the 1/1000 estimate for "bad teams randomly got good talent."  It might also be more plausible than "bad teams deliberately got good talent," considering that nobody noticed any unusual changes in talent at the time.

------

Having got this far, I have to backtrack and point out that these odds and correlations are actually too extreme. We've been assuming that all the unusualness happened after 2013 -- either in the offseason, or in the 2014 season. But 2013 might have also been lucky/unlucky itself, in the opposite direction.

Actually, it probably was. As I said, the historical SD of actual team wins is around 11. In the 2013 season, it was 12. We would have done better by comparing the "too equal" 2014 to the historical norm, rather than to a "too unequal" 2013. 

For instance, we toyed with the idea that there was less luck than normal in 2014. Maybe there was also more luck than normal in 2013. 

Instead of 6.4 both years, what if SD(luck) had actually been 8.4 in 2013, and 4.4 in 2014?

In that case, our estimates would be

SD(2013 talent) = 8.6
SD(2014 talent) = 7.6

That would be just a change of 1.0 wins in competitive balance, much more plausible than our previous estimate of a 3.2 win swing (10.2 to 7.0).

------

Still: no matter which of all these assumptions and calculations you decide you like, it seems like most of the difference must be luck. It might be luck in terms of the bad teams winding up with the good players for random reasons, or it might be that 2013 had the good teams having good luck, or it might be that 2014 had the good teams having bad luck.

Whichever kind of luck it is, you should expect a bounceback to historical norms -- a regression to the mean -- in 2015. 

The only way you can argue for 2015 being like 2014, is if you think the entire move from historical norms was due to changes in the distribution of talent between teams, due to economic forces rather than temporary random ones. 

But, again, if that's the case, show us! Personnel changes between 2013 and 2014 are public information. If they did favor the bad teams, show us the evidence, with estimates. I mean that seriously ... I haven't checked at all, and it's possible that it's obvious, in retrospect, that something real was going on.

------

Here's one piece of evidence that might be relevant -- betting odds. In 2014, the SD of Bovada's "over/under" team predictions was 7.16. This year, it's only 6.03.*

(* Bovada's talent spread is tighter than what we expect the true distribution to be, because some of team talent is as yet unpredictable -- injuries, trades, prospects, etc.)

Some of that might be a reaction to bettor expectations, but probably not much. I'm comfortable assuming that Bovada thinks the talent distribution has compressed by around one win.*

Maybe, then, we should expect a talent SD of 8.0 wins, rather than the historical norm of 9.0. That's more reasonable than expecting the 2013 value of 10.2, or the 2014 value of 7.0. 

If the SD of talent is 8, and the SD of luck is 6.4 as usual, that means we should expect the SD of this year's standings to be 10.2. That seems reasonable. 

------

Anyway, this is all kind of confusing. Let me try to summarize everything more understandably.

------

The distribution of team wins was much tighter in 2014 than in 2013. As I see it, there are six different factors that could have contributed to the increase in standings parity:

-- 1. Player movement from 2013 to 2014 brought better players to the worse teams (to a larger extent than normal), due to changes in the economics of MLB.

-- 2. Player movement from 2013 to 2014 brought better players to the worse teams (to a larger extent than normal), due to "random" reasons -- for instance, prospects maturing and improving faster for the worse teams.

-- 3. There was more randomness than usual in 2013, which caused us to overestimate disparities in team talent.

-- 4. There was less randomness than usual in 2014, which caused us to underestimate disparities in team talent.

-- 5. Randomness in 2013 favored the better teams, which caused us to overestimate disparities in team talent.

-- 6. Randomness in 2014 favored the worse teams, which caused us to underestimate disparities in team talent.

Of these six possibilities, only #1 would suggest that the increase in parity is real, and should be expected to repeat in 2015. 

#3 through to #6 suggest that 2013 was a random aberration, and would suggest that 2015 would be more like the historical norm (SD of 11 games) rather than like 2013 (SD of 12 games). 

Finally, #2 is a hybrid -- a one-time random "shock to the system," but with hangover effects into future seasons. If, for instance, the bad teams just happened to have great prospects arrive in 2014, those players will continue to perform well into 2015 and beyond. Eventually, the economics of the game will push everything back to equilibrium, but that won't happen immediately, so much of the 2014 increase in parity could remain.

------

Here's my "gut" breakdown of the contribution each of those six factors:

25% -- #1, changes in talent for economic reasons
 5% -- #2, random changes in talent
10% -- #3, "too much" luck in 2013
20% -- #4, "too little" luck in 2014
10% -- #5, luck favoring the good teams in 2013
30% -- #6, luck favoring the bad teams in 2014

Caveats: (1) This is just my gut; (2) the percentages don't have any actual meaning; and (3) I could easily be wrong.

If you don't care about the reasons, just the bottom line, that breakdown won't mean anything to you. 

As I said, my gut for the bottom line is that it seems reasonable to expect 2015 to end with a standings SD of 10.2 wins ... based on the changes in the Vegas odds.

But ... if there were an over/under on that 10.2, my gut would say to take the over. Even after all these arguments -- which I do think make sense -- I still have this nagging worry that I might just be getting fooled by randomness.



Labels: , , ,

Wednesday, March 25, 2015

Does WAR undervalue injured superstars?

In an article last month, "Ain't Gonna Study WAR No More" (subscription required), Bill James points out a flaw in WAR (Wins Above Replacement), when used as a one-dimensional measure of player value. 

Bill gives an example of two hypothetical players with equal WAR, but not equal value to their teams. One team has Player A, an everyday starter who created 2.0 WAR over 162 games. Another team has Player B, a star who normally produces at the rate of 4.0 WAR, but one ear created only 2.0 WAR because he was injured for half the season.

Which player's team will do better? It's B's team. He creates 2.0 WAR, but leaves half the season for someone from the bench to add more. And, since bench players create wins at a rate higher than 0.0 -- by definition, since 0.0 is the level of player that can be had from AAA for free -- you'd rather have the half-time player than the full-time player.

This seems right to me, that playing time matters when comparing players of equal WAR. I think we can tweak WAR to come up with something better. And, even if we don't, I think the inaccuracy that Bill identified is small enough that we can ignore it in most cases.

------

First: you have to keep in mind what "replacement" actually means in the context of WAR. It's the level of a player just barely not good enough to make a Major League Roster. It is NOT the level of performance you can get off the bench.

Yes, when your superstar is injured, you often do find his replacement on the bench. That causes confusion, because that kind of replacement isn't what we really mean when we talk about WAR.

You might think -- *shouldn't* it be what we mean? After all, part of the reason teams keep reasonable bench players is specifically in case one of the regulars gets injured. There is probably no team in baseball, when their 4.0 WAR player goes down the first day of the season, can't replace at least a portion of those wins from an available player. So if your centerfielder normally creates 4.0 WAR, but you have a guy on the bench who can create 1.0 WAR, isn't the regular really only worth 3.0 wins in a real-life sense?

Perhaps. But then you wind up with some weird paradoxes.

You lease a blue Honda Accord for a year. It has a "VAP" (Value Above taking Public Transit) of, say $10,000. But, just in case the Accord won't start one morning, you have a ten-year-old Sentra in the garage, which you like about half as much.

Does that mean the Accord is only worth $5,000? If it disappeared, you'd lose its $10,000 contribution, but you'd gain back $5,000 of that from the Sentra. If you *do* think it's only worth $5,000 ... what happens if your neighbor has an identical Accord, but no Sentra? Do you really want to decide that his car is twice as valuable as yours?

It's true that your Accord is worth $5,000 more than what you would replace it with, and your neighbor's is worth $10,000 more than what would he would replace it with. But that doesn't seem reasonable as a general way to value the cars. Do you really want to say that Willie McCovey has almost no value just because Hank Aaron is available on the bench?

------

There's also another accounting problem, one that commenter "Guy123" pointed out on Bill's site. I'll use cars again to illustrate it.

Your Accord breaks down halfway through the year, for a VAP of $5,000. Your mother has only an old Sentra, which she drives all year, for an identical VAP of $5,000.

Bill James' thought experiment says, your Accord, at $5,000, is actually worth more than your mother's Sentra, at $5,000 -- because your Accord leaves room for your own Sentra to add value later. In fact, you get $7,500 in VAP -- $5,000 from half a year of the Accord, and $5,000 from half a year of the Sentra.

Except that ... how do you credit the Accord for the value added by the Sentra? You earned a total of $7,500 in VAP for the year. Normal accounting says $5,000 for the Accord, and $2,500 for the Sentra. But if you want to give the Accord "extra credit," you have to take that credit away from the Sentra! Because, the two still have to add up to $7,500.

So what do you do?

------

I think what you do, first, is not base the calculation on the specific alternatives for a particular team. You want to base the calculation on the *average* alternative, for a generic team. That way, your Accord winds up worth the same as your neighbor's.

You can call that, "Wins Above Average Bench." If only 1 in 10 households has a backup Sentra, then the average alternative is one tenth of $5,000, or $500. So the Accord has a WAAB of $9,500.

All this needs to happen because of a specific property of the bench -- it has better-than-replacement resources sitting idle.

When Jesse Barfield has the flu, you can substitute Hosken Powell for "free" -- he would just be sitting on the bench anyway. (It's not like using the same starting pitcher two days in a row, which has a heavy cost in injury risk.)

That wouldn't be the case if teams didn't keep extra players on the bench, like if the roster size for batters were fixed at nine. Suppose that when Jesse Barfield has the flu, you have to call Hosken Powell up from AAA. In that case, you DO want Wins Above Replacement. It's the same Hosken Powell, but, now, Powell *is* replacement, because replacement is AAA by definition.

Still, you won't go too wrong if you just stick to WAR. In terms of just the raw numbers, "Wins Above Replacement" is very close to "Wins Above Average Bench," because the bottom of the roster, the players that don't get used much, is close to 0.0 WAR anyway.

For player-seasons between 1982 and 1991, inclusive, I calculated the average offensive expectation (based on a weighted average of surrounding seasons) for regulars vs. bench players. Here are the results, in Runs Created per 405 outs (roughly a full-time player-season), broken down by "benchiness" as measured by actual AB that year:

500+ AB: 75
401-500: 69
301-400: 65
201-300: 62
151-200: 60
101-150: 59
 76-100: 45
 51- 75: 33

A non-superstar everyday player, by this chart, would probably come in at around 70 runs. A rule of thumb is that everyday players are worth about 2.0 WAR. So, 0.0 WAR -- replacement level -- would about 50 runs.

The marginal bench AB, the ones that replace the injured guy, would probably come from the bottom four rows of the chart -- maybe around 55. That's 5 runs above replacement, or 0.5 wins. 

So, the bench guys are 0.5 WAR. That means when the 4.0 guy plays half a season, and gets replaced by the 0.5 guy for the other half, the combination is worth 2.25 WAR, rather than 2.0 WAR. As Bill pointed out, the WAR accounting credits the injured star with only 2.0, and he still comes out looking only equally as good as the full-time guy.

But if we switch to WAAB ... now, the full-time guy is 1.5 WAAB (2.0 minus 0.5). The half-time star is 1.75 WAAB (4.0 minus 0.5, all divided by 2). That's what we expected: the star shows more value.

But: not by much. 0.25 wins is 2.5 runs, which is a small discrepancy compared to the randomness of performance in general. And even that discrepancy is random, since something as large as a quarter of a win only shows up when a superstar loses half the season to injury. The only time when it's large and not random is probably star platoon players -- but there aren't too many of those.

(The biggest benefit to accounting for the bench might be when evaluating pitchers, who, unlike hitters, vary quite a bit in how much they're physically capable of playing.)

I don't see it as that a big deal at all. I'd say, if you want, when you're comparing two batters, give the less-used player a bonus of 0.1 WAR for each 100 AB of playing time. 

Of course, that estimate is very rough ... the 0.1 wins could easily be 0.05, or 0.2, or something. Still, it's still going to be fairly small -- small enough that I'd be it wouldn't change too many conclusions that you'd reach if you just stuck to WAR.





        

Labels: ,

Friday, March 06, 2015

Why is the bell curve so pervasive?

Why do so many things seem to be normally distributed (bell curved)? That's something that bothered me for a long time. Human heights are (roughly) normally distributed. So are are weights of (presumably identical) bags of potato chips, basketball scores, blood pressure, and a million other things, seemingly unrelated.

Well, I was finally able to "figure it out," in the sense of, finding a good intuitive explanation that satisfied my inner "why". Here's the explanation I gave myself. It might not work for you -- but you might already have your own.

------

Imagine a million people each flipping a coin 100 times, and reporting the number of heads they got. The distribution of those million results will have a bell-shaped curve with a mean around 50. (Yes, the number of heads is discrete and the bell curve is continuous, but never mind.)  

In fact, you can prove, mathematically, that you should get something very close to the normal distribution. But is there an intuitive explanation that doesn't need that much math?

My explanation is: the curve HAS to be bell-shaped. There's no alternative based on what we already know about it.

-- First: you probably want the distribution to be curved, and not straight lines. I guess you could expect something kind of triangular, but that would be weird.

-- Second: the curve can never go below the horizontal axis, since probabilities can't be negative.

-- Third: the curve has to be highest at 50, and always go lower when you move farther from the center -- which means, at the extremes, it gets very, very close to zero without ever touching it.

That means we're stuck with this:





How do you fill that in without making something that looks like a bell? You can't. 

This line of thinking -- call it the "fill in the graph" argument -- doesn't prove it's the normal distribution specifically. It just explains why it's a bell curve. But, I didn't have a mental image of the normal distribution as different from other bell-shaped curves, so it's close enough for my gut. In fact, I'm just going to take it as a given that it's the normal distribution, and carry on.

(By the way, if you want to see the normal distribution arise magically from the equivalent of coin flips, see the video here.) 

-----

That's fine for coin flips. But what about all those other things? Say, human height? We still know it's a bell-shaped curve from the same "fill in the graph" argument, but how do we know it's the same one as coins? After all, a single human's height isn't the same thing as flipping a quarter 100 times. 

My gut explanation is ... it probably *is* something like coin flips. Imagine that the average adult male is 5' 9". But there may be (say) a million genes that move that up or down. Suppose that for each of those genes, if it shows "heads," you get to be 1/180 of an inch taller. If the gene shows "tails," you're 1/180 of an inch shorter.

If that's how it works, and each gene is independent and random, the population winds up following a normal distribution with a standard deviation of around 2.8 inches, which is roughly the real-world number.

It seems reasonable to me, intuitively, to think that the genetics of height probably do work something like this oversimplified example. 

-----

How does this apply to weights of bags of chips? Same idea. The chips are processed on complicated machinery, with a million moving parts. They aren't precise down to the last decimal place. If there are 1,000 places on the production line where the bag might get a fraction heavier or lighter, the coin-flip model works fine.

-----

But for test scores, the coin-flip model doesn't seem to work very well. People have different levels of skill with which they pick up the material, and different study habits, and different reactions to the pressure of an exam, and different speeds at which they write. There's no obvious "coin flipping" involved in the fact that some students work hard, and some don't bother too much.

But there can be coin flipping involved in some of those other things. Different levels of skill could be somewhat genetic, and therefore normally distributed. 

And, most of those other things have to be *roughly* bell-shaped, too, by the "fill in the graph" argument: the extremes can't go below zero, and the curve needs to drop consistently on both sides of the peak. 

So to get the final test result, you're adding the bell-shaped curve for ability, plus the bell-shaped curve for speed, plus the bell-shaped curve for industriousness, and so on.

When you add variables that are normally distributed, the sum is also normally distributed. Why? Well, suppose ability is the equivalent of the sum of 1000 coin flips. And industriousness is the equivalent of the sum of 350 coin flips. Then, "ability plus industriousness" is just the sum of 1350 coin flips -- which is still a bell curve.

My guess is that there are a lot of things in the universe that work this way, and that's why they come out normally distributed. 

If you want to go beyond genetics ... well, there are probably a million environmental factors, too. Going back to height ... maybe, the more you exercise, the taller you get, by some tiny fraction. (Maybe exercise burns more calories, which makes you hungrier, and it's the nutrition that helps you get taller. Whatever.)  

Exercise could be normally distributed, too, or at least many of its factors might. For instance, how much exercise you get might partly depend on, say, how far you had to walk to school. That, itself, has to roughly be a bell curve, by the same old "fill in the graph" argument.

------

What makes bell curves even more ubiquitous is that you get bell curves even if you start with something other than coin flips.

Take, for instance, the length of a winning streak in sports. That isn't a coin flip, and it isn't, itself, normally distributed. The most frequent streak is 0 wins, then 1, then 2, and so on. The graph would look something like this (stolen randomly from the web):



But, the thing is: the distribution of one winning streak doesn't look normal at all. But if you add up, say, a million winning streaks -- the result WILL be normally distributed. That's the most famous result in statistics, the "Central Limit Theorem," which says that if you add up enough identical, independent random variables, you always get a normal curve.

Why? 
My intuitive explanation is: the winning streak totals reflect, roughly, the same underlying logic as the coin flips.

Suppose you're figuring out how to get 50 heads out of 100 coins. You say, "well, all the odd flips might be heads. All the even flips might be heads. The first 50 might be heads, and the last 50 might be tails ... " and so on.

For winning streaks: Suppose you're trying to figure out how to get a total of (say) 67 wins out of 100 streaks. You say, "well, maybe all the odd streaks are 0, and all the low even streaks are 1, and streak number 100 is a 9-gamer, and streak number 98 is a 7-gamer, and so on. Or, maybe the EVEN streaks are zero, and the high ODD streaks are the big ones. Or maybe it's the LOW odd streaks that are the big ones ... " and so on.

In both cases, you calculate the probabilities by choosing combinations that add up. It's the fact that the probabilities are based on combinations that makes things come out normal. 

------

Why is that? Why does the involvement of combinations lead to a normal distribution? For that, the intuitive argument involves some formulas (but no complicated math). 

This is the actual equation for the normal distribution:


f(x, \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi} } e^{ -\frac{(x-\mu)^2}{2\sigma^2} }

It looks complicated. It's got pi in it, and e (the transcendental number 2.71828...), and exponents. How does all that show up, when we're just flipping coins and counting heads?

It comes from the combinations -- specifically, the factorials they contain. 

The binomial probability of getting exactly 50 heads in 100 coin tosses is:



It turns out that there is a formula, "Stirling's Approximation," which lets you substitute for the factorials. It turns out that you can rewrite n! this way:

n! \sim \sqrt{2 \pi n}\left(\frac{n}{e}\right)^n

It's only strictly equal as n approaches infinity, but it's very, very close for any value of n. 

Stick that in where the factorial would go, and do some algebra manipulation, and the "e" winds up flipping from the denominator to the numerator, and the "square root of 2 pi" flips from the numerator to the denominator ... and you get something that looks really close to the normal distribution. Well, I'm pretty sure you do; I haven't tried it myself. 

I don't have to ... at this point, my gut is happy. My sense of "I still don't understand why" is satisfied by seeing the Stirling formula, and seeing how the pi and e come out of the factorials in roughly the right place. 



(UPDATE, 3/8/2015: I had originally said, in the first paragraph, that test scores are normally distributed.  In a tweet, Rodney Fort pointed out that they're actually *engineered* to be normally distributed. So, not the best example, and I've removed it.)

Labels: , ,

Sunday, March 01, 2015

Two nice statistical analogies

I'm always trying to find good analogies to help explain statistical topics. Here's a couple of good ones I've come across lately, that I'll add to the working list I keep in my brain.


1.

Here's Paul Bruno, explaining why r-squared is not necessarily a good indicator of whether or not something is actually important in real life:

"Consider 'access to breathable oxygen'. If you crunched the numbers, you would likely find that access to breathable oxygen accounts for very little – if any – of the variation in students' tests scores. This is because all students have roughly similar access to breathable oxygen. If all students have the same access to breathable oxygen, then access to breathable oxygen cannot 'explain' or 'account for' the differences in their test scores.

"Does this mean that access to breathable oxygen is unimportant for test scores? Obviously not. On the contrary: access to breathable oxygen is very important for kids’ test scores, and this is true even though access to breathable oxygen explains ≈ 0% of their variation."

Great way to explain it, and an easy way to understand why, if you want to see if a factor is important in a "breathable oxygen" kind of way, you need to look at the regression coefficient, not the r-squared.


2.

This sentence comes from Jordan Ellenberg's mathematics book, "How Not To Be Wrong," which I talked a bit about last post:


"The significance test is the detective, not the judge."

I like that analogy so much I wanted to start by putting it by itself ... but I should add the previous sentence for context:


"A statistically significant finding [only] gives you a clue, suggesting a promising place to focus your research energy. The significance test is the detective, not the judge."         [page 161, emphasis in original]

(By the way, Ellenberg doesn't put a hyphen in the phrase "statistically-significant finding," but I normally do. Is the non-hyphen a standard one-off convention, like "Major League Baseball?")

(UPDATE: this question now answered in the comments.)

The point is: one in twenty experiments would produce a statistically-significant result just by random chance. So, statistical significance doesn't mean you can just leap to the conclusion that your hypothesis is true. You might be one of that "lucky" five percent. To be really confident, you need to wait for replications, or find other ways to explore further.

I had previously written an analogy that's kind of similar:


"Peer review is like the police deciding there's enough evidence to lay charges. Post-publication debate is like two lawyers arguing the case before a jury."

Well, it's really the district attorney who has the final say on whether to lay charges, right?  In that light, I like Ellenberg's description of the police better than mine. Adding that in, here's the new version:  


"Statistical significance is the detective confirming a connection between the suspect and the crime. Peer review is the district attorney deciding there's enough evidence to lay charges.  Post-publication debate is the two lawyers arguing the case before a jury."

Much better, I think, with Ellenberg's formulation in there too.






Labels: , , , , ,

Friday, February 20, 2015

Replacing "statistically significant"

In his recent book, "How Not To Be Wrong," mathematician Jordan Ellenberg writes about how the word "significant" means something completely different in statistics than it does in real life:

"In common language it means something like 'important' or 'meaningful.' But the significance test scientists use doesn't measure importance ... [it's used] merely to make a judgment that the effect is not zero. But the effect could still be very small -- so small that the drug isn't effective in any sense that an ordinary non-mathematical Anglophone would call significant. ...

"If only we could go back in time to the dawn of statistical nomenclature and declare ... 'statistically noticeable' or 'statistically detectable' instead of 'statistically significant!'"

I absolutely agree.

In fact, in my view, the problem is even more serious the other way, when there is *no* statistical significance. Researchers will say, "we found no statistically-significant effect," which basically means, "we don't have enough evidence to say either way." But readers will take that as meaning, "we find at best a very small effect." That's not necessarily the case. Studies often find values that would be very significant in the real world, but reject them because the confidence interval is wide enough to include zero. 

-----

Tom Tango will often challenge readers to put aside "inertial reasoning" and consider how we would redesign baseball rules if we were starting from scratch. In that tradition, how would we redo the language of statistical significance?

I actually spent a fair bit of time on this a year or so ago. I went to a bunch of online thesauruses, and wrote down every adjective that had some kind of overlap with "significant." Looking at my list ... I notice I actually didn't include Ellenberg's suggestions, "noticeable" or "detectable." Those are very good candidates. I'll add those now, along with a few of their synonyms.

OK, done. Here's my list of possible candidates:

convincing, decisive, unambiguous, probable cause, suspicious, definite, definitive, adequate, upholdable, qualifying, sufficing, signalling, salient, sufficient, unambiguous, defensible, sustainable, marked, rigorous, determinate, permissible, accreditable, attestable, credentialed, credence-ive, credible, threshold, reliable, presumptive, persuasive, confident, ratifiable, legal, licit, sanctionable, admittable, acknowledgeable, endorsable, affirmative, affirmable, warrantable, conclusive, sufficing, sufficient, valid, assertable, clear, ordainable, non-spurious, dependable, veritable, creditable, attestable, avowable, vouchable, substantive, noticeable, detectable, perceivable, discernable, observable, appreciable, ascertainable, perceptible

You can probably divide these into classes, based on shades of meaning:

1. Words that mean "enough to be persuasive." Some of those are overkill, some are underkill. "Unambiguous," for instance, would be an obvious oversell; you can have a low p-value and still be pretty ambiguous. On the other hand, "defensible" might be a bit too weak. Maybe "definite" is the best of those, suggesting precision but not necessarily absolute truth.

2. Words that mean "big enough to be observed." Those are the ones that Ellenberg suggested, "noticeable" and "detectable." Those seem fine when you actually find significance, but not so much when you don't. "We find no relationship that is statistically detectable" does seem to imply that there's nothing there, rather than that you just don't have enough data in your sample.

3. Words that mean "enough evidence." That's exactly what we want, except I can't think of any that work. The ones in the list aren't quite right. "Probable cause" is roughly the idea we're going for, but it's awkward and sounds too Bayesian. "Suspicious" has the wrong flavor. "Credential" has a nice ring to it -- as an adjective, not a noun, meaning "to have credence." You could say, for instance, "We didn't have enough evidence to get a credential estimate."  Still a bit awkward, though. "Determinate" is pretty good, but maybe a bit overconfident.

Am I missing some? I tried to think, what's the word we use when we say an accused was acquitted because there wasn't enough evidence? "Insufficient" is the only one I can think of. Everything else is a phrase -- "within a reasonable doubt," or "not meeting the burden of proof."

4. Words that mean "passing an objective level," as in meeting a threshold. Actually, "threshold" as an adjective would be awkward, but workable -- "the coefficient was not statistically threshold." There's also "adequate," and "qualifying,” and "sufficient," and  "sufficing." 

5. Finally, there's words that mean "legal," in the sense of, "now the peer reviewers will permit us to treat the effect as legitimate." Those are words like "sanctionable," "admittable," "acknowledgable," "permissible," "ratifiable," and so on. My favorite of these is "affirmable." You could write, "The coefficient had a p-value of .06, which falls short of statistical affirmability." The reader now gets the idea that the problem isn't that the effect is small -- but, rather, that there's something else going on that doesn't allow the researcher to "affirm" it as a real effect.

What we'd like is a word that has a flavor matching all these shades of meaning, without giving the wrong idea about any of them. 

So, here's what I think is the best candidate, which I left off the list until now:

Dispositive. 

"Dispositive" is a legal term that means "sufficient on its own to decide the answer." If a fact is dispositive, it's enough to "dispose" of the question.

Here's a perfect example:

"Whether he blew a .08 or higher on the breathalyzer is dispositive as to whether he will be found guilty of DUI."

It's almost exact, isn't it? .08 for a conviction, .05 for statistical significance.

I think "dispositive" really captures how statistical significance is used in practice -- as an arbitrary standard, a "bright line" between Yes and No. We don't allow authors to argue that their study is so awesome that p=.07 should really be allowed to be considered significant, any more than we allow defendants to argue that should be acquitted at a blood alcohol level of .09 because they're especially good drivers. 

Moreover, the word works right out of the box in its normal English definition. Unlike "significant," the statistical version of "dispositive" has the same meaning as the usual one. If you say to a non-statistician, "the evidence was not statistically dispositive," he'll get the right idea -- that an effect was maybe found, but there's not quite enough there for a decision to be made about whether it's real or not. In effect, the question is not yet decided. 

That's the same as in law. "Not dispositive" means the evidence or argument is a valid one, but it's not enough on its own to decide the case. With further evidence or argument, either side could still win. That's exactly right for statistical studies. A "non-significant" p-value is certainly relevant, but it's not dispositive evidence of presence, and it's not dispositive evidence of absence. 

Another nice feature is that the word still kind of works when you use it to describe the effect or the estimate, rather than the evidence: 

"The coefficient was not statistically dispositive."

It's not a wonderful way to put it, but it's reasonable. Most of the other candidate words don't work well both ways at all -- some are well-suited only to describing the evidence, others only to describing the estimates. These don't really make sense:

"The evidence was not statistically detectable."  
"The effect was not statistically reliable."
"The coefficient was not statistically accreditable."

Another advantage of "dispositive" is that unlike "significant," you can leave out the word "statistical" without ambiguity:

"The evidence was not dispositive."
"The coefficient was not dispositively different from zero."

Those read fine, don't they? I bet they'd almost always read fine. I'd bet that if you were to pick up a random study, and do a global replace of "statistically significant" with "dispositive," the paper wouldn't suffer at all. (It might even be improved, if the change highlighted cases where "significant" was used in ways it shouldn't have been.)

-----

When I'm finally made Global Despotic Emperor of Academic Standards, the change of terminology will be my first official decree.

Unless someone has a better suggestion. 



Labels: , , ,