Thursday, April 25, 2013

A breakdown of the luck in MLB season records

If you take a .500 baseball team, and flip it 162 times, you should expect it to come up "win" 81 times.  But that will vary -- sometimes it'll win fewer, and sometimes it'll win more.  You can calculate that the distribution of wins should follow a normal distribution, with a mean of 81, and a standard deviation of 6.36.

Using the rule of thumb that 95% of observations are within two standard deviations of the mean, you can figure that, around one time in 20, that team will win 94 or more games, or fewer than 68, just by luck alone.

Where does that luck show up?

The way I see it, you can break it up into five mutually-exclusive observations (as I described in a previous post):

1.  The team's hitters could have better or worse performances than their talent expectation -- that is, "career years" in either direction -- in terms of their basic batting line.

2.  The team's pitchers could do the same (that is, the opposing team's batters could have "career years").

3.  The team could score more (or fewer) runs than expected from its composite batting line.  In other words, it could beat its Runs Created (or Linear Weights, or Base Runs) estimate.  That usually happens if the team hits better (or worse) than expected in high-leverage situations in terms of runs scoring -- such as, for instance, bases loaded and two outs.

4.  The opposition could do the same.

5.  The team could win more (or fewer) games than expected from the number of runs it scored and allowed.  In other words, it could beat its Pythagorean projection (or, alternatively, the "10 runs equals a win" rule of thumb).  That usually happens when a team scores more runs in high-leverage situations in terms of game outcome -- such as, for instance, with the score tied in the ninth inning.

If my logic is right, those five calculations cover all the binomial luck, and none of them overlap (that is, no luck is counted twice).

The question I spent the last couple of days looking at, is: how does the overall variation break down into the five parts?  That is, which of the five types is most important?  It wasn't that hard to figure out; I'm not sure why I didn't do it years ago.

------

First, the "career years" thing.  How do you figure that out?  Well, it's pretty easy to get an estimate.  I took the overall MLB stats for the 1984 season (1984 chosen arbitrarily), and divided by 26 to get a team average.  It worked out that there were 25.27 hitless at-bats per game, per team.  (It's not 27 because of bottom-of-the-ninth issues, outs made on base, double plays, and so on.)

So, I ran a little simulation.  I created random plate appearances, with league-average probabilities, until I got to 25.27 times 162 batting outs.  Then, I calculated the Linear Weights runs for that simulated batting line.  (I used weights of .47/.85/1.02/1.4/0.33 for 1B/2B/3B/HR/BB.  The value assigned to the out didn't matter here, because every season had the same number of outs.)

In that simulation, the standard deviation of runs was 31.9.  So, that's my estimate of how much random variation there is in terms of "career years".  It's 31.9 for the team's hitters, and 31.9 for the team's pitchers.

------

For beating the "runs created" estimate, I looked at real life data.  I actually used Linear Weights, instead, because I think it's more accurate.  I used the above weights for the basic events, and I calculated the value of the batting out for each season (I probably should have used league-season, but I don't feel like redoing it.)

I think I did 1960 to 2001, omitting strike seasons.

The standard deviation of Linear Weights luck was 23.9 runs.  I did only batting, because I didn't have detailed statistics handy for opposition batting.  But, I'm going to assume that would have worked out about the same.

------

Finally, for Pythagoras, I looked at real-life teams from 1973 to 2001 (again omitting strike seasons).  The standard error was 3.91 wins, which I converted to 39.1 runs.   

------

So, here are the results:

31.9 -- career years by hitters
31.9 -- career years by pitchers
23.9 -- linear weights luck for hitters
23.9 -- linear weights luck for opposition hitters
39.1 -- Pythagoras luck

To get the overall SD, you square the five numbers, add them up, and take the square root.  If you do that, you get 68.6.  That's somewhat higher than what we expected, which was 63.6.

68.6 -- five categories combined
63.6 -- theoretical expectation

Why the difference?  I'm not sure.  Some possibilities:

1.  Linear Weights is known to overestimate luck for very bad and very good teams, and underestimate it for medium teams.  That would inflate those two SDs, a bit.

2.  Teams that have good Linear Weights Luck score more runs.  Teams that score more runs play fewer bottom-of-the-ninths on offense, and more bottom-of-the-ninths on defense.  That would compress their run differential, which would make them look like they had more Pythagoras luck than they did.  That is: we *are* double counting a little bit of luck in this case, due to the fact that Pythagoras doesn't take innings into account, just games.  

3.  I used real-life teams for Pythagoras luck.  But, in real life, there are things that make a team beat its Pythagoras that have nothing to do with luck.  For instance, a team with much better relief pitchers will be able to hold on to small leads, and win more games than expected.  Also, managers who make blowouts worse by using their worst pitchers to mop-up will show more apparent Pythagoras luck, by giving up more runs that don't affect who wins.

4.  The same thing could be true for Linear Weights luck.  Linear Weights assumes that each individual event -- a single, say -- occurs in a league-average context of runners on base.  But teams that score primarily by the home run hit in a below-average context (to see why, imagine that a team hits ONLY home runs.  Each will be worth only 1.0 runs, instead of the 1.4 the formula thinks it's worth).  And teams that score primarily by singles probably hit in an above-average context.  So, that would tend to magnify the errors, in either direction.

I suspect the Pythagoras is the biggest thing ... maybe what I'll do, eventually, is pick real-life games randomly broken among all different teams, and calculate Pythagoras error that way.  And maybe I'll do the same thing for Linear Weights.  I'm guessing that will bring the numbers down at least a bit.  Whether the total will drop from 68.6 all the way to 63.6 ... well, I doubt it, but you never know.

------

If you like talking in terms of variances -- that is, r-squareds -- or you like it when things add up to 100 percent, here are the five variances as a percentage of the total.  (I'll also change "linear weights luck" to "cluster luck", in honor of Joe Peta, since Linear Weights luck is the result of clutch hitting, which means clustering of offensive events.)

22% -- career years by hitters
22% -- career years by pitchers
12% -- cluster luck by offense
12% -- cluster luck by opposition offense
32% -- Pythagoras luck

And, combining offense/defense:

44% -- career years by team's players
24% -- overall cluster luck
32% -- Pythagoras luck

I suspect that after that other simulation I plan on doing, the career years numbers will wind up a bit higher, and the others a bit lower.  But I think this will still be pretty close.  


Labels: , ,

Monday, April 22, 2013

Do athletes have shorter lifespans?

According to this article in Pacific Standard magazine, athletes have lower lifespans than those in other occupations.

The article cites a recent academic study (.pdf) that looked at 1,000 consecutive obituaries in the New York Times.  That study 


" ...found the youngest average age of death was among athletes (77.4 years), performers (77.1 years), and non-performers who worked in creative fields, such as authors, composers, and artists (78.5 years). The oldest average age of death was found among people famous for their work in politics (82.1 years), business (83.3 years), and the military (84.7 years)."

The authors of the study say,


"... our data raise the intriguing speculation that young people contemplating certain careers (e.g. performing arts and professional sports) may be faced, consciously or otherwise, with a faustian choice: namely, 1. to maximize their career potential and  competitiveness even though the required psychological and physical costs may be expected to shorten their longevity, or 2. to fall short of their career potential so as to balance their lives and permit a normal lifespan."

But: isn't there a selective sampling problem here?

To appear in a New York Times obituary, you have to be relatively famous, or, at least, have passed a certain standard of fame or accomplishment in your chosen field.

If your chosen field is athletics, you reach that threshold early in your life -- in your 30s, say.  Wayne Gretzky, Ken Griffey Jr., Bjorn Borg.  If your field is business, you probably have to reach the level of CEO to make the Times.  The median age of a CEO in the S&P 500 is mid-50s ... so the median age for an *accomplished* CEO is probably around 60. 

Same for politics: the median age of a US senator is almost 62 years.  For a US congressman, the mean is around 57.

So, of course looking at obituaries will make you think there's a difference!  Your sample includes athletes who died at 40, but not politicians who died at 40.  Politicians who died at 40 either haven't become famous yet -- or, more likely, haven't even become politicians yet!

And, quickly checking out the US mortality table ... a 35-year-old male is expected to live to about 77.5.  A 60-year-old male is expected to live to about 80.9.

Seems about right.

----

If you want a two-line analogy, try this: 

No US president has ever died before the age of 35.  That doesn't mean that if you want to make sure you don't die in childhood, you should become a US president.


Labels: , ,

Sunday, April 21, 2013

Pythagorean good luck associated with Runs Created bad luck


I noticed recently that there's a negative correlation between certain measures that we think are random and independent.  For instance, outshooting Pythagoras tends to be associated with undershooting Runs Created.  I don't know why, and I'm looking for ideas.

----

Let me give you some background to what luck numbers I'm doing here.

Back in 2005, I did a study to estimate real teams' historical talent levels from their stats.  I figured that there were five mutually exclusive ways a team could perform differently from its talent:

1.  Its batters could have lucky or unlucky years, in terms of raw batting line.

2.  Its pitchers could have lucky or unlucky years, in terms of the opposition's raw batting line.

3.  It could create more or fewer runs than expected from its batting line (runs created).

4.  Its opponents could create more or fewer runs than expected from their batting line (runs created).

5.  It could over- or undershoot its Pythagorean projection.

The last three were easy -- I just compared them to their estimates.  The first two were harder.  How can you tell whether a player is having a career year?  What I did, for that, is I took the weighted average of the four surrounding seasons, and regressed that to the mean.  The results for players came out fairly reasonable.  

The results for teams came out reasonable too, IMO.  The luckiest team from 1960-2001 was the 2001 Mariners (who the study said "should have" won 89 games instead of 116), and the unluckiest was the 1962 Mets ("should have" won 61 instead of 40).  

[If you want more details, see my web page (search for "1994 Expos").  You can actually download the spreadsheet there that I'm using.  Also, I wrote up the findings for SABR's "Baseball Research Journal," and I found a repost here (.pdf).]

The "career year" estimates for teams seemed pretty good.  I had tweaked the formulas to make them close to unbiased.  For 1973 to 2001 -- the subset of seasons I'm using for this, less strike years -- the mean batting luck was +1.8 runs, and the mean pitching luck was -0.1 runs.  

So, I was pretty happy with the overall results.

-----

OK, so ... while I was working with the data yesterday, I noticed some correlations I didn't expect.  

First: it turns out there's a strong correlation between "Pythagoras luck" and "career year luck" (batting plus pitching).  That correlation is negative 0.1.  Why would that happen?

The only theory I can think of -- when a team plays well, it wins a lot of games.  That means it plays fewer ninth innings on offense, and more ninth innings on defense.  That artificially makes it look lucky in Pythagoras (which is based on run differential).  

But that should create a *positive* correlation with player performance luck, not negative!

Pythagoras luck had an SD of around 40 runs per season.  Career year luck was around 65.  So, every four extra Pythagoras wins is related to around negative 6 runs of "career year" effect.  Not a whole lot, but I still don't know what's going on.  

----

And, worse: there's a strong correlation between "Pythagoras luck" and "Runs Created luck".  This time, negative 0.15.  

So: for every win by which a team beats its Pythagoras, it's given up one-tenth of a win in Runs Created luck.  How would that happen?  The only thing I can think of is walkoff wins with runners on base: every one of those might lead RC to believe you were unlucky by ... what, half a run?  So that's not really enough.

-----

Finally ... there's a huge correlation (minus 0.2) between "career year luck" and "pythagoras + RC luck".  For every four wins a team gained due to Pythagoras/RC luck, they lost one back to player underperformance.  

For that, I have a hypothesis.  Runs Created is known for overestimating the best offenses.  So, when a team beats its RC estimate, it's less likely to be having a great year.  That means its batters are more likely to be underperforming.  

Here's something to support that idea: when I checked, I found almost all the correlation comes from comparing batting career years with batting RC luck, and from comparing pitching career years with pitching RC luck.  Comparing pitching to batting gives almost zero correlations.

I'm not sure if that explanation is enough to explain the -0.2, but it's something.

-----

So what's going on?  Shouldn't clutch hitting (which is what RC luck is) be uncorrelated with, say, scoring runs when you need them the most (which is what Pythagoras luck is)?  Shouldn't whether you get a few extra hits one season (which is career year luck) be uncorrelated with *when* those hits happen (which is RC luck)?

Why are these things associated?  It must be something about the way I'm measuring them, as opposed to, being lucky one way causes you to be lucky another way.  Right?


Any ideas?




Labels: , , ,

Thursday, April 11, 2013

Can we tell simulation from real life?

I was a participant on the "randomness" panel at the Sloan Conference last month.  One of the questions was, "How can fans get a feel for how much luck there is in sports?"

My answer went something like this: Play simulation games, like APBA or Strat-O-Matic for baseball.  You'll find that, one game, team A will win 11-1, and, the next, they might lose 8-2 to the same opponents.  Even with exactly the same talent, as determined by the game, the results will vary widely just because of random variation.

What I wanted to add at the time, but trailed off because I lost my train of thought, was: if you're skeptical, you might think that those games are "over"-random, given that they use dice rolls and all.  But ... it turns out that random APBA outcomes are very, very close to real-life outcomes.  For instance, I'd bet that pairs of "11-1 then 2-8" games are almost exactly as common in baseball history as they would be in APBA-simulated baseball history.

Now, I have no actual evidence for that, but I think it's true.  Still, I got to thinking ... what are the ways where real life and APBA *would* be different?  That is, suppose I handed you a bunch of actual game box scores, and a bunch of APBA box scores.  Would you be able to tell which pile was which?

We need to add some assumptions.  Let's suppose that the simulation is as "perfect" as sabermetric knowledge permits -- that is, it uses proper log5, the best park effects, the best guess at how DIPS should work, the proper understanding that batters hit better with runners being held on first base, and so on.  Let's suppose, too, that we clone the team's managers, and let them make game decisions the same way as real life (when to change pitchers, put in a pinch hitter, call for a hit-and-run, etc.).

And, let's also assume that we're going to weed out games with the really weird things, the ones that no simulation could be smart enough to to include with the right probabilities, like Derek Jeter's famous "flip" throw home, or the time the ball bounced off Jose Canseco's head for a home run.   Or, if you prefer, assume that the simulation IS smart enough, if that doesn't bend your brain too much.

Really, what we're trying to do here is assume that the simulation has every probability perfect: it's just that the outcomes are independent and randomly determined, by dice rolls based on the probabilities, instead of by actual play of the game by flesh-and-blood humans.

If we did all that, could anyone tell the difference?  

My gut answer: it would be hard.  There are some things we could look for.  Injuries, for instance, mean that batters would be a tiny bit "streaky", in that bad performance would be clustered more than randomly, during those times when the player is hurt.  You might find that, in real life, rookies start out well and peter out, as opponents figure out their weaknesses, whereas in APBA, the cards are fixed.  

But, overall, I think even the most knowledgable experts would have trouble telling the pile of real box scores from the pile of simulated box scores.

Think about this in concrete terms, of what you, personally would do.  Suppose I took one of those computer games, Pursue the Pennant, or something, and simulated a bunch of games from the 1978 schedule.  And I print off the box scores, and put them alongside the real ones, and I hand them to you in person.

Assuming you don' t actually remember a lot of details from 1978 games -- like actual scores, or player performances -- what would you do to figure out which was which?


Labels: , ,

Thursday, April 04, 2013

Accurate prediction and the speed of light II

Last post, I argued that there's a natural limit to how good a prediction can be.  If you try to forecast an MLB team's season record, the best you can hope for, in the long run, is a standard error of 6.4 games.

But ... even if that's true, can't you at least tell the good forecasters from the bad?

Suppose you checked 20 forecasts, and you ranked them at the end of the year.  The guy who finishes first should still get some credit, right?

Well ... not necessarily.

Suppose, that, typically, sportswriters are pretty good estimators of talent.  Maybe they're within 3 games, typically, so if the God's-eye view is that the Brewers should go 81-81, most of the forecasters will guess between 78 and 84.  (The true spread of talent is roughly 9 games.  So we're assuming experts can spot 2/3 of the differences between teams, in a certain sense.)

However, some forecasters are particularly astute, and they're within 2 games.  Others aren't very good, and they're within, say, 5 games.

What happens?  Well, by my simulation, the good forecaster (every team off by exactly two games) should be expected to finish with an average discrepancy (standard error) of 6.7 wins.  The lousy forecaster (every team off by 5) will finish with a discrepancy 8.1.

Not much difference, right?  One forecaster was actually two and a half times worse than the other, in the only part of the task under his control (estimating the talent).  But, his results were only 25 percent worse.

And, he might still "win" the competition!  Again by the simulation, the inferior forecaster will have a lower error more than 35 percent of the time.  Remember, that's when the one guy was 250 percent worse than the other!

-----

That's with two forecasters.  I reran the simulation with nine forecasters, ranging from 0 games off to 8 games off in talent appraisal.  The best forecaster won 17.5 percent of the time.  But the worst forecaster -- who misestimated every team's talent by eight games -- still won 6.5 percent of the seasons!  

And eight games ... well, if you estimate every team team's talent will be 81-81, you'll be off by an average of around 9 games.  So, the guy who almost can't tell one team from another ... he still finished first one season in 16.  

Why does that happen?  Because the luck differences overwhelm most of the skill differences.  The standard error from luck is 6.4 games.  The worst predictor adds on an error of 8.  The square root of (6.4 squared plus 8 squared) equals 10.2.

So: the best predictor -- in fact, the best possible predictor -- winds up at 6.4.  The worst in the simulation winds up at 10.2.  That's not that much worse.  In fact, it's "not that much worse" enough that he still wins 6.5 percent of the seasons.

-----

And ... what are you going to do if the winner winds up under the natural limit of 6.4?  It seems weird that you'd award a prediction trophy for something that MUST have been luck-aided.  Yes, the winner was likeliest the best judge of talent (in the Bayesian sense), but ... it would still be weird.

It's like ... suppose you have high-school track tryouts, and you time the athletes independently with stopwatches.  The judges aren't trained, so they're a bit random.  Sometimes they start the time too early, or stop too early.  Sometimes, their view of the finish line is obscured and they have to guess a bit.  

Every runner takes his turn.  When it's over, you discover that the fastest guy ran the 100 metres in ... 9.4 seconds.

Since the world record is 9.58 ... well, you KNOW this high-school kid didn't get 9.4.  He was obviously just lucky, lucky that the judge's stopwatch deficiencies worked in his favor that time.

That's what it's like when someone makes an almost perfect prediction of the MLB season.  It's not possible to truly be that skilled.  



-----

(Correction above: "team" changed to "team's talent", 4pm.)

Labels: , ,

Monday, April 01, 2013

Accurate prediction and the speed of light

This is the time of year when you see lots of baseball prediction stuff ... how many games teams will win, who will finish in first place, how the post-season brackets will go, and so on.

And I hate them, when they're taken seriously.  Because, predicting outcomes with a high degree of accuracy is impossible.  All you can do is guess at the basic probabilities.  After that, it's all luck.

Suppose that you're able to figure that a certain team -- Milwaukee, say -- is actually a .500 team in terms of talent.  Obviously, there's going to be a certain amount of error in your assessment, since it's impossible to know for sure -- but, for the sake of argument, let's say you just know.

Then, subject to certain nitpicks (which I'll leave in the comments), you can consider the Brewers season like 162 coin tosses.  The most likely outcome is 81 heads and 81 tails, but it's probably going to be different just because of luck.  Statistically, you can calculate that the standard error is around 6.4 wins.  That means that, around 1/3 of the time, your estimate will be off by more than around 6.4 wins either way.  And, around 1/20 of the time, your estimate will be off by more than 12.8 wins.  

Suppose that, being rational, you predict 81-81.  And, at the end of the season, the Brewers indeed wound up 81-81.  You're a hero!  But, you were lucky.  The chance that an average team will go exactly 81-81 is ... well, I'm too lazy to calculate it, so I simulated it, and it's around 6.3 percent.  You hit a 15:1 longshot.

-------

Basically, it's like a law of nature that it is impossible to regularly forecast team records with a margin of error of fewer than around 6.4 wins.  Not difficult, but *impossible*.  It's impossible in the same sense as constructing a perpetual motion machine is impossible, or turning lead into gold on your kitchen stove is impossible, or accurately determining the temperature 100 years from today at 4:33 pm is impossible.  No matter how much you know about the team, and the players, and the second baseman's diet, and the third baseman's mental state, and whether the right fielder is on PEDs ... the best you can do, in the long run, is a standard error of around 6.4 wins. 

When forecasters have a contest, and after the season, one of them has "won" with, a standard error of, say, 4.9 wins ... well, you may be impressed.  But he was certainly at least partly lucky.  He beat the natural limit of 6.4.  He was better than perfect.  You may think you're praising his forecasting acumen, but, really, you're implicitly praising his ability to influence coin tosses.

-------

As far as I'm concerned, this feature of randomness -- the existence of a "speed of light" limit to accuracy -- is so fundamental that it should be called "The First Law of Forecasting," or something.  There is a natural limit that cannot be breached, and it usually comes much sooner than we expect.

The newspapers are full of writers, and pundits, that ignore this law, not just in sports, but in everything.  They assume that if you're smart enough, and expert enough, you can accurately predict who's going to win tomorrow's game, or what the Dow-Jones average will be next year, or what's going to happen in North Korea.

But you can't.  


Labels: , ,