Thursday, May 23, 2013

The OBP/SLG regression puzzle

In the second "puzzle" last post, I noticed that, when you run a regression to predict winning percentage from on-base percentage and slugging percentage, you get that a point of OBP is worth between two and three times as much as a point of SLG.  That's different from the consensus value of 1.7 (which Tango derives here).  Why the difference?

When I wrote that post, I thought I knew the answer ... but I actually hadn't.  So I spent the last few days trying to figure it out.  I actually jumped around among a whole bunch of possibilities, and hit a lot of dead ends ... but I think I finally got somewhere.  Here's the current state of my thinking.  

As usual, I could be wrong.  I was wrong a few times in the course of working on all this ... 

------

I think the issue is one of non-linearity.  The regression assumes that runs/wins are linear in OBP and SLG, but they might not be.  In fact, in a bit, I'll show they're not.

Why does non-linearity matter?  Because, the "1.7" comes from adding events to an average team, and looking at the marginal impact.  If there's linearity, then we know that impact must be the same for all teams.  But, if there's not, that doesn't necessarily work.

To see why: consider a relationship that's actually cubic:

0, 1, 8, 27, 64, 125, 216

If you do a linear regression to predict x-cubed from x, for those seven values, you get

y = 34x - 39

The regression says that if you increase X by 1, Y increases by 34.

But ... that's not true for the *average* value of X.  The average value is 3.  The difference between 3.5 cubed and 2.5 cubed isn't 34 -- it's 27.25.  (To be more precise, we can take the first derivative of x-cubed at 3, and get 27.)

So the average coefficient is higher than the coefficient at the average.  

Sometimes the coefficient at the average will be "too high", and sometimes, like now, it will be "too low".  I'm not completely sure of the exact conditions for each.  The point is, though, that if you have non-linearity, the two coefficients will probably be different.

And that means if you have two variables, the ratio will be different.  Suppose you have Y = a^3 + b.  The regression will give you coefficients of 34 and 1.  But the values at the average will be 27 and 1.  So the ratio is 34 overall, but 27 at the average.

That might be what's happening here.  OPS and SLG are non-linear in separate ways, and that changes the ratio from 2.3 overall, to 1.7 at the average.

-------

OPS and SLG are indeed non-linear in a certain way.  In the way I'm going to show you, you don't need any baseball knowledge.  I'm going to show you that they're non-linear, not in terms of *runs*, but in terms of *raw events*.  

Suppose you have a batting line, and you want to add walks to raise the OBP by one point (.001).  How many walks do you have to add?  Well, it depends on your original OBP.

Suppose you're at .333 -- you have 333 "OBs" (walks or hits) in 1000 PA.  How do you get to .334?  You can't just add one walk, because that only brings you to .333666 (334/1001).  It turns out you have to add approximately 1.5015 walks.  That brings you to 334.5015 OBs in 1001.5015 PA, which brings you to .334.

But, now, suppose you're at .400, and you want to get to .401.  How many walks do you have to add now?  This time, it's 1.66945.  401.66945 divided by 1001.66945 equals .401.

(I did a little algebra to get the formula that, for every 1000 PA you start with, the number of additional walks you need is (1000 divided by (.999 minus OBP)).  That's where those two numbers came from.)

That is: points of OBPs give you increasing returns *in terms of number of walks*.  So the more OBPs you get, the more each additional one is worth.  Or, if you want to put it another way, the more OBPs you have, the harder it is to get another one, because you need more walks to get it.

Again, this is not a baseball observation.  The same thing applies, to, say, games of gin rummy.  If you're at .333 and you want to get to .334, you only need to win your next 1.5015 games.  If you're at .400 and you want to get to .401, though, you have to win your next 1.66945 games.  

-------

So: as we saw, a point OBP offers a higher return when OBP is already high.  That, by itself, is enough to make the coefficient of OBP different from the marginal value *for an average team*, which is where the 1.7 came from.  

But ... what about SLG?  If SLG also offers increasing returns, its coefficient will vary, too.  If it varies the same way, we should still get 1.7!  

Yes, indeed.  But, who knows if SLG *does* have increasing returns?  And who knows if it does, if it's exactly equal to the increasing returns of OBP?  That would be quite a coincidence, wouldn't it?

Since we have no reason to expect OBP and SLG to offer the exact same distortion caused by increasing returns ... we have no reason to expect the ratio OBP/SLG to be exactly 1.7.

This doesn't explain why it's at the level it's at -- "slightly higher than 1.7," we could call it.  From the logic we've seen so far, it could be anything: lower than 1.7, higher than 1.7, much different, a little different, whatever.

But: that's why, in theory, it won't be exactly 1.7.  If that's all you're looking for, an explanation of why it *could* be different, there it is.  I'm going to keep going, but it gets boring and technical and long for the next bit.

------

OK, so we talked about adding a point of OBP by adding walks.  Now, let's talk about adding a point of SLG by adding an extra base.

Adding extra bases doesn't change the denominator of SLG (which is at-bats).  So, if you want to add one point of SLG where there's 1,000 AB, you can just add one extra base.  Change a double to a triple, or something.

But: the denominator, the number of AB, is not the same for every team.  The more AB you have, the more valuable a point of SLG.  At 1,000 AB, you need only 1 extra base.  At 1,020 AB, you need 1.02 extra bases, which is 2% more valuable.

AB is hits plus outs.  In our regression, every team has roughly the same number of outs (since we did full seasons only), so the only difference is hits.  So, the more hits a team has, the more valuable a point of SLG from extra bases.  And hits correlates highly with OPS.

So: the more OPS a team has, the more valuable a point of SLG.  But ... well, it's a weak increase, compared to the OPS increase.  I'm almost willing to call this one linear.

------

What about adding a point of SLG by adding a single?  That's different, because a single affects both SLG and OBP.  So, we need to do this in two steps: we add enough singles to raise SLG by a point, and then subtract enough walks to lower OBP back to where it was before.

How many singles to we have to add to SLG?  That's the same formula as for how many walks we had to add to OBP.  For 1,000 AB, it's

1000 divided by (.999 minus SLG)

That increases OBP by that many "events", so we subtract that exact number of walks, and OBP is back to where it was before.  (Effectively, we've just converted walks to singles at the exact rate that OBP stays the same, but SLG goes up a point.)

The increase in runs is, therefore,

[1000 / (.999 - SLG)] * [value of single - value of walk]

We're assuming singles and walks have constant value -- .47 and .34, say -- so we get that adding one point of OBP adds

+.14 * [1000 / (.999 - SLG)] runs.

That's a higher number when SLG is higher, so we see that a point of SLG also has increasing returns.  (I'm not going to try to figure out by how much.)

------

The last case is adding a point of OBP by singles (and leaving SLG alone).

How many singles do we need to add?  Same formula: 

[1000 / (.999 - OBP)]

But, that will also increase SLG, so we have to subtract that enough "extra bases" from SLG to bring it back to where it was before.

Adding hits increased total bases by the same number as it increased AB.  But, to keep slugging the same when adding AB, we need to increase total bases by only SLG times the number of AB.  So, we need to subtract (1-SLG) total bases, for each single added.  

That is, we need to subtract

[1000 / (.999 - OBP)] * (1-SLG) bases for each hit.

Combining the the two steps, gives a batting line change of 

[1000 / (.999 - OBP)] cases of "add one single, and subtract (1-SLG) bases".

Assigning run values here -- say, .47 for a single, and .26 for a base -- gives a run increase of 

[1000 / (.999 - OBP)] * (.47 - .26 (1-SLG))

That gives increasing returns in OBP, and also increasing returns in SLG.  Again, I'm not going to try and quantify which is bigger.

------

Those are the only four cases I see of how to increase one of OBP and SLG at the margin.  (For extra-base hits, you just add the two cases -- add singles, and then add extra bases.  The math works out the same.)

That means, in terms of increasing returns, we have:

Increase SLG by bases -- roughly linear
Increase SLG by hits  -- increasing in SLG
Increase OBP by walks -- increasing in OBP
Increase OBP by hits  -- increasing in OBP and SLG

So, some ways are increasing in OBP, and some in SLG, and ... it looks like OBP and SLG are represented roughly equally.  It looks like we should expect a ratio that's not too far from 1.7.  It might not be *exactly* 1.7, but our gut says it should be not too different.  Which is about right -- it's in the 2s.

This is all theory.  Is there any evidence we can look at?

Well, it looks like teams with lots of walks should be different from teams with lots of hits.  The walking teams should see lots of increasing returns in OBP, and so a higher ratio.  And the hitting teams should see lots of increasing returns in SLG, and so a lower ratio.

So, I repeated the regression, but included only teams who were at least two percentage points higher than normal in their BB/H ratio.  This is the regression for those teams:

wpct = 2.69 OBP + 1.05 SLG - .845 (ratio: 2.5)

And for teams who walked two percentage points *less* than normal:

wpct = 1.62 OBP + 1.03 SLG - .491 (ratio: 1.6)

So, that seems to support the theory!  More walks = higher ratio, as hypothesized.

The results are similar if I use other point thresholds for higher/lower than average:

0 points: 2.6 low, 3.7 high
1 point : 2.0 low, 4.6 high
2 points: 1.6 low, 2.5 high
3 points: 0.8 low, 1.5 high
4 points: 7.1 low, 3.6 high

(The theory seems to fail in the extreme case ... but it's probably sample size.  If you up the SLG coefficient by 2 SDs, the ratio drops from 7.1 all the way to 1.6.)

Overall, I'd say, the test seems to support the theory.

-----

OK, now the bad news.  I don't think this is the real answer.  Yes, I think it's all correct, but I wonder if the effect is much too small to make such a big difference, from 1.7 to 2.3. 

Also, this occurred to me, another explanation that seems bigger: 

Walks get lumped in with singles in OBP.  Extra bases get lumped in with singles in SLG.  Which is worth more: a single, or the exact number of walks and extra bases that have the same impact on OBP and SLG?  Whichever is worth more, if the good teams get more of that one relative to the other, that will show up in a higher coefficient.  If the good teams get fewer of that one, the coefficient would be lower.  

This last explanation seems to me like the effect would be bigger.  Further research required, I guess.


Labels: , , ,

Thursday, May 16, 2013

Two regression puzzles

Here's a couple of interesting sports regression problems I ran into in the past week, if you're into that kind of thing.  What struck me about them is how simple the actual regressions are, but how hard you have to think to figure out what they really mean.

----

The first one comes from Brian Burke.

Brian ran a regression to link an NFL quarterback's performance to his salary.  He got a decent relationship, with a correlation of .46.  Based on that regression, it looked like Aaron Rodgers should be worth around $25 million a year.  

So far so good.

Then, Brian ran exactly the same regression, but switched the X and Y axes.  He got the same correlation, of course.  And the points on the graph were exactly the same, just sideways.  But, this time, it looked like Rodgers should be worth only about $11 million!

How is that possible?

Here's the post where Brian lays out both arguments -- along with pictures -- and asks which is right.  It took me a couple of hours of pondering, but I think I figured it out.  

My answer is in the comments to Brian's post.  I think it's correct, but I'm not completely sure ... and I don't think I even convinced Brian.

----

The second one you can understand, probably, without pictures.  I'll elaborate in the next post, but I'll just lay it out for now.

It's an established result, in baseball analysis, that a point of on-base percentage is worth about 1.7 times as much as a point of slugging percentage.  (Here's a discussion at Tango's old blog; you can probably Google and find others.)

But ... if you do a regression, that's not what you get.

I ran a regression to predict team winning percentage from OBP and SLG, using seasons from 1960 to 2011.  My equation was:

wpct = (2.52 OBP) + (0.71 SLG) - 0.62

By this regression, it looks like a point of OBP is worth 3.5 times a point of SLG -- almost twice as much as the true value of 1.7.  Also, the 2.52 and the 0.71 aren't right either, individually.

It's not just random error ... even if you move the two coefficients together by 2 standard errors each, the ratio still won't reach 1.7.  Also, if you break this down into subsets, you get roughly similar results for each (as long as you keep enough seasons to reduce the randomness enough).

What's going on?

It took me a while -- again -- but I think I figured this one out too.  I'll explain in the next post.




UPDATE, Friday 5/17:  Upon further reflection, I *haven't* figured out the second one yet.  But I'm working on it!


Labels: ,

Monday, May 06, 2013

How extreme are simulation game results?

Last post, I tried to figure out the theoretical breakdown of luck in team records.  In standard errors, I got:

31.9 runs from career years batting
31.9 runs from career years pitching
23.9 runs from event clustering
23.9 runs from opposing team's event clustering 
39.1 runs from Pythagoras

A couple of posts before that, I had done the same thing, but for my "luck" study.  There, the "career year" luck estimates were higher.  (The clustering and Pythagoras estimates were roughly the same, but that's because I used pretty much the same method).

42.7 runs from career years batting
48.5 runs from career years pitching
23.8 runs from event clustering
25.8 runs from opposing team's event clustering 
39.1 runs from Pythagoras

There are good reasons for my "career year" estimates being farther off -- mostly, because I had to interpret changes in talent (aging, injuries, learning to hit a breaking ball) as luck.  (There are also selective sampling issues.)

Anyway, part of the reason I did all this was because of a comment from Ted Turocy in my post on simulation games:


"Do you also assume that the player "cards" have been suitably regressed to the mean? If not -- which is the case with all standard season sets/disks -- then the simulation will tend to have more extreme totals on the player leaderboards."

I hadn't thought of that.  Ted is right ... most simulations don't regress to the mean.  If Mark McGwire hit 70 home runs, the game will be calibrated so that McGwire's expectation is 70.  Which means, around half the time, he'll actually hit *more* than 70.  In fact, the binomial SD for McGwire's 1998 is around 8 homers, so that, a good proportion of simulated seasons, he might hit 80 or more!

The *team* totals will be more extreme too, then, and so will the team standings.  So I wondered: how much more extreme?  I seem to recall, some time ago, APBA putting out a promotional flyer with the results of a full simulated season (sent in by a customer), and showing how similar it was to the actual standings.  You'd think, though, without regressing to the mean, you'd have too much of a spread, though.

Well, now we can figure it out.

From 1973 to 2001 (omitting strike seasons), the SD of team records (normalized to 162 games) was almost exactly 11 wins.  The theoretical SD of luck is 6.36 wins, which means the SD from talent is almost exactly 9 wins.  (9 squared plus 6.36 squared equals 11 squared.)  

So, that's 90 runs.

But, now, the simulation is increasing that, by taking the "career years" luck, and making it part of the player's "talent" (which is what the card represents).  For the team as a whole, it works out to 31.9 runs pitching, and 31.9 runs hitting.

So the SD of talent is now 101 runs -- the square root of (90 squared plus 31.9 squared plus 31.9 squared).

Which means the SD of observed W-L records is now 119 runs, or 11.9 wins -- the square root of (101 squared + 63.6 squared).

11.0 wins -- real life
11.9 wins -- APBA

Not much difference -- around a single win.   

So, if you figure the most extreme team in a year is maybe 2 SD from the mean ... the top and bottom teams should be two wins more extreme than real life.  (They won't necessarily be the *same* teams.)

That's not a big deal ... you probably wouldn't even notice.  Real life often bumps the extremes well beyond that.

The real-life SD of wins is 11.0, but it varies quite a bit by season.   Here are the 26 season SDs:

10.3,  9.9, 11.7, 11.8, 14.4, 12.3, 12.9, 11.6, 10.5, 9.8, 8.9, 12.6, 10.3, 9.8, 12.0, 10.0, 9.1, 9.7, 10.2, 12.2, 10.0, 9.6, 13.5, 12.5, 10.0, 13.0, 14.8, 13.4, 13.5, 10.8, 10.1, 9.3, 11.1, 11.4, 11.0, 11.4

11.9 fits right in.  The extremes are much higher than that.  In 1984, the SD of wins was less than 9.  In 2002, it was 14.75.  I've never seen anyone mention either, that there was so much parity in 1984 but so little in 2002.  It seems that we don't even notice *big* changes.

The overall SD of the SDs was more than 1.5 ... so a change from 11.0 to 11.9 is only 0.6 SDs from the previous mean.  For statistical significance, you'd need to more triple that ... which means you'd need roughly ten times as many seasons.  It would take 360 years of Major League Baseball to get a 50 percent chance of statistical significance for a single unregressed APBA season.

It's a smaller effect than I had been thinking it would be.

------

In any case, that just applies to team records.  You'll be able to notice the spread more easily in individual results.

For a season of 600 AB, the binomial standard deviation of batting average is 18.7 points.  

In 2012, Buster Posey led the majors in batting average, at .336.  In a simulation, he'd have a 50% chance of beating that.

Miguel Cabrera, who hit .330, would have probably a 40% chance.  Andrew McCutchen, half an SD behind, would have a 30% chance.  Mike Trout, also around 30%.  Adrian Beltre (.321), 20%.  So far, that adds up to 170%, or an average of 1.7 simulated hitters beating .336.  And that's only after looking at five players!

So, it's virtually assured that the simulated leader will outhit the actual leader.  That's probably true in any of the major statistics where players get similar numbers of opportunities.

------

Bottom line: in an APBA season, you'll notice players are more extreme than real life -- but you probably won't notice that teams are.





Labels: , , , ,

Thursday, April 25, 2013

A breakdown of the luck in MLB season records

If you take a .500 baseball team, and flip it 162 times, you should expect it to come up "win" 81 times.  But that will vary -- sometimes it'll win fewer, and sometimes it'll win more.  You can calculate that the distribution of wins should follow a normal distribution, with a mean of 81, and a standard deviation of 6.36.

Using the rule of thumb that 95% of observations are within two standard deviations of the mean, you can figure that, around one time in 20, that team will win 94 or more games, or fewer than 68, just by luck alone.

Where does that luck show up?

The way I see it, you can break it up into five mutually-exclusive observations (as I described in a previous post):

1.  The team's hitters could have better or worse performances than their talent expectation -- that is, "career years" in either direction -- in terms of their basic batting line.

2.  The team's pitchers could do the same (that is, the opposing team's batters could have "career years").

3.  The team could score more (or fewer) runs than expected from its composite batting line.  In other words, it could beat its Runs Created (or Linear Weights, or Base Runs) estimate.  That usually happens if the team hits better (or worse) than expected in high-leverage situations in terms of runs scoring -- such as, for instance, bases loaded and two outs.

4.  The opposition could do the same.

5.  The team could win more (or fewer) games than expected from the number of runs it scored and allowed.  In other words, it could beat its Pythagorean projection (or, alternatively, the "10 runs equals a win" rule of thumb).  That usually happens when a team scores more runs in high-leverage situations in terms of game outcome -- such as, for instance, with the score tied in the ninth inning.

If my logic is right, those five calculations cover all the binomial luck, and none of them overlap (that is, no luck is counted twice).

The question I spent the last couple of days looking at, is: how does the overall variation break down into the five parts?  That is, which of the five types is most important?  It wasn't that hard to figure out; I'm not sure why I didn't do it years ago.

------

First, the "career years" thing.  How do you figure that out?  Well, it's pretty easy to get an estimate.  I took the overall MLB stats for the 1984 season (1984 chosen arbitrarily), and divided by 26 to get a team average.  It worked out that there were 25.27 hitless at-bats per game, per team.  (It's not 27 because of bottom-of-the-ninth issues, outs made on base, double plays, and so on.)

So, I ran a little simulation.  I created random plate appearances, with league-average probabilities, until I got to 25.27 times 162 batting outs.  Then, I calculated the Linear Weights runs for that simulated batting line.  (I used weights of .47/.85/1.02/1.4/0.33 for 1B/2B/3B/HR/BB.  The value assigned to the out didn't matter here, because every season had the same number of outs.)

In that simulation, the standard deviation of runs was 31.9.  So, that's my estimate of how much random variation there is in terms of "career years".  It's 31.9 for the team's hitters, and 31.9 for the team's pitchers.

------

For beating the "runs created" estimate, I looked at real life data.  I actually used Linear Weights, instead, because I think it's more accurate.  I used the above weights for the basic events, and I calculated the value of the batting out for each season (I probably should have used league-season, but I don't feel like redoing it.)

I think I did 1960 to 2001, omitting strike seasons.

The standard deviation of Linear Weights luck was 23.9 runs.  I did only batting, because I didn't have detailed statistics handy for opposition batting.  But, I'm going to assume that would have worked out about the same.

------

Finally, for Pythagoras, I looked at real-life teams from 1973 to 2001 (again omitting strike seasons).  The standard error was 3.91 wins, which I converted to 39.1 runs.   

------

So, here are the results:

31.9 -- career years by hitters
31.9 -- career years by pitchers
23.9 -- linear weights luck for hitters
23.9 -- linear weights luck for opposition hitters
39.1 -- Pythagoras luck

To get the overall SD, you square the five numbers, add them up, and take the square root.  If you do that, you get 68.6.  That's somewhat higher than what we expected, which was 63.6.

68.6 -- five categories combined
63.6 -- theoretical expectation

Why the difference?  I'm not sure.  Some possibilities:

1.  Linear Weights is known to overestimate luck for very bad and very good teams, and underestimate it for medium teams.  That would inflate those two SDs, a bit.

2.  Teams that have good Linear Weights Luck score more runs.  Teams that score more runs play fewer bottom-of-the-ninths on offense, and more bottom-of-the-ninths on defense.  That would compress their run differential, which would make them look like they had more Pythagoras luck than they did.  That is: we *are* double counting a little bit of luck in this case, due to the fact that Pythagoras doesn't take innings into account, just games.  

3.  I used real-life teams for Pythagoras luck.  But, in real life, there are things that make a team beat its Pythagoras that have nothing to do with luck.  For instance, a team with much better relief pitchers will be able to hold on to small leads, and win more games than expected.  Also, managers who make blowouts worse by using their worst pitchers to mop-up will show more apparent Pythagoras luck, by giving up more runs that don't affect who wins.

4.  The same thing could be true for Linear Weights luck.  Linear Weights assumes that each individual event -- a single, say -- occurs in a league-average context of runners on base.  But teams that score primarily by the home run hit in a below-average context (to see why, imagine that a team hits ONLY home runs.  Each will be worth only 1.0 runs, instead of the 1.4 the formula thinks it's worth).  And teams that score primarily by singles probably hit in an above-average context.  So, that would tend to magnify the errors, in either direction.

I suspect the Pythagoras is the biggest thing ... maybe what I'll do, eventually, is pick real-life games randomly broken among all different teams, and calculate Pythagoras error that way.  And maybe I'll do the same thing for Linear Weights.  I'm guessing that will bring the numbers down at least a bit.  Whether the total will drop from 68.6 all the way to 63.6 ... well, I doubt it, but you never know.

------

If you like talking in terms of variances -- that is, r-squareds -- or you like it when things add up to 100 percent, here are the five variances as a percentage of the total.  (I'll also change "linear weights luck" to "cluster luck", in honor of Joe Peta, since Linear Weights luck is the result of clutch hitting, which means clustering of offensive events.)

22% -- career years by hitters
22% -- career years by pitchers
12% -- cluster luck by offense
12% -- cluster luck by opposition offense
32% -- Pythagoras luck

And, combining offense/defense:

44% -- career years by team's players
24% -- overall cluster luck
32% -- Pythagoras luck

I suspect that after that other simulation I plan on doing, the career years numbers will wind up a bit higher, and the others a bit lower.  But I think this will still be pretty close.  


Labels: , ,

Monday, April 22, 2013

Do athletes have shorter lifespans?

According to this article in Pacific Standard magazine, athletes have lower lifespans than those in other occupations.

The article cites a recent academic study (.pdf) that looked at 1,000 consecutive obituaries in the New York Times.  That study 


" ...found the youngest average age of death was among athletes (77.4 years), performers (77.1 years), and non-performers who worked in creative fields, such as authors, composers, and artists (78.5 years). The oldest average age of death was found among people famous for their work in politics (82.1 years), business (83.3 years), and the military (84.7 years)."

The authors of the study say,


"... our data raise the intriguing speculation that young people contemplating certain careers (e.g. performing arts and professional sports) may be faced, consciously or otherwise, with a faustian choice: namely, 1. to maximize their career potential and  competitiveness even though the required psychological and physical costs may be expected to shorten their longevity, or 2. to fall short of their career potential so as to balance their lives and permit a normal lifespan."

But: isn't there a selective sampling problem here?

To appear in a New York Times obituary, you have to be relatively famous, or, at least, have passed a certain standard of fame or accomplishment in your chosen field.

If your chosen field is athletics, you reach that threshold early in your life -- in your 30s, say.  Wayne Gretzky, Ken Griffey Jr., Bjorn Borg.  If your field is business, you probably have to reach the level of CEO to make the Times.  The median age of a CEO in the S&P 500 is mid-50s ... so the median age for an *accomplished* CEO is probably around 60. 

Same for politics: the median age of a US senator is almost 62 years.  For a US congressman, the mean is around 57.

So, of course looking at obituaries will make you think there's a difference!  Your sample includes athletes who died at 40, but not politicians who died at 40.  Politicians who died at 40 either haven't become famous yet -- or, more likely, haven't even become politicians yet!

And, quickly checking out the US mortality table ... a 35-year-old male is expected to live to about 77.5.  A 60-year-old male is expected to live to about 80.9.

Seems about right.

----

If you want a two-line analogy, try this: 

No US president has ever died before the age of 35.  That doesn't mean that if you want to make sure you don't die in childhood, you should become a US president.


Labels: , ,

Sunday, April 21, 2013

Pythagorean good luck associated with Runs Created bad luck


I noticed recently that there's a negative correlation between certain measures that we think are random and independent.  For instance, outshooting Pythagoras tends to be associated with undershooting Runs Created.  I don't know why, and I'm looking for ideas.

----

Let me give you some background to what luck numbers I'm doing here.

Back in 2005, I did a study to estimate real teams' historical talent levels from their stats.  I figured that there were five mutually exclusive ways a team could perform differently from its talent:

1.  Its batters could have lucky or unlucky years, in terms of raw batting line.

2.  Its pitchers could have lucky or unlucky years, in terms of the opposition's raw batting line.

3.  It could create more or fewer runs than expected from its batting line (runs created).

4.  Its opponents could create more or fewer runs than expected from their batting line (runs created).

5.  It could over- or undershoot its Pythagorean projection.

The last three were easy -- I just compared them to their estimates.  The first two were harder.  How can you tell whether a player is having a career year?  What I did, for that, is I took the weighted average of the four surrounding seasons, and regressed that to the mean.  The results for players came out fairly reasonable.  

The results for teams came out reasonable too, IMO.  The luckiest team from 1960-2001 was the 2001 Mariners (who the study said "should have" won 89 games instead of 116), and the unluckiest was the 1962 Mets ("should have" won 61 instead of 40).  

[If you want more details, see my web page (search for "1994 Expos").  You can actually download the spreadsheet there that I'm using.  Also, I wrote up the findings for SABR's "Baseball Research Journal," and I found a repost here (.pdf).]

The "career year" estimates for teams seemed pretty good.  I had tweaked the formulas to make them close to unbiased.  For 1973 to 2001 -- the subset of seasons I'm using for this, less strike years -- the mean batting luck was +1.8 runs, and the mean pitching luck was -0.1 runs.  

So, I was pretty happy with the overall results.

-----

OK, so ... while I was working with the data yesterday, I noticed some correlations I didn't expect.  

First: it turns out there's a strong correlation between "Pythagoras luck" and "career year luck" (batting plus pitching).  That correlation is negative 0.1.  Why would that happen?

The only theory I can think of -- when a team plays well, it wins a lot of games.  That means it plays fewer ninth innings on offense, and more ninth innings on defense.  That artificially makes it look lucky in Pythagoras (which is based on run differential).  

But that should create a *positive* correlation with player performance luck, not negative!

Pythagoras luck had an SD of around 40 runs per season.  Career year luck was around 65.  So, every four extra Pythagoras wins is related to around negative 6 runs of "career year" effect.  Not a whole lot, but I still don't know what's going on.  

----

And, worse: there's a strong correlation between "Pythagoras luck" and "Runs Created luck".  This time, negative 0.15.  

So: for every win by which a team beats its Pythagoras, it's given up one-tenth of a win in Runs Created luck.  How would that happen?  The only thing I can think of is walkoff wins with runners on base: every one of those might lead RC to believe you were unlucky by ... what, half a run?  So that's not really enough.

-----

Finally ... there's a huge correlation (minus 0.2) between "career year luck" and "pythagoras + RC luck".  For every four wins a team gained due to Pythagoras/RC luck, they lost one back to player underperformance.  

For that, I have a hypothesis.  Runs Created is known for overestimating the best offenses.  So, when a team beats its RC estimate, it's less likely to be having a great year.  That means its batters are more likely to be underperforming.  

Here's something to support that idea: when I checked, I found almost all the correlation comes from comparing batting career years with batting RC luck, and from comparing pitching career years with pitching RC luck.  Comparing pitching to batting gives almost zero correlations.

I'm not sure if that explanation is enough to explain the -0.2, but it's something.

-----

So what's going on?  Shouldn't clutch hitting (which is what RC luck is) be uncorrelated with, say, scoring runs when you need them the most (which is what Pythagoras luck is)?  Shouldn't whether you get a few extra hits one season (which is career year luck) be uncorrelated with *when* those hits happen (which is RC luck)?

Why are these things associated?  It must be something about the way I'm measuring them, as opposed to, being lucky one way causes you to be lucky another way.  Right?


Any ideas?




Labels: , , ,

Thursday, April 11, 2013

Can we tell simulation from real life?

I was a participant on the "randomness" panel at the Sloan Conference last month.  One of the questions was, "How can fans get a feel for how much luck there is in sports?"

My answer went something like this: Play simulation games, like APBA or Strat-O-Matic for baseball.  You'll find that, one game, team A will win 11-1, and, the next, they might lose 8-2 to the same opponents.  Even with exactly the same talent, as determined by the game, the results will vary widely just because of random variation.

What I wanted to add at the time, but trailed off because I lost my train of thought, was: if you're skeptical, you might think that those games are "over"-random, given that they use dice rolls and all.  But ... it turns out that random APBA outcomes are very, very close to real-life outcomes.  For instance, I'd bet that pairs of "11-1 then 2-8" games are almost exactly as common in baseball history as they would be in APBA-simulated baseball history.

Now, I have no actual evidence for that, but I think it's true.  Still, I got to thinking ... what are the ways where real life and APBA *would* be different?  That is, suppose I handed you a bunch of actual game box scores, and a bunch of APBA box scores.  Would you be able to tell which pile was which?

We need to add some assumptions.  Let's suppose that the simulation is as "perfect" as sabermetric knowledge permits -- that is, it uses proper log5, the best park effects, the best guess at how DIPS should work, the proper understanding that batters hit better with runners being held on first base, and so on.  Let's suppose, too, that we clone the team's managers, and let them make game decisions the same way as real life (when to change pitchers, put in a pinch hitter, call for a hit-and-run, etc.).

And, let's also assume that we're going to weed out games with the really weird things, the ones that no simulation could be smart enough to to include with the right probabilities, like Derek Jeter's famous "flip" throw home, or the time the ball bounced off Jose Canseco's head for a home run.   Or, if you prefer, assume that the simulation IS smart enough, if that doesn't bend your brain too much.

Really, what we're trying to do here is assume that the simulation has every probability perfect: it's just that the outcomes are independent and randomly determined, by dice rolls based on the probabilities, instead of by actual play of the game by flesh-and-blood humans.

If we did all that, could anyone tell the difference?  

My gut answer: it would be hard.  There are some things we could look for.  Injuries, for instance, mean that batters would be a tiny bit "streaky", in that bad performance would be clustered more than randomly, during those times when the player is hurt.  You might find that, in real life, rookies start out well and peter out, as opponents figure out their weaknesses, whereas in APBA, the cards are fixed.  

But, overall, I think even the most knowledgable experts would have trouble telling the pile of real box scores from the pile of simulated box scores.

Think about this in concrete terms, of what you, personally would do.  Suppose I took one of those computer games, Pursue the Pennant, or something, and simulated a bunch of games from the 1978 schedule.  And I print off the box scores, and put them alongside the real ones, and I hand them to you in person.

Assuming you don' t actually remember a lot of details from 1978 games -- like actual scores, or player performances -- what would you do to figure out which was which?


Labels: , ,

Thursday, April 04, 2013

Accurate prediction and the speed of light II

Last post, I argued that there's a natural limit to how good a prediction can be.  If you try to forecast an MLB team's season record, the best you can hope for, in the long run, is a standard error of 6.4 games.

But ... even if that's true, can't you at least tell the good forecasters from the bad?

Suppose you checked 20 forecasts, and you ranked them at the end of the year.  The guy who finishes first should still get some credit, right?

Well ... not necessarily.

Suppose, that, typically, sportswriters are pretty good estimators of talent.  Maybe they're within 3 games, typically, so if the God's-eye view is that the Brewers should go 81-81, most of the forecasters will guess between 78 and 84.  (The true spread of talent is roughly 9 games.  So we're assuming experts can spot 2/3 of the differences between teams, in a certain sense.)

However, some forecasters are particularly astute, and they're within 2 games.  Others aren't very good, and they're within, say, 5 games.

What happens?  Well, by my simulation, the good forecaster (every team off by exactly two games) should be expected to finish with an average discrepancy (standard error) of 6.7 wins.  The lousy forecaster (every team off by 5) will finish with a discrepancy 8.1.

Not much difference, right?  One forecaster was actually two and a half times worse than the other, in the only part of the task under his control (estimating the talent).  But, his results were only 25 percent worse.

And, he might still "win" the competition!  Again by the simulation, the inferior forecaster will have a lower error more than 35 percent of the time.  Remember, that's when the one guy was 250 percent worse than the other!

-----

That's with two forecasters.  I reran the simulation with nine forecasters, ranging from 0 games off to 8 games off in talent appraisal.  The best forecaster won 17.5 percent of the time.  But the worst forecaster -- who misestimated every team's talent by eight games -- still won 6.5 percent of the seasons!  

And eight games ... well, if you estimate every team team's talent will be 81-81, you'll be off by an average of around 9 games.  So, the guy who almost can't tell one team from another ... he still finished first one season in 16.  

Why does that happen?  Because the luck differences overwhelm most of the skill differences.  The standard error from luck is 6.4 games.  The worst predictor adds on an error of 8.  The square root of (6.4 squared plus 8 squared) equals 10.2.

So: the best predictor -- in fact, the best possible predictor -- winds up at 6.4.  The worst in the simulation winds up at 10.2.  That's not that much worse.  In fact, it's "not that much worse" enough that he still wins 6.5 percent of the seasons.

-----

And ... what are you going to do if the winner winds up under the natural limit of 6.4?  It seems weird that you'd award a prediction trophy for something that MUST have been luck-aided.  Yes, the winner was likeliest the best judge of talent (in the Bayesian sense), but ... it would still be weird.

It's like ... suppose you have high-school track tryouts, and you time the athletes independently with stopwatches.  The judges aren't trained, so they're a bit random.  Sometimes they start the time too early, or stop too early.  Sometimes, their view of the finish line is obscured and they have to guess a bit.  

Every runner takes his turn.  When it's over, you discover that the fastest guy ran the 100 metres in ... 9.4 seconds.

Since the world record is 9.58 ... well, you KNOW this high-school kid didn't get 9.4.  He was obviously just lucky, lucky that the judge's stopwatch deficiencies worked in his favor that time.

That's what it's like when someone makes an almost perfect prediction of the MLB season.  It's not possible to truly be that skilled.  



-----

(Correction above: "team" changed to "team's talent", 4pm.)

Labels: , ,

Monday, April 01, 2013

Accurate prediction and the speed of light

This is the time of year when you see lots of baseball prediction stuff ... how many games teams will win, who will finish in first place, how the post-season brackets will go, and so on.

And I hate them, when they're taken seriously.  Because, predicting outcomes with a high degree of accuracy is impossible.  All you can do is guess at the basic probabilities.  After that, it's all luck.

Suppose that you're able to figure that a certain team -- Milwaukee, say -- is actually a .500 team in terms of talent.  Obviously, there's going to be a certain amount of error in your assessment, since it's impossible to know for sure -- but, for the sake of argument, let's say you just know.

Then, subject to certain nitpicks (which I'll leave in the comments), you can consider the Brewers season like 162 coin tosses.  The most likely outcome is 81 heads and 81 tails, but it's probably going to be different just because of luck.  Statistically, you can calculate that the standard error is around 6.4 wins.  That means that, around 1/3 of the time, your estimate will be off by more than around 6.4 wins either way.  And, around 1/20 of the time, your estimate will be off by more than 12.8 wins.  

Suppose that, being rational, you predict 81-81.  And, at the end of the season, the Brewers indeed wound up 81-81.  You're a hero!  But, you were lucky.  The chance that an average team will go exactly 81-81 is ... well, I'm too lazy to calculate it, so I simulated it, and it's around 6.3 percent.  You hit a 15:1 longshot.

-------

Basically, it's like a law of nature that it is impossible to regularly forecast team records with a margin of error of fewer than around 6.4 wins.  Not difficult, but *impossible*.  It's impossible in the same sense as constructing a perpetual motion machine is impossible, or turning lead into gold on your kitchen stove is impossible, or accurately determining the temperature 100 years from today at 4:33 pm is impossible.  No matter how much you know about the team, and the players, and the second baseman's diet, and the third baseman's mental state, and whether the right fielder is on PEDs ... the best you can do, in the long run, is a standard error of around 6.4 wins. 

When forecasters have a contest, and after the season, one of them has "won" with, a standard error of, say, 4.9 wins ... well, you may be impressed.  But he was certainly at least partly lucky.  He beat the natural limit of 6.4.  He was better than perfect.  You may think you're praising his forecasting acumen, but, really, you're implicitly praising his ability to influence coin tosses.

-------

As far as I'm concerned, this feature of randomness -- the existence of a "speed of light" limit to accuracy -- is so fundamental that it should be called "The First Law of Forecasting," or something.  There is a natural limit that cannot be breached, and it usually comes much sooner than we expect.

The newspapers are full of writers, and pundits, that ignore this law, not just in sports, but in everything.  They assume that if you're smart enough, and expert enough, you can accurately predict who's going to win tomorrow's game, or what the Dow-Jones average will be next year, or what's going to happen in North Korea.

But you can't.  


Labels: , ,

Tuesday, March 19, 2013

NFL coaching decisions cost 0.73 wins per team


By making bad decisions on fourth down, NFL coaches are sacrificing almost three-quarters of a win per season.  That's from Matt Meiselman, who crunched some numbers with Brian Burke and posted on Brian's site.  

In 2012, the Cleveland Browns were the "worst", sacrificing a probabilistic 1.02 wins by making 42 "wrong" decisions.  The Packers were the least "worst", giving up only around half an expected win.  I would have expected New England to represent well in this measure, since Bill Belichick has often been touted as a sabermetrically-savvy coach, but the Patriots were only a bit better than average, at 0.6.

Those numbers are based on expectations for an average team.  It's quite likely that they overstate the cost, if the probabilities vary a lot based on quality of team.  My suspicion is that the quality effect is pretty small, because the spread of "wrongness" is so narrow.  In fact, the spread suggests to me that coaches are generally following the same "book" of conventional wisdom, with individual differences being pretty minor.  

The article implies that the losses are due to coaches generally being risk-averse, but doesn't give the numbers.  Is *every* bad decision caused by playing it too safe?  95 percent?  50 percent?  I don't know the answer.  My gut says ... I dunno, I'll guess 92 percent of cases are when the coach should have gone for it and didn't, instead of when he shouldn't have and did.  Matt/Brian, if you're reading this, am I close?  

-------

I'm shocked at how high the numbers are.  Losing 0.73 wins is huge, considering that the difference between a playoff team and an average team is only, what, two games out of sixteen?  

I'd bet that's by far the biggest in-game coaching factor in any major sport (leaving out the decision of who plays).  In baseball, it's the equivalent of 4.6 games per 162, which is about the same percentage of distance to the playoffs.  But I can't see that MLB managers would have anything near that much influence.

-------

At the Sloan convention, there was a lot of talk about how analytics people can increase their influence ... like, what to do or say to get coaches and management to listen to us numbers geeks.  

But, in this case, I think there's an easier path.  Any time there's a fourth-down decision, the TV broadcast could put the probabilities on the screen.  Like, for instance, "teams that go for it should be expected to win 48% of the time, while teams that punt should win only 30% of the time."  That's simple enough for viewers to understand ... which means, fans will be second-guessing the coach based on the numbers, rather than random feelings.  It would still be fun to discuss ... the ESPN guys could argue about why the percentages don't apply in this particular case, because the offense is poor, or the defense has momentum, or whatever.

In any case, it would change the nature of the second-guessing.  Right now, a coach may attract 1 pound of criticism when he plays it by the book, and 5 pounds when he goes for it.  With the probabilities on the screen, maybe the 1:5 ratio will immediately change to 1:3 or something, and then, over time, as the stats gain acceptance, all the way to 1:1.  Then, you've reached the tipping point where the coaches' incentives change.  Now, they take more flak, and sacrifice more job security, when they *don't* go by the percentages.  It wouldn't take long, I suspect, for things to change after that.



Labels: , ,

Monday, March 18, 2013

Voice of Fire

(Warning: non-sports post.)


This painting is "Voice of Fire," by Barnett Newman.  It's around 18 feet tall and 7 feet wide.  

File:Voice of Fire photo.jpg

In 1990, the National Gallery of Canada purchased it for $1,800,000.  An uproar ensued.  That's because the painting is ... well, it's pretty simple.  It consists of three solid vertical stripes of equal width, blue/red/blue.  

Now, I'm not much into art, so I'm one of those "uproar" people.  I just don't get the idea of spending almost two million taxpayer dollars on this.  But ... well, maybe it's my own ignorance.  Maybe I just don't get art, the same way Joe Morgan doesn't get sabermetrics. 

So, I've put together some possible reasons that someone might think this painting is worth the cost.  I'm hoping you guys can help me out, maybe let me know if I've hit on something.


1.  It's beautiful

Maybe this is just a beautiful work of art, and I don't appreciate its beauty enough.  If that's the case, though ... why did it take until 1967 for someone to discover how well those vertical stripes go together?  Why didn't the fashion industry figure this out and put it on ties or blouses long before that?

And is there some subtlety that I don't get?  Does the blue have to be exactly the right shade, or it looks ugly and amateurish?  Would it not work in other colors completely? 

Even if that's the case, that it's so aesthetically striking ... why not put up a reproduction?  You wouldn't have to copy it exactly, just hire some guy with a canvas and a roller.  Seriously ... I've never heard anyone say that it's the detail, the brushstrokes, that make it so nice.  It really just seems to be stripes.  

I don't mean to suggest that it's ugly or anything ... I honestly like it.  I just don't see how it's so good that the public needs to be able to see it, or how it's worth $2 million.  It's not a question of "is it nice."  There's lots of nice out there.  It's a question of, "is it $2 million National Gallery nice?"  What is it that makes this a "major league" piece of art?


2.  It needs context

Bobby Fischer moves his rook.  Weeks later, after analyzing the game, critics realize what a great move it was -- it won the game for him.  But ... there's nothing special about the WAY he lifted the piece, or even the beauty of the board.  It's that the move was a brilliant answer to the question, "what's the best response to that other move?"

Or, think of a running gag.  "Don't call me Shirley."  It's funny in conversation, only in the right context, and only if the other person understands the reference.  (One of my favorite variations: "Frankly, my dear, I don't give a damn."  "Well, I don't give a damn either, and stop calling me Frankly."  Most people wouldn't get that at all.)

Maybe that's it.  Maybe "Voice of Fire" is a culmination of an abstract art conversation.  One guy does green splotches, and another guy paints yellow cubes, and the connoisseurs see it as a stunning follow-up, and then this guy says, "I have the perfect response!"  And he does 17-foot stripes, and the art people appreciate the subtle nuances of the riposte, because that's the way art works.  

As this article says,


" ... there’s a level of importance to the work, which has social value to many people ... it isn’t enough to simply look at a painting and evaluate it on the basis of how elaborate it is. In order to enjoy the work and see its value, you have to learn more about it."

Still: why not a reproduction?  And an explanation?


3.  Mood affiliation

Tyler Cowen writes about "mood affiliation."  That's when people adopt a certain feeling, position or attitude, and then reject anything not consistent with that mood.

For instance, just try to get any rabid Republican to say anything good about Barack Obama.  Try to get an environmentalist to say something bad about an Al Gore policy prescription.  You can't.  They just identify so strongly with one side, and it feels so good, that you can't help opposing anything on the other, or you'll break your emotional bubble.  

Maybe in this case ... well, some people like art, and they feel that government should fund art, and the public should see art, and art it subtle, and art is best appreciated by experts.  And, so, any view that challenges their good feeling about their views, even if it's reasonable and not really that threatening, doesn't penetrate.  The idea that some paintings aren't worth it -- or that "simple" abstracts are less worthy than, say, portraits -- creates an "urgent feeling that the idea needs to be countered".  

They all love art, and none of their fellow art-lovers would even think of breaking the taboo by suggesting that the painting isn't worth $2 million.


4.  Collectibility

I own a few game-worn NHL jerseys.  They were expensive.  Looking at them, you can't tell them apart from the official jerseys you can buy at stores or online, the ones you have custom made by the same companies that make the real ones.  But ... these were worn by actual players, in actual, real NHL games!  That makes them worth substantially more.  There must be some invisible aura ... or maybe it's Hal Gill's actual, real-life sweat molecules that get me to reach deeper into my wallet.

Is that what's happening?  This is a real, original, Barnett Newman painting.  Who cares how good it is?  He painted it!  It's like the jersey Don Awrey wore in Game 1 of the Canada-Russia series.  Who cares if the sportswriters said he played very poorly that game?  It's an important artifact!

I actually have some sympathy for this view, as you may guess.  It would explain why a reproduction just wouldn't do.  (Well, it would suggest why the price is much higher for the original, anyway.)  And it would explain why galleries don't really seem to offer a lot of reproductions.  You'd think they'd want to show the best art, to the best of their ability ... but, maybe it's about having a nice collection of originals.  Nobody lines up to see a photocopy of the Honus Wagner card.


5.  Investment

Suppose the painting appreciates at the rate of inflation.  Then, the only cost is the foregone interest on the money used to buy it.  Suppose that's 5 percent a year, or $40,000.  

Is it possible that "Voice of Fire" attracts an extra $40,000 in revenue, through Gallery admissions or sales?  Recent admission and parking revenues were around $2 million for the year (.pdf, see page 31).  Most of that is for the special exhibitions, I would think.  It doesn't seem to me like 5 or 10 percent of patrons wouldn't bother coming if "Voice of Fire" weren't there.


6.  Moral Rights

Maybe, legally, you can't exhibit a reproduction, so you have to buy the original.  Or, maybe you can, legally, but, morally, there's an unwritten rule in the art industry that it's wrong to reproduce the idea, even when it's so simple that it can't be copyrighted, out of respect to the artist.

But that only explains why you don't hire a guy with a roller.  It doesn't explain why you should think this canvas is worth $1.8 million, instead of, say, a few thousand.  


7.  BS

Finally, maybe the critics are right ... and it's all just bullsh*t.  It really IS just an 18-foot canvas with three stripes.

In support of this theory: 


'Shirley Thompson, who was the National Gallery’s director at the time, says that with no outrageous content to concentrate on, the clash over Voice kept circling back to how individuals responded to it. “You have to look at yourself,” says Thompson, who turns 80 next month and remains a presence in Ottawa art circles. “You have to look at your understanding of the metaphysical dimension of life.”'  (source)

"You have to look at your understanding of the metaphysical dimension of life."  Geez, I couldn't make up something BSier if I tried.  It's almost self-parody.

And, honestly, I hadn't seen that quote until I started writing this and started Googling.  (Also, I didn't notice her name was Shirley until I was almost finished editing.)

------

So, what is it?  I think it's all of these, in some proportion.  My hypothesis is that there's a certain amount of context involved, but it's still mostly mood affiliation, bullsh*t, and groupthink.  I hypothesize that, just like intelligent Republicans can convince each other that Obama wasn't born in the USA, intelligent art fans can get jobs at the Gallery and convince each other that three stripes is a good way to spend two million dollars.  Nobody who works in the artistic community can, or wants to, say the emperor has no clothes.  And those who do, being outside the art world, are seen as uncomprehending philistines to be dismissed.

Am I wrong?  I know there are art fans reading this.  If you disagree, tell me why this painting is a piece of important art that's worth $2 million.  I truly want to know, and I mean that in the sense that I really am keeping an open mind.









Labels: ,