Friday, October 21, 2016

National League forecasts were too accurate in 2016

FiveThirtyEight predicted the National League surprisingly accurately this year.

The standard error of their predictions -- that is, the SD of the difference between their team forecasts, and what actually happened -- was only 4.5 games.* (Here's the link to their forecast -- go to the bottom and choose "April 2".)

(* The SD is the square root of the average squared error. If you prefer just the average error, in this case, it was three-and-a-third games. But I'll be using just the SD in the rest of this post. In most cases, to estimate average error when you only have the SD, you can multiply by 2/pi (approximately 0.64).)

4.5 games is very, very good. In fact, it's so good it can't possibly be all skill. The "speed of light" limit on forecasting MLB is about 6.4 games. That is, even if you knew absolutely everything about the talent of a team and its opposition, every game, an SD of 6.4 is the very best you could expect to do.

Of course, you can get lucky, and beat 6.4 games. You could even get to zero, if fortune smiles on you and every team hits your projection exactly. But, 6.4 is the best you can do by skill.**  

(** Actually, it might be a bit less, 6.3 or something, because 6.4 is what you get when teams are evenly matched ... mismatches are somewhat easier to predict. But never mind.)

How unusual is an SD of 4.5? Well, not *that* unusual. By my estimate, the SD of the observed SD -- sorry if that's a little confusing -- is somewhere around 1.7, for a league of 15 teams. So, FiveThirtyEight was a little over one standard deviation lucky, which isn't really a big deal. Even taking into account that FiveThirtyEight couldn't have been perfectly accurate in their talent assessments, it's still not that big a deal. If they were off, on talent, by around 3 games per team, that would bring them to only about 1.5 SDs of luck.

Still not a huge deal, but interesting nonetheless.

------

It wasn't just FiveThirtyEight whose projections did well ... the Vegas bookmakers did OK too. Well, at least the one I looked at, Bovada. (I assume the others would be pretty close.)  They had an SD of 5.5 games, which is also better than the "speed of light."  (I can't find the page I got them from, but this one, from a month earlier, is close.)

That suggests that it probably wasn't any particular flash of brilliance from either FiveThirtyEight or Bovada ... it must have been something about the way the season unfolded. 

Maybe, in 2016, there was less random deviation than usual? One type of random variation is whether a team exceeds their Pythagorean Projection -- that is, whether they win more (or fewer) games than you'd expect from their runs scored and allowed. To check that, I used Baseball Prospectus's numbers -- specifically, the difference between actual and "first-order wins."***

(*** Why didn't I use second-order wins? See the P.S. at the bottom of the post.)

In the National League in 2016, the SD of Pythagorean error was 3.55. That is indeed a little smaller than the average of around 4.0. But that small difference isn't nearly enough to explain why the projections were so good.

Here's what I think is the bigger factor -- actually, a combination of two factors.

First, by random chance, the better teams happened to undershoot their Pythagorean expectation, and the worse teams happened to exceed it. 

The Cubs were the best team in the league, and also the team with the most bad luck, -4.8 games. The Phillies were the worst team in the league with luck removed; you'd expect them to have won only 61.4 games, but they but played +9.6 games above their Pythagorean projection to go 71-91.

Those two were the most obvious examples, but the pattern continued through the league. Overall, the correlation between first-order wins (which is an approximation of talent) and Pythagorean error was huge: +0.61. Normally, you'd expect it to be close to zero. (In the American league, it was, indeed, close to zero, at -0.06.)

Second, there was a similar, offsetting relationship in the predictions themselves. 

It turns out that the forecast errors had a strong pattern this year.  Instead of being random, they came out too "conservative" -- they underestimated the talent of the better teams, and overestimated the talent of the worse teams. Here's the distribution of FiveThirtyEight's forecast errors, with the teams sorted by their forecast:

Top 5 teams: average error -4 wins (underestimate)
Mid 5 teams: average error +4  win (overestimate)
Btm 5 teams: average error +1  win (overestimate)

So, in summary:

-- FiveThirtyEight predicted teams too close to the mean
-- Teams' Pythagorean luck moved them closer to the mean

Those two things cancelled each other out to a significant extent. And that's why FiveThirtyEight was so accurate.

-------

Next post: The American League, which is interesting for completely different reasons.

-------

P.S. Baseball Prospectus also produces "second-order wins," which attempts to remove a second kind of luck, what I call "Runs Created luck" (and others call "cluster luck"), which is teams scoring more or fewer runs than would be expected by their batting line. I started to do that, but ... I stopped, because I found something weird.

When you remove luck from the standings, you expect to make them tighter, to bring teams closer together. (To see that better, imagine removing luck from coin tosses. Every team reverts to .500.)

Removing first-order (Pythagorean) luck does seem to reduce the SD of the standings. But, removing second-order (Cluster) luck seems to do the *opposite*.

I checked four seasons of BP data, and, in every case, the SD of second-order wins (for the set of all 30 teams) was higher than the SD of first-order wins:  

         Actual  First-order  Second-order
------------------------------------------
2016      10.7        10.8        13.1
2015      10.4        10.1        11.8
2014       9.6         8.9         9.6
2013      12.2        12.2        12.8

So, either the good teams got lucky all four years, or there's something weird about how BP is computing second-order luck. 










Labels: , , , , ,

2 Comments:

At Monday, October 24, 2016 10:01:00 AM, Blogger Zach said...

Hey Phil,

I agree there is some upper limit to how good a forecast can be, I'm just not sure that 6.4 is definitely it.

You said in the comments of your first limit post:
"All you need for the limit to hold is for it to be impossible to predict the outcome of any single game. With that assumption, the rest of it follows mathematically. That is, if you can't predict one game better than 50/50, it MUST be true that you can't predict 162 games better than with an SD of 6.4."

But, how certain do you have to be on individual games for that 50/50 to not hold? Does a polarized field change this? If half the teams with 10 games and half win 150 you might be able to get significantly above 6.4.

 
At Monday, October 24, 2016 11:21:00 PM, Blogger Phil Birnbaum said...

Hi, Zach,

The SD is the square root of [(win prob) * (lose prob) * (number of games)]. When the games are 50/50, that works out to 6.36. When the games are 90/10, that works out to 3.8.

So, yes, you can go lower than 6.4 if the games are unbalanced. But even if every game was 70/30 -- which is way too extreme for MLB -- you're still at a "speed of light" of 5.8 games.

At 60/40, which is reasonable, you're at 6.2.

 

Post a Comment

<< Home