How extreme are simulation game results?
Last post, I tried to figure out the theoretical breakdown of luck in team records. In standard errors, I got:
31.9 runs from career years batting
31.9 runs from career years pitching
23.9 runs from event clustering
23.9 runs from opposing team's event clustering
39.1 runs from Pythagoras
A couple of posts before that, I had done the same thing, but for my "luck" study. There, the "career year" luck estimates were higher. (The clustering and Pythagoras estimates were roughly the same, but that's because I used pretty much the same method).
42.7 runs from career years batting
48.5 runs from career years pitching
23.8 runs from event clustering
25.8 runs from opposing team's event clustering
39.1 runs from Pythagoras
There are good reasons for my "career year" estimates being farther off -- mostly, because I had to interpret changes in talent (aging, injuries, learning to hit a breaking ball) as luck. (There are also selective sampling issues.)
Anyway, part of the reason I did all this was because of a comment from Ted Turocy in my post on simulation games:
"Do you also assume that the player "cards" have been suitably regressed to the mean? If not -- which is the case with all standard season sets/disks -- then the simulation will tend to have more extreme totals on the player leaderboards."
I hadn't thought of that. Ted is right ... most simulations don't regress to the mean. If Mark McGwire hit 70 home runs, the game will be calibrated so that McGwire's expectation is 70. Which means, around half the time, he'll actually hit *more* than 70. In fact, the binomial SD for McGwire's 1998 is around 8 homers, so that, a good proportion of simulated seasons, he might hit 80 or more!
The *team* totals will be more extreme too, then, and so will the team standings. So I wondered: how much more extreme? I seem to recall, some time ago, APBA putting out a promotional flyer with the results of a full simulated season (sent in by a customer), and showing how similar it was to the actual standings. You'd think, though, without regressing to the mean, you'd have too much of a spread, though.
Well, now we can figure it out.
From 1973 to 2001 (omitting strike seasons), the SD of team records (normalized to 162 games) was almost exactly 11 wins. The theoretical SD of luck is 6.36 wins, which means the SD from talent is almost exactly 9 wins. (9 squared plus 6.36 squared equals 11 squared.)
So, that's 90 runs.
But, now, the simulation is increasing that, by taking the "career years" luck, and making it part of the player's "talent" (which is what the card represents). For the team as a whole, it works out to 31.9 runs pitching, and 31.9 runs hitting.
So the SD of talent is now 101 runs -- the square root of (90 squared plus 31.9 squared plus 31.9 squared).
Which means the SD of observed W-L records is now 119 runs, or 11.9 wins -- the square root of (101 squared + 63.6 squared).
11.0 wins -- real life
11.9 wins -- APBA
Not much difference -- around a single win.
So, if you figure the most extreme team in a year is maybe 2 SD from the mean ... the top and bottom teams should be two wins more extreme than real life. (They won't necessarily be the *same* teams.)
That's not a big deal ... you probably wouldn't even notice. Real life often bumps the extremes well beyond that.
The real-life SD of wins is 11.0, but it varies quite a bit by season. Here are the 26 season SDs:
10.3, 9.9, 11.7, 11.8, 14.4, 12.3, 12.9, 11.6, 10.5, 9.8, 8.9, 12.6, 10.3, 9.8, 12.0, 10.0, 9.1, 9.7, 10.2, 12.2, 10.0, 9.6, 13.5, 12.5, 10.0, 13.0, 14.8, 13.4, 13.5, 10.8, 10.1, 9.3, 11.1, 11.4, 11.0, 11.4
11.9 fits right in. The extremes are much higher than that. In 1984, the SD of wins was less than 9. In 2002, it was 14.75. I've never seen anyone mention either, that there was so much parity in 1984 but so little in 2002. It seems that we don't even notice *big* changes.
The overall SD of the SDs was more than 1.5 ... so a change from 11.0 to 11.9 is only 0.6 SDs from the previous mean. For statistical significance, you'd need to more triple that ... which means you'd need roughly ten times as many seasons. It would take 360 years of Major League Baseball to get a 50 percent chance of statistical significance for a single unregressed APBA season.
It's a smaller effect than I had been thinking it would be.
In any case, that just applies to team records. You'll be able to notice the spread more easily in individual results.
For a season of 600 AB, the binomial standard deviation of batting average is 18.7 points.
In 2012, Buster Posey led the majors in batting average, at .336. In a simulation, he'd have a 50% chance of beating that.
Miguel Cabrera, who hit .330, would have probably a 40% chance. Andrew McCutchen, half an SD behind, would have a 30% chance. Mike Trout, also around 30%. Adrian Beltre (.321), 20%. So far, that adds up to 170%, or an average of 1.7 simulated hitters beating .336. And that's only after looking at five players!
So, it's virtually assured that the simulated leader will outhit the actual leader. That's probably true in any of the major statistics where players get similar numbers of opportunities.
Bottom line: in an APBA season, you'll notice players are more extreme than real life -- but you probably won't notice that teams are.