Tuesday, March 26, 2019

True talent levels for individual players

(Note: Technical post about practical methods to figure MLB distribution of player talent and regression to the mean.)


For a long time, we've been using the "Palmer/Tango" method to estimating the spread of talent among MLB teams. You're probably sick of seeing it, but I'll run it again real quick for 2013:

1. Find the SD of observed team winning percentage from the standings. In 2013, SD(observed) was 0.0754.

2. Calculate the theoretical SD of luck in a team-season. Statistical theory tells us the formula is the square root of p(1-p)/162, where p is the probability of winning. Assuming teams aren't that far from .500, SD(luck) works out to around 0.039.

3. Since luck is independent of talent, we can say that SD(observed)^2 = SD(luck)^2 + SD(talent)^2 . Substituting the numbers gives our estimate that SD(talent) = 0.0643. 

That works great for teams. But what about players? What's the spread of talent, in, say, on-base percentage, for individual hitters?

It would be great to use the same method, but there's a problem. Unlike team-seasons, where every team plays 162 games, every player bats a different number of times. Sure, we can calculate SD(luck) for each hitter individually, based on his playing time, but then how do we combine them all into one aggregate "SD(luck)" for step 3? 

Can we use the average number of plate appearances? I don't think that would work, actually, because the SD isn't linear. It's inversely proportional to the square root of PA, but even if we used the average of that, I still don't think it would work.

Another possibility is to consider only batters with close to some arbitrary number of plate appearances. For instance, we could just take players in the range 480-520 PA, and treat them as if they all had 500 PA. That would give a reasonable approximation.

But, that would only help us find talent for batters who make it to 500 PA. Those batters are generally the best in baseball, so the range we find will be much too narrow. Also, batters who do make it to 500 PA are probably somewhat lucky (if they started off 15-for-100, say, they probably wouldn't have been allowed to get to 500). That means our theoretical formula for binomial luck probably wouldn't hold for this sample.

So, what do we do?

I don't think there's an easy way to figure that out. Unless Tango already has a way ... maybe I've missed something and reinvented the wheel here, because after thinking about it for a while, I came up with a more complicated method. 

The thing is, we still need to have all hitters have the same number of PA. 

We take the batter with the lowest playing time, and use that. It might be 1 PA. In that case, for all the hitters who have more than 1 PA, we reduce them down to 1 PA. Now that they're all equal, we can go ahead and run the usual method. 

Well, actually, that's a bit of an exaggeration ... 1 PA doesn't work. It's too small, for reasons I'll explain later. But 20 PA does seem to work OK. So, we reduce all batters down to 20 PA.*  

*The only problem is, we'll only be finding the talent range for the subset of batters who are good (or lucky) enough to make it to 20 plate appearances. That should be reasonable enough for most practical purposes, though.  

How do we take a player with 600 PA, and reduce his batting line to 20 PA? We can't just scale down. Proportionally, there's much less randomness in large samples than small, so if we treated a player's 20 PA as an exact replica of his performance in 600 PA, we'd wind up with the "wrong" amount of luck compared to what the formulas expect, and we'd get the wrong answer.

So, what I did was: I took a random sample of 20 PA from every batter's batting line, sampling "without replacement" (which means not using the same plate appearance twice). 

Once that's done, and every hitter is down to 20 PA, we can just go ahead and use the standard method. Here it is for 2013:

1. There were 602 non-pitchers in the sample. The SD of the 602 observed batter OBP values (based on 20 PA per player) was 0.1067.

2. Those batters had an aggregate OBP of .2944. The theoretical SD(luck) in 20 PA with a .2944 expectation is 0.1019.

3. The square root of (0.1067 squared - 0.1019 squared) equals 0.0317 squared.

So, our estimate of SD(talent) = 0.0317. 

That implies that 95% of batters range between .247 and .373. Seems pretty reasonable.


I think this method actually works quite decently. One issue, though, is that it includes a lot of randomness. All the regulars with 500 or 600 plate appearances ... we just randomly pick 20, and ignore the rest. The result is sensitive to which random numbers are pulled. 

How sensitive? To give you an idea, here are the results of 10 different random runs:


I should explain the "imaginary" one. That happens when, just by random chance, SD(observed) is smaller than the expected SD(luck). It's more frequent when the sample size is so small -- say, 20 PA -- that luck is much larger than talent. 

In our original run, SD(observed) was 0.0107 and SD(luck) was 0.0102.  Those are pretty close to each other. It doesn't take much random fluctuation to reverse their order ... in the "imaginary" run, the numbers were 0.01021 and 0.01022, respectively.

More generally, when SD(observed) and SD(luck) are so close, SD(talent) is very sensitive to small random changes in SD(observed). And so the estimates jump around a lot.

(And that's the reason I used the 20 PA minimum. With a sample size of 1 PA, there would be too much distortion from the lack of symmetry. I think. Still investigating.)

The obvious thing to do is just do a whole bunch of random runs, and take the average. That's doesn't quite work, though. One problem is that you can't average the imaginary numbers that sometimes come up. Another problem -- actually, the same problem -- is that the errors aren't symmetrical. A negative random error decreases the estimate more than a positive random error increases the estimate. 

To help get around that, I didn't average the 500 estimates in the list. Instead, I averaged the 500 values of SD(observed), and 500 estimates of SD(luck). Then, I calculated SD(talent) from those.

The result:

SD(talent) = 0.0356

Even with this method, I suspect the estimate is still a bit off. I'm thinking about ways to improve it. I still think it's decent enough, though.


So, now we have our estimate that for 2013, SD(talent)=0.0356. 

The next step: estimating a batter's true talent based on his observed OBP.

We know, from Tango, that we can estimate any player's talent by regressing to the mean -- specifically, "diluting" his batting line by adding a certain number of PA of average performance. 

How many PA do we need to add? As Tango showed, it's the number that makes SD(luck) equal to SD(talent). 

In the 500 simulations, SD(luck) averaged 0.1023 in 20 PA. To get luck down to 0.0356, where it would equal SD(talent), we'd need 166 PA. (That's 20 multiplied by the square of (0.1023 / 0.0356)). I'll just repeat that for reference:

Regress by 166 PA

A value of 166 PA seems reasonable. To check, I ran every season from 1950 to 2016, and 166 was right in line. 

The average of the 57 seasons was 183 PA. The highest was 244 PA (1981); the lowest was 108 PA (1993).  


Now we know we need to add 166 PA of average performance to a batting line to go from observed performance to estimated talent. But what, exactly, is "average performance"?

There are at least four different possibilities:

1. Regress to the observed real-life OBP. In MLB in 2013, for non-pitchers with at least 20 PA, that was .3186. 

2. Regress to the observed real-life OBP weighting every batter equally. That works out to .2984. (It's smaller than the actual MLB number because, in real life, worse hitters get fewer-than-equal PA.)

3. Regress to the average *talent*, weighted by real-life PA.

4. Regress to the average *talent*, weighting every batter equally.

Which one is correct? I had never actually thought about the question before. That's because I had only every used this method on team talent, and, for teams, all four averages are .500.  Here, they're all different. 

I won't try to explain why, but I think the correct answer is number 4. We want to regress to the average talent of the players in the sample.

Except ... now we have a Catch-22. 

To regress performance to the mean, we need to know the league's average talent. But to know the league's average talent, we need to regress performance to the mean!

What's the way out of this? It took me a while, but I think I have a solution.

The Tango method has an implicit assumption that -- while some players may have been lucky in 2013, and some unlucky -- overall, luck evened out. Which means, the observed OBP in MLB in 2013 is exactly equal to the expected OBP based on player talent.

Since the actual OBP was .3186, it must be that the expected OBP, based on player talent, is also .3186. That is: if we regress every player towards X by 166 PA, the overall league OBP has to stay .3186. 

What value of X makes that happen?

I don't think there can be an easy formula for X, because it depends on the distribution of playing time -- most importantly, how much more playing time the good hitters got that year compared to the bad hitters.

So I had to figure it out by trial and error. The answer:

Mean of player talent = .30995

(If you want to check that yourself, just regress every player's OBP while keeping PA constant, and verify that the overall average (weighted by PA) remains the same. Here's the SQL I used for that:

sum(H+bb)/sum(ab+bb) AS actual, 
sum((h+bb+.30995*166)/(ab+bb+166)*(ab+bb)) / sum(ab+bb) AS regressed 
FROM batting
WHERE yearid=2013 and ab+bb>=20 and primarypos <> "P"
The idea is that "actual" and "regressed" should come out equal.

The "primarypos" column is one I created and populated myself, but the rest should work right from the Lahman database. You can leave out the "primarypos" and just use all hitters with 20+ PA. You'll probably find that it'll be something lower than .30995 that makes it work, since including pitchers brings down the average talent.  Also, with a different population of talent, the correct number of PA to regress should be something other than 166 -- probably a little lower? -- but 166 is probably close.

While I'm here ... I should have said earlier that I used only walks, AB, and hits in my definition of OBP, all through this post.)


So, a summary of the method:

1. For each player, take a random 20 PA subset of his batting line. Figure SD(observed) and SD(luck).

2. Repeat the above enough times to get a large sample size, and average out to get a stable estimate of SD(observed) and SD(luck).

3. Use the Tango method to calculate SD(talent).

4. Use the Tango method to calculate how many PA to regress to the mean to estimate player talent.

5. Figure what mean to regress to by trial and error, to get the playing-time-weighted average talent equal to the actual league OBP.


If I did that right, it should work for any stat, not just OBP. Eventually I'll run it for wOBA, and RC27, and BABIP, and whatever else comes to mind. 

As always, let me know if I've got any of this wrong.

Labels: , , , ,

Tuesday, January 15, 2019

Fun with splits

This was Frank Thomas in 1993, a year in which he was American League MVP with an OPS of 1.033.

                 PA   H 2B 3B HR  BB  K   BA   OPS 
'93 F. Thomas   676 174 36  0 41 112 54 .317 1.033  

Most of Thomas's hitting splits were fairly normal:

Home/Road:              1.113/0.950
First vs. Second Half:  0.970/1.114
Vs. RHP/LHP:            1.019/1.068
Outs in inning:         1.023/1.134/0.948
Team ahead/behind/tied: 1.016/0.988/1.096
Early/mid/late innings: 1.166/0.950/0.946
Night/day:              1.071/0.939

But I found one split that was surprisingly large:

              PA   H 2B 3B HR BB  K   BA   OPS  RC/G 
Thomas 1     352 108 22  0 33 58 34 .367 1.251 14.81 
Thomas 2     309  66 14  0  8 54 20 .259 0.796  5.45 

"Thomas 1" was an order of magnitude better than "Thomas 2," to the extent that you wouldn't recognize them as the same player. 

This is a real split ... it's not a selective-sampling trick, like "team wins vs. losses," where "team wins" were retroactively more likely to have been games in which Thomas hit better. (For the record, that particular split was 1.172/.828 -- this one is wider.)

So what is this split? The answer is ... 


The first line is games on odd-numbered days of the month. The second line is even-numbered days.

In other words, this split is random.

In terms of OPS difference -- 455 points -- it's the biggest odd/even split I found for any player in any season from 1950 to 2016 with at least 251 AB PA each half. 

If we go down to a 150 AB 201 PA minimum, the biggest is Ken Phelps in 1987:

1987 Phelps   PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd          204  31  3  0  8 39 33 .188 0.695  3.79 
even         208  55 10  1 19 41 42 .329 1.204 13.03 

And if we go down to 100 AB 101 PA, it's Mike Stanley, again in 1987, but on the opposite days to Phelps:

1987 Stanley  PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd          134  42  6  1  6 18 23 .362 1.034 10.49 
even         113  17  2  0  0 13 25 .170 0.455  1.55 

But, from here on, I'll stick to the 251 AB PA standard.

That 1993 Frank Thomas split was also the biggest gap in home runs, with a 25 HR difference between odd and even (33 vs. 8). Here's another I found interesting -- Dmitri Young in 2001:

2001 D Young  PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
Odd          285  68 12  2  2 18 40 .255 0.639  3.48 
Even         292  95 16  1 19 19 37 .348 1.013  9.51 

Only two of Young's 21 home runs came on odd-numbered days. The binomial probability of that happening randomly (19-2/2-19 or better) is about 1 in 4520.*  And, coincidentally, there were exactly 4516 players in the sample!

(* Actually, it must be more likely than 1 in 4520. The binomial probability assumes each opportunity is independent, and equally likely to occur on an even day as an odd day. But, PA tend to happen in daily clusters of 3 to 5. Since PAs are more likely to cluster, so are HR. 

To see that more easily, imagine extreme clustering, where there are only two games a year (instead of 162), with 250 PA each game. Half of all players would have either all odd PA or all even PA, and you'd see lots of extreme splits.)

For K/BB ratio, check out Derek Jeter's 2004:  

2004 Jeter   PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd         362 113 27  1 15 14 63 .325 0.888  7.12 
even        327  75 17  0  8 32 36 .254 0.720  4.40 

There were bigger differences, but I found Jeter's the most interesting. 

In 1978, all 10 of Rod Carew's triples came on even-numbered days:

1978 Carew   PA   H 2B 3B HR BB  K  BA   OPS   RC/G 
odd         333  92 10  0  0 45 34 .319 0.766  5.46 
even        309  96 16 10  5 33 28 .348 0.950  8.69 

A 10-0 split is a 1-in-512 shot. I'd say again that it's actually a bit more likely than that because of PA clustering, but ... Carew actually had *fewer* PA in that situation! 

Oh, and Carew also hit all five of his HR on even days. Combining them into 15-0 is binomial odds of 16383 to 1, if you want to do that.

Strikeouts and walks aren't quite as impressive. It's Justin Upton 2013 for strikeouts:

2003 Upton     PA   H 2B 3B HR BB   K   BA  OPS  RC/G 
odd           330  71 14  1 16 31 102 .237 0.761 4.67 
even          303  76 13  1 11 44  59 .293 0.875 6.84 

And Mike Greenwell 1988 for walks:

88 Greenwell   PA   H 2B 3B HR BB   K  BA   OPS  RC/G 
odd           357  91 15  3 10 62  18 .308 0.910 7.61 
even          320 101 24  5 12 25  20 .342 0.973 8.85 

Interestingly, Greenwell was actually more productive on the even-numbered days where he took less than half as many walks.

Finally, here's batting average, Grady Sizemore in 2005:

2005 Sizemore  PA   H 2B 3B HR BB   K  BA   OPS  RC/G 
odd           344  69  9  4 12 26  79 .217 0.660 3.45 
even          348 116 28  7 10 26  53 .360 0.992 9.50 

Another anomaly -- Sizemore hit more home runs on his .217 days than on his .360 days.


Anyway, what's the point of all this? Fun, mostly. But, for me, it did give me a better idea of what kinds of splits can happen just by chance. If it's possible to have a split of 33 odd homers and 8 even homers, just by luck, then it's possible to have 33 first-half homers and 8 second-half homers, just by luck. 

Of course, you should just expect that size of effect once every 40 years or so. It might more intuitive to go from a 40-year standard to a single-season standard, to get a better idea of what we can expect each year. 

To do that, I looked at 1977 to 2016 -- 39 seasons plus 1994. Averaging the top 39 should roughly give us the average for the year. Instead of the average, I figured I'd just (unscientifically) take the 25th biggest ... that's probably going to be close to the median MLB-leading split for the year, taking into account that some years have more than one of the top 39.

For HR, the 25th ranked is Fred McGriff's 2002. It's an impressive 22/8 split:

02 McGriff   PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         297  70 11  1 22 42  47 .275 0.961  7.74 
even        289  73 16  1  8 21  52 .272 0.754  4.89 

For OPS, it's Scott Hatteberg in 2004:

04 Hatteberg PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         312  92 19  0 10 37  23 .335 0.926  8.12 
even        310  64 11  0  5 35  25 .233 0.647  3.47

For strikeouts, it's Felipe Lopez, 2005. Not that huge a deal ... only 27 K difference.

05 F. Lopez  PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         316  78 15  2 12 19  69 .263 0.755  4.75 
even        321  91 19  3 11 38  42 .322 0.928  7.95 

For walks, it's Darryl Strawberry's 1987. The difference is only 23 BB, but to me it looks more impressive than the 27 strikeouts:

87 Strwb'ry  PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         315  77 15  2 19 37  55 .277 0.912  7.02 
even        314  74 17  3 20 60  67 .291 1.045  9.49 

For batting average, number 25 is Orestes Infante, 2011, but I'll show you the 24th ranked, which is Rickey Henderson in his rookie card year. (Both players round to a .103 difference.)

1980 Rickey  PA   H 2B 3B HR BB   K  BA   OPS   RC/G 
odd         340 100 13  1  2 60  21 .357 0.903  8.07 
even        368  79  9  3  7 57  33 .254 0.739  4.67 


I'm going to think of this as, every year, the league-leading random split is going to look like those. Some years it'll be higher, some lower, but these will be fairly typical.

That's the league-leading split for *each category*. There'll be a random home/road split of this magnitude (in addition to actual home/road effect). There'll be a random early/late split of this magnitude (in addition to any fatigue/weather effects). There'll be a random lefty/righty split of this magnitude (in addition to actual platoon effects). And so on.

Another way I might use this is to get an intuitive grip on how much I should trust a potentially meaningful split. For instance, if a certain player hits substantially worse in the second half of the season than in the first half, how much should you worry? To figure that out, I'd list a season's biggest even/odd splits alongside the season's biggest early/late splits. If the 20th biggest real split is as big as the 10th biggest random split, then, knowing nothing else, you can start with a guess that there's a 50 percent chance the decline is real.

Sure, you could do it mathematically, by figuring out the SD of the various stats. But that's harder to appreciate. And it's not nearly as much fun as being able to say that, in 1987, Rod Carew hit every one of his 10 triples and 5 homers on even-numbered days. Especially when anyone can go to Baseball Reference and verify it.

Labels: , , ,

Tuesday, December 18, 2018

Does the NHL's "loser point" help weaker teams?

Back when I calculated that it took 73 NHL games for skill to catch up with luck in the standings, I was surprised it was so high. That's almost a whole season. In MLB, it was less than half a season, and in the NBA, Tango found it was only 14 games, less than one-fifth of the full schedule.

Seventy-three games seemed like that was a lot of luck. Why so much? As it turns out, it was an anomaly -- the NHL was just having an era where differences in team talent were small. Now, it's back under 40 games.

But I didn't know that at the time, so I had a different explanation: it must be the extra point the NHL started giving out for overtime losses. The "loser point," I reasoned, was reducing the importance of team talent, by giving the worse teams more of a chance to catch up to the better teams.

My line of thinking was something like this: 

1. Loser points go disproportionately to worse teams. For team-seasons, there's a correlation of around .4 between negative goal differential (a proxy for team quality) and OTL. So, the loser point helps the worse teams gain ground on the better teams.

2. Adding loser points adds more randomness. When you lose by one goal, whether that goal comes early in the game, or after the third period, is largely a matter of random chance. That adds "when the goals were" luck to the "how many goals there were" luck, which should help mix up the standings more. In fact, as I write this, the Los Angeles Kings have two more wins and three fewer losses than the Chicago Blackhawks. But, because Chicago has five OTL to the Kings' one, they're actually tied in the standings.

But ... now I realize that argument is wrong. And, the conclusion is wrong. It turns out the loser point actually does NOT help competitive balance in the NHL. 

So, what's the flaw in my old argument? 


I think the answer is: the loser point does affect how compressed the standings get in terms of actual points, but it doesn't have much effect on the *order* of teams. The bottom teams wind up still at the bottom, but (for instance) instead of having only half as many points as the top teams, they have two-thirds as many points.

Here's one way to see that. 

Suppose there's no loser point, so the winner always gets two points and the loser always gets none (even if it was an overtime or shootout loss). 

Now, make a change so the losing team gets a point, but *every time*. In that case, the difference between any two teams gets cut in half, in terms of points -- but the order of teams stays exactly the same. 

The old way, if you won W games, your point total was 2W. Now, it's W+82. Either way, the order of standings stays the same -- it's just that the differences between teams are cut in half, numerically.

It's still true that the "loser point" goes disproportionately to the worse teams -- the 50-32 team gets only 32 loser points, while the 32-50 team gets 50 of them. But that doesn't matter, because those points are never enough to catch up to any other team. 

If you ran the luck vs. skill numbers for the new system compared to the old system, it would work out exactly the same.


In real life, of course, the losing team doesn't get a point every time: only when it loses in overtime. Last season, that happened in about 11.6 percent of games, league-wide, or about 23.3 percent of losses.

If the loser point happened in *exactly* 23.3 percent of losses, for every team, with no variation, the situation would be the same as before -- the standings would get compressed, but the order wouldn't change. It would be as if, every loss, the loser got an extra 0.233 points. No team could pass any other team, since for every two points it was behind, it only gets 0.233 points to catch up. 

But: what if you assume that it's completely random which losses become overtime losses?  Now, the order can change. A 40-42 team can catch up to a 41-41 team if its losses had randomly included two more overtime losses than its rival. The chance of that happening is helped by the fact that the 40-42 team has one extra loss to try to randomly convert. It needs two random points to catch up, but it starts with a positive expectation of an 0.233 point head start.

If losses became overtime losses in a random way, then, yes, the OTL would make luck more important, and my argument would be correct. But they don't. It turns out that better teams turn losses into OTL much more frequently than worse teams, on a loss-for-loss basis.

Which makes sense. Worse teams' losses are more likely to be blowouts, which means they're less likely to be close losses. That means fewer one-goal losses, proportionately. 

In other words: 

(a) bad teams have more losses, but 
(b) those losses are less likely to result in an OTL. 

Those two forces work in opposite directions. Which is stronger?

Let's run the numbers from last year to find out.

If we just gave two points for a win, and zero for a loss, we'd have: 

SD(luck)    = 9.06
SD(talent)  =13.76

But in real life, which includes the OTL, the numbers are

SD(luck)    = 8.48
SD(talent)  =12.90

Converting so we can compare luck to talent:

35.5 games until talent=luck (no OTL point)
35.4 games until talent=luck (with OTL point)

It turns out, the two factors almost exactly cancel out! Bad teams have more chances for an OTL point because they lose more -- but those losses are less likely to be OTL almost in exact proportion.

And that's why I was wrong -- why the OTL point doesn't increase competitive balance, or make the standings less predictable. It just makes the NHL *look* more competitive, by making the point differences smaller.

Labels: , , ,

Wednesday, December 12, 2018

2007-12 was an era of competitive balance in the NHL

Five years ago, I calculated that in the NHL, it took 73 games until talent was as important as luck in determining the standings. But in a previous study, Tango found that it took only 36 games. 

Why the difference?

I think it's because the years for which I ran the study -- 2006-07 to 2011-12 -- were seasons in which the NHL was much more balanced than usual. 

For each of those six seasons, I went to hockey-reference to find the SD of team standings points:

2006-07  16.14
2007-08  10.43
2008-09  13.82
2009-10  12.95
2010-11  13.27
2011-12  11.73
average  13.18  (root mean square)

Tango's study was written in August, 2006. The previous season had a higher spread:

2005-06  16.52

So, I think that's the answer. It just happened that the seasons I looked at had less competitive balance that the season or seasons Tango looked at.

But what's the right answer for today's NHL? Well, it looks like the standings spread in recent seasons has moved back closer to Tango's numbers:

2013-14  14.26
2014-15  15.91
2015-16  12.86
2016-17  15.14
2017-18  15.44
average  14.76

What does that mean for the "number of games" estimate? I'll do the calculation for last season, 2017-18.

From the chart, SD(observed) is 15.44 points. SD(luck) is roughly the same for all years of the shootout era (although it varies very slightly with the number of overtime losses), so I'll use the old study's number of 8.44 points. 

As usual, 

SD(talent)^2 = SD(observed)^2 - SD(luck)^2
SD(talent)^2 = 15.44^2 - 8.44^2
SD(talent)   = 12.93

So last year, SD(talent) was 12.93. For the six seasons I looked at, it was 8.95. 

2016-12   8.95
2017-18  12.93

Now, let's convert to games.*  

*Specifically, "luck as important as talent" means SD(luck)=SD(talent). Formula: using the numbers for a full season, divide SD(luck) by SD(talent), square it, and multiply by the number of games (82).

When SD(talent) is 8.95, like the seasons I looked at, it takes 77 games for luck and talent to even out. When SD(talent) = 12.93, like it was last year, it takes only ... 36 games.

Coincidentally, 36 games is exactly what Tango found in his own sample.

talent=luck, after
2016-12  77 games
2017-18  36 games

Two things we can conclude from this:

1. Actual competitive balance (in terms of talent) does seem to change over time in non-random ways. The NHL from 2006-12 does actually seem to have been a more competitive league than from 2013-18. 

2. The "number of games" way of expressing the luck/talent balance is very sensitive to moderate changes in the observed range of the standings.


To expand a bit on #2 ... 

There must be significant random fluctuations in observed league balance.  We mention that sometimes in passing, but I think we don't fully appreciate how big those random fluctuations can be.

Here, again, is the SD(observed) for the seasons 2014-17:

2014-15  15.91
2015-16  12.86
2016-17  15.14

It seems unlikely that 2015-16 really had that much tighter a talent distribution than the surrounding seasons. What probably happened, in 2015-16, was just a fluke -- the lucky teams happened to be lower-talent, and the unlucky teams happened to be higher-talent. 

In other words, the difference was probably mostly luck. 

A different kind of luck, though -- luck in how each individual team's "regular" luck correlated, league-wide, with their talent. When the better teams (in talent) are luckier than the worse teams , the standings spread goes up. When the worse teams are luckier, the standings get compressed.

Anyway ... the drop in the chart from from 15.91 to 12.86 doesn't seem that big. But it winds up looking bigger once you subtract out luck to get to talent:

2014-15  13.49
2015-16   9.70
2016-17  12.57

The difference is more pronounced now. But, check out what happens when we convert to how many games it takes for luck and talent to even out:

Talent=luck, after
2014-15  32 games
2015-16  62 games
2016-17  37 games

Now, the differences are too large to ignore. From 2014-15 to 2015-16, SD(observed) went down only 19 percent, but the "number of games" figure nearly doubled.

And that's what I mean by #2 -- the "number of games" estimate is very sensitive to what seem like mild changes in standings variation. 


Just for fun, let's compare 2006-07, one of the most unbalanced seasons, to 2007-08, one of the most balanced. Just looking at the standings, there's already a big difference:

2006-07  16.14
2007-08  10.43

But it becomes *huge* when when you express it in games: 

Talent=luck, after
2006-07   31 games
2007-08  156 games

In one year, our best estimate of how many games it takes for talent to exceed luck changed by a factor of *five times*. And, I think, almost all that difference is itself just random luck.

Labels: , , ,

Monday, December 03, 2018

Answer to: a flawed argument that marginal offense and defense have equal value

The puzzle from last post was this:  What's wrong with this argument, that a run scored has to be worth exactly the same as a run prevented?

Imagine giving a team an extra run of offense over a season.  You pick a random game, and add on a run, and see if that changes the result.  Maybe it turns an extra-inning loss into a nine-inning win, or turns a one-run loss into an extra-inning game.  Or, maybe it just turns an 8-3 blowout into a 9-3 blowout.

But, it will always be the same as giving them an extra run of defense, right?  Because, it doesn't matter if you turn a 5-4 loss into a 5-5 tie, or into a 4-4 tie.  And it doesn't matter if you turn an 8-3 blowout into a 9-3 blowout, or into a 8-2 blowout.  

Any time one more run scored will change the result of a game, one less run allowed will change it in exactly the same way!  So, how can the value of the run scored possibly be different from the value of the run allowed?

The answer is hinted at by a comment from Matthew Hunt:

"Is it the zero lower bound for runs? You can always increase the number of offensive runs, but you can't hold an opponent to -1 runs."

It's not specifically the zero lower bound -- the argument is wrong even if shutouts are rare -- but it does have to do with the issue of runs prevented.


(Note: for this post, I'm going to treat runs as if they have a Poisson distribution, to make the argument smoother. In reality, runs in baseball come in bunches, and aren't Poisson at all. If that bothers you, just transfer the argument to hockey or soccer, where goals are much closer to Poisson.)


The answer, I think, is this:  If you want to properly remove a single opponent's run from a season, you don't do it by choosing a random game. You have do it by choosing a random *run*.

When you *add* runs, it's OK to do it by choosing a game first, because all games have roughly equal opportunities to score more runs. But when you *remove* runs, you have to remove a run that's already there ... and you have to weight them all equally when deciding which one to remove.

If you don't weight them the runs equally ... well, suppose you have game A with ten runs, and game B with two runs. If you choose a random game first, each B run has five times the chance of being chosen as each of the A runs. 

Here's another way of looking at it. Suppose you randomly allocate 700 runs among 162 games, and then you realize you made a mistake, you only meant to allocate 699 runs. You'd look up the 700th run you added, and reverse it. 

But, that 700th run is more likely to come from a high-scoring game than a low-scoring game. Why? Because, before you added the last run, the game you were about to add it to was as average as the 161 other games. But after you add the run, that game must now be expected to be one run more than average. (Actually, 699/700 more, but close enough).

So, if you removed a 700th run by choosing a random game first, you'd be choosing it from an expected average game, not an expected above-average-game. And so your distribution will be more bunched up than it should be, and it would no longer be the same as the distribution would be if you just stopped at 699 runs.

And, of course, you might randomly choose a shutout, which brings that game's runs to -1, proving more obviously that your distribution is wrong.

You don't actually have to reverse the 700th run ... there's nothing special about that one compared to the other 699. You can pick the first run, or the 167th run, or a random run. But you have to choose a particular run without regard to the game it's in, or any other context.


Why does a random run have a different value from a run from a random game? 

Because the probabilities change. 

For one thing, you're now much less likely to choose a game where you only allowed one run. You probably won those games anyway, so those runs are less valuable than average. Since you choose less valuable runs less often than before, the value of the run goes up.

But, for another thing, you're now much more likely to choose a game where you gave up a lot of runs. You probably lost those games anyway, so the saved run again probably wouldn't help; you'd just lose 8-3 instead of 9-3. Since you're more likely to choose these less-valuable runs than before, the value of the run goes down.

So some runs where the value is low, you're more likely to choose. Others, you're less likely to choose. Which effect dominates? I don't think we can decide easily from this line of thinking alone. We'd have to do some number crunching.

If we did, we'd find out (as the other argument proved) that "choose a run instead of a game" makes runs prevented more valuable when you already score more than you allow, but less valuable when you allow more than you score. 

But, I don't see a way to prove that from this argument. If you do, let me know!


Finally, let me make one part of the argument clearer. Specifically, why is it OK to pick a random game when adding a run *scored*, but not when subtracting a run *allowed*? Shouldn't it be symmetrical?

Actually, it *is* symmetrical.

When you add a run, you're taking a non-run and changing it to a run. Well, there are so many occurrences of non-runs that they're roughly equal in every game. If you think about changing an out to a run, every game has roughly 27 outs, so every game is already equal.

If you think about hockey ... say, every 15-second interval has a chance of a goal. That's 240 segments per game. In a two-goal game, there are 238 non-goal segments that can be converted into a goal. In a 10-goal game, there are only 230 segments. But 230 is so closer to 238 that you can treat them as equal.*

(* In a true Poisson distribution, they're exactly equal, because you model the game as an infinite number of intervals. Infinity minus 2 is equal to infinity minus 10.)

When you subtract a run ... the process is symmetrical, but the numbers are different. A two-goal game has only two chances to convert a goal to a non-goal, while a ten-goal game has ten -- five times as many. Instead of a 230:238 ratio, you have a 2:10 ratio. The 2 and 10 aren't close enough to treat as equal.

In theory, the two cases are symmetrical in the sense that both are wrong. But, in practice, choosing goals scored by game is wrong but close enough to treat as right. Choosing goals allowed by game is NOT close enough to treat as right.

The fact that goals are rare compared to non-goals is what makes the difference. That difference is why the statistics textbooks say that Poisson is used for the distribution of "rare events."  

Goals are rare events. Non-goals are not.

Labels: ,

Tuesday, November 27, 2018

A flawed argument that marginal offense and defense have equal value

Last post, I argued that a defensive run saved isn't necessarily equally as valuable as an extra offensive run scored.  

But I didn't realize that was true right away.  Originally, I thought that they had to be equal.  My internal monologue went like this:

Imagine giving a team an extra run of offense over a season.  You pick a random game, and add on a run, and see if that changes the result.  Maybe it turns an extra-inning loss into a nine-inning win, or turns a one-run loss into an extra-inning game.  Or, maybe it just turns an 8-3 blowout into a 9-3 blowout.

(It turns out that every ten games, that run will turn a loss into a win ... see here.  But that's not important right now.)

But, it will always be the same as giving them an extra run of defense, right?  Because, it doesn't matter if you turn a 5-4 loss into a 5-5 tie, or into a 4-4 tie.  And it doesn't matter if you turn an 8-3 blowout into a 9-3 blowout, or into a 8-2 blowout.  

Any time one more run scored will change the result of a game, one less run allowed will change it in exactly the same way!  So, how can the value of the run scored possibly be different from the value of the run allowed?

That argument is wrong.  It's obvious to me now why it's wrong, but it took me a long time to figure out the flaw in this argument.

Maybe you're faster than I was, and maybe you have an easier explanation than I do.  Can you figure out what's wrong with this argument?  

(I'll answer next post if nobody gets it.  Also, it helps to think of runs (or goals, or points) as Poisson, even if they're not.)

Labels: ,

Monday, October 01, 2018

When is defense more valuable than offense?

Is it possible, as a general rule, for a run prevented to be worth more than a run scored?

I don't think so. 

Suppose every team in the league scored one fewer run, and allowed one fewer run. If runs prevented were more valuable than runs scored, every team would improve. But, then, the league would no longer balance out to .500.

But the values of offensive and defensive runs *are* different for individual teams.

Suppose a team scores 700 runs and allows 600. That's an expected winning percentage of .57647 (Pythagoras, exponent 2). 

Suppose it gains a run of offense, so it scores 701 instead of 700. At 701-600, its expectation becomes .57717, an improvement of .00070.

Now, instead, suppose its extra run comes on defense, and it goes 700-599. Now, its expectation is .57728, an improvement of .00081.

So, for that team, the run saved is more valuable than the run scored.

It turns out that if a team scores more than it allows, a run on defense is more valuable than a run on offense. If a team allows more than it scores, the opposite is true. 


Just recently, I figured out an intuitive way to show why that happens, without having to use Pythagoras at all. I'm going to switch from baseball to hockey, because if you assume that goals scored have a Poisson distribution, the explanation works out easier.

Suppose the Edmonton Oilers score 5 goals per game, and allow 4. If they improve their offense by a goal a game, the 5-4 advantage becomes 6-4. If they improve their defense by a goal, the 5-4 becomes 5-3.

Which is better? 

Even though both scenarios have the Oilers scoring an average two more goals than the opposition, that doesn't happen every game, because there's random variation in how the goals are distributed among the games. With zero variation, the Oilers win every game 5-3 or 6-4. But, with the kind of variation that actually occurs, there's a good chance that the Oilers will lose some games. 

For instance, Edmonton might win one game 7-1, but lose the next 5-3. Over those two games, the Oilers do indeed outscore their opponents by two goals a game, on average, but they lose one of the two games.

The average is "Oilers finish the game +2". The Oilers lose when the result is at least two goals against them. In other words, when the result varies from expectation by -2 goals or greater.

The more variation around the mean of +2, the greater the chance the Oilers lose. Which  means the team with the advantage wants less variation in scores, and the underdog wants more variation.

Now, let's go to the assumption that goals follow a Poisson distribution.*  

(*Poisson is the distribution you get if you assume that in any given moment, each team has its own fixed probability of scoring, independent of what happened before. In hockey, that's a reasonable approximation -- not perfect, but close enough to be useful.)

For a Poisson distribution, the SD of the difference in goals is exactly the square root of the total goals scored.

In the 5-3 case, the SD of goal differential is the square root of 8. In the 6-4 case, the SD is the square root of 10. Since root-10 is higher than root-8, the underdog should prefer 6-4, but the favored Oilers should prefer 5-3.

Which means, for the favorite, a goal of defense is more valuable than a goal of offense.

This "proof" is only for Poisson, but, for the other sports, the same logic holds. In baseball, football, soccer, and basketball, the more goals/runs/points per game, the more variation around the expectation.

Think about what a two goal/point/run spread means in the various sports leagues. In the NBA, where 200 points are scored per game, a 2-point spread is almost nothing. In the NFL, it means more. In MLB, it means a lot more. In the NHL, more still. And, in soccer, where the average is fewer than three goals per game, a two-goal advantage is almost insurmountable.

Labels: , , , ,