Tuesday, December 18, 2018

Does the NHL's "loser point" help weaker teams?

Back when I calculated that it took 73 NHL games for skill to catch up with luck in the standings, I was surprised it was so high. That's almost a whole season. In MLB, it was less than half a season, and in the NBA, Tango found it was only 14 games, less than one-fifth of the full schedule.

Seventy-three games seemed like that was a lot of luck. Why so much? As it turns out, it was an anomaly -- the NHL was just having an era where differences in team talent were small. Now, it's back under 40 games.

But I didn't know that at the time, so I had a different explanation: it must be the extra point the NHL started giving out for overtime losses. The "loser point," I reasoned, was reducing the importance of team talent, by giving the worse teams more of a chance to catch up to the better teams.

My line of thinking was something like this: 


1. Loser points go disproportionately to worse teams. For team-seasons, there's a correlation of around .4 between negative goal differential (a proxy for team quality) and OTL. So, the loser point helps the worse teams gain ground on the better teams.

2. Adding loser points adds more randomness. When you lose by one goal, whether that goal comes early in the game, or after the third period, is largely a matter of random chance. That adds "when the goals were" luck to the "how many goals there were" luck, which should help mix up the standings more. In fact, as I write this, the Los Angeles Kings have two more wins and three fewer losses than the Chicago Blackhawks. But, because Chicago has five OTL to the Kings' one, they're actually tied in the standings.

But ... now I realize that argument is wrong. And, the conclusion is wrong. It turns out the loser point actually does NOT help competitive balance in the NHL. 

So, what's the flaw in my old argument? 

-------

I think the answer is: the loser point does affect how compressed the standings get in terms of actual points, but it doesn't have much effect on the *order* of teams. The bottom teams wind up still at the bottom, but (for instance) instead of having only half as many points as the top teams, they have two-thirds as many points.

Here's one way to see that. 

Suppose there's no loser point, so the winner always gets two points and the loser always gets none (even if it was an overtime or shootout loss). 

Now, make a change so the losing team gets a point, but *every time*. In that case, the difference between any two teams gets cut in half, in terms of points -- but the order of teams stays exactly the same. 

The old way, if you won W games, your point total was 2W. Now, it's W+82. Either way, the order of standings stays the same -- it's just that the differences between teams are cut in half, numerically.

It's still true that the "loser point" goes disproportionately to the worse teams -- the 50-32 team gets only 32 loser points, while the 32-50 team gets 50 of them. But that doesn't matter, because those points are never enough to catch up to any other team. 

If you ran the luck vs. skill numbers for the new system compared to the old system, it would work out exactly the same.

-------

In real life, of course, the losing team doesn't get a point every time: only when it loses in overtime. Last season, that happened in about 11.6 percent of games, league-wide, or about 23.3 percent of losses.

If the loser point happened in *exactly* 23.3 percent of losses, for every team, with no variation, the situation would be the same as before -- the standings would get compressed, but the order wouldn't change. It would be as if, every loss, the loser got an extra 0.233 points. No team could pass any other team, since for every two points it was behind, it only gets 0.233 points to catch up. 

But: what if you assume that it's completely random which losses become overtime losses?  Now, the order can change. A 40-42 team can catch up to a 41-41 team if its losses had randomly included two more overtime losses than its rival. The chance of that happening is helped by the fact that the 40-42 team has one extra loss to try to randomly convert. It needs two random points to catch up, but it starts with a positive expectation of an 0.233 point head start.

If losses became overtime losses in a random way, then, yes, the OTL would make luck more important, and my argument would be correct. But they don't. It turns out that better teams turn losses into OTL much more frequently than worse teams, on a loss-for-loss basis.

Which makes sense. Worse teams' losses are more likely to be blowouts, which means they're less likely to be close losses. That means fewer one-goal losses, proportionately. 

In other words: 

(a) bad teams have more losses, but 
(b) those losses are less likely to result in an OTL. 

Those two forces work in opposite directions. Which is stronger?

Let's run the numbers from last year to find out.

If we just gave two points for a win, and zero for a loss, we'd have: 

SD(observed)=16.47
SD(luck)    = 9.06
SD(talent)  =13.76

But in real life, which includes the OTL, the numbers are

SD(observed)=15.44
SD(luck)    = 8.48
SD(talent)  =12.90

Converting so we can compare luck to talent:

35.5 games until talent=luck (no OTL point)
35.4 games until talent=luck (with OTL point)

It turns out, the two factors almost exactly cancel out! Bad teams have more chances for an OTL point because they lose more -- but those losses are less likely to be OTL almost in exact proportion.

And that's why I was wrong -- why the OTL point doesn't increase competitive balance, or make the standings less predictable. It just makes the NHL *look* more competitive, by making the point differences smaller.


Labels: , , ,

Wednesday, December 12, 2018

2007-12 was an era of competitive balance in the NHL

Five years ago, I calculated that in the NHL, it took 73 games until talent was as important as luck in determining the standings. But in a previous study, Tango found that it took only 36 games. 

Why the difference?

I think it's because the years for which I ran the study -- 2006-07 to 2011-12 -- were seasons in which the NHL was much more balanced than usual. 

For each of those six seasons, I went to hockey-reference to find the SD of team standings points:

SD(observed)
--------------
2006-07  16.14
2007-08  10.43
2008-09  13.82
2009-10  12.95
2010-11  13.27
2011-12  11.73
--------------
average  13.18  (root mean square)

Tango's study was written in August, 2006. The previous season had a higher spread:

SD(observed)
--------------
2005-06  16.52

So, I think that's the answer. It just happened that the seasons I looked at had less competitive balance that the season or seasons Tango looked at.

But what's the right answer for today's NHL? Well, it looks like the standings spread in recent seasons has moved back closer to Tango's numbers:

SD(observed)
--------------
2013-14  14.26
2014-15  15.91
2015-16  12.86
2016-17  15.14
2017-18  15.44
--------------
average  14.76

What does that mean for the "number of games" estimate? I'll do the calculation for last season, 2017-18.

From the chart, SD(observed) is 15.44 points. SD(luck) is roughly the same for all years of the shootout era (although it varies very slightly with the number of overtime losses), so I'll use the old study's number of 8.44 points. 

As usual, 

SD(talent)^2 = SD(observed)^2 - SD(luck)^2
SD(talent)^2 = 15.44^2 - 8.44^2
SD(talent)   = 12.93

So last year, SD(talent) was 12.93. For the six seasons I looked at, it was 8.95. 

SD(talent)
--------------
2016-12   8.95
2017-18  12.93

Now, let's convert to games.*  

*Specifically, "luck as important as talent" means SD(luck)=SD(talent). Formula: using the numbers for a full season, divide SD(luck) by SD(talent), square it, and multiply by the number of games (82).

When SD(talent) is 8.95, like the seasons I looked at, it takes 77 games for luck and talent to even out. When SD(talent) = 12.93, like it was last year, it takes only ... 36 games.

Coincidentally, 36 games is exactly what Tango found in his own sample.

talent=luck, after
------------------
2016-12  77 games
2017-18  36 games

Two things we can conclude from this:

1. Actual competitive balance (in terms of talent) does seem to change over time in non-random ways. The NHL from 2006-12 does actually seem to have been a more competitive league than from 2013-18. 

2. The "number of games" way of expressing the luck/talent balance is very sensitive to moderate changes in the observed range of the standings.

--------

To expand a bit on #2 ... 

There must be significant random fluctuations in observed league balance.  We mention that sometimes in passing, but I think we don't fully appreciate how big those random fluctuations can be.

Here, again, is the SD(observed) for the seasons 2014-17:

SD(observed)
--------------
2014-15  15.91
2015-16  12.86
2016-17  15.14

It seems unlikely that 2015-16 really had that much tighter a talent distribution than the surrounding seasons. What probably happened, in 2015-16, was just a fluke -- the lucky teams happened to be lower-talent, and the unlucky teams happened to be higher-talent. 

In other words, the difference was probably mostly luck. 

A different kind of luck, though -- luck in how each individual team's "regular" luck correlated, league-wide, with their talent. When the better teams (in talent) are luckier than the worse teams , the standings spread goes up. When the worse teams are luckier, the standings get compressed.

Anyway ... the drop in the chart from from 15.91 to 12.86 doesn't seem that big. But it winds up looking bigger once you subtract out luck to get to talent:

SD(talent)
--------------
2014-15  13.49
2015-16   9.70
2016-17  12.57

The difference is more pronounced now. But, check out what happens when we convert to how many games it takes for luck and talent to even out:

Talent=luck, after
------------------
2014-15  32 games
2015-16  62 games
2016-17  37 games

Now, the differences are too large to ignore. From 2014-15 to 2015-16, SD(observed) went down only 19 percent, but the "number of games" figure nearly doubled.

And that's what I mean by #2 -- the "number of games" estimate is very sensitive to what seem like mild changes in standings variation. 

-------

Just for fun, let's compare 2006-07, one of the most unbalanced seasons, to 2007-08, one of the most balanced. Just looking at the standings, there's already a big difference:

SD(observed)
--------------
2006-07  16.14
2007-08  10.43

But it becomes *huge* when when you express it in games: 

Talent=luck, after
------------------
2006-07   31 games
2007-08  156 games

In one year, our best estimate of how many games it takes for talent to exceed luck changed by a factor of *five times*. And, I think, almost all that difference is itself just random luck.







Labels: , , ,

Thursday, May 03, 2018

NHL referees balance penalty calls between teams


That finding, from Michael Lopez, shows that the next penalty in an NHL game is significantly less likely to go to the team that's had more penalties so far in the game.

That was a new finding to me. A few years ago, I found that the next penalty is more likely to go to the team that had the (one) most recent penalty -- but I hadn't realized that quantity matters, too.

(My previous research can be found here: part one, two, three.)

So, I dug out my old hockey database and see if I could extend Michael's results. All the findings here are based on the same data as my other study -- regular season NHL games from 1953-54 to 1984-85, as provided by the Hockey Summary Project as at the end of 2011.

-------

Quickly revisiting the old finding: referees do appear to call "make-up" penalties. The team that got the benefit of the most recent power play is almost 50 percent more likely to have the next call go against them. That team got the next penalty 59.7% of the time, versus only 40.3% for the previously penalized team.

39599/98167 .403 -- team last penalized
58568/98167 .597 -- other team

Now, let's look at total numbers of penalties instead. I've split the data into home and road teams, because road teams do get more penalties -- 52 percent vs. 48 percent overall.  (That difference is mitigated by the fact that referees balance out the calls. The first penalty of the game goes to the road team 54 percent of the time. The drop from 54 percent for the first call, down to 52 percent overall, is due to the referees balancing out the next call or calls.)

So far, nothing exciting. But here's something. It turns out that the *second* call of the game is much more likely than average to be a makeup call:

.703 -- visiting penalty after home penalty
.297 -- home penalty after home penalty

.653 -- home penalty after visiting penalty 

.347 -- visiting penalty after visiting penalty

Those numbers are huge. Overall, there are more than twice as many "make up" calls as "same team" calls.

In this case, quantity and recency are the same thing. Let's move on to the third penalty of the game, where they can be different.  From now on, I'll show the results in chart form:

.705 0-2 
.462 1-1
.243 2-0

Here's how to read the chart: when the home team has gone "0-2" in penalties -- that is, both previous penalties to the visiting team -- it gets 70.5% of the third penalties. When the previous two penalties were split, the home team gets 46.2%, similar to the overall average. When the home team got both previous penalties, though, it draws the third only 24.3% of the time (in other words, the visiting team drew 75.7%).

Here's the fourth penalty. I've added sample sizes, in parentheses.

.701 0-3 (755)
.559 1-2 (6951)
.373 2-1 (5845)
.261 3-0 (468)

It's a very smooth progression, from .701 down to .261, exactly what you would expect given that make-up calls are so common. 

Here's the fifth penalty:

.677 0-4 ( 195)
.619 1-3 (3244)
.465 2-2 (6950)
.351 3-1 (2306)
.316 4-0 ( 117)

That's the chart that corresponds to Michael Lopez's tweet, and if you scroll back up you'll see that these numbers are pretty close to his.

Sixth penalty:

.667 0-5 (  48)
.637 1-4 (1182)
.520 2-3 (4930)
.413 3-2 (4134)
.323 4-1 ( 773)
.226 5-0 (  31)

Again, the percentages drop every step ("monotonically," as they say in math).

Seventh penalty:

.692 0-6 (  13)

.585 1-5 ( 369)
.577 2-4 (2528)
.489 3-3 (4140)
.399 4-2 (1798)
.379 5-1 ( 219)
.200 6-0 (  13)

Eighth penalty:

.667 0-7 (   3)
.607 1-6 ( 122)
.588 2-5 ( 969)
.527 3-4 (2721)
.422 4-3 (2414)
.374 5-2 ( 652)
.412 6-1 (  68)
.000 7-0 (   1)

Still a perfect pattern.  It breaks up just a little bit here, for the ninth penalty, but that's probably just small sample size.

.000 0-8 (   1)
.553 1-7 (  38)
.586 2-6 ( 348)
.566 3-5 (1358)
.484 4-4 (2063)
.392 5-3 (1037)
.340 6-2 ( 191)
.333 7-1 (  21)

(This is getting boring, so here's a technical note to break the monotony. I included all penalties, including misconducts. I omitted all cases where both teams took a penalty at the same time, even if one team took more penalties than the other. In fact, I treated those as if they never happened, so they don't break the string. This may cause the results to be incorrect in some cases: for instance, maybe Boston takes a minor, then there's a fight and Montreal gets a major and a minor while Boston gets only a major. Then, Montreal takes a minor. In that case, the study will treat the Montreal minor as a make-up call, when it's really not. I think this happens infrequently enough that the results are still valid.)

I'll give two more cases. Here's the twelfth penalty:

.692 2-9 ( 13)
.623 3-8 ( 61)
.532 4-7 (250)
.506 5-6 (478)
.488 6-5 (459)
.449 7-4 (198)
.457 8-3 ( 35)
.200 9-2 (  5)

Almost perfect.  But ... the pattern does seems to break down later on, at the 14th to 16th penalty (I stopped at 16), probably due to sample size issues. Here's the fourteenth, which I think is the most random-looking of the bunch. You could almost argue that it goes the "wrong way":

.000  2-11 (  1)
.375  3-10 (  8)
.333  4- 9 ( 27)
.516  5- 8 ( 95)
.438  6- 7 (169)
.480  7- 6 (148)
.465  8- 5 ( 71)
.577  9- 4 ( 26)
.600 10- 3 (  5)

Still, I think the overall conclusion isn't threatened, that quantity is a factor in make-up calls.

------

OK, so now we know that quantity matters. But couldn't that mean that recency doesn't matter? We did find that the team with the most recent penalty was less likely to get the next one -- but that might just be because that team is also more likely to have a higher quantity at that point. After all, when a team takes three of the first four penalties, there's a 75 percent chance* it also took the most recent one. 

(* It's actually not 75 percent, because make-up calls make the sequence non-random. But the point remains.)

So, maybe the recency effect is just an illusion, by the quantity effect. Or vice versa.

So, here's what I did: I broke down every row in every table by who got the more recent call. It turns out: recency does matter.

Let's take that 3-for-4 example I just used:

.613 home team overall     (3244)
---------------------------------
.508 after VVVH            ( 486)
.639 after other sequences (2758)

From this, it looks like there's both aspects here. When the home team is "up 3-1" in penalty advantage, it gets only 51 percent of the penalties if its previous penalty was the last of the four. That's still more than the 46.1 percent it gets to start the game, or the 46.5 percent it would get if it had been 2-2 instead of 3-1.

This seems to be true for most of the breakdowns -- maybe even all the ones with large enough sample sizes. I'll just arbitrarily pick one to show you ... the ninth penalty, home team 3-5.

.392 home team overall     (1037)
---------------------------------
.362 when most recent was H (743)
.469 when most recent was V (294)

Even better: here's the entire chart for the eighth penalty: overall vs. last penalty went to home team ("last H") vs. last penalty went to visiting team "last V". 

overall   last H    last V
----------------------------------
 .607      .750      .596      1-6 
 .588      .477      .609      2-5 
 .527      .446      .584      3-4 
 .422      .372      .518      4-3 
 .374      .357      .466      5-2 
 .412      .406      .500      6-1 

Clearly, both recency and quantity matter. Holding one constant, the other still follows the "make-up penalty" pattern. 

Can we figure out *how much* is recency and *how much* is quantity?  It's probably pretty easy to get a rough estimate with a regression. I'm about to leave for the weekend, but I'll look at that next week. Or you can download the results (speadsheet here) and do it yourself.




Labels: , , ,

Monday, June 27, 2016

NHL teams strategize when to play for overtime

Here's an article I found a year ago in the Journal of Sports Economics, but didn't get around to writing about until now.

It's by Michael Lopez, and it's called "Inefficiencies in the National Hockey League Points System and the Teams That Take Advantage.

As is well-known, NHL teams have an incentive to get games to go to overtime. If a game is settled in regulation, the winning team gets two standings points, while the loser gets none. However, if a game goes to overtime or a shootout, the winning team gets the same two points -- but the losing team gets one point too.

An overtime game is better for teams, in general, because they get to split three points between them instead of just two. So it's no surprise than NHL teams respond to the incentive. In the first thirteen seasons after the "loser point" rule was adopted, the frequency of overtime games jumped from 20.2 percent to 23.6 percent. (Coincidentally, it's the same 23.6 percent before and after the shootout was adopted.)

Lopez's paper was able to quantify two new additional findings:

1. Games are more likely to go to overtime later in the season than earlier; and

2. Games are more likely to go into overtime when teams are not in the same conference.

These make sense, intuitively. Later in the season, some teams are fighting desperately for a playoff spot, and the extra standings point is much more important for them in terms of leverage. And, whether a team makes the playoffs depends only on the other teams in its own conference, so sharing an extra point with an other-conference opponent doesn't cost anything at all. (Well, maybe it might, rarely, cost home-ice advantage in the finals, but that's highly unlikely.)

As mentioned, 23.6 percent of games went to overtime in the shootout era. But the overtime percentage varies substantially by situation:

25.4% Inter-conference games
23.2% Within-conference games 

22.0% September-December games
23.8% January-February games
25.6% March games
29.3% April games

The conference difference is only significant at p=.08, but the month difference is significant at p=.001.

-------

But, Lopez found, the differences are actually larger than those raw percentages, because the two situations aren't independent. As it turns out, the NHL tends to schedule within-conference games late in the season. That's for drama, so that the most meaningful, high-leverage games are likely to be against historical rivals.

Because of that, the two effects partially cancel each other out. The late-season effect tends to increase overtimes, but those games tend to be within-conference, which decreases them.

Lopez separated out those factors with a regression. Calculating from his coefficients, and assuming teams of equal talent, I get:

23.5% within conference, early in season
26.2% different conference, early in season
31.8% within conference, April
35.0% different conference, April

So, the differences are a lot bigger than the raw numbers show.

---------

Something else that's interesting: the "within conference" effect is very recent.

The overall conference effect was 2.7 percentage points (26.2 versus 23.5). But, almost the entire effect came from the last two seasons in the study. For the study's first twelve years, there was almost no difference at all, on average. But in 2010-11 and 2011-12, the conference effects were 4.7 and 5.8 percentage points, respectively.

It's like teams suddenly caught on to the idea that they don't want to give points away to conference opponents.

But ... well, it seems to me that strategy doesn't really make a whole lot of sense.

Yes, it's true that you don't get an advantage against your rival by playing for three points instead of two. It almost seems like it's worse -- if you win in overtime, you only gain one point on your opponent (you get two, they get one). But if you win in regulation, you gain two points!  Except that it's symmetrical ... if *they* beat *you* in overtime, they only gain one point on you. 

The disadvantage comes not from any negative expectation -- it's symmetrical, after all -- but that the other-conference games come with a *positive* expectation.  You share three points instead of two, but your opponent's gain is not your loss, so the more points to split, the better.

So, against that particular opponent, the inter-conference overtime game is much better for you, with 50 percent more points up for grabs, and no penalty for the points the other team takes, beyond your disappointment at not getting them yourself.

The problem, though, is: that's only true for the one team you're playing against. But, you're not just competing in the standings against this one particular opponent. You're also competing against the other 12 (West) or 14 (East) teams in the conference! If you can raise the expected payoff to 1.5 points each instead of 1.0, you break even against the one same-conference opponent, but gain an expected half point against at least 12 other teams!

Sure, there's *a bit* less incentive within conference, because you stand to gain on only 12 teams, instead of 13 teams when you win an inter-conference game. But, that's so small a drop in incentive that you shouldn't even see it. 

To repeat an analogy I've used in the past: If you see a $2 coin in the street, you'll pick it up. If it's only a $1 coin, sure, you're less likely to pick it up, in theory. But, in practice? You'll still pick it up so often that nobody will be able to tell the difference. It's like a 99.99% chance compared to a 99.98% chance, or something.

-------

Also: why should there be a March/April effect?  Every game counts equally in the standings. A November game is just as important for making the playoffs, on average, as an April game.

Of course, in April you *know* how important the game is, whereas for a November game, it might, in retrospect, turn out to have been meaningless. But, since games count equally, the overall leverages have to be the same. If that's the case, then for every absolutely crucial April game, there must be an offsetting meaningless one, in order for the April average to equal the November average.

I wonder if the April effect applies only to the most important games. Maybe teams are thinking, "well, we feel a bit weird lowering our intensity to play for the regulation tie, so we're only going to do it when it's really, really important."  In other words, the probability of overtime doesn't increase smoothly with leverage -- instead, it takes a big jump when the pressure to gain points is exceptionally high. 

Maybe I'm 100% willing to steal food if I'm on the brink of starvation, but I'm not 50% willing to steal food if I'm only halfway to starving. In the latter case, the risk isn't worth it.

It could be the same thing here. Maybe teams aren't willing to play a less intense strategy (or whatever they do to play for overtime) when it's an ordinary, early-season game. But, when it's *really* important, that's when it's worth the trouble.


Labels: , , , ,

Wednesday, January 07, 2015

Predicting team SH% from player talent

For NHL teams, shooting percentage (SH%) doesn't seem to carry over all that well from year to year. Here repeated from last post, are the respective correlations: 

-0.19  2014-15 vs. 2013-14
+0.30  2013-14 vs. 2012-13
+0.33  2012-13 vs. 2011-12
+0.03  2011-12 vs. 2010-11
-0.10  2010-11 vs. 2009-10
-0.27  2009-10 vs. 2008-09
+0.04  2008-09 vs. 2007-08

(All data is for 5-on-5 tied situations. Huge thanks to puckalytics.com for making the raw data available on their website.)

They're small. Are they real? It's hard to know, because of the small sample sizes. With only 30 teams, even if SH% were totally random, you'd still get coefficients of this size -- the SD of a random 30-team correlation is 0.19.  

That means there's a lot of noise, too much noise in which to discern a small signal. To reduce that noise, I thought I'd look at the individual players on the teams.  (UPDATE: Rob Vollman did this too, see note at bottom of post.)

Start with last season, 2013-14. I found every player who had at least 20 career shots in the other six seasons in the study. Then, I projected his 2013-14 "X-axis" shooting percentage as his actual SH% in those other seasons.  

For every team, I calculated its "X-axis" shooting percentage as the average of the individual player estimates.  

(Notes: I weighted the players by actual shots, except that if a player had more shots in 2013-14 than the other years, I used the "other years" lower shot total instead of the current one. Also, the puckalytics data didn't post splits for players who spent a year with multiple teams -- it listed them only with their last team. To deal with that, when I calculated "actual" for a team, I calculated it for the Puckalytics set of players.  So the team "actual" numbers I used didn't exactly match the official ones.)

If shooting percentage is truly (or mostly) random, the correlation between team expected and team actual should be low.  

It wasn't that low. It was +0.38.  

I don't want to get too excited about that +38, because most other years didn't show that strong an effect. Here are the correlations for those other years:

+0.38  2013-14
+0.45  2012-13
+0.13  2011-12
-0.07  2010-11
-0.34  2009-10
-0.01  2008-09
+0.16  2007-08

They're very similar to the season-by-season correlations at the top of the post ... which, I guess, is to be expected, because they're roughly measuring the same thing.  

If we combine all the years into one dataset, so we have 210 points instead of 30, we get 

--------------
+0.13  7 years

That could easily be random luck.  A correlation of +0.13 would be on the edge of statistical significance if the 210 datapoints were independent. But they're not, since every player-year appears up to six different times as part of the "X-axis" variable.

It's "hockey significant," though. The coefficient is +0.30. So, for instance, at the beginning of 2013-14, when the Leafs' players historically had outshot the Panthers' players by 2.96 percentage points ... you'd forecast the actual difference to be 0.89.  (The actual difference came out to be 4.23 points, but never mind.)

-----

The most recent three seasons appear to have higher correlations than the previous four. Again at the risk of cherry-picking ... what happens if we just consider those three?

+0.38  2013-14
+0.45  2012-13
+0.13  2011-12
--------------
+0.34  3 years

The +0.34 looks modest, but the coefficient is quite high -- 0.60. That means you have to regress out-of-sample performance only 40% back to the mean.  

Is it OK to use these three years instead of all seven? Not if the difference is just luck; only if there's something that actually makes the 2011-12 to 2013-14 more reliable.  

For instance ... it could be that the older seasons do worse because of selective sampling. If players improve slowly over their careers, then drop off a cliff ... the older seasons will be more likely comparing the player to his post-cliff performance. I have no idea if that's a relevant explanation or not, but that's the kind of argument you'd need to help justify looking at only the three seasons.

Well, at least we can check statistical significance. I created a simulation of seven 30-team seasons, where each identical team had an 8 percent chance of scoring on each of 600 identical shots. Then, I ran a correlation for only three of those seven seasons, like here.

The SD of that correlation coefficient was 0.12. So, the +0.34 in the real-life data was almost three SDs above random.

Still: we did cherry-pick our three seasons, so the raw probability is very misleading.  If it had been 8 SD or something, we would have been pretty sure that we found a real relationship, even after taking the cherry-pick into account. At 3 SD ... not so sure.

-----

Well, suppose we split the difference ... but on the conservative side. The 7-year coefficient is 0.30. The 3-year coefficient is 0.60.  Let's try a coefficient of 0.40, which is only 1/3 of the way between 0.30 and 0.60.

If we do that, we get that the predictive ability of SH% is: one extra goal per X shots in the six surrounding seasons forecasts 0.4 extra goals per X shots this season.

For an average team, 0.4 extra goals is around 5 extra shots, or 9 extra Corsis.

In his study last month, Tango found a goal was only 4 extra Corsis.  Why the difference? Because our studies aren't measuring the same thing.  We were asking the same general question -- "if you combine "goals" and "shots," does that give you a better prediction than "shots" alone? -- but doing so by asking different specific questions.  

Tango asked how you predict half a team's games predict the other half. I was asking how you predict a team's year from its players' six surrounding years. It's possible that the "half-year" method has more luck in it ... or that other differences factor in, also.

My gut says that the answers we found are still fairly consistent.

---



UPDATE: Rob Vollman, of "Hockey Abstract" fame, did a similar study last summer (which I read, but had forgotten about).  Slightly different methodology, I think, but the results seem consistent.  Sorry, Rob!



Labels: , , , , ,

Thursday, December 18, 2014

True talent levels for NHL team shooting percentage, part II

(Part I is here)

Team shooting percentage is considered an unreliable indicator of talent, because its season-to-season correlation is low. 

Here are those correlations from the past few seasons, for 5-on-5 tied situations.

-0.19  2014-15 vs. 2013-14
+0.30  2013-14 vs. 2012-13
+0.33  2012-13 vs. 2011-12
+0.03  2011-12 vs. 2010-11
-0.10  2010-11 vs. 2009-10
-0.27  2009-10 vs. 2008-09
+0.04  2008-09 vs. 2007-08

The simple average of all those numbers is +0.02, which, of course, is almost indistinguishable from zero. Even if you remove the first pair -- the 2014-15 stats are based on a small, season-to-date sample size -- it's only an average of +0.055.

(A better way to average them might be to square them (keeping the sign), then taking the root mean square. That gives +0 .11 and +0.14, respectively. But I'll just use simple additions in this post, even though they're probably not right, because I'm not looking for exact answers.)

That does indeed suggest that SH isn't that reliable -- after all, there were more negative seasons than strong positive ones.

But: what if we expand our sample size, by looking at the correlation between pairs that are TWO seasons apart? Different story, now:

+0.35   2014-15 vs. 2012-13
+0.12   2013-14 vs. 2011-12
+0.27   2012-13 vs. 2010-11
+0.12   2011-12 vs. 2009-10
+0.41   2010-11 vs. 2008-09
-0.03   2009-10 vs. 2007-08

These six seasons average +0.21, which ain't bad.

------

Part of the reason that the two-year correlations are high might be that team talent didn't change all that much in the seasons of the study. I checked the correlation between overall team talent, as measured by hockey-reference.com's "SRS" rating. For 2008-09 vs. 2013-14, the correlation was +0.50.

And that's for FIVE seasons apart. So far, we've only looked at two seasons apart.

I chose 2008-09 because you'll notice the correlations that include 2007-08 are nothing special. That, I think, is because team talent changed significantly between 07-08 and 08-09. If I rerun the SRS correlation for 2007-08 vs. 2013-14 -- that is, going back only one additional year -- it drops from +0.50 to only +0.25.

On that basis, I'm arbitrarily deciding to drop 2007-08 from the rest of this post, since the SH% discussion is based on an assumption that team talent stays roughly level.

------

But even if team talent changed little since 2008-09, it still changed *some*. So, wouldn't you still expect the two-year correlations to be lower than the one-year correlations? There's still twice the change in talent, albeit twice a *small* change.

You can look at it a different way -- if A isn't strongly related to B, and B isn't strongly related to C, then how can A be strongly related to C?

Well, I think it's the other way around. It's not just that A *can* be strongly related to C. It's that, if there's really a signal within the noise, you should *expect* A to be strongly related to C.

Consider 2009-10. In that year, every team had a certain SH% talent. Because of randomness, the set of 30 observed team SH% numbers varied from the true talent. The same would be true, of course, for the two surrounding seasons, 2008-09, and 2010-11.

But both those surrounding seasons had a substantial negative correlation with the middle season. That suggests that for each of those surrounding seasons, their luck varied from the middle season in the "opposite" way. Otherwise, the correlation wouldn't be negative.

For instance, maybe in the middle season, the Original Six teams were lucky, and the other 24 teams were unlucky. The two negative correlations with the surrounding seasons suggest that in each of those seasons, maybe it was the other way around, that the Original Six were unlucky, and the rest lucky.

Since the surrounding seasons both had opposite luck to the middle season, they're likely to have had similar luck to each other. 

In this case, they are. The A-to-B correlation is -0.27. The B-to-C correlation is -0.10. But the A-to-C correlation is +0.41. Positive, and quite large.

-0.10   2010-11 (A) vs. 2009-10 (B)
-0.27   2009-10 (B) vs. 2008-09 (C)
-----------------------------------
+0.41   2010-11 (A) vs. 2008-09 (C)


------

This should be true even if SH% is all random -- that is, even if all teams have the same talent. The logic still holds: if A correlates to B the same way C correlates to B ... that means A and C are likely to be somewhat similar. 

I ran a series of three-season simulations, where all 30 teams were equal in talent. When both A and C had a similar, significant correlation to B (same sign, both above +/- 0.20), their correlation with each other averaged +0.06. 

In our case, we didn't get +0.06. We got something much bigger: +0.41. That's because the underlying real-life talent correlation isn't actually zero, as it was in the simulation. A couple of studies suggested it was around +0.15. 

So, the A-B was actually -0.25 "correlation points", relative to the trend: -0.10 relative to zero, plus -0.15 below typical. (I'm sure that isn't the way to do it statistically -- it's not perfectly additive like that -- but I'm just illustrating the point.)  Similarly, the B-C was actually -0.42 points.

Those are much larger effects when you correct them that way, so they have a stronger result. When I limited the simulation sample so both A-B and A-C had to be bigger than +/- 0.25, the average A-C correlation almost tripled, to +0.16. 

Add that +0.16 to the underlying +0.15, and you get +0.31. Still not the +0.41 from real life, but close enough, considering the assumptions I made and shortcuts I took.

------

Since we have six seasons with stable team talent, we don't have to stop at two-season gaps ... we can go all the way to five-season gaps, and pair every season with every other season. Here are the results:


         14-15  13-14  12-13  11-12  10-11  09-10  08-09  
--------------------------------------------------------
14-15           -0.19  +0.35  +0.20  +0.15  +0.46  -0.07  
13-14    -0.19         +0.30  +0.12  +0.27  -0.07  +0.42  
12-13    +0.35  +0.30         +0.33  +0.27  +0.24  +0.26  
11-12    +0.20  +0.12  +0.33         +0.03  +0.12  -0.08  
10-11    +0.15  +0.27  +0.27  +0.03         -0.10  +0.41  
09-10    +0.46  -0.07  +0.24  +0.12  -0.10         -0.27  
08-09    -0.07  +0.42  +0.26  -0.08  +0.41  -0.27


The average of all these numbers is ... +0.15, which is exactly what the other studies averaged out to. That's coincidence ... they used a different set of pairs, they didn't limit the sample to tie scores, and 14-15 hadn't existed yet. (Besides, I think if you did the math, you'd find you wanted the root of the average r-squared, which would be significantly higher than  +0.15.)

Going back to the A-B-C thing ... you'll find it still holds. If you look for cases where A-B and B-C are both significantly below the 0.15 average, A-C will be high. (Look in the same row or column for two low numbers.)  

For instance, in the 14-15 row, 13-14 and 08-09 are both negative. Look for the intersection of 13-14 and 08-09. As predicted, the correlation there is very high -- +0.42. 

By similar logic, if you find cases where A-B and B-C go in different directions -- one much higher than 0.15, the other much lower -- then, A-C should be low.

For instance, in the second row, 09-10 is -0.07, but 08-09 is +0.42. The prediction is that the intersection of 09-10 and 08-09 should be low -- and it is, -0.27.

------

Look at 2012-13. It has a strong positive correlation with every other season in the sample. Because of that, I originally guessed that 2012-13 is the most "normal" of all the seasons, the one where teams most played to their overall talent. In other words, I guessed that 2012-13 was the one with the least luck.

But, when I calculated the SDs of the 30 teams for each season ... 2012-13 was the *highest*, not the lowest. By far! And that's even adjusting for the short season. In fact, all the full seasons had a team SD of 1.00 percentage points or lower -- except that one, which was at the adjusted equivalent of 1.23.

What's going on?

Well, I think it's this: in 2012-13, instead of luck mixing up the differences in team talent, it exaggerated them. In other words: that year, the good teams got lucky, and the bad teams got unlucky. In 2012-13, the millionaires won most of the lotteries.

That kept the *order* of the teams the same -- which means that 2012-13 wound up the most exaggeratedly representative of teams' true talent.

Whether that's right or not, it seems that two things should be true:

-- With all the high correlations, 2012-13 should be a good indicator of actual talent over the seven-year span; and

-- Since we found that talent was stable, we should get good results if we add up all six years for each team, as if it was one season with six times as many games.* 

*Actually, about five times, since there are two short seasons in the sample -- 2012-13, and 2014-15, which is less than half over as I write this.

Well, I checked, and ... both guesses were correct.

I checked the correlation between 2012-13 vs. the sum of the other five seasons (not including the current 2014-15). It was roughly +0.54. That's really big. But, there's actually no value in that ... it was cherry-picked in retrospect. Still, it's just something I found interesting, that for a statistic that is said to have so little signal, a shortened season can still have a +0.54 correlation with the average of five other years!

As for the six-season averages ... those DO have value. Last post, when we tried to get an estimate of the SD of team talent in SH% ... we got imaginary numbers! Now, we can get a better answer. Here's the Palmer/Tango method for the 30 teams' six-year totals:

SD(observed) = 0.543 percentage points
SD(luck)     = 0.463 percentage points
--------------------------------------
SD(talent)   = 0.284 percentage points

That 0.28 percentage points has to be an underestimate. As explained in the previous post, the "all shots are the same" binomial luck estimate is necessarily too high. If we drop it by 9 percent, as we did earlier, we get

SD(observed) = 0.543 percentage points
SD(luck)     = 0.421 percentage points
--------------------------------------
SD(talent)   = 0.343 percentage points

We also need to bump it for the fact that this is the talent distribution for a six-season span -- which is necessarily tighter than a one-season distribution (since teams tend to regress to the mean over time, even slightly). But I don't know how much to bump, so I'll just leave it where it is.

That 0.34 points is almost exactly what we got last post. Which makes sense -- all we did was multiply our sample size by five. 

The real difference, though, is the credibility of the estimate. Last post, it was completely dependent on our guess that the binomial SD(luck) was 9 percent too high. The difference between guessing and not guessing was huge -- 0.34 points, versus zero points.  In effect, without guessing, we couldn't prove there was any talent at all!

But now, we do have evidence of talent ... and guessing adds only around 0.6 points. If you refuse to allow a guess of how shots vary in quality ... well, you still have evidence, without guessing at all, that teams must vary in talent with an SD of at least 0.284 percentage points.




Labels: , , , , ,