Thursday, December 18, 2014

True talent levels for NHL team shooting percentage, part II

(Part I is here)

Team shooting percentage is considered an unreliable indicator of talent, because its season-to-season correlation is low. 

Here are those correlations from the past few seasons, for 5-on-5 tied situations.

-0.19  2014-15 vs. 2013-14
+0.30  2013-14 vs. 2012-13
+0.33  2012-13 vs. 2011-12
+0.03  2011-12 vs. 2010-11
-0.10  2010-11 vs. 2009-10
-0.27  2009-10 vs. 2008-09
+0.04  2008-09 vs. 2007-08

The simple average of all those numbers is +0.02, which, of course, is almost indistinguishable from zero. Even if you remove the first pair -- the 2014-15 stats are based on a small, season-to-date sample size -- it's only an average of +0.055.

(A better way to average them might be to square them (keeping the sign), then taking the root mean square. That gives +0 .11 and +0.14, respectively. But I'll just use simple additions in this post, even though they're probably not right, because I'm not looking for exact answers.)

That does indeed suggest that SH isn't that reliable -- after all, there were more negative seasons than strong positive ones.

But: what if we expand our sample size, by looking at the correlation between pairs that are TWO seasons apart? Different story, now:

+0.35   2014-15 vs. 2012-13
+0.12   2013-14 vs. 2011-12
+0.27   2012-13 vs. 2010-11
+0.12   2011-12 vs. 2009-10
+0.41   2010-11 vs. 2008-09
-0.03   2009-10 vs. 2007-08

These six seasons average +0.21, which ain't bad.


Part of the reason that the two-year correlations are high might be that team talent didn't change all that much in the seasons of the study. I checked the correlation between overall team talent, as measured by's "SRS" rating. For 2008-09 vs. 2013-14, the correlation was +0.50.

And that's for FIVE seasons apart. So far, we've only looked at two seasons apart.

I chose 2008-09 because you'll notice the correlations that include 2007-08 are nothing special. That, I think, is because team talent changed significantly between 07-08 and 08-09. If I rerun the SRS correlation for 2007-08 vs. 2013-14 -- that is, going back only one additional year -- it drops from +0.50 to only +0.25.

On that basis, I'm arbitrarily deciding to drop 2007-08 from the rest of this post, since the SH% discussion is based on an assumption that team talent stays roughly level.


But even if team talent changed little since 2008-09, it still changed *some*. So, wouldn't you still expect the two-year correlations to be lower than the one-year correlations? There's still twice the change in talent, albeit twice a *small* change.

You can look at it a different way -- if A isn't strongly related to B, and B isn't strongly related to C, then how can A be strongly related to C?

Well, I think it's the other way around. It's not just that A *can* be strongly related to C. It's that, if there's really a signal within the noise, you should *expect* A to be strongly related to C.

Consider 2009-10. In that year, every team had a certain SH% talent. Because of randomness, the set of 30 observed team SH% numbers varied from the true talent. The same would be true, of course, for the two surrounding seasons, 2008-09, and 2010-11.

But both those surrounding seasons had a substantial negative correlation with the middle season. That suggests that for each of those surrounding seasons, their luck varied from the middle season in the "opposite" way. Otherwise, the correlation wouldn't be negative.

For instance, maybe in the middle season, the Original Six teams were lucky, and the other 24 teams were unlucky. The two negative correlations with the surrounding seasons suggest that in each of those seasons, maybe it was the other way around, that the Original Six were unlucky, and the rest lucky.

Since the surrounding seasons both had opposite luck to the middle season, they're likely to have had similar luck to each other. 

In this case, they are. The A-to-B correlation is -0.27. The B-to-C correlation is -0.10. But the A-to-C correlation is +0.41. Positive, and quite large.

-0.10   2010-11 (A) vs. 2009-10 (B)
-0.27   2009-10 (B) vs. 2008-09 (C)
+0.41   2010-11 (A) vs. 2008-09 (C)


This should be true even if SH% is all random -- that is, even if all teams have the same talent. The logic still holds: if A correlates to B the same way C correlates to B ... that means A and C are likely to be somewhat similar. 

I ran a series of three-season simulations, where all 30 teams were equal in talent. When both A and C had a similar, significant correlation to B (same sign, both above +/- 0.20), their correlation with each other averaged +0.06. 

In our case, we didn't get +0.06. We got something much bigger: +0.41. That's because the underlying real-life talent correlation isn't actually zero, as it was in the simulation. A couple of studies suggested it was around +0.15. 

So, the A-B was actually -0.25 "correlation points", relative to the trend: -0.10 relative to zero, plus -0.15 below typical. (I'm sure that isn't the way to do it statistically -- it's not perfectly additive like that -- but I'm just illustrating the point.)  Similarly, the B-C was actually -0.42 points.

Those are much larger effects when you correct them that way, so they have a stronger result. When I limited the simulation sample so both A-B and A-C had to be bigger than +/- 0.25, the average A-C correlation almost tripled, to +0.16. 

Add that +0.16 to the underlying +0.15, and you get +0.31. Still not the +0.41 from real life, but close enough, considering the assumptions I made and shortcuts I took.


Since we have six seasons with stable team talent, we don't have to stop at two-season gaps ... we can go all the way to five-season gaps, and pair every season with every other season. Here are the results:

         14-15  13-14  12-13  11-12  10-11  09-10  08-09  
14-15           -0.19  +0.35  +0.20  +0.15  +0.46  -0.07  
13-14    -0.19         +0.30  +0.12  +0.27  -0.07  +0.42  
12-13    +0.35  +0.30         +0.33  +0.27  +0.24  +0.26  
11-12    +0.20  +0.12  +0.33         +0.03  +0.12  -0.08  
10-11    +0.15  +0.27  +0.27  +0.03         -0.10  +0.41  
09-10    +0.46  -0.07  +0.24  +0.12  -0.10         -0.27  
08-09    -0.07  +0.42  +0.26  -0.08  +0.41  -0.27

The average of all these numbers is ... +0.15, which is exactly what the other studies averaged out to. That's coincidence ... they used a different set of pairs, they didn't limit the sample to tie scores, and 14-15 hadn't existed yet. (Besides, I think if you did the math, you'd find you wanted the root of the average r-squared, which would be significantly higher than  +0.15.)

Going back to the A-B-C thing ... you'll find it still holds. If you look for cases where A-B and B-C are both significantly below the 0.15 average, A-C will be high. (Look in the same row or column for two low numbers.)  

For instance, in the 14-15 row, 13-14 and 08-09 are both negative. Look for the intersection of 13-14 and 08-09. As predicted, the correlation there is very high -- +0.42. 

By similar logic, if you find cases where A-B and B-C go in different directions -- one much higher than 0.15, the other much lower -- then, A-C should be low.

For instance, in the second row, 09-10 is -0.07, but 08-09 is +0.42. The prediction is that the intersection of 09-10 and 08-09 should be low -- and it is, -0.27.


Look at 2012-13. It has a strong positive correlation with every other season in the sample. Because of that, I originally guessed that 2012-13 is the most "normal" of all the seasons, the one where teams most played to their overall talent. In other words, I guessed that 2012-13 was the one with the least luck.

But, when I calculated the SDs of the 30 teams for each season ... 2012-13 was the *highest*, not the lowest. By far! And that's even adjusting for the short season. In fact, all the full seasons had a team SD of 1.00 percentage points or lower -- except that one, which was at the adjusted equivalent of 1.23.

What's going on?

Well, I think it's this: in 2012-13, instead of luck mixing up the differences in team talent, it exaggerated them. In other words: that year, the good teams got lucky, and the bad teams got unlucky. In 2012-13, the millionaires won most of the lotteries.

That kept the *order* of the teams the same -- which means that 2012-13 wound up the most exaggeratedly representative of teams' true talent.

Whether that's right or not, it seems that two things should be true:

-- With all the high correlations, 2012-13 should be a good indicator of actual talent over the seven-year span; and

-- Since we found that talent was stable, we should get good results if we add up all six years for each team, as if it was one season with six times as many games.* 

*Actually, about five times, since there are two short seasons in the sample -- 2012-13, and 2014-15, which is less than half over as I write this.

Well, I checked, and ... both guesses were correct.

I checked the correlation between 2012-13 vs. the sum of the other five seasons (not including the current 2014-15). It was roughly +0.54. That's really big. But, there's actually no value in that ... it was cherry-picked in retrospect. Still, it's just something I found interesting, that for a statistic that is said to have so little signal, a shortened season can still have a +0.54 correlation with the average of five other years!

As for the six-season averages ... those DO have value. Last post, when we tried to get an estimate of the SD of team talent in SH% ... we got imaginary numbers! Now, we can get a better answer. Here's the Palmer/Tango method for the 30 teams' six-year totals:

SD(observed) = 0.543 percentage points
SD(luck)     = 0.463 percentage points
SD(talent)   = 0.284 percentage points

That 0.28 percentage points has to be an underestimate. As explained in the previous post, the "all shots are the same" binomial luck estimate is necessarily too high. If we drop it by 9 percent, as we did earlier, we get

SD(observed) = 0.543 percentage points
SD(luck)     = 0.421 percentage points
SD(talent)   = 0.343 percentage points

We also need to bump it for the fact that this is the talent distribution for a six-season span -- which is necessarily tighter than a one-season distribution (since teams tend to regress to the mean over time, even slightly). But I don't know how much to bump, so I'll just leave it where it is.

That 0.34 points is almost exactly what we got last post. Which makes sense -- all we did was multiply our sample size by five. 

The real difference, though, is the credibility of the estimate. Last post, it was completely dependent on our guess that the binomial SD(luck) was 9 percent too high. The difference between guessing and not guessing was huge -- 0.34 points, versus zero points.  In effect, without guessing, we couldn't prove there was any talent at all!

But now, we do have evidence of talent ... and guessing adds only around 0.6 points. If you refuse to allow a guess of how shots vary in quality ... well, you still have evidence, without guessing at all, that teams must vary in talent with an SD of at least 0.284 percentage points.

Labels: , , , , ,

Saturday, December 13, 2014

True talent levels for NHL team shooting percentage

How much of the difference in team shooting percentage (SH%) is luck, and how much is talent? That seems like it should be pretty easy to figure out, using the usual Palmer/Tango method.


Let's start with the binomial randomness inherent in shooting. 

In 5-on-5 tied situations in 2013-14 (the dataset I'm using for this entire post), the average team took 721 shots. At a 8 percent SH%, one SD of binomial luck is

The square root of (0.08 * 0.92 / 721)

... which is almost exactly one percentage point.

That's a lot. It would move an average team about 10 positions up or down in the standings -- say, from 7.50 (16th) to 8.50 (4th). 

If you want to compare that to Corsi ... for CF% (defined as the percentage of shots at goal which go to the offense), the SD due to binomial luck is also (coincidentally) about one percentage point. That would take a 50.0% team to 51.0%, which is only maybe three places in the standings.

That's one reason that SH% isn't as reliable an indicator as Corsi: a run of luck can make you look like the best or worst in the in that category, instead of just moving you a few spots.


If we just go to the data and observe the SD of actual team SH%, it also comes out to about one percentage point. 


SD(talent)^2 = SD(observed)^2 - SD(luck)^2

we get

SD(talent)^2 = 1.0 - 1.0 

Which equals zero. And so it appears there's no variance in talent at all -- that SH% is, indeed, completely random!

But ... not necessarily. For two reasons.


First, var(observed) is itself random, based on what happened in the 2013-14 season. We got a value of around 1.00, but it could be that the "true" value, the average we'd get if we re-ran the season an infinite number of times, is different. 

How much different could it be? I wrote a simulation to check. I ran 5,000 seasons of 30 teams, each with 700 shots and a shooting percentage of 8.00 percent. 

As expected, the average of those 5,000 SDs was around 1.00. But the 5,000 values varied with an SD of 0.133 percentage points. (Yes, that's the SD of a set of 5,000 SDs.)  So the standard 95% confidence interval gives us a range of (0.83, 1.27). 

That doesn't look like it would make a whole lot of difference in our talent estimate ... but it does. 

At the top end of the confidence interval, an observed SD of 1.27, we'd get

SD(talent) squared  = 1.27 squared - 1.00 squared 
                    = 0.52 squared

That would put the SD of talent at 0.52 percentage points, instead of zero. That's a huge difference numerically, and a huge difference in how we think of SH% talent. Without the confidence interval, it looks like SH% talent doesn't exist at all. With the confidence interval, not only does it appear to exist, but we see it could be substantial.

Why is the range so wide? It's because the observed spread isn't much different from the binomial luck. In this case, they're identical, at 1.00 each. In other situations or other sports, they're farther apart. In MLB team wins, the SD of actual wins is almost double the theoretical SD from luck. In the NHL, it's about one-and-a-half times as big. In the NBA ... not sure; it's probably triple, or more. 

If you have a sport where the range of talent is bigger than the range of luck, your SD will be at least 1.4 times as big as you'd see otherwise -- and 1.4 times is a significant enough signal to not be buried in the noise. But if the range of talent is only, say, 40% as large as the range of luck, your expected SD will be only 1.077 times as big -- that is, only eight percent larger. And that's easy to miss in all the random noise.


Can we narrow down the estimate with more seasons of data? 

For 2011-12, SD(observed) was 0.966, which actually gives an imaginary number for SD(talent) -- the square root of a negative estimate of var(talent). In other words, the teams were closer than we'd expect them to be even if they were all identical! 

For 2010-11, SD(observed) was 0.88, which is even worse. In 2009-10, it was 1.105. Well, that works: it suggests SD(talent) = 0.47 percentage points. For 2008-09, it's back to imaginary numbers, with SD(observed) = 0.93. (Actually, even 2013-14 gave a negative estimate ... I've been saying SD(luck) and SD(observed) were both 1.00, but they were really 1.01 and 0.99, respectively.)

Out of five seasons, we get four impossible situations, the teams are closer together than we'd expect even if they were identical!

That might be random. It might be something wrong with our assumption that talent and luck are independent. Or, it might be that there's something else wrong. 

I think it's that "something else". I think that it's we're not using a good enough assumption about shot types.


Our binomial luck calculation assumed that all the shots were the same, that every shot had an identical 8% chance of becoming a goal. If you use a more realistic assumption, the effects of luck come out lower.

The typical team in the dataset scored about 56 goals. If that's 700 shots at 8 percent, the luck SD is 1 percent, as we found. But suppose those 56 goals come from a combination of high-probability shots and low-probability shots, like this:

For instance: 

 5 goals =   5 shots at 100% 
15 goals =  30 shots at  50%
30 goals = 300 shots at  10%
 6 goals = 365 shots at   1.64%
56 goals = 700 shots at   8%

If you do it that way, the luck SD drops from 1.0% to 0.91%.

And that makes a big difference. 1.00 squared minus 0.91 squared is around 0.4 squared. Which means: if that pattern of shots is correct, then the SD of team SH% talent is 0.4 points. 

That's pretty meaningful, about five places in the standings.

I'm not saying that shot pattern is accurate... it's a drastic oversimplification. But "all shots the same" is also an oversimplification, and the one that gives you the most luck. Any other pattern will have less randomness. 

What is actually the right pattern? I have no idea. But if you find one that splits the difference, where the luck SD drops only to 0.95% or something ... you'll still get SD(talent) around 0.35 percentage points, which is still meaningfully different from zero.

(UPDATE: Tango did something similar to this for baseball defense, to avoid a too-high estimate for variation in how teams convert balls-in-play to outs.  He describes it here.)


What's right? Zero? 0.35? 0.53? We could use some other kinds of evidence. Here's some other data that could help, from the hockey research community.

These two studies, that I pointed to in an earlier post, found year-to-year SH% correlations in the neighborhood of 0.15. Since the observed SD is about 1.0, that would put the talent SD in the range of 0.15. That seems reasonable, and consistent with the confidence intervals we just saw and the guesses we just made.

Var(talent) for Corsi doesn't have these problems, so it's easy to figure. If you assume a game's number of shots is constant, and binomial luck applies to whether those shots are for or against -- not a perfect model, but close enough -- the estimate of SD(talent) is around 4 percentage points.

Converting that to goals:

-- one talent SD in SH% =  1 goal
-- one talent SD in CF% = 10 goals

So, Corsi is 10 times as useful to know as SH%! Well, that might be a bit misleading: CF% is based on both offense and defense, while SH% is offense only. So the intuitive take on the ratio is probably only 5 times. 

Still: Corsi talent dwarfs SH% talent when it comes to predicting future performance, by a weighting of 5 to 1. No wonder Corsi is so much more predictive!

Either way, it doesn't mean that SH% is meaningless. This analysis suggests that teams who have a very high SH% are demonstrating a couple of 5-on-5 tied goals worth of talent. (And, of course, a proportionate number of other goals in other situations.)


And, if I'm not mistaken ... again coincidentally, one point of CF% is worth the same, in terms of what it tells you about a team's talent, as one point of SH%. (Of course, SH% is much harder to achieve -- only a few of teams are as much as 1 point of SH% above or below average, while almost every team is above or below 51.0% CF%.)

So, instead of using Corsi alone ... just add CF% and SH%! That only works in 5-on-5 tied situations -- otherwise, it's ruined by score effects. But I wouldn't put too much trust in any shot study that doesn't adjust for score effects, anyway.


I started thinking about this after the shortened 2012-13 season, when the Toronto Maple Leafs had an absurdly high SH% in 5-on-5 tied situations (10.82, best in the league), but an absurdly low CF% (43.8%, second worst to Edmonton).

My argument is: if you're trying to project the Leafs' scoring talent, you can't just use the Corsi and ignore the SH%. If the Leafs are 2 points above average in SH%, that tells you as much about their talent as two Corsi points. Instead of projecting the Leafs to score like a 43.8% Corsi team, you have to project them to score like, maybe, a 45.8% team. Which means that instead of second worst, maybe they're probably only fifth or sixth worst.

That's almost exactly what I estimated a year ago, based on a completely different method and set of assumptions. Neither analysis is perfect, and there's still lots of randomness in the data and uncertainty in the assumptions ... but, still, it's nice to see the results kind of confirmed.

Labels: , , , ,

Thursday, December 11, 2014

The best NHL players create higher quality shots

A couple of months ago, I pointed to data that showed team shot quality is a real characteristic of a team, and not just random noise which the hockey analytics consensus believes it to be. 

That had to do with opposition shot distance. In 2013-14, the Wild allowed only 32 percent of opposing shots from in close, as compared to the Islanders, who allowed 61 percent. Those differences are far too large to be explained by luck.

Here's one more argument, that -- it seems to me -- is almost undeniable evidence that SH% must be a real team skill.


It's conventional wisdom that some players shoot better than others, right? In a 1991 skills competition, Ray Bourque shot out four targets in four shots. The Hull brothers (and son) were well-known for their ability to shoot. Right now, Alexander Ovechkin is considered the best pure scorer in the league.*

(*Update: yeah, that's not quite right, as a reader points out on Twitter.)

In 1990-91, Brett Hull scored 86 goals with a 22.1 percent SH%. Nobody would argue that was just luck, right? You probably do have to regress that to the mean -- his career average was only 15.7 percent -- but you have to recognize that Brett Hull was a much better shooter than average.

Well, if that's true for players, it's true for teams, right? Teams are just collections of players. The only way out is to take the position that Hull just cherry-picked his team's easiest shots, and he was really just stealing shooting percentage from his teammates. 

That's logically possible. In fact, I think it's actually true in another context, NBA players' rebounds. It doesn't seem likely for hockey, but, still, I figured, I have to check.


I went to and found the top 10 players in "goals created" in 2008-09. (I limited the list to one player per team.)  

For each of those players, I checked his team's shooting percentage with and without him on the ice, in even-strength situations in road games, the following season. (Thanks to "Super Shot Search," as usual, for the data.)  

As expected, their teams generally shot better with them than without them:

With  Without
10.1   5.8   Ovechkin
 5.5   9.3   Malkin
 7.1   5.3   Parise
 6.0   8.2   Carter 
10.3   9.1   Kovalchuk
 8.9   5.7   Datsyuk
10.8   7.5   Iginla
10.7   7.2   Nash
 9.8   7.5   Staal
 9.0   9.1   Getzlaf
 9.4   7.5   Average

Eight of the ten players improved their team's SH%. Weighting all players equally, the average increase came out to +1.9 percentage points, which is substantial. 

It would be hard to argue, I think, that this could be anything other than player influence. 

It looks way too big to be random. It can't be score effects, because these guys probably play roughly the same number of minutes per game regardless of the score. It can't be bias on the part of home-team shot counters, because these are road games only. 

And, it can't be players stealing from teammates, because the numbers are for all teammates on the ice at the same time. You can't steal quality shots from players on the bench.


I should mention that there was also an effect for defense, but it was so small you might as well call it zero. The opposition had a shooting percentage of 7.7 percent with one of those ten players on the ice, and 7.8 percent without. 

That kind of makes sense -- the players in that list are known for their scoring, not their defense. I wonder if we'd find a real effect if chose the players on some defensive scale instead? Maybe Selke Trophy voting?

Also ... what's with Malkin? On offense, the Penguins shot 3.8 percentage points worse on his shifts. On defense, the Penguins opponents shot 3.8 percent better. Part of the issue is that his "off" shifts are Sidney Crosby's "on" shifts. But even his raw numbers are unusually low/high.


Speaking of Crosby ... if you don't believe that the good players have consistently high shot quality, Crosby's record should help convince you. Every year of his career, the Penguins had higher quality shots with Crosby on the ice than without:

With  Without 
11.7   9.4   2008-09 Crosby
10.1   7.1   2009-10 
 9.6   6.9   2010-11
13.9   7.3   2011-12
13.2   8.8   2012-13
10.4   6.5   2013-14
 7.1   7.0   2014-15 (to 12/9)
10.9   7.6   Average

Sidney Crosby shifts show a consistent increase of 3.3 percentage points -- even including the first third of the current season at full weight. 

You could argue that's just random, but it's a tough sell.


Now, for team SH%, you could still make an argument that goes something like this:

"Of course, superstars like Sidney Crosby create better quality shots. Everyone always acknowledged that, and this blog post is attacking a straw man. The real point is ... there aren't that many Sidney Crosbys, and, averaged over a team's full roster, their effects are diluted to the point where team differences are too small to matter."

But are they really too small to matter?  How much do the Crosby numbers affect the Penguins' totals?

Suppose we regress Crosby's +3.3 to the mean a bit, and say that the effect is really more like 2.0 points. In 2013-14, about 38 percent of the Penguins' 865 (road, even-strength) shots came with Crosby on the ice. That means that the Crosby shifts raised the team's overall road SH% by about 0.76 percentage points. 

That's not dilute at all. Looking at the overall team 5-on-5 road shooting percentages, 0.76 points would move an average team up or down about 8 positions in the rankings.


Based on all this, I think it would be very, very difficult to continue arguing that team shooting percentage is just random.

Admittedly, that still doesn't mean it's important. Because, even if it's not just random, how is it that all these hockey sabermetric studies have found them so ineffective in projecting future performance? 

The simple answer, I think, is: weak signal, lots of noise. 

I have some ideas about the details, and will try to get them straight in my mind for next a future post. Those details, I think, might help explain the Leafs -- which is one of the issues that got this debate started in the first place.

Labels: , , ,

Tuesday, December 09, 2014

Corsi vs. Tango

Tom Tango is questioning the conventional sabermetric wisdom on Corsi. In a few recent posts, he presents evidence that Corsi can be improved upon as a predictor of future NHL team performance. Specifically: goals are important, too.

That has always seemed reasonable to me. In fact, it seems so reasonable that I wonder why it's disputed. But it is. 

Goals is just shots multiplied by shooting percentage (SH%). The consensus among the hockey research community is that their studies show SH% is not a skill that carries over from year to year. And, therefore, goals can't matter once you have shots.

I've been disputing that for a while now -- at least seven posts worth (here's the first). But I've been doing it from argument. Tango jumps right to the equations. He split seasons in half randomly, and ran a regression to try to predict one half from the other half. Goals proved to be very significant. In fact, when you try to predict, you have to weight goals *four times as heavily* as Corsis. (Five times as heavily as unsuccessful Corsis.)

In a tongue-in-cheek jab at statistics named after their inventors, he called that new statistic the "Tango." 

Despite Tango's regression results, the hockey analysts who commented still disagree. I'm still surprised by that ... the hockey sabermetrics community are pretty smart guys, very good at what they do, and a lot of them have been hired by NHL teams. I've had times when I've wondered if I'm missing something ... I mean, when it's 1 dabbler against 20 analysts who do this every day, it's probably the 20 who are right. Well, now it's two against the world instead of one ... and the second is Tango, which makes me a little more confident. 

Also ... Tango jumps right to the actual question, and proves goals significantly improve the prediction. That's hard to argue with; at least, it's harder to argue with than what I'm doing, which is attacking the assumption that shooting percentage is all random. 

You can see one response here, and Tango's reply here.  

Tango got his data from, who agreed to make it available to all (thank you!!!). I was planning to do some work with the data myself, but ... I guess Tango and I think about things from different angles, because, the more I thought about it, the more of my "usual" arguments I thought of, the less direct ones. So, there'll be another post coming soon. I'll play with the data when my thoughts wind up somewhere that I need to look at it.

For this post, a few of my observations from Tango's posts and the discussion that followed.


In one of his posts, Tango wrote,

"One of the first things that (many) people did was to run a correlation of Corsi v Tango, come up with an r=.98 or some number close to 1 and then conclude: “see, it adds almost nothing”. If only that were true. "

Tango is absolutely right (you should read the whole thing). It's just another case of jumping to conclusions from a high or low correlation coefficient.

Sabermetrics is pretty good at figuring out good and bad. It has to be -- I mean, even fans and sportswriters are pretty good at it, and the whole point of sabermetrics is to do better. We're already in "high correlation" territory, able to separate the good teams and players from the bad teams and players pretty easily. 

Find a 10-year-old kid who's a serious sports fan -- any sport -- and get him to rank the players from best to worst -- whether by formula, or by raw statistics. Then, find the best sabermetric statistic you can, and rank the players that way.

I bet the correlation would be over 0.9. Just by gut.

We're already well into the .9s, when it comes to understanding hockey. Any improvements are going to be marginal, at least if you measure them by correlation. And so, it follows that *of course* Tango and Corsi are going to correlate highly. 

Also, and as Tom again points out, if Corsi already has a high correlation with something, at first glance, Tango can appear to increase it only slightly. If you start with, say, 0.92, and Tango improves it to 0.93 ... well, that doesn't look like much, intuitively. But it IS much. If you look at it another way -- still intuitively -- it was 8% uncorrelated before, and now it's only 7% uncorrelated. You've improved your predictive ability by 12%!

The point is, you have to think about what the numbers actually mean, instead of just having it click "93 isn't much bigger than 92, so who cares?"

Tom illustrates the point by noting that, even though Tango and Corsi appear to be highly correlated to each other, Tango improves a related correlation from .44 to .50. There must be some significant differences there.


There's a more important argument, though, than to not underestimate how much better .93 is than .92. And that argument is: *it's not about the correlation*. 

Yes, it's nice to have a higher and higher r-squared, and be able to reduce your error more and more. But it's not really error reduction we're after. It's *knowledge*. 

It's quite possible that a model that gives you a low correlation actually gives you MORE of an understanding of hockey than a model that gives you a high correlation. Here's an easy example: the correlation between points in the standings, and whether or not you make the playoffs, is very high, close to 1.0. The correlation between your Corsi and whether or not you make the playoffs is lower, maybe 0.7 or something (depending how you do it -- which is another reason not to rely on the correlation alone). 

Which tells you more about hockey that you didn't already know? Obviously, it's the Corsi one. Everyone already knows that points determine whether you qualify for the playoffs. When you find out that shots seem to be important, that's new knowledge -- the knowledge that outshooting your opponents means something. (Of course, *what* it means is something you have to figure out yourself -- correlation doesn't tell you what type of relationship you've found.)

And that's what's going on here. Corsi has a high correlation with future winning, but Corsi *and goals* has an even higher correlation (to a significant extent). What does that tell us? 

Goals matter, not just shots. 

That's an important finding!  You can't dismiss it just because the predictions don't improve that much. If you do, you're missing the point completely. 

You wouldn't do that in other aspects of life, would you? Those faulty airbags in the news recently, the ones that kill people with shrapnel ... those constitute a small, small fraction of collision deaths. If you looked only at predicting fatalities, knowing about those airbags is a rounding error. 

But the point is not just to predict fatality rates!  Well, not for most of us. If you're an insurance company, then, sure, maybe it doesn't make that much difference to you, a couple of cents on each policy you write. But that doesn't mean the information isn't important. It's just important for different questions. Like, how can we reduce fatalities? We can reduce fatalities by replacing the defective air bags!

Also, the information means it is false to state that faulty airbags don't matter. You can still argue that they don't matter MUCH, relative to the overall total of collision deaths; that might be true. But for that, you can't argue from correlation coefficients. You have to argue from ... well, you can use the regression equation. You can say, "only 1 person in a million dies from the airbag, but 1000 in a million die from other causes."

In this case, Tango found that a goal matters four times as much as a shot. If, roughly speaking, there are 12 shots for every goal, then every 12 shots, you get 12 "points" of predictive value from the shots, and 4 "points" of predictive value from the goals. 

The ratio isn't 1000:1 like for the airbags. It's 3:1. How can you dismiss that? Not only is it important knowledge about hockey, but the magnitude of the effect is really going to affect your predictions.


Why does the conventional wisdom dispute the relevance of goals? Because the consensus is that shooting percentage is random -- just like clutch hitting in baseball is random.

Why do they think that? Because the year-to-year team correlation for shooting percentage is very low.

I think it's the "low number means no effect" fallacy. Here are two studies I Googled. This one found an r-squared of .015, and here's another at .03. 

If you take the square root of those r-squareds, to give you correlations, you get 0.12 and 0.17, respectively.

Those aren't that small. Especially if you take into account how much randomness there is in shooting percentage. A signal of 0.17 in all that noise might be pretty significant.


It's well-known that both Corsi and shooting percentage change with the score of the game. When you're up by one goal, your SH% goes up by almost a whole percentage point -- and your Corsi goes down by four points. When you're up two or more, the differences are even bigger. 

That's probably because when teams are ahead, they play more defensively. Their opponents, who are trailing, play more aggressively -- they press in the offensive zone more, and get more shots. 

So, teams in the lead see their shots go down. But their expected SH% goes up, because they get a lot of good scoring chances when the opposition takes more chances -- more breakaways, odd-man rushes, and so on.

It occurred to me that these score effects could, in theory, explain Tango's finding. 

Suppose Team A has 30 Corsis and 5 goals in a game, and Team B has 30 Corsis and no goals. 

Even if shooting percentage is indeed completely random, team A is probably better than team B. Why? Because, with 5 goals, team B probably had a lead most of the game. If it had 30 Corsis despite leading, it must be a much better Corsi team to overcome the “handicap” of shooting so much when it’s ahead. So, when it’s behind, it’ll probably kick the other team’s butt.

I don't think Tango's finding is *all* score effects. And, even if it were, all that would mean is that if you didn't explicitly control for score, "Tango" would be a more accurate statistic than Corsi. And most of the hockey studies I've seen *don't* control for score.


Here's one empirical result that might help -- or maybe won't help at all, I'm actually not sure. (Tell me what you think.)

My hypothesis has been that some teams have better shooting percentages, along with lower Corsis, because they choose to set up for higher quality shots. Instead of taking a 5% shot from the point, they take a 50/50 chance on setting up a 10% shot from closer in. 

As I have written, I think the evidence shows some support for that hypothesis. In 5-on-5 tied situations, there's a negative correlation between Corsi rate and SH%. In 2013-14 (raw stats here), it was -0.16. In the shortened 2012-13 season, it was -0.04. In 2011-12, -0.14. 

Translating the -0.14 into hockey: for every additional goal due to shot quality, teams lowered their Corsi by 2.1 shots.

That's a tradeoff of around 2 Corsis per goal. Tango found 4 Corsis per goal. Does that mean that two of Tango's four Corsis come from shot quality, and the other two come from score effects?

Not sure. There's probably too much randomness in there anyway to be confident, and I'm not completely sure that they're measuring the same thing. But, there it is, and I'll think about it some more.


UPDATE, 30 minutes later: Colby Cosh pointed out, on Twitter, that Tango's regression used Corsi and goal *differential* -- offense minus defense.  That means the "goals against" factor is partially a function of save percentage, which partially reflects goalie talent, which, of course, carries over into the future. So, goalie talent absolutely has to be part of the explanation of the "goals" term.

So now we have two factors: score effects and goalie effects. Could that fully explain the predictive value of goals, without resort to shot quality?  I'll have to think about the actual numbers, whether the magnitudes could be high enough to cover the full "4 shots" coefficient.

Labels: , , , , ,

Tuesday, December 02, 2014

Players being "clutch" when targeting 20 wins -- a follow-up

In his 2007 essay, "The Targeting Phenomenon," (subscription required), Bill James discussed how there are more single-season 20-game winners than 19-game winners. That's the only time that happens, that the higher number happens more frequently than the lower number. 

This is obviously a case of pitchers targeting the 20-win milestone, but Bill didn't speculate on the actual mechanisms for how the target gets hit. In 2008, I tried to figure it out. But, this past June, Bill pointed out that my conclusion didn't fit with the evidence:

"... the Birnbaum thesis is that the effect was caused one-half by pitchers with 19 wins getting extra starts, and one-half by poor offensive support by pitchers going for their 21st win, thus leaving them stuck at 20. But that argument doesn't explain the real life data. 

"[If you look closely at the pattern in the numbers,] the bulge in the data is exactly what it should be if 20 is borrowing from 19 -- and is NOT what it should be if 20 is borrowing both from 19 and 21."

(Here's the link.  Scroll down to OldBackstop's comment on 6/6/2014.)

So, I rechecked the data, and rethought the analysis, and ... Bill is right, as usual. The basic data was correct, but I didn't do the adjustments properly.


My original study covered 1940 to 2007. This study, though, will cover only 1956 to 2000. That's because I couldn't find my original code and data. The "1956" is what I happened to have handy, and I decided to stop at 2000 because Bill did. 

First, here are the raw numbers of seasons with X wins:

17 wins: 159
18 wins: 132
19 wins:  92
20 wins: 113
21 wins:  56
22 wins:  35
23 wins:  20
24 wins:  20

You can see the bulge we're dealing with: there are way too many 20-win pitchers. And it can't be that the excess comes from the 21-win bucket, because, then, the average of 20 and 21 would stay the same, and wouldn't be much lower than 19. That can't be right. And, as Bill pointed out, even if only *half* the excess came from the 21 bucket, 20 would still be too big relative to 19.

So, let me try fixing the problem.

In the other study, I checked four ways in which 20 wins could get targeted:

1. Extra starts for pitchers getting close
2. Starters left in the game longer when getting close
3. Extra relief apparances for pitchers getting close
4. Better performance or luck when shooting for 20 than when shooting for 21.

I'll take those one at a time.


1. Extra starts

The old study found that pitchers who eventually wound up at 19 or 20 wins did, in fact, get more late-season starts than others -- about 23 more overall. In this smaller study (1956-2000 instead of 1940-2007), that translates down to maybe 18 extra starts. 

That's about 9 extra wins. Let's allocate four of them to pitchers who wound up at 19 instead of 18, and the other five to pitchers who wound up at 20 instead of 19. If we back that out of the actual data, we get:

18 wins: 132 136
19 wins:  92  93
20 wins: 113 108
21 wins:  56  56

(If you're reading this on a newsfeed that doesn't support font variations: the first column is the old values, which should be struck out.)

What happens is: the 18 bucket gets back the four pitchers who won 19 instead. The 19 bucket loses those four pitchers, but gains back the five pitchers who won 20 instead of 19. The 20 bucket loses those five pitchers.

(In the other study, I didn't bother doing this, backing out the effects when I found them, so I wound up taking some of them from the wrong place, which caused the problem Bill found.)

So, we've closed the gap from 21 down to 15.


2. Starters left in the game longer

After I had posted the original study, Dan Rosenheck commented,
"You didn't look at innings per start. I bet managers leave guys with 19 W's in longer if they are tied or trailing in the hope that the lineup will get them a lead before they depart."

I checked, and Dan was right. In a subsequent comment, I figured Dan's explanation accounted for about 10 extra twenty-game winners. Those are all taken from the 19-game bucket, because the effect occurred only for starters currently pitching with 19 wins.

For this smaller dataset, I'll reduce the effect from 10 seasons to 7. 


18 wins: 136 136
19 wins:  93 100
20 wins: 108 101
21 wins:  56  56

Now, the bulge is down to 1.  We still have a ways to go, if the 19 is to be significantly higher than the 20, but we're getting there.


3. Extra Relief Apparances

The other study listed every pitcher who got a win in relief while nearing 20 wins. Counting only the ones from 1956 to 2000, we get:

3 pitchers winding up at 19
5 pitchers winding up at 20
2 pitchers winding up at 21

Backing those out:

18 wins: 136 139
19 wins: 100 102
20 wins: 101  98
21 wins:  56  54

The gap now goes the proper direction, but only slightly.


4. Luck

This was the most surprising finding, and the one responsible for the "getting stuck at 20" phenomenon. Pitchers who already had 20 wins were unusually unlikely to get to 21 in a subsequent start. Not because they pitched any worse, but because they got poor run support from their offense.

When Bill pointed out the problem, I wondered if the run-support finding was just a programming mistake. It wasn't -- or, at least, when I rewrote the program, from scratch, I got the same result.

For every current starter win level, here are the pitchers' W-L records in those starts, along with the team's average runs scored and allowed:

17 wins:   483-311 .557   4.30-3.61
18 wins:   350-250 .608   4.30-3.61
19 wins:   260-182 .588   4.24-3.56
20 wins:   150-136 .524   3.81-3.54
21 wins:    94- 61 .606   4.49-3.44
22 wins:    59- 23 .720   4.26-2.80

The run support numbers are remarkably consistent -- except at 20 wins. Absent any other explanation, I assume that's just a random fluke.

If we assume that the 20-win starters "should have" gone 171-115 (.598) instead of 150-136 (.524), that makes a difference of 21 wins.

The mistake I made in the previous study was to assume that those wins were all stolen from the "21-win" bucket. Some were, but not all. Some of the unlucky pitchers eventually got past the 20-win mark; a few, for instance, went on to post 23 wins. In their case, it becomes the 23-win bucket stealing a player from the 24-win bucket.

I checked the breakdown. For every starter who tried for his 21st win but didn't achieve it that game, I calculated where he eventually finished the season. From there, I scaled the totals down to 21, the number of wins lost to bad luck. The result:

20  wins:  9 pitchers
21  wins:  5 pitchers
22  wins:  3 pitchers
23  wins:  1 pitcher
24  wins:  2 pitchers
25+ wins:  less than 1 pitcher

So: the 20-win bucket stole 9 pitchers from the 21-win bucket. The 21-win bucket stole 5 pitchers from the 22-win bucket. And so on. 

Adjusting the overall numbers gives this:

18 wins: 139 139
19 wins: 102 102
20 wins:  98  89
21 wins:  54  50
22 wins:  35  33


And that's where we wind up. It's still not quite enough, to judge by Bill's formula and even just the eyeball test. It still looks like there's a little bulge at 20, by maybe five pitchers. If 20 could steal five more pitchers from 19, we'd be at 107/84, which would look about right.

But, we've done OK. We started with a difference of +21 -- that is, 21 more twenty-game winners than nineteen-game winners -- and finished with a difference of -13. That means we found an explanation for 34 games, out of what looks like a 39-game discrepancy.

Where would the other five come from? I don't know. It could be luck and rounding errors. It could also be that the years 1956-2000 aren't a representative sample of the original study, so we lost a bit of accuracy when I scaled down.  Or, it could be some fifth real factor I haven't thought of.

In any case, here's the final breakdown of the number of "excess" 20-game winners:

-- 5 from getting extra starts;
-- 7 from being left in games longer than usual;
-- 3 from getting extra relief appearances;
-- 9 from bad run support getting them stuck at 20;
-- 5 from luck/rounding/sources unknown.

By the way, one important finding still stands through both studies. Starters didn't seem to pitch any better than normal with their 20th win on the line, so you can't accuse them of trying harder in the service of a selfish personal goal.

Labels: , , ,