### "Stumbling on Wins:" is there really little difference between goalies?

My copy of "Stumbling on Wins," the new book by David Berri and Martin Schmidt, arrived on Friday. It's a quicker read than their first book, "The Wages of Wins"; for one thing, it's shorter, at 140 pages (before appendices and endnotes). For another thing, the writing style is a bit breezier and less technical, more suited to the non-academic (but serious) sports fan.

The theme of this book is how decision-makers in sports make bad decisions because they don't know how to properly evaluate the information they have. Irrationality in decision making is a subject that's been popularized quite a bit lately. In the last few years, you've got "Predictably Irrational" by Dan Ariely, "Nudge" by Richard Thaler and Cass Sunstein, "Sway" by Ori and Rom Brafman, "Priceless" by William Poundstone, and others. The authors of this book acknowledge the trend, and that they chose their title in tribute to Daniel Gilbert's "Stumbling on Happiness."

I disagree with many (but not all) of the conclusions the authors reach ... it seems like, too often, the authors will do a quick study, look at the results superficially, jump to conclusions that I don't think are justified, and argue from those conclusions that decision-makers are doing it wrong.

For now, I'll just give you one example. In Chapter 3, they argue that NHL goalies are overpaid. Why? Because

"... there simply is little difference in the performance of most NHL goalies."

Why evidence to they give for this?

First, they ran a correlation between a goalie's save percentage (SV%) in consecutive seasons. They got an r-squared of .06, or 6%. That's a small number. So goalies are inconsistent, and what is being observed is not really the goalie's talent.

That's not correct at all.

As I wrote before, and Tango has repeatedly said on his own blog, you can't just observe that because the r-squared is a small number, that the relationship between two variables is weak. Indeed, the same relationship can give you very different r-squareds, depending on other factors in your data, the most obvious of which, here, is sample size.

The r-squared, by definition, is the variance of talent as a percentage of total variance. But the smaller your sample, the more total variance you have just because of luck. And so, the smaller the sample, the lower the r-squared, regardless of whether the talent is low or high.

A low r-squared might mean a small needle -- or it might mean a large haystack.

Unless you take a few seconds to figure out which it is, your r-squared doesn't tell you much of anything about the relationship between the two variables.

What *does* that .06 mean? Well, if the r-squared is .06, then the r is about .25. Roughly speaking, that means you can expect 25% of a goalie's difference from the mean to be repeated next year. Put another way, you have to regress the goalie 75% towards the mean.

Yes, that's not as much as you'd expect. By that calculation, if the average save percentage is .904, and goalie X comes in one season at .924, you'd expect next year he'd be at .909 -- one quarter of the distance between .904 and .924. That's still something: it's .005 above average, which is one goal every 200 shots, or about 10 goals a season.

What do you think -- the idea that a .924 goalie is really .909, does that mean "there's little difference between goalies?" That's more a matter of opinion ... but at least now you have the numbers you need to get a grip on what's going on. The "r-squared equals only .06" doesn't really help you decide.

----

Anyway, that's one problem, that the .06 isn't as small as it looks. A bigger problem is that I don't think the .06 is accurate.

I repeated the same correlation for two sets of two consecutive years, 2005-06 to 2006-07, and 2007-08 to 2008-09. I looked at only the 20 goalies with the most minutes played. I got r-squareds of .30 and .25, respectively, both much higher than the authors' .06.

Why? I think it's because the authors included goalies with many fewer shots against. They don't say exactly what their criteria were, except that they "adjusted for time on the ice" (whatever that means: SV% doesn't depend on time played). In other studies in the same chapter, they used 1000 minutes as a criterion, so maybe that's what they did here.

Now, to simplify, suppose the variance of SV% consists of only talent and luck. A full-time goalie plays about 3,500 minutes. In my regression, it turns out that you get 1 part talent to three parts luck (that's where the .25 comes from: 25% of the total is talent). Now, suppose Berri and Schmidt's average goalie played only half that, or 1,750 minutes. Then the luck variance would be twice as high, and they'd get one part talent to *six* parts luck. That would drop the r-squared down from .25 to .14.

I don't know how the authors got .06 when my analysis shows .14 ... maybe their cutoff was lower than 1,000 minutes. Maybe there's some selection bias in my sample of top goalies only. Maybe my four seasons just happened to be not quite representative. Regardless, the fact that the r-squared varies so much with your selection criterion shows that you can't take it at face value without doing a bit of work to interpret it.

In any case, going back to my r-squared of .25 ... the square root of .25 is .50. That means that exactly half a full-time goalie's observed difference from the mean is real, and will be repeated next season; if a goalie is .020 better than average this year, expect him to be .010 better than average next year. That's pretty reasonable. In that light, I don't think you can say "there's little difference between goalies" at all.

----

And, in fact, we should be able to figure out the spread in goalie talent directly, by a method I learned from Tango a few years ago.

Suppose a goalie faces 1,700 shots, and is expected to save 90% of them. By random chance, he'll sometimes save more than 90%, and sometimes less. By the binomial approximation to the normal distribution, the standard deviation of his save percentage due to luck will be .0073.

Now, for the five seasons I checked, the top 20 goalies that year had an actual SD between .007 and .013 ... let's call it about .011.

That's higher than .0073, as you'd expect. The .0073 is what you'd get if all goalies were identical. But there's also extra variance from the fact that some goalies are better than others. Since

(Observed SD)^2 = (Non-luck SD)^2 + (Luck SD)^2

we can say

.011 ^2 = (Non-luck SD) ^2 + .0073 ^2

So the non-luck SD should be about .0082. If we consider everything that's not binomial luck to be talent, then we can say that the SD of top-20 goalie talent is .008. (I dropped the last decimal because our numbers are very rough here.)

If everything that's "non-luck" should repeat next year, we should get an r-squared of about (.008/.011)^2, which is .53. I only got .25 or .30. Why? Well, there could be more luck involved than just binomial. Not all shots are created equal; maybe some goalies got easier shots, and some harder (search for "Shot Quality" here). Maybe there's some variation in talent because of injury or age. There's definitely the quality of the goalie's defense, and that varies a bit from year to year.

Still, there's quite a bit of evidence of talent here. The theoretical value for r-squared was .53, which means the theoretical value for r is .73. That means that if a goaltender was absolutely perfectly consistent, and every shot gave him the same chance of stopping it, each and every year ... then, 73% of his observed talent would be real. That's what it means to be absolutely consistent.

I didn't find .73, but I found about .50. That's a pretty good proportion of the theoretical maximum. I think we can say that a good part of what we see of a goalie's performance is real.

But, does all this mean that "there's little difference between goalies?" Well, let's check. We got an r-squared of .25, which means that 25% of the variance is talent. The variance observed is .011^2, so the variance due to talent is a quarter of that, which is .0055^2.

That means that a goalie who's one SD above average will have a save percentage .0055 better than average. A goalie who's two SDs above average will be .011 better than the mean.

In the context of 1700 shots, one SD is about 9 goals. Two SDs is about 18 goals. And that's from only the 20 goalies with the most playing time. You'd imagine that if you included backup goalies, the variance would be larger. But, to be conservative, I'll leave the SD at 9 goals for now.

Berri and Schmidt looked at Martin Brodeur's career and found he saved an average of 13.6 goals per year, compared to an average goalie. That's consistent with a 9 goal SD; it implies that Brodeur is about one and a half SDs above average, which seems very reasonable. The authors also point out that, in terms of wins, an advantage of 13.6 goals a year is very small compared to what an NBA superstar can provide. That's true, but it doesn't mean that goalies don't matter in the context of hockey. To address that point, you need to look at the 9 goal SD. Is that a lot?

Well ... I'm not sure. I think it's more than it looks. Let's compare goalies to skaters.

Looking at the plus-minus statistics from 2008-09, a bunch of Bruins come up near the top, with numbers scattered around +30. That means that, when those players were on the ice in non-power-play situations, the Bruins scored 30 more goals than they gave up. Along with Detroit, that seems to be the highest bunch in the league.

Since five players are on the ice, you could give each of them credit for 6 extra goals. But they're not all equal -- some are better than others. Let's say that instead of 6/6/6/6/6, they might be 10/8/6/4/2.

That means that the best player on the Bruins might be worth 10 goals. Regressing that to the mean, let's call it 8 goals. Adding power plays, which weren't included in plus/minus, let's move it back to 10 goals.

That's the best player on the best team. But maybe the best player in the league wasn't on the Bruins -- he might have been on a mediocre team, and his teammates caused his plus/minus to drop. How do we adjust for that? I don't know, but let's bump it up 4 goals, and estimate that the best player in the NHL was worth 14 goals last year.

Now, figure the best goalie is about 2 SD above average, for 18 goals. So, the best goalie in the league is better than the best skater! That doesn't suggest, at all, that there's little difference between goalies.

Except ... last year's top plus/minus figure of +37 (David Krejci) is low by historical standards. In 1981-82, the top five players had plus-minuses above 66, almost twice what the Bruins had last year (although in a higher-scoring offensive environment). And, in 1970-71, Bobby Orr had a plus-minus of +124. Back then, you could certainly argue that goalies were more homogeneous than skaters, and the best skater (Gretzky, Orr, or Lemieux) was easily better than the best goalie. And I think that coincides with the intuition that people had back then, that a good goalie could help, but would never be a factor like a Gretzky would.

Still, maybe we should bump the 14 goal estimate for the best skater up a little bit, closer to the 18 goals we found for the best goalie.

I may be wrong in my logic somewhere, but, if I've done everything right, it seems that top goalies in this era are very similar in importance to top skaters. So when Berri and Schmidt accuse GMs of signing goalies to big contracts because "the people that write the checks" don't "understand [the] story" that goalies don't matter much ... well, I think they underestimate the capabilities of those hockey executives. Their judgment might not be perfect, but I think they understand the variation of talent at least as well as Berri and Schmidt seem to.

-----

So I think Berri and Schmidt got into trouble by just looking at the number .06 without thinking about what it meant. They do this again, a bit later, when they run a correlation between SV% in the regular season, and SV% in the playoffs. That's just doomed to fail, because the playoff sample is so small. That makes the variance due to luck very large, which, in turn, brings the r-squared very close to zero.

Actually, they find an r-squared of .07, which is actually larger than the .06 they found over two consecutive regular seasons. You'd think it would be smaller, since playoff samples are so much smaller. I wonder if the .06 was maybe they used very small samples over the regular season, including goalies with only a couple of games played?

Anyway, after that, they try the correlation between two consecutive playoff appearances. They found "none" of the performance was predictable, which suggests an r-squared of .00 (or maybe they assume it's .00 because it wasn't statistically significant). But that's probably just a sample size issue. If their intention was to show that playoff performance by goalies has a lot of random luck in it, well, yes, of course it does. But if their intent is to conclude that goalie performance is completely unpredictable, that one r-squared isn't enough evidence of that. And I'd bet that if they looked a little closer, they'd find that goalies perform in the playoffs exactly as you'd expect them to, subject to a substantial amount of binomial random luck. Or maybe not -- maybe playoff hockey is so different from regular season that different goalies excel at it. But if you want to check that, you have to do more than just run a single regression and look at a single r-squared.

----

Finally, another non sequitur arises where they write,

"Looking at ... goalies ... one sees an average save percentage of [.895]. The standard deviation of that percentage, though, is only .018. Hence the coefficient of variation of save percentage [the SD divided by the mean] is only 0.02. Hence, there simply is very little difference in the performance of most NHL goalies."

Now, I don't get this at all. How does the coefficient of variation tell you whether or not there's a qualitative difference in performance? It just doesn't. The fact that the SD is a small fraction of the mean doesn't have anything to do with how important the statistic is.

Inutitively, I can see how you might jump to that conclusion, if you don't think about it much. But if you do, it makes no sense. The proportion doesn't matter. When it comes to goals, it's the absolute number that matters. If you let in 10 more goals than average over a season, you cost your team 10 goals. It doesn't matter if you and the other goalies get 100 shots, 1000 shots, 10,000 shots, or 100,000 shots -- ten goals in a season is ten goals in a season.

Another way to look at it is that the SV% statistic is arbitrary, which means the coefficient of variation is arbitary. Suppose the NHL had decided to use "goal percentage" instead of "save percentage", counting up the percentage of shots that went in, instead of the percentage that did not. In that case, the SD would be exactly the same, .018. But the average is now the opposite of what it was -- if 89.5% of shots are stopped, then 10.5% of shots are NOT stopped. And so now your coefficient of variation is .17.

One way, you get .02. Another way, you get .17. So how can the size of the arbitrary coefficient of variation possibly have anything to do with how important goaltending is?

I'm sure the coefficient of variation has its uses, but this isn't one of them.

-----

In summary: as I read it, Berri and Schmidt's argument goes something like this:

-- The r-squared of SV% in consecutive seasons is .06.

-- The r-squared of SV% between a season and the playoffs is .07.

-- The r-squared of SV% between two consecutive playoffs is .00.

-- The coefficient of variation for SV% is .02.

--> These are all small numbers. Therefore, goalies' performances aren't consistent. That means there's not much difference between them, and GMs don't seem to realize this.

As I wrote, I don't think that logic makes sense. I think the evidence shows that, in the current era, good goalies are about as valuable as good skaters. I haven't looked, but I bet that salary data would show that to be roughly consistent with what GMs think.

Labels: goalies, hockey, statistics, Stumbling on Wins, The Wages of Wins

## 12 Comments:

In fairness, I should say that Berri and Schmidt are correct when they say that a goalie has much less impact than a star basketball player.

They say that a great season by a goalie might be worth six to nine wins above average, but the top eight players in the NBA last year all produced more than ten wins above average (and Chris Paul led with 22 wins).

I agree with that, and I'm not disputing that part of the argument at all.

Phil:

Great post. Small point: You say that "If everything that's "non-luck" should repeat next year, we should get an r-squared of about (.008/.011)^2, which is .53. I only got .25 or .30. Why?"

I think .008/.011 implies an r of about .53, and thus an r-squared of about .28, which is entirely consistent with your data. And if so, then your estimated SD for top-20 goalies would be .008, more like 13 goals per season.

And presumably this would be still larger if you included non-elite goalies, as we should if we want to really understand their value. The authors' reported SD of "only" .018 would support that notion (although smaller samples probably contributes a lot to their higher SD).

Hey Phil,

I agree with your general point about R-squared (and the later Coef. of Var), but I'm a bit confused about the idea that a goalie, 2 SDs (18 goals) above the average goalie, is worth more than a skater who is more like 1.5 SDs above the average goalie (14 goals above). In absolute terms, that would be correct, but with all the 'replacement' value discussions and positional adjustments, is it really useful to compare the two in that way?

For example, let's say that the ability to convert a skater to a goalie is similar to converting a position player to a pitcher in baseball, so there's no question about changing positions. If we have the best pitcher around, at 20 runs above average with an SD of 10, and the best position player at 40 runs above average, why would we use the same SD of the run distribution for both? Wouldn't we want to use separate SDs for position players vs. pitchers? Or goalies vs. skaters?

With your example, we find it's obvious that the position player saves more runs than the average pitcher. But what if the SD for position players is 40, while the SD for pitchers is still 10? In that case, the large supply of position players at 40 above average (a 1 SD talent) vs. the much low supply of pitchers at 20 above average would make me lean toward paying for the pitcher (a 2 SD talent).

Now, I have no clue about the distribution of talent among goalies. But is it possible that it's the case that their talents, as a whole, don't differ from one another very much, while that 14 goal 'skater' is much more rare given the SD of 'skater' goals? You very well may come up with the same conclusion, but I thought this was the entire point behind many sabermetricians adjusting for the talent at the position. I believe this may be what Berri and Schmidt are getting at with their coefficient of variation (though you make a good point at looking at the absolute value of that number, and this may have been better informed by looking also at the spread for skaters, though I don't have the book so I'm taking your word for it that they didn't bother to do so).

"But what if the SD for position players is 40, while the SD for pitchers is still 10? In that case, the large supply of position players at 40 above average (a 1 SD talent) vs. the much low supply of pitchers at 20 above average would make me lean toward paying for the pitcher (a 2 SD talent)."

Millsy: you are confusing scarcity with value here. The +2SD pitcher is indeed a more scarce talent than the +1SD hitter (by definition). But if we assume replacement level is say -2SDs, then the +1 hitter is worth 120 runs above replacement in your example, while the +2 pitcher is worth just 40 runs above replacement. This hitter is 3x as valuable, even though he's less scarce.

I see this point made all the time: "talent X may be worth more because it's very scarce." Scarcity has no intrinsic value. A skill has value because it creates wins/runs, and the scarcity of a given talent level has value because of its distance from the mean. But scarcity itself has zero value.

The world you're describing is basically in which all pitchers are virtually the same, as compared to the huge talent spread among hitters. And in that world pitchers just wouldn't be very valuable.

My biggest issue is that both Phil as well as the authors are using Overall Save Percentage as a measure of goalie talent without controlling for a couple things that might change the results significantly. (As is the case with any statistical study, having good knowledge of the subject as well as previous research of the subject is vital.)

First, man-strength (i.e. Even Strength, Power Play, Penalty Kill) is far and away the biggest contributor to save percentage. League average save percentage is WAY different at 4-on-5 than it is at 5-on-5. Further, teams differ greatly in their ability to take and draw penalties, so goaltenders (even in the span of many seasons) can spend quite a different amount of time at different man-advantages, which will unfairly affect their overall save percentage.

Second, home arenas in the NHL have been empirically shown to display bias in the recording of shots. For example, New Jersey tends to way undercount shots relative to average, and Anaheim tends to overcount. This has a significant effect on save percentage.

Third, while shot quality differences are mostly observed via differences in man-advantages, there are still some effects at even strength. The biggest determining factor seems to be "game score." I.e. the distribution of shot differentials and shooting percentages gets wacky when teams are leading/trailing.

So, imo if you wanted to do this type of study more thoroughly, you should look at the distribution of save percentages only at even strength when the score is tied during road games.

Yes, I agree you have to go deeper than just SV% to get a better idea of how goalies compare ... somewhere in the post, there's a link to some shot quality studies that do that.

I used SV% because that's what Berri and Schmidt did, and I didn't particularly want the discussion to be about what statistics they used, but, rather, what techniques they used.

Eek. Guy, thank you for correcting me on that. I was going one way with my point and completely went off in the wrong direction.

I think your correction still supports the idea that the distributional concerns over Goalie talent could be different than that of 'skater' talent. In fact, I went in the complete opposite direction than I intended.

If our goalies are all very similar in talent, while the skaters are not, then we wouldn't want to pay a lot for them. So, if the majority are very close together, then we should be indifferent for the most part to who our goalie is (as you say with the pitchers in your last paragraph).

On the other hand, if 'skaters' bottom out at a pretty crappy level, it may be more instructive to compare them to one another, rather than lumping them in with goalies. I still think they should at least be separated to see IF there is a difference in the distribution (especially the left tail, like the low replacement level of a batter in your example).

I assume this was the point that Schmidt and Berri were trying to make: that all goalies are pretty much the same, so paying so much for them isn't necessary. If the worst you can do is -10 among goalies, when the worst skater is more of a -30, then you might be better off spending for the skater at +14 than the goalie at +18, no?

Whether or not that is true is for someone that knows something about hockey (not me).

Millsy,

I don't think that was the point that Berri and Schmidt were trying to make ... there's nothing in the text that suggests to me that they were thinking on those lines.

Also, if you look at the bottom of the +/- list, the worst players are less minus than the top players are plus. Which makes sense ... it's the same thing in baseball, there are more guys 50 points over .265 than there are 50 points under .265. The .205 guys just don't get to play.

The worst +/- guys are in the -20s. That's -4 goals each, on average, since there are five guys on the ice. Maybe the worst guy is -10 goals on the season? The worst goalies with 10+ games appear to be around .890, which is 20 goals for a full-time goalie.

I'd say there's no evidence that the worst goalie is less bad than the worst skater.

Thanks, Phil. Like I said, I have no clue about the talent in NHL really (perhaps living in Michigan, I should learn something about hockey). I would expect the right skew you mention. Was mainly curious if, of those who play, it's easier to field (ice?) one position relative to the other, and that was what they were implying. Or, in other words, the left tail is different for the positions.

Hi, Millsy,

Good points ... I don't know how the skew breaks down for skaters as opposed to goalies. Something that could be looked into, I'm sure.

Damn, you've picked the toughest nut to crack in hockey. Roger Neilson never figured out goaltending, so it's really an unknown, relatively speaking.

All of Sunny Mehta's comments are terrifically valid. As well there are survival bias issues even greater than in baseball. Plus the variability in track records and shots for each goalie leads to wide likelihood spreads for new goalies ... and there are a lot of GMs willing to gamble on a great goalie, it can be a job saver. So you get a left shoulder in talent distribution (i.e. there are too many bad goalies in the NHL, we're just not sure which ones they are).

And if that's not bad enough, the overall talent distribution still appears to be slightly right skewed with low kurtosis. So any help that simple Bayesian methods provide in terms of frame size ... it all gets given back because the Beta form will be heavily left skewed here.

And another fundamental problem is that the sample and frame sizes aren't anywhere near large enough to yield luck distributions that are truly binomial. So these covariance/variance ratio analyses are doubly nonsensical.

So if anyone is thinking that Berri and Schmidt are getting closer to solving the goaltender evaluation problem in the NHL ... it strikes me as equivalent to saying that monkeys are getting closer to travelling to the moon because they climbed a taller tree yesterday.

Post a Comment

<< Home