## Wednesday, March 31, 2010

### Stumbling on Wins: Do coaches not understand how players age?

On page 118 of "Stumbling on Wins," authors David Berri and Martin Schmidt argue that NBA coaches don't understand how players age. That's because, according to Berri and Schmidt, coaches give players more and more minutes until age 28. But, they, report, player productivity actually peaks at age 24. Therefore,

"... the allocation of minutes suggests the age profile in basketball is not well understood by NBA coaches."

Geez, that doesn't follow at all.

First, I don't understand how the authors figure that minutes played peak at 28. If you look at actual minutes played by age, the peak appears to be earlier. These are minutes by age for the current 2009-10 season, on the day I'm writing this:

19: 1512
20: 10932
21: 38198
22: 37283
23: 52626
24: 52653
25: 47297
26: 34481
27: 43339
28: 29843
29: 48955
30: 37756
31: 27852
32: 14336
33: 20376
34: 11677
35: 12976
36: 5333
37: 4516
38: 0
39: 122

The curve appears to reach its high point at 23 and 24, then diminishes irregularly down to age 39. There are a couple of blips, notably at 29, but you certainly wouldn't put the minutes peak at anything other than 23-24.

So why do the authors say 28 is the peak? I'm not sure. In a footnote, they say the details can be found on their website, but there's nothing posted yet for that chapter (seven).

I suspect the issue is selective sampling. If you look at only players who had long careers, you could very well come up with a peak of 28. As has been discussed repeatedly here and at Tango's site in the context of baseball aging, when you look only at players with long careers, you're sampling only those who aged more gracefully then others. And so your peak will be biased high.

Also, a player with a long career is probably a full-time player for most of it. Suppose someone comes up at 23 and plays until 33. His first couple of seasons and last couple of seasons, he might be a part-time player; the middle seasons, he's full-time, with only minor variations in minutes. So his minutes curve looks like: low horizontal line, high horizontal line, low horizontal line. If you try to draw a smooth curve to that, it'll peak right in the middle, which, for our example, is age 28.

The idea is: there's only so much playing time you can give to a good player. You might give him 40 minutes a game at age 28, when he's still very, very, good ... but you can't give him 50 minutes a game when he's 24 and brilliant. So the curve is roughly flat in a good player's prime, and the off-years at the beginning and the end will artificially make it look like there's a peak in the middle.

Anyway, this is all speculation until Berri and Schmidt post the study.

The average minute in the above table occurs at age 26.6 -- below the 28 that Berri and Schmidt talk about, but above the 24 that they say it should be. It makes sense that it should be well above 24. A good player might still be in the league ten years after the peak, at age 34 -- but there's no way he'd be in the league ten years before the peak, at age 14. If a player can play when he's old, but not when he's young, that, obviously, will skew the mean above the peak of 23-24.

There are probably other reasons, too, but I think that's the main one.

Berri and Schmidt think that NBA minutes peak later than 24 because coaches don't understand how players age. It seems obvious that there's a more plausible explanation -- that it's because players like Shaquille O'Neal are able to play NBA basketball at age 37, but not at age 9.

Labels: , ,

## Sunday, March 28, 2010

### "Stumbling on Wins:" is there really little difference between goalies?

My copy of "Stumbling on Wins," the new book by David Berri and Martin Schmidt, arrived on Friday. It's a quicker read than their first book, "The Wages of Wins"; for one thing, it's shorter, at 140 pages (before appendices and endnotes). For another thing, the writing style is a bit breezier and less technical, more suited to the non-academic (but serious) sports fan.

The theme of this book is how decision-makers in sports make bad decisions because they don't know how to properly evaluate the information they have. Irrationality in decision making is a subject that's been popularized quite a bit lately. In the last few years, you've got "Predictably Irrational" by Dan Ariely, "Nudge" by Richard Thaler and Cass Sunstein, "Sway" by Ori and Rom Brafman, "Priceless" by William Poundstone, and others. The authors of this book acknowledge the trend, and that they chose their title in tribute to Daniel Gilbert's "Stumbling on Happiness."

I disagree with many (but not all) of the conclusions the authors reach ... it seems like, too often, the authors will do a quick study, look at the results superficially, jump to conclusions that I don't think are justified, and argue from those conclusions that decision-makers are doing it wrong.

For now, I'll just give you one example. In Chapter 3, they argue that NHL goalies are overpaid. Why? Because

"... there simply is little difference in the performance of most NHL goalies."

Why evidence to they give for this?

First, they ran a correlation between a goalie's save percentage (SV%) in consecutive seasons. They got an r-squared of .06, or 6%. That's a small number. So goalies are inconsistent, and what is being observed is not really the goalie's talent.

That's not correct at all.

As I wrote before, and Tango has repeatedly said on his own blog, you can't just observe that because the r-squared is a small number, that the relationship between two variables is weak. Indeed, the same relationship can give you very different r-squareds, depending on other factors in your data, the most obvious of which, here, is sample size.

The r-squared, by definition, is the variance of talent as a percentage of total variance. But the smaller your sample, the more total variance you have just because of luck. And so, the smaller the sample, the lower the r-squared, regardless of whether the talent is low or high.

A low r-squared might mean a small needle -- or it might mean a large haystack.

Unless you take a few seconds to figure out which it is, your r-squared doesn't tell you much of anything about the relationship between the two variables.

What *does* that .06 mean? Well, if the r-squared is .06, then the r is about .25. Roughly speaking, that means you can expect 25% of a goalie's difference from the mean to be repeated next year. Put another way, you have to regress the goalie 75% towards the mean.

Yes, that's not as much as you'd expect. By that calculation, if the average save percentage is .904, and goalie X comes in one season at .924, you'd expect next year he'd be at .909 -- one quarter of the distance between .904 and .924. That's still something: it's .005 above average, which is one goal every 200 shots, or about 10 goals a season.

What do you think -- the idea that a .924 goalie is really .909, does that mean "there's little difference between goalies?" That's more a matter of opinion ... but at least now you have the numbers you need to get a grip on what's going on. The "r-squared equals only .06" doesn't really help you decide.

----

Anyway, that's one problem, that the .06 isn't as small as it looks. A bigger problem is that I don't think the .06 is accurate.

I repeated the same correlation for two sets of two consecutive years, 2005-06 to 2006-07, and 2007-08 to 2008-09. I looked at only the 20 goalies with the most minutes played. I got r-squareds of .30 and .25, respectively, both much higher than the authors' .06.

Why? I think it's because the authors included goalies with many fewer shots against. They don't say exactly what their criteria were, except that they "adjusted for time on the ice" (whatever that means: SV% doesn't depend on time played). In other studies in the same chapter, they used 1000 minutes as a criterion, so maybe that's what they did here.

Now, to simplify, suppose the variance of SV% consists of only talent and luck. A full-time goalie plays about 3,500 minutes. In my regression, it turns out that you get 1 part talent to three parts luck (that's where the .25 comes from: 25% of the total is talent). Now, suppose Berri and Schmidt's average goalie played only half that, or 1,750 minutes. Then the luck variance would be twice as high, and they'd get one part talent to *six* parts luck. That would drop the r-squared down from .25 to .14.

I don't know how the authors got .06 when my analysis shows .14 ... maybe their cutoff was lower than 1,000 minutes. Maybe there's some selection bias in my sample of top goalies only. Maybe my four seasons just happened to be not quite representative. Regardless, the fact that the r-squared varies so much with your selection criterion shows that you can't take it at face value without doing a bit of work to interpret it.

In any case, going back to my r-squared of .25 ... the square root of .25 is .50. That means that exactly half a full-time goalie's observed difference from the mean is real, and will be repeated next season; if a goalie is .020 better than average this year, expect him to be .010 better than average next year. That's pretty reasonable. In that light, I don't think you can say "there's little difference between goalies" at all.

----

And, in fact, we should be able to figure out the spread in goalie talent directly, by a method I learned from Tango a few years ago.

Suppose a goalie faces 1,700 shots, and is expected to save 90% of them. By random chance, he'll sometimes save more than 90%, and sometimes less. By the binomial approximation to the normal distribution, the standard deviation of his save percentage due to luck will be .0073.

Now, for the five seasons I checked, the top 20 goalies that year had an actual SD between .007 and .013 ... let's call it about .011.

That's higher than .0073, as you'd expect. The .0073 is what you'd get if all goalies were identical. But there's also extra variance from the fact that some goalies are better than others. Since

(Observed SD)^2 = (Non-luck SD)^2 + (Luck SD)^2

we can say

.011 ^2 = (Non-luck SD) ^2 + .0073 ^2

So the non-luck SD should be about .0082. If we consider everything that's not binomial luck to be talent, then we can say that the SD of top-20 goalie talent is .008. (I dropped the last decimal because our numbers are very rough here.)

If everything that's "non-luck" should repeat next year, we should get an r-squared of about (.008/.011)^2, which is .53. I only got .25 or .30. Why? Well, there could be more luck involved than just binomial. Not all shots are created equal; maybe some goalies got easier shots, and some harder (search for "Shot Quality" here). Maybe there's some variation in talent because of injury or age. There's definitely the quality of the goalie's defense, and that varies a bit from year to year.

Still, there's quite a bit of evidence of talent here. The theoretical value for r-squared was .53, which means the theoretical value for r is .73. That means that if a goaltender was absolutely perfectly consistent, and every shot gave him the same chance of stopping it, each and every year ... then, 73% of his observed talent would be real. That's what it means to be absolutely consistent.

I didn't find .73, but I found about .50. That's a pretty good proportion of the theoretical maximum. I think we can say that a good part of what we see of a goalie's performance is real.

But, does all this mean that "there's little difference between goalies?" Well, let's check. We got an r-squared of .25, which means that 25% of the variance is talent. The variance observed is .011^2, so the variance due to talent is a quarter of that, which is .0055^2.

That means that a goalie who's one SD above average will have a save percentage .0055 better than average. A goalie who's two SDs above average will be .011 better than the mean.

In the context of 1700 shots, one SD is about 9 goals. Two SDs is about 18 goals. And that's from only the 20 goalies with the most playing time. You'd imagine that if you included backup goalies, the variance would be larger. But, to be conservative, I'll leave the SD at 9 goals for now.

Berri and Schmidt looked at Martin Brodeur's career and found he saved an average of 13.6 goals per year, compared to an average goalie. That's consistent with a 9 goal SD; it implies that Brodeur is about one and a half SDs above average, which seems very reasonable. The authors also point out that, in terms of wins, an advantage of 13.6 goals a year is very small compared to what an NBA superstar can provide. That's true, but it doesn't mean that goalies don't matter in the context of hockey. To address that point, you need to look at the 9 goal SD. Is that a lot?

Well ... I'm not sure. I think it's more than it looks. Let's compare goalies to skaters.

Looking at the plus-minus statistics from 2008-09, a bunch of Bruins come up near the top, with numbers scattered around +30. That means that, when those players were on the ice in non-power-play situations, the Bruins scored 30 more goals than they gave up. Along with Detroit, that seems to be the highest bunch in the league.

Since five players are on the ice, you could give each of them credit for 6 extra goals. But they're not all equal -- some are better than others. Let's say that instead of 6/6/6/6/6, they might be 10/8/6/4/2.

That means that the best player on the Bruins might be worth 10 goals. Regressing that to the mean, let's call it 8 goals. Adding power plays, which weren't included in plus/minus, let's move it back to 10 goals.

That's the best player on the best team. But maybe the best player in the league wasn't on the Bruins -- he might have been on a mediocre team, and his teammates caused his plus/minus to drop. How do we adjust for that? I don't know, but let's bump it up 4 goals, and estimate that the best player in the NHL was worth 14 goals last year.

Now, figure the best goalie is about 2 SD above average, for 18 goals. So, the best goalie in the league is better than the best skater! That doesn't suggest, at all, that there's little difference between goalies.

Except ... last year's top plus/minus figure of +37 (David Krejci) is low by historical standards. In 1981-82, the top five players had plus-minuses above 66, almost twice what the Bruins had last year (although in a higher-scoring offensive environment). And, in 1970-71, Bobby Orr had a plus-minus of +124. Back then, you could certainly argue that goalies were more homogeneous than skaters, and the best skater (Gretzky, Orr, or Lemieux) was easily better than the best goalie. And I think that coincides with the intuition that people had back then, that a good goalie could help, but would never be a factor like a Gretzky would.

Still, maybe we should bump the 14 goal estimate for the best skater up a little bit, closer to the 18 goals we found for the best goalie.

I may be wrong in my logic somewhere, but, if I've done everything right, it seems that top goalies in this era are very similar in importance to top skaters. So when Berri and Schmidt accuse GMs of signing goalies to big contracts because "the people that write the checks" don't "understand [the] story" that goalies don't matter much ... well, I think they underestimate the capabilities of those hockey executives. Their judgment might not be perfect, but I think they understand the variation of talent at least as well as Berri and Schmidt seem to.

-----

So I think Berri and Schmidt got into trouble by just looking at the number .06 without thinking about what it meant. They do this again, a bit later, when they run a correlation between SV% in the regular season, and SV% in the playoffs. That's just doomed to fail, because the playoff sample is so small. That makes the variance due to luck very large, which, in turn, brings the r-squared very close to zero.

Actually, they find an r-squared of .07, which is actually larger than the .06 they found over two consecutive regular seasons. You'd think it would be smaller, since playoff samples are so much smaller. I wonder if the .06 was maybe they used very small samples over the regular season, including goalies with only a couple of games played?

Anyway, after that, they try the correlation between two consecutive playoff appearances. They found "none" of the performance was predictable, which suggests an r-squared of .00 (or maybe they assume it's .00 because it wasn't statistically significant). But that's probably just a sample size issue. If their intention was to show that playoff performance by goalies has a lot of random luck in it, well, yes, of course it does. But if their intent is to conclude that goalie performance is completely unpredictable, that one r-squared isn't enough evidence of that. And I'd bet that if they looked a little closer, they'd find that goalies perform in the playoffs exactly as you'd expect them to, subject to a substantial amount of binomial random luck. Or maybe not -- maybe playoff hockey is so different from regular season that different goalies excel at it. But if you want to check that, you have to do more than just run a single regression and look at a single r-squared.

----

Finally, another non sequitur arises where they write,

"Looking at ... goalies ... one sees an average save percentage of [.895]. The standard deviation of that percentage, though, is only .018. Hence the coefficient of variation of save percentage [the SD divided by the mean] is only 0.02. Hence, there simply is very little difference in the performance of most NHL goalies."

Now, I don't get this at all. How does the coefficient of variation tell you whether or not there's a qualitative difference in performance? It just doesn't. The fact that the SD is a small fraction of the mean doesn't have anything to do with how important the statistic is.

Inutitively, I can see how you might jump to that conclusion, if you don't think about it much. But if you do, it makes no sense. The proportion doesn't matter. When it comes to goals, it's the absolute number that matters. If you let in 10 more goals than average over a season, you cost your team 10 goals. It doesn't matter if you and the other goalies get 100 shots, 1000 shots, 10,000 shots, or 100,000 shots -- ten goals in a season is ten goals in a season.

Another way to look at it is that the SV% statistic is arbitrary, which means the coefficient of variation is arbitary. Suppose the NHL had decided to use "goal percentage" instead of "save percentage", counting up the percentage of shots that went in, instead of the percentage that did not. In that case, the SD would be exactly the same, .018. But the average is now the opposite of what it was -- if 89.5% of shots are stopped, then 10.5% of shots are NOT stopped. And so now your coefficient of variation is .17.

One way, you get .02. Another way, you get .17. So how can the size of the arbitrary coefficient of variation possibly have anything to do with how important goaltending is?

I'm sure the coefficient of variation has its uses, but this isn't one of them.

-----

In summary: as I read it, Berri and Schmidt's argument goes something like this:

-- The r-squared of SV% in consecutive seasons is .06.
-- The r-squared of SV% between a season and the playoffs is .07.
-- The r-squared of SV% between two consecutive playoffs is .00.
-- The coefficient of variation for SV% is .02.

--> These are all small numbers. Therefore, goalies' performances aren't consistent. That means there's not much difference between them, and GMs don't seem to realize this.

As I wrote, I don't think that logic makes sense. I think the evidence shows that, in the current era, good goalies are about as valuable as good skaters. I haven't looked, but I bet that salary data would show that to be roughly consistent with what GMs think.

## Thursday, March 25, 2010

### Regression to the likely

In the previous post, I gave an example of a statistical test on clutch hitting. It went like this:

"Joe Blow hits .300 in the clutch, and .230 non-clutch. Someone does a standard statistical test on the difference, and finds a significance level of .05. That means that, if the player were actually exactly the same in the clutch as the non-clutch, there would have been only a 5% chance of him having a 70-point-or-more difference just by luck."

Typically, when you get a statistically significant result like this, you use the observed effect as your estimate of the real effect -- in this case, 70 points. Previously, Tango had argued that you shouldn't do that. All you've shown by "statistical significance" is that the result is significantly different from zero. It could be 40 points, it could be 20 points, it could be anything non-zero. You shouldn't just assume it's 70.

I agreed. The point is that you have to take this result, and combine it with everything else you know about clutch hitting, before making a "baseball" estimate of what that observed 70 point difference really tells you.

To clarify what I meant in the previous post, let me give you an example that makes it more obvious. Suppose that you decide to study how good a hitter Albert Pujols is. You randomly pick 8 of his at-bats, and it turns out that in those AB, he went 7-for-8. And suppose your null hypothesis is that Pujols is average, just a .270 hitter.

If you were to do a traditional binomial test, you would find that Pujols' observed .875 batting average is high enough that you would easily reject the null hypothesis that he's .270.

But even though the sample showed .875, would anyone seriously argue that the evidence shows that Albert Pujols is an .875 hitter? That only makes sense if you're willing to ignore everything that you know about baseball, and if you're also willing to ignore everything that *everybody else* knows about baseball -- that there's no such thing as an .875 hitter.

There is nothing wrong with the statistical calculations and statistical test that came up with the .875 estimate. It's just that a naive statistical test doesn't know anything about baseball. And if you want to make an argument about baseball, you have to use baseball knowledge. The fact that you did a statistical test, and it came up significantly different from conventional wisdom, does NOT mean that conventional wisdom is wrong. When you have piles and piles of evidence that says that Pujols is not an .875 hitter, and one statistical test that estimates he is, based only on 8 AB, then if you consider only the 8 AB and ignore everything else, you're not being a sabermetrician -- you're doing a first-year STAT 101 class assignment.

Anyway, I think the ".875 Pujols" example makes the point clearer than the ".070 clutch" example, because it's more obvious that a .875 hitter is absurd than that a .070 clutch talent is absurd. Less so to Tango, of course, who has studied the clutch issue, and probably reacted the same way to ".070 clutch" as I did to ".875 batting average."

I think the example also makes the arguments of some of the other post's commenters more understandable. A couple of them were arguing that, OK, if you know something about baseball, you might discount the .070. But *in the absence of any other information*, you can take the .070 as the best estimate. To which I say, yes, that's true (subject to serious caveats I'll get to in a minute), but not all that relevant. Because, if you rephrase it as, "in the absence of any other information about Albert Pujols, you can take .875 as the best estimate of his batting average talent" ... well, that's still true, but completely irrelevant to any study that's trying to learn something about baseball.

------

So, we ran the Joe Blow statistical test, and we found that the .070 was statistically signficant at .05. And we're also willing to say, "in the absence of any other information, we can take the .070 as the best unbiased estimate of Joe Blow's clutch ability."

Now, I'm going to create a simulated teammate for Joe -- call him David Doe. I'm going to play God, and I'm going to create David Doe to be a clutch hitter. I'm going to go to my computer, and ask for a random number between 0 and 1, and divide it by 10 to get David's random clutch talent. I'm not going to tell you that true random clutch talent, because I'm God and you're the sabermetrician, so it's up to you to figure that out.

However, I'm going to simulate a bunch of non-clutch AB for David, and a bunch of clutch AB. I'll tell you the results and let you do the statistical test. Actually, I'll even do the test for you and tell the result and the significance level.

OK, let me run my randomizer ... done. And, hey, it turns out, coincidentally, the random numbers came up exactly the same for David as for Joe -- he hit exactly 70 points better in the clutch. And, coincidentally, the significance level is the same, .05.

As we said back when we were talking about Joe, in the absence of any other information than that given, we can say the best estimate for Joe's clutch ability is .070. Similarly, in the absence of any other information than that given, what's the best estimate for David's clutch ability?

For David, the estimate of .070 is biased too high. Why? Because we know something in the David case that we didn't know in the Joe case: that because of the way I randomly chose David's clutch ability, it can never be more than .100.

David's true clutch ability might be between .040 and .070, and he just got lucky. It might be between .070 and .100, and he just got unlucky. Those two possibilities are symmetrical, so if we consider only those possibilities, .070 *is* our best estimate.

But there is another possibility: that David's true clutch ability is between .000 and .040, and he got even more lucky. That's not balanced by anything on the "unlucky" side, because there is no way David was actually a .100 to .140 hitter who got unlucky -- that case is impossible.

So David is more likely to have been lucky than unlucky, and the best estimate of his clutch ability is less than .070.

A way to make this more obvious is to give the standard confidence interval around .070. For both Joe and David, it might be the interval (.005, .135). For Joe, that makes sense. For David, it becomes clear that it doesn't make sense: everything above .100 is impossible, and so you know David's confidence interval is wrong.

This looks like a trick, but it's not, not really. It's just a special case of the principle that the estimate and confidence interval are not correct if some possible values of the parameter were less likely to be true than others. "Impossible" is just a special case of "less likely".

The real God does these things too. He's made it nearly impossible for anyone to be an .875 hitter. And, evidence shows, he's made it very unlikely for anyone to be a .070 clutch hitter. If you ignore those facts, you'll come up with implausible predictions. If you really believe that a 95% confidence interval around Joe Blow is centered at .070 clutch, I'll be happy to bet you on Joe's performance next year. You should be willing to give me 19:1 odds that Joe will hit within his confidence interval. I'll be very happy to take 10:1. If you're not willing to take my bet, then you don't really believe in your results, do you?

-----

When we said that you can trust the .070 estimate "in the absence of any other information," that phrase, "in the absence of any other information," is a bit of a fuzzy way of saying what the true condition is. There's a mathematical Bayesian way of phrasing the condition, but I'll just use a rough approximation:

-- You can accept the point estimate and the confidence interval if, before the fact, you could say that every value was equally likely to be true.

That's not always the case. In my simulation, it was explicitly not the case, since I told you in advance that clutch couldn't be less than .000 or higher than .010.

It's not the case for clutch, either. Even if you didn't know anything about clutch hitting, it would be obvious, wouldn't it, that a clutch talent of .900 was impossible? And so, technically, not every value is equally likely to be true -- .070 is plausible, .900 is not. So, technically speaking, citing the .070 is invalid. It is NOT accurate to give a confidence interval for your parameter unless you are willing to assume everything is equally likely, from minus infinity to plus infinity.

That's a technicality, of course ... .900 is so far away from the .070 that the study showed, that you don't lose much accuracy from the fact that it's impossible. We don't actually have to go to infinity -- we can actually just say (and, again, this is a paraphrase of the math),

-- You can accept the point estimate and the confidence interval for all practical purposes if, before the fact, you could say that every value within a reasonable distance of the estimate was roughly equally likely to be true.

In my opinion, this is NOT strictly the same as saying "in the absence of any other information." It's an explicit assumption that has to be made, one that just happens to be reasonable in the absence of other information. Normally, it's just ignored or taken for granted. But that doesn't mean it's not really lurking there.

So, in the case of Joe Blow, is every value within a reasonable distance of the confidence interval equally likely to be true? I don't think so. Our hypothetical confidence interval is (.005, .135). That is within a reasonable distance of .000, and, you'd think, values around .000 are more likely to be true than any other value.

Why are values around .000 more likely to be true than other values? You could argue that from the evidence of previous studies. But you don't need to. You can argue it on first principles.

Most human attributes are normally distributed, or, at least, distributed in a "normal-like" way where there are more people near average than far from average. Assuming that we have already normalized Joe's clutch stats to the league average clutch stats (as most studies do), the league average is .000, and so we should have expected that Joe's clutch talent would be more likely found around .000 than .070. Therefore, the the condition "every value within a reasonable distance is roughly equally likely to be true" does not hold.

(Notice that you don't need to know whether clutch hitting exists or how big it is -- or even know anything about baseball -- to know that you can't take the point estimate for it at face value! All you need to know is that the distribution is higher in the middle. I think that's kind of cool, even though it's really just the same argument as regression to the mean.)

And so, you can't just take the .070 as your estimate without further argument.

If you disagree with me, you can still try the "without further information" argument. You might reply, "well, Phil, I agree with you, but you have to admit that *in the absence of any other information*, the .070 is a good unbiased estimate."

To which my first reaction is,

"if you're not willing to even consider that human tendencies are clustered near average, if you're willing to go that far to ignore that other information, then you're not studying baseball, you're just doing mathematical exercises."

And then I say,

"It's not even technically true that the .070 is correct 'in the absence of additional information.' That's just a fuzzy way of phrasing it. What is *really* true is that the .070 is correct only if you are willing to assume that all values were, a priori, equally likely. That's an assumption that you have to make, even though you avoid making it, and even though you may not even realize you're making it. And your assumption is false. It's not just a case of ignoring information -- it's a case of ignoring information *and then assuming the opposite*."

-----

I mentioned "regression to the mean," which, in sabermetrics, is the idea that when you try to estimate a player's or team's talent, you have to take the performance and "regress it" (move it closer) to the mean. If you find ten players who hit .350, you'll find that next season they may only hit .310 or so. And if you find a bunch of teams who go 30-51 in the first half of the season, you'll find they do better in the second half.

This happens because there are more players with talent below .350 than above, so that when a player does hit .350, he's more likely to have been a lower-than-.350 player who got lucky than a higher-than-.350 player who got unlucky.

Regression to the mean is actually a special case of what I'm describing here. You might call this Bayesian stuff "regression to the likely." It's likely the .350 hitter is actually a lower-than-.350 player, so that's the way you regress.

"Regression to the likely" is just another way of saying, "take all the other evidence and arguments into account," because it's all those other things that made you come up with what you thought was likely in the first place.

If you accept a .070 clutch number at face value, you are implicitly saying "when I regressed to the likely, I made no change to my belief in the .070, because there was nothing more likely to regress to." If that's what you think, you have to argue it. You can't just ignore or deny that you're implying it.

-----

It may sound like I'm arguing that everyone who has ever created a confidence interval in an academic study is wrong. But, not really -- much of the time, in academic studies, the hidden assumption is true.

Suppose you do a study that tries to figure out how much it costs in free agent salary to get an extra win. And you wind up with an estimate of \$5,500,000, with a standard deviation of \$240,000. Is there any reason to believe that any number within a reasonable distance of \$5.5MM is likelier than any other? I can't think of one. I mean, it was easy to think of a reason why more players should be average clutch than extreme clutch, but, before doing this study, was there any reason to think that wins should be more likely to cost \$4,851,334 than, say, \$6,033,762? I can't think of any.

Much of the time things are like this: a study will try to estimate some parameter for which there was no strong logical reason to expect a certain result over any other result. And, in those cases, the confidence interval is just fine.

But other times it's not. Clutch is one of them.

If you take the study's observed point estimate of at face value -- and, as Tango observes, most studies do -- you're making that hidden assumption that (in the case of clutch) .070 is just as likely as .000. You're making that assumption whether you realize it or not; and the fact that the assumption is hidden does not mean you're entitled to assume it's true. In the clutch case, it seems obvious that it's not. And so, the .070 estimate, and accompanying confidence interval, are simply not correct in any predictive baseball sense.

Labels: ,

## Friday, March 19, 2010

### Statistical significance is only one piece of evidence

Joe Blow hits .300 in the clutch, and .230 non-clutch. Someone does a standard statistical test on the difference, and finds a signficance level of .05. That means that, if the player were actually exactly the same in the clutch as the non-clutch, there would have been only a 5% chance of him having a 70-point-or-more difference just by luck.

That 5% is something of a standard in research: at .05 or below, you conclude that the observation is unlikely enough -- "statistically significant" enough -- to justify a conclusion that you're seeing something real. So, you "reject" the idea that the differences between his clutch and non-clutch performances were caused by luck. You conclude that there is something going on that leads to Joe doing better in the clutch.

The above is my paraphrase of something Tango posted yesterday. He cites the above, and then makes an important caveat: the fact that you reject the null hypothesis -- that you're rejecting the idea that there's no difference between clutch and non-clutch -- does NOT mean that you should conclude that the actual difference is 70 points. All you can conclude, Tango argues, is that the difference isn't zero. For all you know, it might be 30 points, or 10 points, or even 1 point. You are NOT entitled to assume that it's 70 points, just because that's what the actual sample showed.

He's absolutely right.

Then, later in the comments, he says (and I'm paraphrasing again) that the actual difference should be taken to be greater than zero, but less than 70. Well, I agree with him on this baseball point, that greater than 0 and less than 70 is correct. But the point I want to make here is that you *can't* conclude that *from just the statistical test*. The test itself doesn't let you say that. You have to use common sense and logic, and make an argument.

First, and as commenter Sunny Mehta points out, the standard "frequentist" statistical method described here does NOT allow you to say anything at all about what the actual difference might be. It says that if you make the assumption that the parameter is 0 (no difference between clutch and non-clutch), the observation will only happen 5% of the time. If you choose to cite the rarity of the 5%, and reject the hypothesis, the statistical method doesn't say anything about how to adjust your guess as to what the real parameter is. All you can say is "non-zero."

If you want to do better than that, and argue about what the difference actually is, you have to use "Bayesian" methods. You can do that formally, or you can do that informally. Informally is easier: it basically means, "now that you've done the test and got a confidence interval, make a common sense argument about what's really going on, based on what you know about the real world."

That's what Tango is doing when he rejects the 70 points. At the risk of trying to read his mind, what he's saying is: "first, the test only lets you reject the zero. It doesn't tell you the answer is 70 points, or, for that matter, anything else in particular. And second, from what I know about baseball, 70 points is ridiculous."

The "from what I know about baseball" part is critical. Because, let's suppose instead of clutch and non-clutch, you compared Mark Belanger to Tony Gwynn. And let's suppose you got the same 70 point difference, and the same significance level. I'm guessing that you'd no longer argue that the "real" difference is between 0 and 70 points. You'd argue that it's probably MORE than 70 points. Why? Because your baseball knowledge tells you that Mark Belanger normally hits .220, and Tony Gwynn normally hits .330, and if the difference you observed was only 70 points, that's probably a bit too low. There's still some randomness going on -- it could be that Belanger learned something over the off-season, and Gwynn is injured -- but it's more likely that the talent difference is higher, and it was just luck that made them hit only .070 apart.

For clutch, 70 points is ridiculously high. For Belanger/Gwynn, 70 points is somewhat low. But the statistical test might wind up exactly the same. It's only because of your pre-existing baseball knowledge -- your "prior," as it's called in Bayesian -- that you argue the two cases differently.

When would your best guess be that the difference is *exactly* 70 points? When you know nothing about the subject, and 70 points seems as good a guess as anything else. Suppose there are two tribes with two systems of measurement. One measures in "blorps," and one measures in "glimps". You ask the tribes, indpendently, to estimate certain identical distances. You do the calculations, and it turns out that a "blorp" seems to be about .230 miles, and a "glimp" seems to be about .300 miles. The difference is significant at .05, so you conclude that a blorp is not equal to a glimp. What's your best estimate of the difference between them? .070. You have no "prior" reason to think it should be more, or less.

And there are even times where, despite the significance level, you can use common sense to call BS, and believe that the difference is zero DESPITE the statistical significance. Suppose Player X hits .230 when the first decimal place of the Dow Jones average is even, and he hits .300 when it's odd. Even though the difference is statistically significant at p=.05, you're not going to actually believe there's a connection, are you? It would take a lot more than that ... a significance level of 5% is only one time in 20. It's a lot less likely than 1 in 20 that there's actually some connection between the digit and the player's hitting, isn't there? I mean, if you get enough researchers to come up with some pseudo-random split, one of them is likely going to come up with something significant. You have to use common sense to accept the "nothing going on" hypothesis in practical terms, even though formally you "reject" the hypothesis in formal frequentist statistical terms.

So four different cases, four different conclusions, even when the identical statistical test shows a 70 point difference at p=.05:

-- Clutch hitter: real-life difference likely much less than 70 points, but greater than 0
-- Gwynn/Belanger: real-life difference likely more than 70 points
-- Glimps/Blorps: real-life difference likely around 70 points
-- Odd/Even digits: real life difference likely 0 points

A good way to think of it: the result of the statistical test adds to the pile of evidence on whatever issue it's testing. If you have no evidence, take the 70 points (or whatever) at face value. If you DO have evidence, use the 70 point difference to add to the pile, and it will move your conclusion one way or the other.

As Tango says, you are not entitled to assume that the difference is really 70 points just because the 70 is statistically significant. You have to make an argument. And, sure, your argument can be, "I don't know anything about baseball, so 70 points is the most likely difference." But it's perfectly legitimate for Tango to turn around and say, "I DO know something about baseball, and 70 points makes no sense."

Unfortunately, a lot of researchers don't understand that. Or, they believe that the fact that they're not using explicit formal Bayesian methods mean that they're allowed to assume the 70 points without further comment. Tango writes, "[that] is pretty much how I see conclusions being made."

And I agree with him that that's just not right.

Your one statistical test is just one piece of evidence. It does not entitle you to ignore any other evidence, and it does not allow you to make conclusions about baseball without making a logical argument about what your finding means, in the light of other evidence that might be available.

Labels: ,

## Thursday, March 11, 2010

### Doesn't it make financial sense for teams to hire more sabermetricians?

Suppose you have two hitters, each with the same home run rate. The first guy hit mostly long home runs, but the second guy hit quite a few that just barely cleared the fence. For next year, you’d expect that the first guy would hit more HR than the second guy, right?

Right. Greg Rybarczyk did the work, and found that the more “just enough” (JE) home runs in 2007, the bigger the HR drop in 2008.

-- If the player had fewer than 25% JE home runs in 2007, his 2008 HR dropped by 13%.

-- If he had between 26 and 39% JE home runs, he dropped by 12%.

-- But if he had 40% or more JE home runs, his 2008 total dropped by 22%.

In table form:

00-25% JE ---> drop of 13%
26-39% JE ---> drop of 12%
40-99% JE ---> drop of 22%

The study is ongoing following commenters’ suggestions … I’m suggesting a regression to smooth out the categories, so that we can have a single result to predict the amount by which a player will drop.

----

Anyway, Greg's study got me thinking about another of Tango's posts today, one where he links to Sal Baxamusa's summary of a panel at year's MIT Sloan Sports Analytics Conference. One of the panelists was Kevin Kelley, the high-school football coach who became a celebrity when, after studying the issue, he decided never to have his team punt on fourth down. Baxamusa writes,

"Kelley, barely managing to get a word in edgewise, said, "It's not just the method in which it's said, it's who says it."

"Well, fellow statheads, that's just it, isn't it? We can bang the WAR drum all day. We can refine our PITCHf/x studies until we find the one pitch that even Pujols can't hit. We can play all the fancy analytical tricks we want. But when it comes to using these analytics, teams have to do more than just hire a few quants to sit under the stadium and code all day. ...

"John Dewan, of Baseball Info Solutions and pioneer of fielding statistics, said that his organization meets with lots of baseball teams, not just the sabermetric-friendly ones. He was candid in his assessment that some teams that he speaks with don't understand how to effectively use the defensive data that's available. John Abbamondi, the assistant GM of the St. Louis Cardinals, gave an example of a colleague who wanted platoon splits for relief pitchers over a one-week period—data with such small sample sizes so as to be rendered irrelevant as a predictive tool."

The two-part summary seems to be:

-- a lot of teams don't get it, and
-- even when the teams get it, sabermetricians are low status and not listened to much.

Which I don't get. I mean, suppose a team had hired Greg Rybarczyk, and, instead of revealing to the whole world what he found, he told the team. Now, when a team is shopping for a free agent, Greg's study should get them a little more accurate estimate of a player's HR potential.

It might not be a lot: even if you find that a 40 HR hitter with lots of "just enough" home runs should drop to (say) 32, you probably suspected that already, because he probably had an unusually good season. Also, many of the JE home runs would, under other circumstances, have been doubles, not outs, so the drop isn't as significant as it looks.

But still. Suppose your estimate was 1 HR better than before. One HR is about 1.5 runs, which is about 1/7 of a win, which is at least \$500,000. One good sabermetrician like Greg, who you could probably hire for, at a guess, \$70,000 including benefits, would give you, with this one study, information that had the potential to save you \$500,000.

So why don't they hire them and at least listen to them? A few possibilities:

1. Just a prejudice against them and their low status. They're young smart guys with no baseball experience, and don't fit into the culture.

2. Even though this one study has the *potential* to save \$500,000, it probably won't. There might be one or two guys who fit the extreme-JE profile, and the team may not be signing those two guys this year. Also, there's only a small chance that they'd be outbid by exactly \$500K, and the information would make a difference.

3. It's hard for the GM to tell if the stud sabermetrician's results are valid or not. Peer review, before and after publication, makes sure the results make sense. The GM isn't in any position to do that himself (and, indeed, nobody is as good as the community).

4. In the past few years, so much intelligence is out there for free, on the various websites, that the team has enough trouble keeping current with that stuff, never mind creating new knowledge. If Greg comes up with this study on a public site, the team has to run in place just to keep from *losing* the potential \$500K to teams that know about it.

5. The team doesn't know who to hire. For every good sabermetrician, there are ten mediocrities that won't help you much.

6. Even if you find a few runs, they're runs you earn *on average*. If you find a guy who's expected to hit 2 more HR, he might wind up having a bad year, and your good move looks like it was a bad move. So, in that light, it's hard for teams to actually see and believe that your study saved them \$500K.

To me, none of these reasons seem to be enough. For #5, for instance, you could just ask Tom Tango. I'd hire anyone Tango recommended ... although, I guess I needed to know in the first place that Tango was someone to trust.

Anyway ... for, say, \$200,000 a year, you could hire three intelligent, inquisitive sabermetricians, and, among all three, they'd only have to give you ONE RUN A YEAR in extra intelligence to make the hiring worthwhile. Is there something wrong with my logic? Why doesn't every team have two or three Gregs working away?

## Friday, March 05, 2010

### Improving on Pythagoras

A recent press release from Iowa State University promotes a recent study from physics professor Kerry Whisnant, who has discovered an extension to baseball's Pythagorean Projection that makes it more accurate.

The Pythagoran formula ("Pythagoras") says that you can predict team winning percentage as

wpct = RS^2 / (RS^2 + RA^2)

Where RS means "runs scored," RA means "runs allowed", and "^2" means "to the power of 2." The formula was discovered by Bill James some 30 years ago.

But there are cases where the formula doesn't give you the whole story. Your actual win total doesn't just depend on runs scored and allowed -- it also depends on the consistency of each. If your scoring is less consistent than average, you should outperform Pythagoras in terms of wins. For instance, if you score exactly the same number of runs as you allow, you should wind up a .500 team. But if you win more blowouts than average -- by scores of 15-2 and 16-6, for instance -- you'll finish at less than .500, because you've "wasted" your runs when you don't need them. And if you *lose* more blowouts than average, you'll win *more* games than 50 percent, because your opponents are "wasting" their runs.

Whisnant generalized the "blowout" argument to use the standard deviation of runs scored and allowed, instead. The SD is a measure of how spread out the run totals are, so that a team with lots of blowouts will have a higher SD than a team with fewer. But then he went one step further -- he noted that, if you have two teams with the same runs scored, the team with the higher slugging percentage will have a lower SD. Why? Because home runs more consistently contribute to scoring than do singles. Generally, a home run is worth about three singles, but the effects are more certain. If you hit only five home runs in a game, you're going to score exactly five runs. But if you hit 15 singles, you're going to score anywhere from 0 to 15 runs, depending on how they're bunched and how often your outs advance your baserunners.

Whisnant found that if you include the effects of SLG, you can come up with a formula gives you a better estimate than just "vanilla" Pythagoras. (The full formula is available at the above links.)

So how big is the effect? If you have two teams with exactly the same runs scored and allowed, you would normally expect them to win the same number of games. But not if they have different SLGs. For for every .080 that one team exceeds the other opponent in slugging percentage, it should win one extra game. Since a run is one-tenth of a win, that means every .008 is worth a run. And, since there are nine batters, then every .072 excess slugging by a single, individual batter makes him worth the equivalent of an extra run. If you have two players, both creating X runs as best you can measure them, but one's SLG is .400 and the other's is .472, then the second player should be thought of as if he's worth one run more than the other guy.

If you use the new formula, Whisnant found you can reduce the error of the Pythagorean formula roughly by half.

I like the study a lot. It's been known for a long time that *how* you score your runs affects how many games you win, but this is the first time, to my knowledge, that anyone has tried to quantify the effect in terms of batting-line statistics.

----

But ... I'm not 100% convinced about the exact numerical results.

Whisnant calibrated his formula on samples of head-to-head matchups between teams over a season. That's about 10-13 games. But a Pythagorean formula that evaluates winning percentages over 12 games (say) is not necessarily as accurate as one that works on 162 games.

Consider a sample of one game only. The traditional pythagorean formula (with exponent 2) isn't the best fit: it will say, for instance, that a team that scores four runs and allows two should win 80 percent of the time. But, of course, that team will really win 100 percent of the time. For one-game samples, the best exponent is *infinity*. (Or, if you don't mind rounding, an exponent of 15 will be more than good enough.)

That is: Pythagoras works best when the score of every game is independent of the total scores. That's never true, because the score of the game *contributes* to the total. But, for 162 games, it's *almost* true -- one game out of 162 is barely anything. But for one game, it's absolutely not true at all -- not only is the one game not independent of the total, it actually *is* the total!

So what about for 12 games? It's probably closer to the 162-game case than the 1-game case, but I suspect it might still be somewhat off. Whisnant gave his formula coefficients to three decimal places; my guess is that if he recalibrated to 162-games seasons, the third decimal place would definitely change, and probably the second decimal too.

Regardless, even with the existing formula, you could check, by using it for actual team seasons, and comparing its accuracy to Pythagoras on those. I bet you'd find that Whisnant's average error (or average square error) was a lot better, but that it would be very, very slightly biased, as a bit too extreme.

That might be a quibble ... but, combined with the fact that even the best simulations may not be accurate to the third decimal place ... well, I think testing with real MLB data is a must-do.

----

Also, while I think the new formula is theoretically very interesting, I'm not sure it has any practical use to a GM. Remember, to get the equivalent of one extra run, when you're evaluating two equal players, you have to take the one with a slugging percentage .072 higher. What does that mean in real-life terms? To keep things simple, I'll look only at singles and home runs.

Replacing a single with an out is worth about .75 runs. Replacing an out with a home run is worth about 1.75 runs. If a player has 500 AB in a season, .072 in slugging represents 36 extra bases.

A bit of math will show that if you want to keep run production the same, while increasing slugging by .072, you have to turn 49 singles into 21 home runs. So if you start with a .290 hitter with 20 home runs, to gain the equivalent of one single Pythagorean run, you have to trade him for a .234 hitter with 41 home runs.

That's a big difference. And it's not that easy to do: there aren't all that many .234 hitters with 41 home runs, and it might not be worth pursuing one just to gain one-tenth of a win.

More importantly, the run estimation techniques we have simply aren't good enough to be accurate within anything close to a single run. Linear Weights data tells us the relative value of a single and a home run *on average*. But no team is average. A hit creates more runs the more men you already have on base. If you're the Yankees, scoring 5.6 runs a game, instead of the league average 4.8 ... well, is the .234 hitter with more power *really* worth the same as the .290 hitter with less power? My gut says that he'll be worth somewhat less -- even though he'll have more men on base to drive home, his extra outs are more costly on a better team. And I suspect that, on a better team, the value of a single increases more than the value of a home run -- the single has both more men to drive in, and better hitters coming up to drive him home. The HR has more men on base, but gains no benefit from the batters following him.

Moreover, doesn't it make sense that a team scores the most runs when it has some optimum combination of singles hitters and power? A team of nine Mark McGwires would score a lot fewer runs than Linear Weights would suggest, because LW has a built-in assumption that the McGwires would have a typical number of runners on base. If you hire nine McGwires to save nine-tenths of a win in Pythagoras, you're going to lose a lot more than nine-tenths of a win in runs scored, even if Linear Weights tells you otherwise.

Generally, no matter what the runs formulas said, I wouldn't be sure that I knew how to evaluate player productivity to the point where I could be sure to gain that minimal one-run Pythagorean advantage without doing at least one run's worth of damage in the attempt.

-----

On the subject of Pythagoras in general ... there seems to be a implied argument that when a team finishes ahead of its pythgorean projection, it's a positive thing, and when it falls short, it's a negative thing. For instance, a commenter to Whisnant's original article said that you could "squeeze a couple of extra wins" out of a certain number of runs, which suggests it's something to shoot for.

I don't think that's the way to look at it. It seems to imply that God sends a team a certain number of runs, and it's up to the players (or even the manager) to distribute those runs appropriately. And so a team that beats its Pythagorean projection by a win or two has been more successful than if they won fewer games than expected.

It doesn't seem to me that that should be true at all. It's important to win ballgames, and you do that by scoring runs, but the efficiency with which you happened to convert runs to wins is not something that really matters.

Look at it this way. Suppose you're leading 2-1 late in the game, and then you score three insurance runs in the top of the ninth. Your closer strikes out the side, and you wind up winning 5-1. Were those three runs a good thing? Of course they were -- they increased your probability of winning significantly (I could look it up, but maybe it's a tenth of a win or something).

But those runs actually made you *do worse* in terms of your Pythagorean projection. Without the top of the ninth, you would have won the game with a one-run advantage. Now, you win the game with a four-run advantage. You're actually *less efficient* in terms of Pythagoras -- you've "wasted" three runs. But so what? Those runs were a good thing: they insured the victory.

If you hadn't scored in the ninth inning, your season might have (for instance) resulted in you scoring 800 runs, allowing 800 runs, and finishing 81-81. Now, with those three runs, you scored 803 runs, allowed 800, and still finished 81-81. Why is the second scenario worse than the first?

It's not. What I think is, that since you have control over runs more than you have control over Pythagoras (which is still mostly random, despite Whisnant's findings here), you should evaluate your team by runs. It's not "runs are fixed and let's distribute them efficiently." It's, "let's score as many runs as we can and hope they're efficiently distributed, even though we have less control over that."

Take an analogy to taxes. All things considered, we'd all like to pay a lower percentage of our income in taxes -- we'd like our take-home pay to be "efficient" compared to our nominal salary. But would you rather make \$100K and pay \$40K in taxes (for a 40% average tax rate), or make \$40K and pay \$10K in taxes (for a 25% tax rate)? Concentrating on the *rate* is the wrong thing to do: you'll be cheating yourself out of \$30,000 in cash. That's because the situation is NOT that you have a fixed amount of income and want to minimize taxes -- it's that the income AFFECTS the taxes, and in ways that you can't control much.

The ninth-inning analogy: if you score three runs, they'll be taxed at a higher rate than the rest of your runs (in the sense that, since you already had the lead, they won't contribute much to victory). But the rate is LESS than 100%. You're much better off accepting those runs, even if most of their value is taxed away. Having a higher "run" tax rate may sound like a bad thing when you isolate it from the rest of reality. But when you realize that a higher tax rate means that you've scored more runs and perhaps won more games ... it's no longer a bad thing.

But if you don't agree with me, and you think beating your pythagoras is still something to shoot for ... well, when you're behind 3-1 in the ninth, let the other team score an extra 10 runs or so. You'll still lose the game, but, boy, will your pythagorean projection make you look efficient!

Labels: ,