Friday, February 20, 2015

Replacing "statistically significant"

In his recent book, "How Not To Be Wrong," mathematician Jordan Ellenberg writes about how the word "significant" means something completely different in statistics than it does in real life:

"In common language it means something like 'important' or 'meaningful.' But the significance test scientists use doesn't measure importance ... [it's used] merely to make a judgment that the effect is not zero. But the effect could still be very small -- so small that the drug isn't effective in any sense that an ordinary non-mathematical Anglophone would call significant. ...

"If only we could go back in time to the dawn of statistical nomenclature and declare ... 'statistically noticeable' or 'statistically detectable' instead of 'statistically significant!'"

I absolutely agree.

In fact, in my view, the problem is even more serious the other way, when there is *no* statistical significance. Researchers will say, "we found no statistically-significant effect," which basically means, "we don't have enough evidence to say either way." But readers will take that as meaning, "we find at best a very small effect." That's not necessarily the case. Studies often find values that would be very significant in the real world, but reject them because the confidence interval is wide enough to include zero. 


Tom Tango will often challenge readers to put aside "inertial reasoning" and consider how we would redesign baseball rules if we were starting from scratch. In that tradition, how would we redo the language of statistical significance?

I actually spent a fair bit of time on this a year or so ago. I went to a bunch of online thesauruses, and wrote down every adjective that had some kind of overlap with "significant." Looking at my list ... I notice I actually didn't include Ellenberg's suggestions, "noticeable" or "detectable." Those are very good candidates. I'll add those now, along with a few of their synonyms.

OK, done. Here's my list of possible candidates:

convincing, decisive, unambiguous, probable cause, suspicious, definite, definitive, adequate, upholdable, qualifying, sufficing, signalling, salient, sufficient, unambiguous, defensible, sustainable, marked, rigorous, determinate, permissible, accreditable, attestable, credentialed, credence-ive, credible, threshold, reliable, presumptive, persuasive, confident, ratifiable, legal, licit, sanctionable, admittable, acknowledgeable, endorsable, affirmative, affirmable, warrantable, conclusive, sufficing, sufficient, valid, assertable, clear, ordainable, non-spurious, dependable, veritable, creditable, attestable, avowable, vouchable, substantive, noticeable, detectable, perceivable, discernable, observable, appreciable, ascertainable, perceptible

You can probably divide these into classes, based on shades of meaning:

1. Words that mean "enough to be persuasive." Some of those are overkill, some are underkill. "Unambiguous," for instance, would be an obvious oversell; you can have a low p-value and still be pretty ambiguous. On the other hand, "defensible" might be a bit too weak. Maybe "definite" is the best of those, suggesting precision but not necessarily absolute truth.

2. Words that mean "big enough to be observed." Those are the ones that Ellenberg suggested, "noticeable" and "detectable." Those seem fine when you actually find significance, but not so much when you don't. "We find no relationship that is statistically detectable" does seem to imply that there's nothing there, rather than that you just don't have enough data in your sample.

3. Words that mean "enough evidence." That's exactly what we want, except I can't think of any that work. The ones in the list aren't quite right. "Probable cause" is roughly the idea we're going for, but it's awkward and sounds too Bayesian. "Suspicious" has the wrong flavor. "Credential" has a nice ring to it -- as an adjective, not a noun, meaning "to have credence." You could say, for instance, "We didn't have enough evidence to get a credential estimate."  Still a bit awkward, though. "Determinate" is pretty good, but maybe a bit overconfident.

Am I missing some? I tried to think, what's the word we use when we say an accused was acquitted because there wasn't enough evidence? "Insufficient" is the only one I can think of. Everything else is a phrase -- "within a reasonable doubt," or "not meeting the burden of proof."

4. Words that mean "passing an objective level," as in meeting a threshold. Actually, "threshold" as an adjective would be awkward, but workable -- "the coefficient was not statistically threshold." There's also "adequate," and "qualifying,” and "sufficient," and  "sufficing." 

5. Finally, there's words that mean "legal," in the sense of, "now the peer reviewers will permit us to treat the effect as legitimate." Those are words like "sanctionable," "admittable," "acknowledgable," "permissible," "ratifiable," and so on. My favorite of these is "affirmable." You could write, "The coefficient had a p-value of .06, which falls short of statistical affirmability." The reader now gets the idea that the problem isn't that the effect is small -- but, rather, that there's something else going on that doesn't allow the researcher to "affirm" it as a real effect.

What we'd like is a word that has a flavor matching all these shades of meaning, without giving the wrong idea about any of them. 

So, here's what I think is the best candidate, which I left off the list until now:


"Dispositive" is a legal term that means "sufficient on its own to decide the answer." If a fact is dispositive, it's enough to "dispose" of the question.

Here's a perfect example:

"Whether he blew a .08 or higher on the breathalyzer is dispositive as to whether he will be found guilty of DUI."

It's almost exact, isn't it? .08 for a conviction, .05 for statistical significance.

I think "dispositive" really captures how statistical significance is used in practice -- as an arbitrary standard, a "bright line" between Yes and No. We don't allow authors to argue that their study is so awesome that p=.07 should really be allowed to be considered significant, any more than we allow defendants to argue that should be acquitted at a blood alcohol level of .09 because they're especially good drivers. 

Moreover, the word works right out of the box in its normal English definition. Unlike "significant," the statistical version of "dispositive" has the same meaning as the usual one. If you say to a non-statistician, "the evidence was not statistically dispositive," he'll get the right idea -- that an effect was maybe found, but there's not quite enough there for a decision to be made about whether it's real or not. In effect, the question is not yet decided. 

That's the same as in law. "Not dispositive" means the evidence or argument is a valid one, but it's not enough on its own to decide the case. With further evidence or argument, either side could still win. That's exactly right for statistical studies. A "non-significant" p-value is certainly relevant, but it's not dispositive evidence of presence, and it's not dispositive evidence of absence. 

Another nice feature is that the word still kind of works when you use it to describe the effect or the estimate, rather than the evidence: 

"The coefficient was not statistically dispositive."

It's not a wonderful way to put it, but it's reasonable. Most of the other candidate words don't work well both ways at all -- some are well-suited only to describing the evidence, others only to describing the estimates. These don't really make sense:

"The evidence was not statistically detectable."  
"The effect was not statistically reliable."
"The coefficient was not statistically accreditable."

Another advantage of "dispositive" is that unlike "significant," you can leave out the word "statistical" without ambiguity:

"The evidence was not dispositive."
"The coefficient was not dispositively different from zero."

Those read fine, don't they? I bet they'd almost always read fine. I'd bet that if you were to pick up a random study, and do a global replace of "statistically significant" with "dispositive," the paper wouldn't suffer at all. (It might even be improved, if the change highlighted cases where "significant" was used in ways it shouldn't have been.)


When I'm finally made Global Despotic Emperor of Academic Standards, the change of terminology will be my first official decree.

Unless someone has a better suggestion. 

Labels: , , ,

Friday, February 06, 2015

Rating battery life on a 100-point scale

I've written before about how Consumer Reports (CR) uses a 100-point system for its product ratings. In their latest issue, they use that same system to rate AA batteries, and I suspect the ratings turned out so misleading that CR wound up fooling its own editorial staff!


CR rated 13 brands of alkaline batteries, and two brands of lithium batteries.  In the alkaline category, Kirkland Signature (Costco's house brand) was rated "best buy." It was the third-best alkaline, and, at 27 cents a battery, the least expensive by far. Most of the others were between $0.75 and $1.00 (although they would have been cheaper if CR had priced them in a bulk pack, like the Kirkland).

The two lithium batteries rated the highest of all, but they cost more than $2 each.

Now, suppose I'm not near a Costco, and need batteries. My choice is between the high-rated Duracell alkaline, at $1.20, and the Energizer Ultimate Lithium, at $2.50. Which should I buy?

There's no way to tell from the article. Why? Because all we have is that 100-point scale. That doesn't help much. Why doesn't CR just tell us how long each battery lasted, so we can do our own cost/benefit calculation?

It's not quite that simple, you could argue. Batteries perform differently in "high drain" and "low drain" applications. CR tested both -- it used a flashlight for its low-drain test, and a toy for its high-drain test. Then it combined the two, somehow, to get the rating. But, couldn't they have combined them in such a way that the ratings are roughly proportional to how long the batteries last? 


I found a 2012 "battery showdown", from BitBox, that gives you actual data. Here's their graph of how much power you get from different brands of battery at high-drain (before the voltage drops below 0.8V).  The lithiums are the two at the top, the alkalines are the large cluster in the middle, and the cheap carbon-zinc batteries (which CR didn't test) are the poor performers at the bottom.

Looking at their chart of numbers ... the Energizer Ultimate Lithium, it appears, lasts around 3.1 times as long as the Costco alkaline in high-drain applications. At low-drain, the lithium lasts 1.7 times as long.

That's consistent with what I had previously understood -- that lithium batteries are by far the best, but shine more in high-drain applications than low-drain applications. 

Strangely, the CR chart might lead you to expect exactly the opposite! CR rated the lithium batteries "excellent" (their maximum rating) in both applications. That "tied" eight of the 13 alkalines in the high-drain test, but only one in the low-drain test. Based on those ratings, a reader would be forgiven for concluding that lithium batteries give you more leverage in low-drain uses. (In fairness, the text of the article does give the correct advice, although it doesn't explain why the chart seems to imply otherwise.)


Anyway, combining the two factors, 3.1 and 1.7, we might choose to conclude that the lithiums last maybe two-and-a-half times as long as the alkalines.

But CR's ratings give no clue that the difference is that large. All they tell us is that the lithium grades a 96/100, and the Costco alkaline grades an 84/100. In other words: CR gives the lithium 14% more points for 150% more performance. 

Which, I guess, has to be the case, given the rating system. If you give the lithium a perfect score of 100, you'd have to give the alkalines 40 or less. And they can't do that, since, to CR, 40/100 can only mean "poor."


The article goes on to say, 

"The top-scoring [91] alkaline battery -- Duracell Quantum -- was not significantly different from the high-scoring [94 and 96] lithium models ..."

That, I believe, is just plain false. A quick Google search of bloggers who tested the Quantums suggest that, at best, they're a bit better than other alkalines, but nowhere near as good as lithiums. So, CR winds up telling us a battery that lasting twice as long does not make a battery "significantly different."

What happened? 

Well, it might have been a misapplication of the normal criteria for "significantly different."  In their longer ratings articles, CR includes a disclaimer in their ratings: "differences of fewer than X points aren't meaningful."  

For the batteries ... sure, lower in the rankings, five points isn't significant. I fully agree that the Rayovac at 78/100 isn't significantly different from the CVS at 82/100. But it's absolutely not true that the Quantum at 91/100 is anywhere near as good as the lithium at 94/100. The rating system might work in the middle, but it fails at the top.

That's how, I think, CR wound up fooling itself. The writers looked at the ratings, and thought, "hey, it's only three points!"

Labels: , ,

Monday, January 26, 2015

Are umpires biased in favor of star pitchers? Part II

Last post, I talked about the study (.pdf) that found umpires grant more favorable calls to All-Stars because the umps unconsciously defer to their "high status." I suggested alternative explanations that seemed more plausible than "status bias."

Here are a few more possibilities, based on the actual coefficient estimates from the regression itself.

(For this post, I'll mostly be talking about the "balls mistakenly called as strikes" coefficients, the ones in Table 3 of the paper.)


1. The coefficient for "right-handed batter" seems way too high: -0.532. That's so big, I wondered whether it was a typo, but apparently it's not.  

How big? Well, to suffer as few bad calls as his right-handed teammate, a left-handed batter would have to be facing a pitcher with 11 All-Star appearances.

The likely explanation seems to be: umpires don't call strikes by the PITCHf/x (rulebook) standard, and the differences are bigger for lefty batters than righties. Mike Fast wrote, in 2010,

"Many analysts have shown that the average strike zone called by umpires extends a couple of inches outside the rulebook zone to right-handed hitters and several more inches beyond that to left-handed hitters." 

That's consistent with the study's findings in a couple of ways. First, in the other regression, for "strikes mistakenly called as balls", the equivalent coefficient is less than a tenth the size, at -0.047. Which makes sense: if the umpires' strike zone is "too big", it will affect undeserved strikes more than undeserved balls. 

Second: the two coefficients go in the same direction. You wouldn't expect that, right? You'd expect that if lefty batters get more undeserved strikes, they'd also get fewer undeserved balls. But this coefficient is negative both cases. That suggests something external and constant, like the PITCHf/x strike zone overestimating the real one.

And, of course, if the problem is umpires not matching the rulebook, the entire effect could just be that control pitchers are more often hitting the "illicit" part of the zone.  Which is plausible, since that's the part that's closest to the real zone.


2. The "All-Star" coefficient drops when it's interacted with control. Moreover, it drops further for pitchers with poor control than pitchers with good control. 

Perhaps, if there *is* a "status" effect, it's only for the very best pitchers, the ones with the best control. Otherwise, you have to believe that umpires are very sensitive to "status" differences between marginal pitchers' control rates. 

For instance, going into the 2009 season, say, J.C. Romero had a career 12.5% BB/PA rate, while Warner Madrigal's was 9.1%. According to the regression model, you'd expect umpires to credit Madrigal with 37% more undeserved strikes than Warner. Are umpires really that well calibrated?

Suppose I'm right, and all the differences in error rates really accrue to only the very best control pitchers. Since the model assumes the effect is linear all the way down the line, the regression will underestimate the best and worst control pitchers, and overestimate the average ones. (That's what happens when you fit a straight line to a curve; you can see an example in the pictures here.) 

Since the best control pitchers are underestimated, the regression tries to compensate by jiggling one of the other coefficients, something that correlates with only those pitchers with the very best control. The candidate it settles on: All-Star appearances. 

Which would explain why the All-Star coefficient is high, and why it's high mostly for pitchers with good control. 


3. The pitch's location, as you would expect, makes a big difference. The further outside the strike zone, the lower the chance that it will be mistakenly called a strike. 

The "decay rate" is huge. A pitch that's 0.1 feet outside the zone (1.2 inches) has only 43 percent the odds of being called a strike as one that's right on the border (0 feet).  A pitch 0.2 feet outside has only 18 percent the odds (43 percent squared).  And so on.*

(* Actually, the authors used a quadratic to estimate the effect -- which makes sense, since you'd expect the decay rate to increase. If the error rate at 0.1 feet is, say, 10 percent, you wouldn't expect the rate for 1 foot to be 1 percent. It would be much closer to zero. But the quadratic term isn't that big, it turns out, so I'll ignore it for simplicity. That just renders this argument more conservative.) 

The regression coefficient, per foot outside, was 8.292. The coefficient for a single All-Star appearance was 0.047. 

So an All-Star appearance is worth 1/176 of a foot -- which is a bit more than 1/15 of an inch.

That's the main regression. For the one with the lower value for All-Star appearances, it's only an eighteenth of an inch. 

Isn't it more plausible to think that the good pitchers are deceptive enough to fool the umpire by 1/15 inches per pitch, rather than that the umpire is responding to their status? 

Or, isn't it more likely that the good pitchers are hitting the "extra" parts of the umpires' inflated strike zone, at an increased rate of one inch per 15 balls? 


4. The distance from the edge of the strike zone is, I assume, "as the crow flies." So, a high pitch down the middle of the plate is treated as the same distance as a high pitch that's just on the inside edge. 

But, you'd think that the "down the middle" pitch has a better chance of being mistakenly called a strike than the "almost outside" pitch. And isn't it also plausible that control pitchers will have a different ratio of the two types than those with poor control? 

Also, a pitch that's 1 inch high and 1 inch outside registers as the same distance as a pitch over the plate that's 1.4 inches high. Might umpires not be evaluating two-dimensional balls differently than one-dimensional balls?

And, of course: umpires might be calling low balls differently than high balls, and outside pitches differently from inside pitches. If pitchers with poor control throw to the inside part of the plate more than All-Stars (say), and the umpires seldom err on balls inside because of the batter's reaction, that alone could explain the results.


All these explanations may strike you as speculative. But, are they really more speculative than the "status bias" explanation? They're all based on exactly the same data, and the study's authors don't provide any additional evidence other than citations that status bias exists.

I'd say that there are several different possibilities, all consistent with the data:

1.  Good pitchers get the benefit of umpires' "status bias" in their favor.

2.  Good pitchers hit the catcher's glove better, and that's what biases the umpires.

3.  Good pitchers have more deceptive movement, and the umpire gets fooled just as the batter does.

4.  Different umpires have different strike zones, and good pitchers are better able to exploit the differences.

5.  PITCHf/x significantly underestimates umpires in their opinions of what constitutes a strike. Since good pitchers are closer to the strike zone more often, they wind up with more umpire strikes that are PITCHf/x balls. The difference only has to be the equivalent one-fifteenth of an inch per ball.

6.  Umpires are "deliberately" biased. They know that when they're not sure about a pitch, considering the identity of the pitcher gives them a better chance of getting the call right. So that's what they do.

7.  All-Star pitchers have a positive coefficient to compensate for real-life non-linearity in the linear regression model.

8.  Not all pitches the same distance from the strike zone are the same. Better pitchers might err mostly (say) high or outside, and worse pitchers high *and* outside.  If umpires are less likely fooled in two dimensions than one, that would explain the results.


To my gut, #1, unconscious status bias, is the least plausible of the eight. I'd be willing to bet on any of the remaining seven, that they all are contributing to the results to some extent (possibly negatively).  

But I'd bet on #5 being the biggest factor, at least if the differences between umpires and the rulebook really *are* as big as reported.  

As always, your gut may be more accurate than mine.  

Labels: , , , ,

Sunday, January 18, 2015

Are umpires biased in favor of star pitchers?

Are MLB umpires are biased in favor of All-Star pitchers? An academic study, released last spring, says they are. Authored by business professors Braden King and Jerry Kim, it's called "Seeing Stars: Matthew Effects and Status Bias in Major League Baseball Umpiring."

"What Umpires Get Wrong" is the title of an Op-Ed piece in the New York Times where the authors summarize their study. Umps, they write, favor "higher status" pitchers when making ball/strike calls:

"Umpires tend to make errors in ways that favor players who have established themselves at the top of the game's status hierarchy."

But there's nothing special about umpires, the authors say. In deferring to pitchers with high status, umps are just exhibiting an inherent unconscious bias that affects everyone: 

" ... our findings are also suggestive of the way that people in any sort of evaluative role — not just umpires — are unconsciously biased by simple 'status characteristics.' Even constant monitoring and incentives can fail to train such biases out of us."

Well ... as sympathetic as I am to the authors' argument about status bias in regular life, I have to disagree that the study supports their conclusion in any meaningful way.


The authors looked at PITCHf/x data for the 2008 and 2009 seasons, and found all instances where the umpire miscalled a ball or strike, based on the true, measured x/y coordinates of the pitch. After a large multiple regression, they found that umpire errors tend to be more favorable for "high status" pitchers -- defined as those with more All-Star appearances, and those who give up fewer walks per game. 

For instance, in one of their regressions, the odds of a favorable miscall -- the umpire calling a strike on a pitch that was actually out of the strike zone -- increased by 0.047 for every previous All-Star appearance by the pitcher. (It was a logit regression, but for low-probability events like these, the number itself is a close approximation of the geometric difference. So you can think of 0.047 as a 5 percent increase.)

The pitcher's odds also increased 1.4 percent for each year of service, and another 2.5 percent for each percentage point improvement in career BB/PA.

For unfavorable miscalls -- balls called on pitches that should have been strikes -- the effects were smaller, but still in favor of the better pitchers.

I have some issues with the regression, but will get to those in a future post. For now ... well, it seems to me that even if you accept that these results are correct, couldn't there be other, much more plausible explanations than status bias?

1. Maybe umpires significantly base their decisions on how well the pitcher hits the target the catcher sets up. Good pitchers come close to the target, and the umpire thinks, "good control" and calls it a strike. Bad pitchers vary, and the catcher moves the glove, and the umpire thinks, "not what was intended," and calls it a ball.

The authors talk about this, but they consider it an attribute of catcher skill, or "pitch framing," which they adjust for in their regression. I always thought of pitch framing as the catcher's ability to make it appear that he's not moving the glove as much as he actually is. That's separate from the pitcher's ability to hit the target.

2. Every umpire has a different strike zone. If a particular ump is calling a strike on a low pitch that day, a control pitcher is more able to exploit that opportunity by hitting the spot. That shows up as an umpire error in the control pitcher's favor, but it's actually just a change in the definition of the strike zone, applied equally to both pitchers.

3. The study controlled for the pitch's distance from the strike zone, but there's more to pitching than location. Better pitchers probably have better movement on their pitches, making them more deceptive. Those might deceive the umpire as well as the batter. 

Perhaps umpires give deceptive pitches the benefit of the doubt -- when the pitch has unusual movement, and it's close, they tend to call it a strike, either way. That would explain why the good pitchers get favorable miscalls. It's not their status, or anything about their identity -- just the trajectory of the balls they throw. 

4. And what I think is the most important possibility: the umpires are Bayesian, trying to maximize their accuracy. 

Start with this. Suppose that umpires are completely unbiased based on status -- in fact, they don't even know who the pitcher is. In that case, would an All-Star have the same chance of a favorable or unfavorable call as a bad pitcher? Would the data show them as equal?

I don't think so. 

There are times when an umpire isn't really sure about whether a pitch is a ball or a strike, but has to make a quick judgment anyway. It's a given that "high-status" control pitchers throw more strikes overall; that's probably also true in those "umpire not sure" situations. 

Let's suppose a borderline pitch is a strike 60% of the time when it's from an All-Star, but only 40% of the time when it's from a mediocre pitcher.

If the umpire is completely unbiased, what should he do? Maybe call it a strike 50% of the time, since that's the overall rate. 

But then: the good pitcher will get only five strike calls when he deserves six, and the bad pitcher will get five strike calls when he only deserves four. The good pitcher suffers, and the bad pitcher benefits.

So, unbiased umpires benefit mediocre pitchers. Even if umpires were completely free of bias, the authors' methodology would nonetheless conclude that umpires are unfairly favoring low-status pitchers!


Of course, that's not what's happening, since in real life, it's the better pitchers who seem to be benefiting. (But, actually, that does lead to a fifth (perhaps implausible) possibility for what the authors observed: umpires are unbiased, but the *worse* pitchers throw more deceptive pitches for strikes.)

So, there's something else happening. And, it might just be the umpires trying to improve their accuracy.

Our hypothetical unbiased umpire will have miscalled 5 out of 10 pitches for each player. To reduce his miscall rate, he might change his strategy to a Bayesian one. 

Since he understands that the star pitcher has a 60% true strike rate in these difficult cases, he might call *all* strikes in those situations. And, since he knows the bad pitcher's strike rate is only 40%, he might call *all balls* on those pitches. 

That is: the umpire chooses the call most likely to be correct. 60% beats 40%.

With that strategy, the umpire's overall accuracy rate improves to 60%. Even if he has no desire, conscious or unconscious, to favor the ace for the specific reason of "high status", it looks like he does -- but that's just a side-effect of a deliberate attempt to increase overall accuracy.

In other words: it could be that umpires *consciously* take the history of the pitcher into account, because they believe it's more important to minimize the number of wrong calls than to spread them evenly among different skills of pitcher. 

That could just as plausibly be what the authors are observing.

How can the ump improve his accuracy without winding up advantaging or disadvantaging any particular "status" of pitcher? By calling strikes in exactly the proportion he expects from each. For the good pitcher, he calls strikes 60% of the time when he's in doubt. For the bad pitcher, he calls 40% strikes. 

That strategy increases his accuracy rate only marginally -- from 50 percent to 52 percent (60% squared plus 40% squared). But, now, at least, neither pitcher can claim he's being hurt by umpire bias. 

But, even though the result is equitable, it's only because the umpire DOES have a "status bias." He's treating the two pitchers differently, on the basis of their historical performance. But King and Kim's study won't be able to tell there's a bias, because neither pitcher is hurt. The bias is at exactly the right level.

Is that what we should want umpires to do, bias just enough to balance the advantage with the disadvantage? That's a moral question, rather than an empirical one. 

Which are the most ethical instructions to give to the umpires? 


Make what you think is the correct call, on a "more likely than not" basis, *without* taking the pitcher's identity into account.

Advantages: No "status bias."  Every pitcher is treated the same.

Disadvantages: The good pitchers wind up being disadvantaged, and the bad pitchers advantaged. Also, overall accuracy suffers.


Make what you think is the correct call, on a "more likely than not" basis, but *do* take the pitcher's identity into account.

Advantages: Maximizes overall accuracy.

Disadvantages: The bad pitchers wind up being disadvantaged, and the good pitchers advantaged.


Make what you think is the most likely correct call, but adjust only slightly for the pitcher's identity, just enough that, overall, no type of pitcher is either advantaged or disadvantaged.

Advantages: No pitcher has an inherent advantage just because he's better or worse.

Disadvantages: Hard for an umpire to calibrate his brain to get it just right. Also, overall accuracy not as good as it could be. And, how do you explain this strategy to umpires and players and fans?

Which of the three is the right answer, morally? I don't know. Actually, I don't think there necessarily is one -- I think any of the three is fair, if understood by all parties, and applied consistently. Your opinion may vary, and I may be wrong. But, that's a side issue.


Getting back to the study: the fact that umpires make more favorable mistakes for good pitchers than bad pitchers is not, by any means, evidence that they are unconsciously biased against pitchers based on "status." It could just as easily be one of several other, more plausible reasons. 

So that's why I don't accept the study's conclusions. 

There's also another reason -- the regression itself. I'll talk about that next post.

(Hat tip: Charlie Pavitt)

Labels: , , , ,

Wednesday, January 07, 2015

Predicting team SH% from player talent

For NHL teams, shooting percentage (SH%) doesn't seem to carry over all that well from year to year. Here repeated from last post, are the respective correlations: 

-0.19  2014-15 vs. 2013-14
+0.30  2013-14 vs. 2012-13
+0.33  2012-13 vs. 2011-12
+0.03  2011-12 vs. 2010-11
-0.10  2010-11 vs. 2009-10
-0.27  2009-10 vs. 2008-09
+0.04  2008-09 vs. 2007-08

(All data is for 5-on-5 tied situations. Huge thanks to for making the raw data available on their website.)

They're small. Are they real? It's hard to know, because of the small sample sizes. With only 30 teams, even if SH% were totally random, you'd still get coefficients of this size -- the SD of a random 30-team correlation is 0.19.  

That means there's a lot of noise, too much noise in which to discern a small signal. To reduce that noise, I thought I'd look at the individual players on the teams.  (UPDATE: Rob Vollman did this too, see note at bottom of post.)

Start with last season, 2013-14. I found every player who had at least 20 career shots in the other six seasons in the study. Then, I projected his 2013-14 "X-axis" shooting percentage as his actual SH% in those other seasons.  

For every team, I calculated its "X-axis" shooting percentage as the average of the individual player estimates.  

(Notes: I weighted the players by actual shots, except that if a player had more shots in 2013-14 than the other years, I used the "other years" lower shot total instead of the current one. Also, the puckalytics data didn't post splits for players who spent a year with multiple teams -- it listed them only with their last team. To deal with that, when I calculated "actual" for a team, I calculated it for the Puckalytics set of players.  So the team "actual" numbers I used didn't exactly match the official ones.)

If shooting percentage is truly (or mostly) random, the correlation between team expected and team actual should be low.  

It wasn't that low. It was +0.38.  

I don't want to get too excited about that +38, because most other years didn't show that strong an effect. Here are the correlations for those other years:

+0.38  2013-14
+0.45  2012-13
+0.13  2011-12
-0.07  2010-11
-0.34  2009-10
-0.01  2008-09
+0.16  2007-08

They're very similar to the season-by-season correlations at the top of the post ... which, I guess, is to be expected, because they're roughly measuring the same thing.  

If we combine all the years into one dataset, so we have 210 points instead of 30, we get 

+0.13  7 years

That could easily be random luck.  A correlation of +0.13 would be on the edge of statistical significance if the 210 datapoints were independent. But they're not, since every player-year appears up to six different times as part of the "X-axis" variable.

It's "hockey significant," though. The coefficient is +0.30. So, for instance, at the beginning of 2013-14, when the Leafs' players historically had outshot the Panthers' players by 2.96 percentage points ... you'd forecast the actual difference to be 0.89.  (The actual difference came out to be 4.23 points, but never mind.)


The most recent three seasons appear to have higher correlations than the previous four. Again at the risk of cherry-picking ... what happens if we just consider those three?

+0.38  2013-14
+0.45  2012-13
+0.13  2011-12
+0.34  3 years

The +0.34 looks modest, but the coefficient is quite high -- 0.60. That means you have to regress out-of-sample performance only 40% back to the mean.  

Is it OK to use these three years instead of all seven? Not if the difference is just luck; only if there's something that actually makes the 2011-12 to 2013-14 more reliable.  

For instance ... it could be that the older seasons do worse because of selective sampling. If players improve slowly over their careers, then drop off a cliff ... the older seasons will be more likely comparing the player to his post-cliff performance. I have no idea if that's a relevant explanation or not, but that's the kind of argument you'd need to help justify looking at only the three seasons.

Well, at least we can check statistical significance. I created a simulation of seven 30-team seasons, where each identical team had an 8 percent chance of scoring on each of 600 identical shots. Then, I ran a correlation for only three of those seven seasons, like here.

The SD of that correlation coefficient was 0.12. So, the +0.34 in the real-life data was almost three SDs above random.

Still: we did cherry-pick our three seasons, so the raw probability is very misleading.  If it had been 8 SD or something, we would have been pretty sure that we found a real relationship, even after taking the cherry-pick into account. At 3 SD ... not so sure.


Well, suppose we split the difference ... but on the conservative side. The 7-year coefficient is 0.30. The 3-year coefficient is 0.60.  Let's try a coefficient of 0.40, which is only 1/3 of the way between 0.30 and 0.60.

If we do that, we get that the predictive ability of SH% is: one extra goal per X shots in the six surrounding seasons forecasts 0.4 extra goals per X shots this season.

For an average team, 0.4 extra goals is around 5 extra shots, or 9 extra Corsis.

In his study last month, Tango found a goal was only 4 extra Corsis.  Why the difference? Because our studies aren't measuring the same thing.  We were asking the same general question -- "if you combine "goals" and "shots," does that give you a better prediction than "shots" alone? -- but doing so by asking different specific questions.  

Tango asked how you predict half a team's games predict the other half. I was asking how you predict a team's year from its players' six surrounding years. It's possible that the "half-year" method has more luck in it ... or that other differences factor in, also.

My gut says that the answers we found are still fairly consistent.


UPDATE: Rob Vollman, of "Hockey Abstract" fame, did a similar study last summer (which I read, but had forgotten about).  Slightly different methodology, I think, but the results seem consistent.  Sorry, Rob!

Labels: , , , , ,

Thursday, December 18, 2014

True talent levels for NHL team shooting percentage, part II

(Part I is here)

Team shooting percentage is considered an unreliable indicator of talent, because its season-to-season correlation is low. 

Here are those correlations from the past few seasons, for 5-on-5 tied situations.

-0.19  2014-15 vs. 2013-14
+0.30  2013-14 vs. 2012-13
+0.33  2012-13 vs. 2011-12
+0.03  2011-12 vs. 2010-11
-0.10  2010-11 vs. 2009-10
-0.27  2009-10 vs. 2008-09
+0.04  2008-09 vs. 2007-08

The simple average of all those numbers is +0.02, which, of course, is almost indistinguishable from zero. Even if you remove the first pair -- the 2014-15 stats are based on a small, season-to-date sample size -- it's only an average of +0.055.

(A better way to average them might be to square them (keeping the sign), then taking the root mean square. That gives +0 .11 and +0.14, respectively. But I'll just use simple additions in this post, even though they're probably not right, because I'm not looking for exact answers.)

That does indeed suggest that SH isn't that reliable -- after all, there were more negative seasons than strong positive ones.

But: what if we expand our sample size, by looking at the correlation between pairs that are TWO seasons apart? Different story, now:

+0.35   2014-15 vs. 2012-13
+0.12   2013-14 vs. 2011-12
+0.27   2012-13 vs. 2010-11
+0.12   2011-12 vs. 2009-10
+0.41   2010-11 vs. 2008-09
-0.03   2009-10 vs. 2007-08

These six seasons average +0.21, which ain't bad.


Part of the reason that the two-year correlations are high might be that team talent didn't change all that much in the seasons of the study. I checked the correlation between overall team talent, as measured by's "SRS" rating. For 2008-09 vs. 2013-14, the correlation was +0.50.

And that's for FIVE seasons apart. So far, we've only looked at two seasons apart.

I chose 2008-09 because you'll notice the correlations that include 2007-08 are nothing special. That, I think, is because team talent changed significantly between 07-08 and 08-09. If I rerun the SRS correlation for 2007-08 vs. 2013-14 -- that is, going back only one additional year -- it drops from +0.50 to only +0.25.

On that basis, I'm arbitrarily deciding to drop 2007-08 from the rest of this post, since the SH% discussion is based on an assumption that team talent stays roughly level.


But even if team talent changed little since 2008-09, it still changed *some*. So, wouldn't you still expect the two-year correlations to be lower than the one-year correlations? There's still twice the change in talent, albeit twice a *small* change.

You can look at it a different way -- if A isn't strongly related to B, and B isn't strongly related to C, then how can A be strongly related to C?

Well, I think it's the other way around. It's not just that A *can* be strongly related to C. It's that, if there's really a signal within the noise, you should *expect* A to be strongly related to C.

Consider 2009-10. In that year, every team had a certain SH% talent. Because of randomness, the set of 30 observed team SH% numbers varied from the true talent. The same would be true, of course, for the two surrounding seasons, 2008-09, and 2010-11.

But both those surrounding seasons had a substantial negative correlation with the middle season. That suggests that for each of those surrounding seasons, their luck varied from the middle season in the "opposite" way. Otherwise, the correlation wouldn't be negative.

For instance, maybe in the middle season, the Original Six teams were lucky, and the other 24 teams were unlucky. The two negative correlations with the surrounding seasons suggest that in each of those seasons, maybe it was the other way around, that the Original Six were unlucky, and the rest lucky.

Since the surrounding seasons both had opposite luck to the middle season, they're likely to have had similar luck to each other. 

In this case, they are. The A-to-B correlation is -0.27. The B-to-C correlation is -0.10. But the A-to-C correlation is +0.41. Positive, and quite large.

-0.10   2010-11 (A) vs. 2009-10 (B)
-0.27   2009-10 (B) vs. 2008-09 (C)
+0.41   2010-11 (A) vs. 2008-09 (C)


This should be true even if SH% is all random -- that is, even if all teams have the same talent. The logic still holds: if A correlates to B the same way C correlates to B ... that means A and C are likely to be somewhat similar. 

I ran a series of three-season simulations, where all 30 teams were equal in talent. When both A and C had a similar, significant correlation to B (same sign, both above +/- 0.20), their correlation with each other averaged +0.06. 

In our case, we didn't get +0.06. We got something much bigger: +0.41. That's because the underlying real-life talent correlation isn't actually zero, as it was in the simulation. A couple of studies suggested it was around +0.15. 

So, the A-B was actually -0.25 "correlation points", relative to the trend: -0.10 relative to zero, plus -0.15 below typical. (I'm sure that isn't the way to do it statistically -- it's not perfectly additive like that -- but I'm just illustrating the point.)  Similarly, the B-C was actually -0.42 points.

Those are much larger effects when you correct them that way, so they have a stronger result. When I limited the simulation sample so both A-B and A-C had to be bigger than +/- 0.25, the average A-C correlation almost tripled, to +0.16. 

Add that +0.16 to the underlying +0.15, and you get +0.31. Still not the +0.41 from real life, but close enough, considering the assumptions I made and shortcuts I took.


Since we have six seasons with stable team talent, we don't have to stop at two-season gaps ... we can go all the way to five-season gaps, and pair every season with every other season. Here are the results:

         14-15  13-14  12-13  11-12  10-11  09-10  08-09  
14-15           -0.19  +0.35  +0.20  +0.15  +0.46  -0.07  
13-14    -0.19         +0.30  +0.12  +0.27  -0.07  +0.42  
12-13    +0.35  +0.30         +0.33  +0.27  +0.24  +0.26  
11-12    +0.20  +0.12  +0.33         +0.03  +0.12  -0.08  
10-11    +0.15  +0.27  +0.27  +0.03         -0.10  +0.41  
09-10    +0.46  -0.07  +0.24  +0.12  -0.10         -0.27  
08-09    -0.07  +0.42  +0.26  -0.08  +0.41  -0.27

The average of all these numbers is ... +0.15, which is exactly what the other studies averaged out to. That's coincidence ... they used a different set of pairs, they didn't limit the sample to tie scores, and 14-15 hadn't existed yet. (Besides, I think if you did the math, you'd find you wanted the root of the average r-squared, which would be significantly higher than  +0.15.)

Going back to the A-B-C thing ... you'll find it still holds. If you look for cases where A-B and B-C are both significantly below the 0.15 average, A-C will be high. (Look in the same row or column for two low numbers.)  

For instance, in the 14-15 row, 13-14 and 08-09 are both negative. Look for the intersection of 13-14 and 08-09. As predicted, the correlation there is very high -- +0.42. 

By similar logic, if you find cases where A-B and B-C go in different directions -- one much higher than 0.15, the other much lower -- then, A-C should be low.

For instance, in the second row, 09-10 is -0.07, but 08-09 is +0.42. The prediction is that the intersection of 09-10 and 08-09 should be low -- and it is, -0.27.


Look at 2012-13. It has a strong positive correlation with every other season in the sample. Because of that, I originally guessed that 2012-13 is the most "normal" of all the seasons, the one where teams most played to their overall talent. In other words, I guessed that 2012-13 was the one with the least luck.

But, when I calculated the SDs of the 30 teams for each season ... 2012-13 was the *highest*, not the lowest. By far! And that's even adjusting for the short season. In fact, all the full seasons had a team SD of 1.00 percentage points or lower -- except that one, which was at the adjusted equivalent of 1.23.

What's going on?

Well, I think it's this: in 2012-13, instead of luck mixing up the differences in team talent, it exaggerated them. In other words: that year, the good teams got lucky, and the bad teams got unlucky. In 2012-13, the millionaires won most of the lotteries.

That kept the *order* of the teams the same -- which means that 2012-13 wound up the most exaggeratedly representative of teams' true talent.

Whether that's right or not, it seems that two things should be true:

-- With all the high correlations, 2012-13 should be a good indicator of actual talent over the seven-year span; and

-- Since we found that talent was stable, we should get good results if we add up all six years for each team, as if it was one season with six times as many games.* 

*Actually, about five times, since there are two short seasons in the sample -- 2012-13, and 2014-15, which is less than half over as I write this.

Well, I checked, and ... both guesses were correct.

I checked the correlation between 2012-13 vs. the sum of the other five seasons (not including the current 2014-15). It was roughly +0.54. That's really big. But, there's actually no value in that ... it was cherry-picked in retrospect. Still, it's just something I found interesting, that for a statistic that is said to have so little signal, a shortened season can still have a +0.54 correlation with the average of five other years!

As for the six-season averages ... those DO have value. Last post, when we tried to get an estimate of the SD of team talent in SH% ... we got imaginary numbers! Now, we can get a better answer. Here's the Palmer/Tango method for the 30 teams' six-year totals:

SD(observed) = 0.543 percentage points
SD(luck)     = 0.463 percentage points
SD(talent)   = 0.284 percentage points

That 0.28 percentage points has to be an underestimate. As explained in the previous post, the "all shots are the same" binomial luck estimate is necessarily too high. If we drop it by 9 percent, as we did earlier, we get

SD(observed) = 0.543 percentage points
SD(luck)     = 0.421 percentage points
SD(talent)   = 0.343 percentage points

We also need to bump it for the fact that this is the talent distribution for a six-season span -- which is necessarily tighter than a one-season distribution (since teams tend to regress to the mean over time, even slightly). But I don't know how much to bump, so I'll just leave it where it is.

That 0.34 points is almost exactly what we got last post. Which makes sense -- all we did was multiply our sample size by five. 

The real difference, though, is the credibility of the estimate. Last post, it was completely dependent on our guess that the binomial SD(luck) was 9 percent too high. The difference between guessing and not guessing was huge -- 0.34 points, versus zero points.  In effect, without guessing, we couldn't prove there was any talent at all!

But now, we do have evidence of talent ... and guessing adds only around 0.6 points. If you refuse to allow a guess of how shots vary in quality ... well, you still have evidence, without guessing at all, that teams must vary in talent with an SD of at least 0.284 percentage points.

Labels: , , , , ,

Saturday, December 13, 2014

True talent levels for NHL team shooting percentage

How much of the difference in team shooting percentage (SH%) is luck, and how much is talent? That seems like it should be pretty easy to figure out, using the usual Palmer/Tango method.


Let's start with the binomial randomness inherent in shooting. 

In 5-on-5 tied situations in 2013-14 (the dataset I'm using for this entire post), the average team took 721 shots. At a 8 percent SH%, one SD of binomial luck is

The square root of (0.08 * 0.92 / 721)

... which is almost exactly one percentage point.

That's a lot. It would move an average team about 10 positions up or down in the standings -- say, from 7.50 (16th) to 8.50 (4th). 

If you want to compare that to Corsi ... for CF% (defined as the percentage of shots at goal which go to the offense), the SD due to binomial luck is also (coincidentally) about one percentage point. That would take a 50.0% team to 51.0%, which is only maybe three places in the standings.

That's one reason that SH% isn't as reliable an indicator as Corsi: a run of luck can make you look like the best or worst in the in that category, instead of just moving you a few spots.


If we just go to the data and observe the SD of actual team SH%, it also comes out to about one percentage point. 


SD(talent)^2 = SD(observed)^2 - SD(luck)^2

we get

SD(talent)^2 = 1.0 - 1.0 

Which equals zero. And so it appears there's no variance in talent at all -- that SH% is, indeed, completely random!

But ... not necessarily. For two reasons.


First, var(observed) is itself random, based on what happened in the 2013-14 season. We got a value of around 1.00, but it could be that the "true" value, the average we'd get if we re-ran the season an infinite number of times, is different. 

How much different could it be? I wrote a simulation to check. I ran 5,000 seasons of 30 teams, each with 700 shots and a shooting percentage of 8.00 percent. 

As expected, the average of those 5,000 SDs was around 1.00. But the 5,000 values varied with an SD of 0.133 percentage points. (Yes, that's the SD of a set of 5,000 SDs.)  So the standard 95% confidence interval gives us a range of (0.83, 1.27). 

That doesn't look like it would make a whole lot of difference in our talent estimate ... but it does. 

At the top end of the confidence interval, an observed SD of 1.27, we'd get

SD(talent) squared  = 1.27 squared - 1.00 squared 
                    = 0.52 squared

That would put the SD of talent at 0.52 percentage points, instead of zero. That's a huge difference numerically, and a huge difference in how we think of SH% talent. Without the confidence interval, it looks like SH% talent doesn't exist at all. With the confidence interval, not only does it appear to exist, but we see it could be substantial.

Why is the range so wide? It's because the observed spread isn't much different from the binomial luck. In this case, they're identical, at 1.00 each. In other situations or other sports, they're farther apart. In MLB team wins, the SD of actual wins is almost double the theoretical SD from luck. In the NHL, it's about one-and-a-half times as big. In the NBA ... not sure; it's probably triple, or more. 

If you have a sport where the range of talent is bigger than the range of luck, your SD will be at least 1.4 times as big as you'd see otherwise -- and 1.4 times is a significant enough signal to not be buried in the noise. But if the range of talent is only, say, 40% as large as the range of luck, your expected SD will be only 1.077 times as big -- that is, only eight percent larger. And that's easy to miss in all the random noise.


Can we narrow down the estimate with more seasons of data? 

For 2011-12, SD(observed) was 0.966, which actually gives an imaginary number for SD(talent) -- the square root of a negative estimate of var(talent). In other words, the teams were closer than we'd expect them to be even if they were all identical! 

For 2010-11, SD(observed) was 0.88, which is even worse. In 2009-10, it was 1.105. Well, that works: it suggests SD(talent) = 0.47 percentage points. For 2008-09, it's back to imaginary numbers, with SD(observed) = 0.93. (Actually, even 2013-14 gave a negative estimate ... I've been saying SD(luck) and SD(observed) were both 1.00, but they were really 1.01 and 0.99, respectively.)

Out of five seasons, we get four impossible situations, the teams are closer together than we'd expect even if they were identical!

That might be random. It might be something wrong with our assumption that talent and luck are independent. Or, it might be that there's something else wrong. 

I think it's that "something else". I think that it's we're not using a good enough assumption about shot types.


Our binomial luck calculation assumed that all the shots were the same, that every shot had an identical 8% chance of becoming a goal. If you use a more realistic assumption, the effects of luck come out lower.

The typical team in the dataset scored about 56 goals. If that's 700 shots at 8 percent, the luck SD is 1 percent, as we found. But suppose those 56 goals come from a combination of high-probability shots and low-probability shots, like this:

For instance: 

 5 goals =   5 shots at 100% 
15 goals =  30 shots at  50%
30 goals = 300 shots at  10%
 6 goals = 365 shots at   1.64%
56 goals = 700 shots at   8%

If you do it that way, the luck SD drops from 1.0% to 0.91%.

And that makes a big difference. 1.00 squared minus 0.91 squared is around 0.4 squared. Which means: if that pattern of shots is correct, then the SD of team SH% talent is 0.4 points. 

That's pretty meaningful, about five places in the standings.

I'm not saying that shot pattern is accurate... it's a drastic oversimplification. But "all shots the same" is also an oversimplification, and the one that gives you the most luck. Any other pattern will have less randomness. 

What is actually the right pattern? I have no idea. But if you find one that splits the difference, where the luck SD drops only to 0.95% or something ... you'll still get SD(talent) around 0.35 percentage points, which is still meaningfully different from zero.

(UPDATE: Tango did something similar to this for baseball defense, to avoid a too-high estimate for variation in how teams convert balls-in-play to outs.  He describes it here.)


What's right? Zero? 0.35? 0.53? We could use some other kinds of evidence. Here's some other data that could help, from the hockey research community.

These two studies, that I pointed to in an earlier post, found year-to-year SH% correlations in the neighborhood of 0.15. Since the observed SD is about 1.0, that would put the talent SD in the range of 0.15. That seems reasonable, and consistent with the confidence intervals we just saw and the guesses we just made.

Var(talent) for Corsi doesn't have these problems, so it's easy to figure. If you assume a game's number of shots is constant, and binomial luck applies to whether those shots are for or against -- not a perfect model, but close enough -- the estimate of SD(talent) is around 4 percentage points.

Converting that to goals:

-- one talent SD in SH% =  1 goal
-- one talent SD in CF% = 10 goals

So, Corsi is 10 times as useful to know as SH%! Well, that might be a bit misleading: CF% is based on both offense and defense, while SH% is offense only. So the intuitive take on the ratio is probably only 5 times. 

Still: Corsi talent dwarfs SH% talent when it comes to predicting future performance, by a weighting of 5 to 1. No wonder Corsi is so much more predictive!

Either way, it doesn't mean that SH% is meaningless. This analysis suggests that teams who have a very high SH% are demonstrating a couple of 5-on-5 tied goals worth of talent. (And, of course, a proportionate number of other goals in other situations.)


And, if I'm not mistaken ... again coincidentally, one point of CF% is worth the same, in terms of what it tells you about a team's talent, as one point of SH%. (Of course, SH% is much harder to achieve -- only a few of teams are as much as 1 point of SH% above or below average, while almost every team is above or below 51.0% CF%.)

So, instead of using Corsi alone ... just add CF% and SH%! That only works in 5-on-5 tied situations -- otherwise, it's ruined by score effects. But I wouldn't put too much trust in any shot study that doesn't adjust for score effects, anyway.


I started thinking about this after the shortened 2012-13 season, when the Toronto Maple Leafs had an absurdly high SH% in 5-on-5 tied situations (10.82, best in the league), but an absurdly low CF% (43.8%, second worst to Edmonton).

My argument is: if you're trying to project the Leafs' scoring talent, you can't just use the Corsi and ignore the SH%. If the Leafs are 2 points above average in SH%, that tells you as much about their talent as two Corsi points. Instead of projecting the Leafs to score like a 43.8% Corsi team, you have to project them to score like, maybe, a 45.8% team. Which means that instead of second worst, maybe they're probably only fifth or sixth worst.

That's almost exactly what I estimated a year ago, based on a completely different method and set of assumptions. Neither analysis is perfect, and there's still lots of randomness in the data and uncertainty in the assumptions ... but, still, it's nice to see the results kind of confirmed.

Labels: , , , ,