## Monday, January 30, 2012

### Do NHL teams get a boost after killing a two-man advantage?

In an OHL game I was watching the other day, one of the teams had a two-man advantage and didn't score. The announcer was disappointed that the shorthanded team to get a boost from having killed off the penalties, as conventional wisdom says they should.

Is conventional wisdom right? Now that I have access to a database of NHL games (thanks again to the Hockey Summary Project), I was able to check.

This study is basically the same format as the study I did on fights a few weeks back. I found all games from 1967-68 to 1984-85 where one team killed off a two-man advantage (of any length). Then, I found a random control game, which matched the score differential and the relative quality of the home and road teams. When I was done, I had two pools, each comprised of 1,703 games.

The teams that killed the penalties scored an average 0.26 more goals than their opponents from that point to the end of the game (actually, to the 17:00 mark of the third period). On the other hand, the control team scored only 0.12 more goals then their opponents.

That's statistically significant, at almost exactly 2 SDs.

I'll put that in chart form to make it easier to read, along with the SD. I use the term "killing teams" to mean the ones that actually killed off the two-man advantage.

Killing teams .... +0.26 goals (+/- 0.05)
Control teams .... +0.12 goals (+/- 0.05)
------------------------------------------
Difference ....... +0.14 goals (+/- 0.07)

At six goals per win, you'd have expected the extra goals to have resulted in around 40 extra wins. They actually resulted in 32 extra wins. Actually, 36 extra wins, minus 8 fewer ties:

Killing teams .... 836-604-263
Control teams .... 806-626-271
------------------------------------
Difference ....... +36 wins, -8 ties

So, should we conclude that killing off a two-man advantage causes a psychological boost? Well, not so fast. Because, after you take two consecutive penalties, the referee is very likely to try to even things up by giving future penalties to the other team.

The difference of +0.14 goals is almost exactly what you'd get from a single power play. So, if the result of surviving a two-man advantage is that you get one extra "free" power play in the remainder of the game, that would explain the results exactly.

As it turns out, it's not quite that high. It's only half that high. On average, the teams that survived being shorthanded two men got about half an extra power play in the remainder of the game:

Killing teams ... +.346 power plays rest of game
Control teams ... -.130 power plays rest of game
------------------------------------------------
Difference ...... +.476 power plays rest of game

That leaves about 0.07 goals per game as the unexplained difference. It's only 1 SD, which is no longer statistically significant. It's about the effect of half a power play. Or, with an average save percentage of .900, it works out to 7/10 of an additional shot on goal.

--------

We can also handle the penalty issue another way. We can insist that when we choose a control game for the real game, we make sure the control team was the lone who took the last penalty. That way, we'd expect some of the referee "evening up" difference to disappear. Perhaps not all of it, because a two-man advantage isn't the same as a one-man advantage -- but at least part of it.

The additional restriction reduced the sample size to 1,662 games; for the remaining 41 games, I couldn't find a suitable control.

As it turns out, the goal difference stays about the same, even though the penalty difference is significantly reduced:

Killing teams ... +0.25 goals (+/- 0.05)
Control teams ... +0.08 goals (+/- 0.05)
----------------------------------------
Difference ...... +0.17 goals (+/- 0.07)

Killing teams ... +.340 power plays rest of game
Control teams ... +.032 power plays rest of game
------------------------------------------------
Difference ...... +.308 power plays rest of game

The difference of .308 power plays accounts for around .04 goals of the observed .17 difference. That leaves .13, which is a little less than 2 SD from zero. Not statistically significant, but close. (Technically, it's even less than that, because the control games aren't completely independent. Also, when I ran the study a second time, I got +0.10 goals instead of +0.08, which lowers the difference. So think of the 1.9 SD as probably a bit too high.)

Strangely, though, there wasn't as much difference in game results; only the equivalent of 13.5 wins:

Killing teams ... 815-591-256
Control teams ... 807-610-245
------------------------------
Difference: +8 wins, +11 ties

Again at six goals per win, you'd expect 47 wins, not 13.5. What happened?

Well, it turns out that the "killing" teams spent a lot of their goals winning blowouts. For instance, in games won by six goals or more, they were 81-34. The control group was only 73-51.

In those games, the difference was 12.5 wins. That normally "costs" 75 goals, but, for these games, the difference was really around 150 goals. So, that accounts for 75 of the 282 goal difference right there.

The "killing" group also "wasted" goals in the 3- and 4-goal games. That was offset by the opposite effect in five-goal games, but not by much.

------

If you recall, we found the same effect when we looked at fighting: teams that started a fight appeared to score more goals, but not necessarily win more games.

What connects the two studies is ... penalties. It could be that teams that get penalized a lot win a lot of blowouts. Not necessarily because of cause-and-effect, but because it just so happened that, between 1967 and 1984, certain teams just happened to be high in both categories.

Or, it could be coincidence. Or, it could be something else.

------

For my bottom line, I'd say: after killing off a two-man advantage, teams did appear to benefit by about 1/7 of a goal. Half of that can be traced to referees calling fewer penalties against them in the remainder of the game.

The other half is unknown. It's not statistically significant, so you have to give serious consideration to the idea that it's just coincidence ... but the teams *did* appear to benefit, by around 0.07 goals.

Historically, the average size of the "boost" in a team's play after a two-man kill has been small: the equivalent of less than a single shot on goal over the remainder of the game.

Labels: , , ,

## Friday, January 20, 2012

### GiveWell: Overcomplicating research studies can cost lives

"GiveWell" is an organization that evaluates charities. Not just the usual things -- how well they're run, or how much money goes to administrative expenses -- but also how much good they do for the money they receive.

The idea is: if you have \$100 to give to try to make the world a better place, shouldn't you give that \$100 where it would give the most benefit? Not just to whoever shows up at your door that day, or whatever organization makes you feel guiltiest, or whoever's suffering kids look the cutest ... but, seriously, to where you can do the most good.

That might not appeal to everyone. If you donate to maximize your own good feelings, instead of the good your donation actually does, GiveWell's evaluations won't make much difference to you. Some people hate to say "no", and so they prefer to give \$5 to each of the twenty charities that ask for money. Some people prefer to give to diseases that killed their loved ones, or diseases associated with heroes like Terry Fox. Some people give to causes that signal their political views. Most people prefer to give to help people in their own city or country, even when their dollars will save many more lives abroad.

(I've done all these things, and I'm bit embarrassed about some of them. But I'm not alone. I mean, people give money to the Children's Wish Foundation to send a terminally ill kid to Disneyland ... which is nice, but, that same amount of money might actually save ten lives if they sent it to Africa where kids are actually dying of things that are easily preventable. I'm not sure what's up with me, and my fellow humans, sometimes. But I digress.)

So, in at least one sense, GiveWell is to donors what sabermetrics is to Joe Morgan. It does analysis to reach conclusions that some might find uncomfortable.

However, in another sense, what GiveWell does is *unlike* sabermetrics, in that it usually doesn't try to get down to the third decimal place. It argues that it can evaluate charities heuristically, that the differences are big enough that they can figure out which charities are the best, using the charities' own reports. As I interpret what they're saying, GiveWell can very easily tell you whether a charity is a Danny Ainge or an Albert Pujols, and it can even tell you more subtle things, like whether a charity is a Joe Carter or an Albert Pujols. But it doesn't try to figure out if a charity is a Ryan Braun or an Albert Pujols. It will just tell you that both are recommended.

That is, GiveWell argues that its goals are better met by the transparency of its recommendations than by any detailed, opaque analyses.

Which is almost exactly what I argued in one of my recent posts -- that, in research, simplicity and transparency are more important than rigor. Simple studies make it much easier to understand the results and catch the inevitable errors. A gentleman from GiveWell, Elie Hassenfeld, read that post, and pointed me to a particular example of a serious error that his organization uncovered.

(Disclaimer: I don't really know much about GiveWell. However, I've been impressed by what I've seen, and at least two of the blogs I read and respect (here's one) say very good things about them. So my Bayesian evaluation of them is quite high.)

-----

As I said, GiveWell doesn't believe they need detailed statistical cost/benefit studies to decide which charities to recommend. However, charities themselves often use such analyses to decide where the money should be spent. There's a whole bunch of organizations and academics devoted to figuring out how to save the most lives for the fewest dollars.

With that objective, the Bill and Melinda Gates Foundation donated \$3.5 million to fund a study, "Disease Control Priorities in Developing Countries". They published a report ranking various interventions on cost-effectiveness. The Gates Foundation didn't do that itself -- it was done jointly by The World Bank, the National Institutes of Health, the World Health Organization, and the Population Reference Bureau. Those sound like heavyweights in the world health field.

The results found that -- unsurprisingly to me -- hygiene promotion was the cheapest way to reduce death and disease. The second cheapest, though, was deworming. Specifically, "soil-transmitted helminth" (STH) deworming treatments.

After the report was released, the Gates Foundation provided another \$4.4 million to promote the findings. And the findings did indeed attract serious attention. GiveWell writes,

The DCP2’s cost-effectiveness estimates for deworming have been cited widely to advocate a greater focus on treating STH infections, including in:

-- an article in The Lancet

-- a report by REACH, a consortium of large international NGOs and other organizations working to end child hunger, which labeled deworming one of 11 “promoted interventions”

-- the most-cited paper published in the journal International Health

-- an editorial by Peter Hotez, a co-founder of the Global Network for Neglected Tropical Diseases, which has received more than \$40 million in funding from the Gates Foundation

-- work by charity evaluators, such as GiveWell, Giving What We Can, and the University of Pennsylvania’s Center for High Impact Philanthropy.

But, as GiveWell later discovered, it turns out the STH estimate was wrong.

That doesn't sound too serious, but here's the thing: it's not just that the estimate was wrong. It was wrong by a factor of almost ONE HUNDRED. The study said that you could save one "disability-adjusted life year" by spending \$3.41 on deworming treatments. But, after correcting for the (acknowledged) errors in the study, the actual number was \$326.43.

All these well-respected organizations, with serious researchers and serious money, wound up promoting a conclusion that was about as wrong as it could have been. Until the error was caught, then, effectively, 99% of the money devoted to STH treatment was wasted.

How did GiveWell catch the error? Subject matter expertise, mostly. In reading the report, they noticed that the STH estimate was much, much lower than other estimates they had seen. Instead of just assuming that this research was somehow better than the previous studies, they investigated.

That seems like just common sense, right? If you see a study that says an iPod can be bought for \$3, when you know it usually costs \$300, you should look again, shouldn't you? But that didn't happen until someone at GiveWell decided to figure out what was going on.

So they wrote to one researcher, who sent them to other researchers, who sent them complicated spreadsheets. They tried to figure those out, but they couldn't, so they wrote back and forth with questions and explanations. They were referred to still another researcher, who sent them a copy of yet another study that was the source of some of the data.

Eventually, they figured out where the issues were ... if you want a full explanation, it's in their post. It was a lot of detailed, technical effort to figure out what went wrong, and which parameters were in error.

GiveWell's conclusions:

We believe that the errors we’ve found in the estimate would have been caught by a helminth expert independently examining the estimate. Therefore, the presence of these errors implies to us that there has been no such examination. If this is the case, it would argue against the reliability of the DCP2’s estimates in general.

We’ve previously argued for a limited role for cost-effectiveness estimates; we now think that the appropriate role may be even more limited, at least for opaque estimates (e.g., estimates published without the details necessary for others to independently examine them) like the DCP2’s.

More generally, we see this case as a general argument for expecting transparency, rather than taking recommendations on trust - no matter how pedigreed the people making the recommendations. Note that the DCP2 was published by the Disease Control Priorities Project, a joint enterprise of The World Bank, the National Institutes of Health, the World Health Organization, and the Population Reference Bureau, which was funded primarily by a \$3.5 million grant from the Gates Foundation. The DCP2 chapter on helminth infections, which contains the \$3.41/DALY estimate, has 18 authors, including many of the world’s foremost experts on soil-transmitted helminths.

Absolutely right. You can't substitute credentials for subject matter expertise, and you can't substitute complexity for transparency.

And, one thing I would add: when a study appears to discover that you can get benefits at 99% off the original, well-accepted price ... you have to be suspicious about accepting that conclusion, even if you have no other reason to believe there was any mistake.

-----

P.S. GiveWell expands on the theme here.

Labels: , , ,

## Sunday, January 15, 2012

### Are more NHL penalties called in back-to-back games?

In a comment to one of the posts on "make-up" penalties, J.-P. Martel wrote,

"... blow-outs can easily lead to situations that get out of hand, so referees may call penalties on the leading team so that the trailing team still thinks it has a chance to come back, rather than resort to fighting to "prepare" the next game between the two teams.

Actually, you may want to check penalties in the second half of the third period when the teams' next game is (or may be, depending on outcome) against each other (particularly in the playoffs), as opposed to when it's not."

So I did. And, J.-P. is right, it looks like there's something there.

I found all cases from 1967-68 to 1984-85 where teams played back-to-back games (regular season only). Then, I formed three groups:

-- first game of back-to-back games
-- second game of back-to-back games
-- other games that year between those two teams

It turns out that, overall, there are more penalties than usual in the first game, and fewer penalties than usual in the second game:

First game .... 12.36
Second game ... 10.87
Other games ... 11.77

Broken down by periods:

-------------- Gm 1 --- Gm 2 --- Other
--------------------------------------
Period 1 ..... 4.78 ... 3.98 ... 4.37
Period 2 ..... 4.25 ... 3.75 ... 4.12
Period 3 ..... 3.32 ... 3.13 ... 3.26
--------------------------------------
Total ....... 12.36 .. 10.87 .. 11.77

So: there's 0.6 extra penalties in the first game, and 0.9 fewer penalties in the second game.

I thought the second game would be dirtier because the player are holding recent grudges from the previous game, but the numbers show the opposite. The players seem to be more aggressive early, rather than late. In fact, more than half the "first game" effect happens in the first period. By contrast, a large "second game" effect seems to last two periods rather than one.

Most of the differences are statistically significant, which suggests that they're all real. For those scoring at home, here are the standard errors:

-------------- Gm 1 --- Gm 2 --- Other
--------------------------------------
Period 1 ..... 0.16 ... 0.12 ... 0.06
Period 2 ..... 0.13 ... 0.11 ... 0.06
Period 3 ..... 0.15 ... 0.12 ... 0.07
--------------------------------------
Total ........ 0.30 ... 0.24 ... 0.13

Finally, coming back to J.-P.'s hypothesis about the second half of the third period of the first game, here are the numbers:

First game .... 1.76
Second game ... 1.63
Other games ... 1.67

So, yes, there's a small effect where, when the teams are going to meet again next game, the referee calls more penalties than normal in the last ten minutes of the third period. Whether that's because of the referee, or the players, we can't tell.

Taken alone, these differences aren't statistically significant. But, considering they match the pattern, and the broader picture is statistically significant, we can be fairly confident that this is a real effect we're seeing.

That's actually why I saved J.-P.'s scenario for last, so I could first show that the effect is probably real and not just random.

-----

UPDATE, 1/15/2012:

Technical note: the "other games" rows and columns are weighted by games, rather than matchups. Suppose teams A and B had back-to-back games, and so did C and D. But A and B met only 2 other times that year, while C and D met 4 other times. That means that C/D will be overrepresented in the "other games" column.

If I reweight that column so A/B and C/D get equal weight, the results change just a little bit. These are the revised "other" columns:

Overall ...... 11.48 (was 11.77)
1st period .... 4.24 (was 4.37)
2nd period .... 4.07 (was 4.12)
3rd period .... 3.15 (was 3.26)
Last 10 min ... 1.59 (was 1.67)

Labels: ,

## Friday, January 13, 2012

### Do hockey fights lift a team's performance? Part II

The previous post was a study on NHL fights. It found that, generally, a fight doesn't help the team that it's sometimes said to help (the team that's behind in the game, for instance), but in one particular case, MAYBE it did. That was the case where:

(a) one team was behind in the game
(b) that team fought more regularly than the other team, and
(c) the player fighting also fought more often than the other team's fighter.

In that situation, that team appeared to benefit by around 0.13 goals, as compared to a similar team that didn't fight. That was about the same as one extra power play.

However, the result was not statistically significant, being only 1 SD away from zero. Still, I left it at least a little bit open whether the effect *might* be real.

Tom Tango is more skeptical than that:

It’s not monkeys at a typewriter creating Shakespeare, but it’s close.

Well, I have some more evidence that supports that point of view.

I repeated the study 27 times, to get a larger sample of random control games. (I didn't pick the number 27 beforehand; I just ran the thing over and over until I got sick of it.) Here's the average of those 27 runs:

Actual teams .... -0.18 goals
Control teams ... -0.29 goals
------------------------------
Difference ...... +0.11 goals

To remind you what this means: the fighting team meeting the conditions was outscored by its opponent by 0.18 goals over the rest of the game. On the other hand, the control teams, which were selected randomly from games which matched as closely as possible (except for the fight), got outscored by 0.29 goals.

So, it looks like the team that fought gained 0.11 goals per game. As I said, that result is not statistically significant.

But now, here's the new thing. Even though the fighting team gained 0.11 goals, it actually lost more games. Here are the records, in W-L-T format:

Actual teams .... 52-274-38
Control teams ... 49-267-48
----------------------------
Difference ...... -2 wins

So, even though the fighting teams did better on the scoreboard, they did worse in terms of winning games. Actually, they won three extra games, but they lost seven more and tied 10 fewer. That adds up to minus four points in the standings, which is why I write "-2 wins". (I'm ignoring the "pity point" for an overtime loss.)

You wouldn't expect this to happen, that you score more goals but lose more games. The better your goal differential, the better your outcomes should be. I think I saw Gabriel Desjardins write, somewhere, that six goals equals one win. The observed difference of +0.11 goals per game, over 364 games, equals around 40 goals, which is almost seven wins.

But instead of winning seven extra games, the fighting teams *lost* two extra games.

Why did this happen? I think it's just luck, well within the bounds of random error. I think the +0.11 goals per game is random chance, I think the -2 wins is random chance, and I think the discrepancy between the two results is also random chance.

In any case, if you don't like all this talk of significance levels and randomness, you can just summarize like this: overall, the teams that fought wound up very slightly better on the scoreboard, but very slightly worse in the standings.

Labels: , , ,

## Tuesday, January 10, 2012

### Do hockey fights lift a team's performance?

It's been said that when an NHL team needs a lift, a fight can jolt it out of its complacency and make it better. And, just a few days ago, the media cited a study by researcher Terry Appleby, of powerscouthockey.com, showing that momentum (in terms of shots on goal) usually increases for at least one team after a fight.

But, if *either* team can benefit from a fight, what's the point? You want to know if *your* team can benefit from a fight, at least more than the other team does.

The problem is: how can you know that? A fight involves both teams, so if it helps one, it hurts another by the same amount. If you look at both teams, you'll always find the total effect to be zero.

So, the "fighting helps a team" theory has to say *which* team is helped. The most logical interpretation would be that that the fight helps the team that instigated it.

If you're going to study that, you need to know which team is the instigating team. That's tough to figure out from historical data. But, one shortcut would be to assume that the team that generally gets involved in more fights is the team that's more likely to have instigated. The 1974-75 Philadelphia Flyers took 76 fighting penalties (actually, 76 offsetting majors, which I used as a proxy for fights). That same season, the expansion Kansas City Scouts took only 19. It seems fair to assume that if a fight broke out at a Flyers/Scouts game, it was the Flyers who were likely responsible.

On that assumption, I decided to check.

Using data from the Hockey Summary Project, I looked at fights between 1967-68 and 1984-85, and checked to see how the more-likely-to-fight team did in the remainder of the game. Then, I found a control game to match it with. The result was two large groups of games, which could then be compared.

I'll give you an example of how the controls were found.

On Feburary 16, 1969, the Bruins played the Black Hawks at Chicago Stadium. Just as the first period ended, with the score 2-0 Chicago, the Bruins' Don Awrey got into a fight against Stan Mikita of the Hawks.

I looked for a game to serve as the control for that Boston/Chicago game. What I wanted was:

1. A game in the same season, the season before, or the season after;
2. ... where the home team had the same size lead at that same time of the game;
3. ... and where the two teams were of roughly similar relative quality.

#1 and #2 were non-negotiable (except that all differences of 4 or more goals were considered the same). But, for #3, the quality only had to be close, within two goals (which I'll explain in a minute).

I started pulling random games until I found one that matched all three requirements. In this particular case, the control wound up being the Bruins vs. Rangers game of February 23, 1969.

That game qualifies under the rules because

1. 1968-69 is in the same season as the original;
2. That game had the home team also leading by two goals at 20:00 of the first period, and
3. The two sets of teams are of similar relative quality.

Now, let me explain #3.

In 1968-69, the Bruins were +82 in goal differential (303 goals for, 221 against). The Black Hawks were +34 (280-246). So, for the original game, the home team was 48 goals worse than the visiting team.

Since the control game was the same year, the Bruins were still +82. The Rangers were +35 (231-196). So, in the control game, the home team was 47 goals worse than the visiting team.

Since "47 goals worse" is within two goals of "48 goals worse," that's close enough for the Bruins/Rangers game to serve as a control. If it hadn't been within two goals -- which is most of the time -- that game wouldn't have qualified under #3, and I would have tried another random game. (If there were absolutely no games that qualified under #3, I would have taken the one where the team quality was closest in goals. If none of the random games had qualified under #1 and #2, I would have thrown the original game out of the study -- but that never happened.)

OK, so now we have our real game, and our control game.

Which team in our real game are we going to expect to have gotten the "lift" from the fight? In 1968-69, the Bruins had 41 fights, but the Black Hawks had only 20. So, the assumption is that the fight was more the work of the Bruins, and they should be the ones expected to benefit.

How did the Bruins do in the rest of the game relative to the Black Hawks?

Well, the final score was 5-1 Hawks. Since it was 2-0 at the time of the fight, that means the Black Hawks outscored the Bruins 3-1 in the remainder of the game. In other words, a "minus 2" goal differential for the visiting Bruins. (I excluded any goals in the last three minutes of the third period, to make sure empty-net goals didn't screw things up.)

What about the control game? That game actually wound up 9-0 Rangers, which means 7-0 Rangers from the fight to the end of the game. Since the "real" game was relative to the Bruins, the visiting team, we also want to express the control game from the standpoint of the visiting team. So that's "minus 7".

So, our score so far is:

Actual games: -2.0 goal differential for the fighting team

Control group: -7.0 goal differential for the control team

So far, it looks like fighting helps, by five goals a fight!

Of course, that's only one game. I repeated this process for every fight from 1967-68 to 1984-85. Actually, not *every* fight. First, I included only fights where one team appeared to be significantly more aggressive than the other (specifically, where the two teams were 10 or more fighting penalties apart for the season). Second, I included only first- or second-period fights, to increase the amount of time for the "lift" effect to make itself felt.

Even with those restrictions, there were 2,834 fights total. The results:

Fighting teams ... -0.04 goals
Control group .... -0.02 goals

The team with more fights was 0.04 goals worse than the other team over the remainder of the game. It "should have" been 0.02 goals worse. (Both numbers are negative probably because the teams that got in more fights were slightly worse teams overall than their opponents.)

So, there seems to be a small, negative effect: a team loses one additional goal for every 50 fights. But, that difference isn't even close to statistically significant. It's less than one SD from zero. (The two individual SDs are about 0.04 each, so the SD of the difference is around 0.06.)

Conclusion: it doesn't appear that fighting helps a team.

-----

Maybe a difference of 10 fights a year isn't enough to separate the two teams? I redid the study, but required the teams to be 20 fighting penalties apart. That reduced the sample size to 1,581 each group. The results were about the same (the +/- in parentheses is the standard error):

Fighting teams .... 0.00 goals (+/- 0.05)
Control group .... -0.03 goals (+/- 0.06)

-----

Looking at the entire database, I found that the average fight starts with a goal differential of 1.617. The average goal differential in all other games, weighted by the times of fights, is 1.421 goals. So, it seems like fights start when the game is a little more lopsided than usual.

So, maybe it's the team that's *trailing* that starts the fight, in an effort to wake itself up. Maybe we should look at trailing teams, not goonier teams.

I tried that. I threw away all situations where the score was tied when the fight happened, and looked at all the rest. The results:

2,941 datapoints
----------------
Trailing teams ... -0.20 goals (+/- 0.04)

Control group .... -0.19 goals (+/- 0.04)

Again, no real difference.

-----

Trying again, but looking only at fights where one team was trailing by at least three goals:

591 datapoints
--------------
Trailing teams ... -0.25 goals (+/- 0.08)

Control group .... -0.29 goals (+/- 0.08)

Nothing there, either.

-----

Is it possible that the benefit accrues only to GOOD teams trailing by three goals? Those are the teams playing the worst relative to their abilities, so the "wake up" effect should be strongest. Here are teams trailing by 3 goals that were at least +30 in goal differential for the season:

122 datapoints
--------------
Trailing teams ... +0.14 goals (+/- 0.17)

Control group .... +0.17 goals (+/- 0.16)

Nope. What if we look at good teams trailing by *any* number of goals?

841 datapoints
--------------
Trailing teams ... +0.40 goals (+/- 0.07)

Control group .... +0.26 goals (+/- 0.07)

Aha! This time, there's a small "lift" effect, at about 1.4 SD. But, why would there be an effect for teams trailing by 1 goal, but not for teams trailing by 3 goals?

I got curious and ran the same study again, and this time the random control group came in at +0.33, bring the difference down to 1.0 SD. (Of course, it's not appropriate to dismiss the first result just because the second one came out less extreme.)

-----

At this point, you might reasonably argue that the rules "team with more majors that year" and "team trailing in the game" are not precise enough in selecting teams that started the fight. So, this time, I assumed the fight was started by the *player* with the most majors that season, rather than the *team* with the most majors that season. So when the goon of a pacifist team starts a fight with a pacifist of a goon team, you go with the goon player on the pacifist team. The results:

4,185 datapoints
----------------
Goonier Player ... +0.01 goals (+/- 0.03)

Control Group .... -0.02 goals (+/- 0.03)

Again, less than 1 SD difference. There's not much difference between this "goon player" breakdown and the previous "goon team" breakdown, probably because most of the goonier players also played on goonier teams. But it was worth a try.

-----

Finally, one last try. For this run, I combined all three criteria. To be included in the study:

(a) one team had to have at least 20 more majors for the season than its opponent;
(b) that same team's fighter had to have more majors that year than his opponent; and
(c) that same team had to be trailing in the game.

This *has* to work, right? I mean, that pushes all the right buttons: a truculent team, with a figher selected for that purpose, behind in the game and likely to be needing a lift. If *those* teams don't benefit from the fight, then who would?

I expected the same non-result, but, this time, we get the biggest effect so far:

364 datapoints
--------------
Teams qualifying ... -0.18 goals (+/- 0.10)

Control group ...... -0.38 goals (+/- 0.11)

There's a difference of .20 goals -- almost a fifth of a goal per fight! Taken at face value, that means that when a team like that starts a fight, it benefits by even more than a power play (which has a 15 to 20 percent success rate).

That difference is still only about 1.4 SD from zero. Still, I hate to just dismiss it. I've always thought that if you get a result that's significant in the real-world (hockey) sense, but it's not statistically signficant, that's a problem with your study -- it's just that you haven't used enough data to be able to prove anything. We should still be open to the possibility that the effect might be real.

I ran it a few more times, to check if maybe the control group was just a random outlier. The extra results:

Control group: -0.26 goals
Control group: -0.21 goals
Control group: -0.27 goals
Control group: -0.38 goals (again)
Control group: -0.38 goals (again)
Control group: -0.29 goals

So, the original run was a little extreme, but not much.

There are, however, some mitigating factors. First, the control group numbers aren't all independent, since there's a limited number of control games to choose randomly from. Second, we obviously can't do extra runs to reduce random chance in the *real* games, but it's still possible those teams scored more goals for random reasons having nothing to do with any lift they got from the fight. Third, the SDs of both groups are a bit understated: I calculated them based on the assumption that games are independent, but they're not -- a real game appears in the study multiple times, once for each fight, and a control game could get randomly selected more than once, too.

If you average the seven control groups in the seven repetitions of the study, you get -0.31 goals. That's 0.13 goals worse than the actual games. Taking into account the fact that we ran the control group five times, the 0.13 difference is now around 1 SD.

Oh, and this is as good a time as any to emphasize that I could also have screwed up somewhere ... I've already had to rerun everything once when I found a misplaced parenthesis in my code.

-----

So, I guess, our overall conclusion from this study isn't completely certain. We wind up with a summary like:

1. The effect doesn't seem to exist for run-of-the-mill fights.
2. When a goon fighter on a goon team fights when his team is down, it seems to benefit that team by 1/8 of a goal, or a bit less than a normal power play.
3. But, that effect isn't statistically significant, so we have some doubts that it's real.
4. And, with only 364 such datapoints qualifying out of around 5,000, only a small percentage of fights match the criterion for that kind of boost.

If you had to reduce that to one line, it might be:

At best, there might be a small effect in certain specific circumstances ... but much, much less than sportscasters make it out to be.

UPDATE: Part 2 is here.

Labels: , , ,

## Monday, January 09, 2012

### Do NHL referees call "make up" penalties? Part IV

A couple of links to other similar studies on penalty-calling:

1. Commenter Jack linked to this article with some basketball foul-calling data. Turns out the more consecutive fouls against one team, the more likely the next will go to the other team.

2. Another reader pointed me to a 2009 hockey study (web version here, PDF here) by Jack Brimberg and William J. Hurley. They looked at the first three penalties of every game, and found results similar to what I found.

Labels: , , , , ,

## Saturday, January 07, 2012

### Ken Dryden tracers from "The Game"

In my review of Ken Dryden's book "The Game," I listed seven of the details that I tried to trace. I now have an eighth one, and then a retrace of the second one.

These two updates were originally posted to the SIHR mailing list. The original seven are here. Thanks again to the Hockey Summary Project for the data making these tracers possible.

----

8. Here's Dryden, from page 121 of my edition:

"A few months ago, we played the Colorado Rockies at the Forum. Early in the game, I missed an easy shot from the blueline, and a little unnerved, for the next fifty minutes I juggled long shots, and allowed big rebounds for three additional goals. After each Rockies goal, the team would put on a brief spurt and score quickly, and so with only minutes remaining, the game was tied. Then the Rockies scored again, this time a long, sharp-angled shot that squirted through my legs. The game had seemed finally lost. But in the last three minutes, Lapointe scored, then Lafleur, and we won 6-5. Alone in the dressing room later .... I just sat there, unable to understand why I felt the way I did. Only slowly did it come to me: I had been irrelevant; I couldn't even lose the game."

In Dryden's career, I found 12 Canadiens games against Colorado. Montreal won 11 of them and tied one. But none of them was by a score of 6-5.

Montreal did not have any 6-5 wins at all in 1978-79 (when the book is set), or in the previous two seasons. In Dryden's entire career, I found only two 6-5 Montreal wins where he was in net.

One was February 12, 1972, against the Kings. The narrative doesn't match. In that game, the Habs led 6-3, and then the Kings scored two late goals.

The other was November 16, 1972. In that game, Dryden was replaced by Michel Plasse after the first period, so that doesn't match either.

So, I extended the search to look for all games where Dryden gave up 5+ goals, but the Canadiens won anyway. There were five such games:

February 12, 1972, 6-5 against LA (described above)
November 22, 1974, 7-6 against Kansas City
February 18, 1976, 7-5 against Toronto
November 21, 1976, 9-5 against Toronto
December 23, 1977, 7-5 against New York Islanders.

None of the games match exactly, but the Kansas City game is the best candidate:

-- It was against the team that eventually became the Rockies;

-- The opposition scored late (14:49 of the third), and the Habs won it later (17:53);

-- It was a one-goal game;

-- Dryden probably didn't play great (6 goals on 21 shots, seven per period);

-- The last five goals alternated by team.

But other things don't match:

-- The Scouts tied it late, not took the lead late;

-- The Habs scored one goal to win, not two;

-- The goal was scored by Doug Risebrough, not Lapointe or Lafleur -- in fact, neither Lapointe nor Lafleur scored at all that game;

-- The early goals didn't alternate (The goal sequence was kMMMMkkkMkMkM);

-- The game was in Kansas City, not Montreal;

-- The game happened several years previously, rather than months.

I checked the Globe and Mail recap for that game ... it was just a couple of paragraphs long, and didn't mention Dryden at all, or how the Kansas City goals went in. I don't have online access to any Montreal newspapers to get a more detailed game story.

-----

2. Number 2 in my blog post listed a game against Toronto. Dryden writes that the Leafs tie the game early, get confident that they can keep up with the Canadiens, and begin to take over the play. But Mark Napier and Pat Hughes score two quick goals for the Habs. The Canadiens score two more, and then the Leafs get two late. The next day, the players wonder why coach Scotty Bowman didn't give them hell for allowing those two late goals.

It all adds up to 6-4. There was no 6-4 win in Toronto in 1978-79.

So, I looked for other games that might match.

There were only two games during Dryden's career where Napier and Hughes both scored.

One was November 15, 1978, a 6-1 win over Colorado. That doesn't match.

But the game of January, 17, 1979 is probably it. It matches, though not exactly:

-- It's in 1978-79, the season Dryden was writing about.

-- It was against Los Angeles, not Toronto, and home, not road.

-- Although the Kings scored to make it 2-1 at 8:43 of the second period, they never tied it up.

-- After the Kings' goal to make it 2-1, Napier and Hughes scored to make it 4-1.

-- After that, the Habs scored three more goals (not two): Houle, then Napier and Hughes again.

-- After that, the Kings got their two late goals.

-- The final was 7-3, not 6-4.

But, I think, pretty close anyway.

Labels: ,

## Friday, January 06, 2012

### Do NHL referees call "make up" penalties? Part III

The last two posts talked about how NHL referees are more likely to "even-up" their calls, issuing the next penalty to the opposing team 60% of the time.

This post, I'll show a regression I ran to quantify the effect a bit better. If you're not interested, just skip the technical parts (smaller font). If you're *really* not interested, you can probably just skip this entire post, since the results are pretty much the same as shown in the previous posts.

---

Technical notes 1:

Even though we're interested in whether the referee called the next penalty to the "other" team, I set up the regression to predict whether the referee called the next penalty to the *home* team. That just makes everything easier to interpret, but, as I'll describe, it still lets us estimate the "even-up" effect.

In the study, I ignored all misconduct penalties, all first penalties of the game, and all penalties where the other team had a player called at the same time. (I treated those penalties as if they didn't exist, so they didn't interrupt "consecutiveness" of the two surrounding penalties.)

Non-dummy variables I used: Time gone in game. Time since last penalty.

Dummy variables I used: PP goal on last penalty. SH goal on last penalty. Home team lead, from -3 to +3, except 0 (that is, six dummy variables), where anything more or less than 3 goals was coded as 3. All eight of the previous variables interacted with "whether the last penalty was to the home team." And, of course, the dummy for "whether the last penalty was to the home team" itself.

The regression shows the home team percentage diminishes during the game, by about 1 percentage point per period. In all the numbers in this post, I just used the beginning of the game. If you want the middle of the game, subtract about 1.5 percent from each "home team" percentage (or add 1.5 percent to each "visiting team" percentage) if you want to adjust to the middle of the second period.

Also, the regression says you have to subtract about 1 percentage point for every 20 minutes since the last penalty. I didn't bother for this post. If you assume penalties are usually around 5 minutes apart, feel free to subtract 0.25 percentage points from each of the "home team" percentages.

Those two time adjustments won't affect the "even-up" numbers, just the raw percentages of home team penalties.

-----

OK, first, let me show you the percentage of penalties taken by the home team, by game score. Clearly, teams are more likely to take penalties when they're ahead in the game.

After the visiting team took the last penalty, the home team took:

46.3% when down by 3+
49.2% when down by 2
53.1% when down by 1
61.2% when tied
64.3% when up by 1
67.9% when up by 2
66.8% when up by 3+

And after the home team took the last penalty, the home team took:

34.5% when down by 3+
32.8% when down by 2
32.3% when down by 1
35.0% when tied
42.6% when up by 1
47.6% when up by 2
52.9% when up by 3+

Obviously, you can do this for visiting teams just by subtracting all the percentages from 100.

-------

Now, we can calculate the "even-up" effect as the difference between the lines of the two tables. When the score was tied, the home team took:

61.2 percent after visiting team penalty
35.0 percent after home team penalty
-------------------------------------------------
26.2 percent difference

You can convert to visiting team just by subtracting the first two numbers from 100%. The difference has to come out the same. I'll do that anyway. When the score was tied, the visiting team took:

38.8 percent after visiting team penalty
65.0 percent after home team penalty
-----------------------------------------------------
26.2 percent difference

It turns out the "even-up" difference is highest for tie games. Here's the full breakdown:

11.8 percent difference down by 3+
16.4 percent difference down by 2
20.7 percent difference down by 1
26.2 percent difference tied
21.7 percent difference up by 1
20.3 percent difference up by 2
13.9 percent difference up by 3+

------

Tango suggested there might be an extra "compassion" effect when the team scores a PP or SH goal. He seems to have been right. The effect is small, relative to the overall effect, but still enough to affect the games:

-- If the home team scored a PPG on the previous penalty, add 1.2 percentage points to the above differences.

-- if the visiting team scored a PPG on the previous penalty, subtract 3.2 percentage points to the above differences.

-- if the home team scored a SHG on the previous penalty, add 2.8 percentage points from the above differences.

-- if the visiting team scored a SHG on the previous penalty, subtract 0.7 percentage points from the above differences.

The PPG numbers are statistically significant. The SHG numbers aren't, but they go in the right direction and are about the right magnitude, so I think it's reasonable to consider them as decent estimates.

------

So, as I promised in the second paragraph: the results of the regression seem to match what we found in the previous posts.

------

Technical notes 2:

For full disclosure, here are the coefficients for all the variables in the regression. I'll present them in "here's how to calculate the percentage of home-team penalties" format. (If you prefer a table, the full computer output is here (pdf). You'll be able to tell what the variables represent by matching the coefficients to what's below.)

Add -0.2619 if the home team took the last penalty.

Add -8.15E-06 for each second that's passed in the game. (About -.01 per period.)
Add -7.39E-06 for each second that's passed since the last penalty (not significant, p=.104, but magnitude is reasonable and has the right sign).

Add -0.1483 if the home team is down by 3 or more goals.
Add +0.1436 if the home team is down by 3+ and also took the last penalty.

Add -0.1196 if the home team is down by exactly 2 goals.
Add +0.0983 if the home team is down 2 and also took the last penalty.

Add -0.0807 if the home team is down by 1 goal.
Add +0.0545 if the home team is down 1 and also took the last penalty.

Add +0.0309 if the home team is up by 1 goal.
Add +0.0450 if the home team is up 1 and also took the last penalty.

Add +0.0668 if the home team is up by 2 goals.
Add +0.0591 if the home team is up 2 and also took the last penalty.

Add +0.0559 if the home team is up by 3 or more goals.
Add +0.1228 if the home team is up 3+ and also took the last penalty.

Add +0.0115 if a PP goal was scored on the last penalty
Add -0.0438 if a PP goal was scored on the last penalty and that penalty was to the home team.

Add -0.0072 if a SH goal was scored on the last penalty
Add +0.0350 if a SH goal was scored on the last penalty and that penalty was to the home team.

-----

Labels: , , ,

## Tuesday, January 03, 2012

### Do NHL referees call "make up" penalties? Part II

Last post, I found that referees are likely to "even up" their penalty calls: they're around 50% more likely to give the other team the next power play than to give the same team two power plays in a row.

I wasn't not convinced this is because of referee bias, or what Tango calls the "compassionate referee."

Tango suggested this experiment: check to see if a power play goal was scored on the first penalty. If the referee is indeed "compassionate" towards the other team, he should be more compassionate if the penalty actually cost them a goal, less so if there was no goal, and even less so if the penalized team *benefited* from the penalty by scoring shorthanded.

So I checked. I looked at all cases where there was a power play goal (PPG) on a first penalty, and then no more scoring until the next penalty was called. Indeed, that does appear to make the ref more compassionate.

After a penalty resulting a PPG, the next penalty was of the "even-up" variety 65.9% of the time. That's higher than the overall rate of 59.7%. Repeating that in a better font:

65.9% after a PPG
59.7% overall rate

And, the same effect appears for shorthanded goals (SHG):

52.5% after an SHG
59.7% overall rate

It's a large effect, and exactly in the direction Tango predicted.

-----

But wait! It might not be referee bias at all. Because, it turns out that teams with a lead take significantly more penalties than teams who are behind. For instance, when a penalty is called while you have a two goal lead, there's a 55.2% chance the penalty goes against you (and so a 44.8% chance the penalty goes against the other team). Full chart:

55.2% of penalties to team leading by 1
58.2% of penalties to team leading by 2
59.0% of penalties to team leading by 3
59.4% of penalties to team leading by 4
59.7% of penalties to team leading by 5

So, the score effect could explain what we're seeing. After a power play goal, the team has a bigger lead (or smaller deficit) than before. That would make it likely to take more penalties in future, even if the referee wasn't compassionate at all.

(Of course, the score effect might itself be due to referee "compassion," but that's a whole other argument.)

Specifically: a power play goal makes the team 6 percentage points more likely to take the next penalty. But scoring ANY tiebreaking goal in the first period makes a team 5 percentage points likely to take the next penalty. So how can we be sure there's a separate power-play effect, or how big it is?

-----

What might also complicate things is there's a "time of game" effect:

42,721 PPs came in the first period.
38.060 PPs in the second period.
26,705 PPs came in the third period.

There are fewer penalties in the third period than in the first. Is that a separate period effect? It might be.

Here's the score effect chart, again, but this time only for first-period penalties. The effect is more extreme than for the entire game:

55.6% of penalties to team leading by 1
60.1% of penalties to team leading by 2
61.0% of penalties to team leading by 3
66.5% of penalties to team leading by 4
58.7% of penalties to team leading by 5 (only 46 datapoints)

-----

It almost looks like we need a regression to sort all this out. But, wait! One more try before we turn to the dark side. Let's engineer a comparison where score and period won't screw things up.

I took every situation where:

1. It was the first period.

2. The game was tied at the time of the first penalty, and exactly one additional goal was scored before the second penalty.

3. The one extra goal was scored by the team that had the power play on the first penalty.

Then, I divided those situations into two groups.

The "Highest compassion" group is where the team scored the goal *on the power play*, presumably making the referee feel extra bad that he caused the goal. The "Typical compassion," is where the team scored the goal *after* the power play, and the referee's call wasn't the cause.

What percentage of the second penalties went to the other team?

Highest compassion: 71.6% (2163 datapoints)
Average compassion: 69.7% (1051 datapoints).

There's a small effect there, in the expected direction, of 1.9 percentage points. (That's less than 1 SD, so not statistically signficant.)

Here's the same result, but the other way, where it's the originally-penalized team that scored before the next penalty. When that goal was scored shorthanded, we can call that "Lowest compassion". When it wasn't, it's again "Average compassion."

Again, what percentage of the time did the second penalty even things out?

Lowest compassion: 62.5% (253 datapoints)
Average compassion: 58.2% (1006 datapoints).

This time the effect goes the "wrong" way, but there's too little data to draw any conclusions.

Doing the same thing for the second period instead of the first, we find a larger difference, but still not statistically significant (1.4 SD):

Highest compassion: 70.4% (568 datapoints)
Average compassion: 65.7% (271 datapoints).

And the shorthanded case, which really has too small a sample to take seriously:

Lowest compassion: 51.8 (83 datapoints)
Average compassion: 48.9% (268 datapoints).

-----

So, in summary: yes, there appears to be weak evidence for a small "compassion effect."

In the previous post, I considered three hypotheses:

1. Referee bias
2. Penalized teams play more carefully after the penalty
3. Power play teams play more aggressively after the penalty

Here's a fourth one, a variation of one suggested by commenter Wexler in the previous post:

4. Referees like to let the players play, and dislike calling penalties. But, sometimes they have to assert themselves to make sure the game doesn't get out of hand. Sometimes they're a bit too late, and they have to call a penalty on something that wasn't a penalty two minutes ago. This sends a message to the players, "OK, enough."

That might be necessary, but is obviously unfair to the penalized team. And, so, the referees know they have to call a "make up" penalty on those particular calls. Both teams understand what's happening, and won't object to either that call or the subsequent call.

I don't know if #4 is plausible or not ... but one of my co-workers is a soccer referee, and it's consistent with what he says about having to keep the game under control before it's too late.