Monday, May 30, 2011

A new basketball free throw choking study

"Performance Under Pressure in the NBA," by Zheng Cao, Joseph Price, and Daniel F. Stone; Journal of Sports Economics, 12(3). Ungated copy here (pdf).


There's a new paper demonstrating evidence that NBA players choke when shooting free throws under pressure. The link to the paper is above; here's a newspaper article discussing some of the claims.

Using play-by-play data for eight seasons -- 2002-03 to 2009-10 -- the authors find that players' percentage on foul shots goes down significantly late in the game when they're behind. They present their results through a regression, but it's more obvious just by using their summary statistics. Let me show you the trend, which I got by doing some arithmetic on the cells in the paper's Table 1:

In the last 15 seconds of a game, foul shooters hit

.726 when down 1-4 points (922 attempts)
.784 when tied or up 0-4 points (5505)
.776 when up or down 5+ points (4510)

With 16-30 seconds left, foul shooters hit

.748 when down 1-4 points (727 attempts)
.775 when tied or up 1-4 points (2652)
.779 when up or down 5+ points (6174)

With 31-60 seconds left, foul shooters hit

.752 when down 1-4 points (922 attempts)
.742 when tied or up 0-4 points (1634)
.767 when up or down 5+ points (8969)

In all other situations, foul shooters hit about .751 regardless of score differential (400,000+ attempts).


Take a look at the first set of numbers, the "last 15 seconds" group. When down 1 to 4 points, it appears that shooters do indeed "choke," shooting almost 2.5 percentage points (.025) worse than normal. In 5+ point "blowout" situations late in the game, they shoot more than 2.5 percentage points *better* than normal.

But neither of these numbers is statistically significantly different from the overall average (which I'm guessing is about .751). The difference of .025 is about 1.7 SDs.

The real statistical significance comes when you compare the "down by 1-4" group, not to the average, but to the "5+ points" group. In that case, the difference is double: the "down 1-4" is .025 below average, and the "5+" group is .025 *above* average. The difference of .050 is now significant at about 3 SDs.

UPDATE: The above paragraphs are incorrect in one aspect. Dan Stone, one of the paper's authors, corrected me in the comments. What I didn't notice was that in Table 1, the overall free throw percentage of each group was provided. Those percentages are .779 (down 1-4 group), .795 (up 0-4 group), and .782 (5+ group). So the average for those particular players is move like .787 than .751. All three groups shot below expected, but the "down 1-4" group shot WAY below expected.

So the "down 1-4" group is, on its own, statistically significant from expected, without regard to the other two groups. My apologies for not noticing that earlier.

So, if you look only at the last 15 seconds of games, it looks like players down by 1-4 points choke significantly compared to players who are up or down by at least five points.

There are similar (but lesser) differences in the 16-30 seconds group, and the 31-60 seconds group. I haven't done the calculation, but I'm pretty sure you also get statistical significance if you combine the three groups, and compare the "down 1-4" to the "5+" group.


So that's what we're dealing with: when you compare "down 1-4 late in the game" to "up or down 5+ late in the game", the difference is big enough to constitute evidence of choking. The most obvious explanation is that the foul shooters in the two groups might be different. However, that can't be the case, because the authors controlled for who the shooter was, and the results were roughly the same. Indeed, they controlled for a lot of other stuff, too: whether the previous shot was made or not, which quarter it is, whether it's a single foul shot or multiple, and so on. But even after all those controls, the results are pretty much the way I described them above.

Again, I repeat: the authors (and the data) do NOT say that the "choke" group shoots significantly worse than average. They can only say that the "choke" group is significantly worse than one specific group of players: the "don't care" group, shooting late in the game when the result is pretty much already decided, with a gap of 5+ points.

But this fits in with the authors' thesis: that the higher the leverage of the situation, the more choking you see. They later break down the "5+" group into "5-10" and "11+", and they find that even that breakdown is consistent -- the 11+ group shoots better than the (slightly) higher leverage 5-10 group. Indeed, for most of the study, they compare to "11+" instead of "5+". For some of the regressions, they post two sets of results, one relative to the "5-10 points" group, and one relative to the "11+" group. The "11+" results are more extreme, of course.


As I said, the authors don't present the results the way I did above ... they have a big regression with lots of tables and results and such. The result that comes closest to what I did is the first column of their Table 5. There, they say something like,

"In the last 15 seconds of a game, a player down 1-4 points will shoot 5.8 percentage points (.058) worse than if the game were an 11+ point blowout. The SD of that is 2.1 points, so that's statistically significant at p=.01."


Oh, and I should mention that the authors did try to eliminate deliberate misses, by omitting the last foul shot of a set with 5 seconds or less to go. Also, they omitted all foul shots with less than 5 minutes to go in the game (except those in the last 15/30/60 seconds that they're dealing with). I have absolutely no idea why they did that.


Although the authors do mention the "down 1-4" effect above, it's almost in passing -- they spend most of their time breaking the effect down in a bunch of different ways.

The biggest effect they find is for this situation:

-- shooting a foul shot that's not the last of the set (that is, the first of two, or the first or second of three);
-- in the last 15 seconds of the game;
-- team down exactly one point.

compared to

-- shooting a foul shot that's not the last of the set (that is, the first of two, or the first or second of three);
-- in the last 15 seconds of the game;
-- score difference of 11+ points in either direction.

For that particular situation, the difference is a huge 10.8 percentage points (.108), significant at 2.5 SDs.

Also: change "down by one point" to "down by two points", and it's a 6.0 percentage point choke. Change "not the last of the set" to "IS the last of the set," and the choke effect is 6.6 points when down by 1, and 6.0 points when down by 2.

This highly specific stuff doesn't impress me that much ... if you look at enough individual cases, you're bound to find some effects that are bigger and some that are smaller. My guess is that the differences between the individual cases and the overall "down 1-4" case are probably random. However, the authors could counter with the argument that the biggest sub-effects were the ones they predicted -- the "down by 1" and "down by 2" case. On the other hand, late performance is actually *better* than blowouts when the score is tied (by around 0.2 points), a finding the authors say they didn't expect.

So my view is that maybe the "1-4 points" result is real, but I'm wary of the individual breakdowns. Especially this one: in this situation:

-- last 15 seconds of the game
-- for a visiting team
-- where the most recent foul shot was missed
-- down by 1-4 points

the player is 9.6 percentage points (.096) less likely to make the shot than

-- last 15 seconds of the game
-- for a visting team
-- where the most recent foul shot was missed
-- score 11+ points in favor of either team.

Despite the large difference in basketball terms, this one's only significant at .05.


Another thing about the main finding is ... we actually already knew it. Last year, I wrote about a similar study (which the authors reference) that found roughly the same thing. Here, copied from that other post, are the numbers those researchers found, for all foul shots in the last minute of games, broken down by score differential:

-5 points: -3% [percentage points]
-4 points: -1%
-3 points: -1%
-2 points: -5% (significant at .05)
-1 points: -7% (significant at .01)
+0 points: +2%
+1 points: -5% (significant at .05)
+2 points: +0%
+3 points: -1% ("also significant")
+4 points: +1%
+5 points: -1%

There are some differences in the two studies. The older study controlled for player career percentages, instead of player season percentages. It didn't control for quarter (which is why commenters suspected it might just be late-game fatigue). It didn't control for a bunch of other stuff that this newer study does. And it used only three seasons of data instead of eight.

But the important thing is: the newer study's eight seasons *include* the older study's three seasons. And so, you'd expect the results to be somewhat similar. It's possible that the three significant seasons are enough to make all eight seasons look significant, even if the other five seasons were just average.

Let's try, in a very rough way, to see if we can tease out the new study's result for those other five seasons.

In the older study, if we average the -1, -2, -3, and -4 effects, we get -3.5. So, in the last minute, down by 1-4 points, shooters choked by 3.5 percentage points.

How do we get the same number for the newer study? Well, in the top-left cell of Table 5, we get that, in the last 30 seconds and down by 1-4 points, shooters choked by 3.8 percentage points.

That's our starting point. But the new study's selection criteria are a little different from the old study's, so we need to adjust.

First, the "-3.8" in the new study is from comparing to games in which the point differential is 11 or more. The "5-10" is probably a more reasonable comparison to the previous study. The difference between "11+" and "5-10", at 30 seconds, appears to be about one percentage point (compare the second columns of Tables 3 and 4). So we'll adjust that 3.8 down to 2.8.

Second, the new study is for the last 30 seconds, while the old study is for the last minute. From earlier in this post, we see that the 31-60 difference between the "down 1-4 group" and the "5+" group is only about -1.5 percentage points. Averaging that with the -2.8 from the above step (but giving a bit more weight to the -2.8 because there were more shots there), we get to about -2.4.

So we can estimate, very roughly, that for the same calculation,

Old study (three seasons): -3.5
New study (eight seasons): -2.4

Let's assume that if the new study had confined itself to only the same three seasons as the older study, it would have come up with the same result (-3.5). In that case, to get an overall average of -2.4, the other five seasons must have averaged -1.74. That's because, if you take five seasons of -1.74, and three seasons of -3.5, you get -2.4.

So, as a rough guess, the new study found:

-3.5 -- same three seasons as the old study;
-1.7 -- five seasons the old study didn't cover;
-2.4 -- all eight seasons combined.

So, in the new data, this study finds only half the choke effect that the other study did. Moreover, I estimate it's only 1 SD from zero.


That's for "down 1-4 points." Here's the same calculation, broken down by individual score. Here "%" means percentage point difference:

-2 points: First three: -5%. Next five: -2.3%. All eight: -3.3%.
-1 points: First three: -7%. Next five: -1.7%. All eight: -3.7%.
+0 points: First three: +2%. Next five: -1.8%. All eight: -1.0%.
+1 points: First three: -5%. Next five: -0.5%. All eight: -2.2%.
+2 points: First three: +0%. Next five: -0.3%. All eight: -1.3%.

Generally, five new seasons are closer to zero than the three original seasons. That's what you would expect if the original numbers were mostly luck.


So, in summary:

-- The study finds that in the last seconds of games, players behind in close games shoot significantly worse than in blowouts.

-- In the last 30 seconds, they're maybe about 2.8 percentage points worse. In the last 15 seconds, they're maybe about 4.8 percentage points worse.
-- The effect is biggest when down by 1 in the last fifteen seconds.

-- However, they are not statistically significantly better or worse than *average,* just statistically significantly worse than blowouts (although they certainly are "basketball significantly" worse than average).

-- The effect is mostly driven by the three seasons covered in the earlier study. If you look at the other five seasons, the effect is not statistically significant (but still has the same sign).

What do you think? I'm not absolutely convinced there's a real effect overall, but yeah, it seems like it's at least possible.

However, I do think the most extreme individual breakdowns are overstated. For instance, the newspaper article says that in the last 15 seconds, down by 1, players will shoot "5 to 10 percentage points worse than normal." (They really mean "worse than 11+ blowouts," but never mind.) Given that that's the most extreme result the study found, I think it's probably a significant overestimate. I'd absolutely be willing to bet that, over the next five seasons, that the observed effect will be less than five percentage points.


P.S. One last side point: the newspaper article says,

"Shooters who average 90 percent from the line performed slightly better than that under pressure, while 60 percent shooters had a choking effect twice as great as 75 percent shooters. That suggests that a lack of confidence begets less confidence, and vice versa."

This is a correct summary of what the authors say in their discussion, but I think it's wrong. The regressions that this comes from (Table 5, columns 2 and 6) don't include an adjustment for the player. So what it really means is that the 60 percent shooter will be *twice as far below the average player* as the 75 percent shooter. That makes sense -- because he's a worse shooter to begin with, even before any choke effect.


UPDATE: After posting this, I realized that I may have missed one aspect of the regression ... but I think my analysis here is correct if I make one additional assumption that's probably true (or close to true). I'll clarify in a future post.

Labels: , , ,

Friday, May 27, 2011

A fun Markov lineup study

"Reconsideration of the Best Batting Order in Baseball: Is the Order to Maximize the Expected Number of Runs Really the Best?" by Nobuyoshi Hirotsu. JQAS, 7.2.13


When you compare two baseball teams, the one expected to score the most runs isn't necessarily the one that will win more often. For example, suppose team A scores 4 runs per game, every time. Team B scores 3 runs eight games out of ten, and scores 10 runs two games out of ten.

Even though team A scores 4 runs per game, and team B scores an average 4.4 runs per game, A is going to have an .800 winning percentage against B.

In a recent study in JQAS, Nobuyoshi Hirotsu asks the question: does this happen in real life, with batting orders? For the same nine players on a team, is there a batting order that produces fewer runs than another, but still wins 50% of games or more?

To answer the question, you need a lot of computing power. Unlike the contrived example above, any effects using a real batting order are likely to be very, very small. And there are 362,880 different ways of arranging a batting order.

But, Hirotsu was able to do it, thanks to a supercomputer at the University of Tokyo. For all 30 MLB teams in 2007, he took their most frequently used lineups, and did a Markov chain calculation to figure out how many runs they would score.

(He used a very simplified model of a baseball game -- hits, walks, and outs only. Outs do not advance baserunners. No double plays. Runners advance 1-2 and 2-H on a single, 1-3 on a double.
Because this is a Markov study and not a simulation, all figures quoted in the study are *exact* -- subject, of course, to the caveat that the model is much simpler than real baseball.)

After calculating all 362,880 results, he took the single top-scoring lineup, and compared single-game results to the best of the other lineups (10,000 to 30,000 of them). If he found any lineups that beat the best one more than 50% of the time, despite scoring fewer runs, he marked it down.

So, about 600,000 pairs of lineups were compared (say, 20,000 per team times 30 teams). After all that work, how many sets of lineups do you think he found that met the criteria?


There were 13 sets of lineups where the team with more runs scored was not the team with the better record. Those 13 were distributed among only 7 different MLB teams.

For instance, the White Sox. The most frequent lineup actually used by the manager was: Owens, Iguchi, Thome, Konerko, Pierzynski, Dye, Mackowiak, Fields, Uribe. Call that one "123456789".

The lineup that maximized runs scored was "347865921," with 4.85966 runs per game. However, the lineup "348675921" beat it with a .50014 winning percentage, despite scoring only 4.85931 runs per game.

As a general rule, it seems, the lower the standard deviation of runs per game, the more likely a lower-scoring lineup can beat a higher-scoring lineup. In the White Sox case, the SDs were 3.2649 and 3.25817, respectively.

If you studied it, you could probably look at the 13 cases and try to figure out what it is about the players and lineups that makes this possible -- that is, how to create a lineup that's almost as good as the best, but has an SD lower enough to compensate. Do you need lots of power hitters, few, or a mix? Do you have to cluster them, split them up, or go half/half? I have no idea.


Anyway, even though study really has no practical signficance, I really like it. Recreational sabermetrics, I guess you could call it.

Labels: , , ,

Sunday, May 22, 2011

More evidence for referee bias in soccer

Commenter "Millsy" was kind enough to send me a couple of academic studies on soccer refereeing bias. Both of them have pretty solid evidence that referees favor the home team.

The first one is "Favoritism Under Social Pressure," (.pdf) by Luis Cariciano, Ignacio Palacios-Uerta, and Canice Prendergast. The authors looked at how much extra time the referees added to the end of soccer games (to compensate for time lost to injuries, substitutions, and so on). Looking at games in the Primera Division in Spain over two specific seasons (1994-95 and 1998-99), they found that, in games where the score difference was exactly one goal, referees awarded almost twice as much extra time when the home team was trailing as when it was leading. More time, of course, benefits whichever team is behind, as it gives them a better chance to tie the game.

The difference was about 1.8 minutes, even after controlling for yellow cards, red cards, substitutions, and several other things. In round numbers, home teams got four extra minutes to tie the score, but visiting teams got only two minutes. The difference was very statistically significant (at least 15 SDs).

Those two situations -- home team ahead by one goal, and visiting team ahead by one goal -- were the two most significant deviations from the mean of about three minutes. The full data, as read off the authors' chart, arranged by home team lead:

4+ goals ... 3.0 minutes
3 goals .... 3.0 minutes
2 goals .... 2.5 minutes
1 goal ..... 2.1 minutes
0 goals .... 3.3 minutes
-1 goal .... 4.0 minutes
-2 goals ... 2.8 minutes
-2+ goals .. 3.0 minutes

This seems like pretty solid evidence that something is going on ... I can't think of anything that would cause this other than referee bias, but if you can think of anything, let us know.

By the way, the authors estimate that the bias changes the result of about 2.5% of games -- presumably from a loss to a tie. That means it changes home winning percentage by 0.0125. That's less than 10% of the overall home field advantage in soccer, but it's still pretty significant.


However, there's one thing that's a bit strange -- you'd expect that if it's referee bias, some refs would be aware of it and consciously try to avoid it. The authors checked individual referees, though, and they all seemed to be about the same:

" .... we found that most referees appear to be equally biased. Only 3 of the 35 referees in the sample show statistically significant individual effects at the 10% level."

That, to me, seems very unusual, that all the refs would be breaking their vows of fairness in exactly the same way. So I still have some reservations that it's all just unconscious bias, even though I honestly can't think of any other explanation.

Also, I find it interesting that the paper I reviewed last week argued that differences in referees confirms the hypothesis of bias. This paper found *no* differences in referees -- but does not consider this to *refute* the hypothesis of bias.


The second paper is called "Favoritism of agents -- The case of referees' home bias," by Matthias Sutter and Martin G. Kocher (.pdf). The authors looked at the 2000-2001 season in Germany, and found a similar, though less extreme, result:

1 goal ..... 2.2 minutes
0 goals .... 1.8 minutes
-1 goal .... 2.7 minutes

The +1/-1 difference is just 30 seconds, instead of the 100 seconds the Spanish study found. It's still significant at just over 2 SDs. (The difference between -1 and 0 is 3 SDs.)

This study checked a couple of other things that were interesting. In the first half, the pattern was reversed: there was more extra time when the home team was ahead (20 seconds, 2 SDs). The authors think that's because, in the first half, it's to the home team's benefit to have play stop earlier, so they can regroup for the second half.

Also, the authors note that a German magazine, "Kicker Sportmagazin," reviews all games and posts an opinion on which penalty calls were correct and which were incorrect (both actual and missed calls). It turns out that for penalties called in favor of the home team, 5 out of 55 were illegitimate. But for visiting teams, it was only 1 out of 21. So referees favored the home team by about twice as many false positives.

False negatives also favored the home team. There were 12 cases where home team should have been awarded a penalty, but wasn't; there were 19 such cases for the visiting team.

Overall, if you add those up, the visiting team was "cheated" out of about 10 penalty kicks.

Assume 9 of those would have been goals. Nine goals out of 306 games equals 0.03 goals per game. The overall goal difference between home and road was 0.62 goals. So, this particular manifestation of referee bias equals about 5 percent of home field advantage.

That seems plausible to me.

Labels: ,

Tuesday, May 17, 2011

Winning causes payroll: study

Suppose you're a Martian who has just immigrated to North America, and you have no idea how baseball works. All you've got is a database full of statistics, and a black-box graph theory algorithm to try to figure out cause-and-effect relationships. What would you conclude?

The answer can be found in a new JQAS paper, "Dependence Relationships between On Field Performance, Wins, and Payroll in Major League Baseball."

The author applies the algorithm to a bunch of stats, and comes up with a bunch of dependencies for what causes what. Most of them don't make a lot of sense. For instance:

-- Walks depend on: OBP, Runs, and SO.
-- Total bases depend on: AB, 2B, HR, IP, and Runs.
-- Earned runs depend on: ERA, hits against, and HR against.

... and so on.

Be that as it may, the real purpose of the paper is to look at payroll and wins. What does the graph theory algorithm say about those? This:

--Winning percentage depends on: Fielding percentage, On-base percentage, and Saves.

--Payroll depends on: Fielding percentage, Pitcher strikeouts, and Winning percentage.

Who knew?

In any case, after that, the author looks at how an increase in payroll or wins affects the future. He finds:

-- If you bump a team's payroll 10%, it wins an extra 2.5 games this year, but returns to normal afterwards.

-- If you bump a team's payroll 10%, payroll drops slowly from +10% to +2% over the next 10 years.

-- If you bump a team's winning percentage by 10%, it returns to baseline in subsequent years.

These are no big deal. However:

-- If you bump a team's winning percentage by 10%, it bumps the team's payroll by 10% immediately. Then payroll rises to +25% over the next three years, settling back down to +10% by year 10.

So, according to the algorithm, payroll doesn't seem to have long-lasting effects on winning. But winning appears to have long-lasting effects on payroll! Therefore:

"... while we found some evidence that winning affects payroll and payroll affests winning, the evidence suggests the effect of winning on payroll is the more direct, larger, and more lasting in magnitude one."

In summary: winning causes payroll.

That's what the black-box algorithm says. But, to any member of the Martian-American community who may be reading this, I would respectfully suggest: you're better off not putting too much faith in the results of this particular paper.

Labels: , ,

Saturday, May 14, 2011

Are soccer referees biased for the home team?

A little while ago, one of the economics blogs I read (I forget which one) posted a link to a recent (2007) home field advantage paper. The paper is "Referee bias contributes to home advantage in English premiership football," by Ryan H. Boyko, Adam R. Boyko, and Mark G. Boyko. It's a free download.

It's actually pretty clever what they did. What they figured is this: if home field advantage (HFA) is caused by referee bias, it stands to reason that different referees would have different levels of such bias. So they checked the individual HFAs of different referees, to see if the distribution matched what would be expected if they were the same, and any differences were random error.

At least I *think* that's how they did it. They used an "ordinal multinomial regression model," and, unfortunately, I can't explain that because I don't know how it works. The results look pretty much like a normal regression. They tried to predict goal differential for every game. To do that, they used crowd size and percentage of available seats filled. They had dummy variables for year. Most importantly, they also included expected goals for and against for the home and visiting teams, where "expected" means average for the games of that season, not including that game (but not adjusting for the fact that both averages will be one game biased for home/road). And, of course, they used the identity of the referee for the game.

From all that, they got that referees were collectively statistically significant, at p=0.0263. But that was the result of a Chi-squared test on the entire group of 50 referees, so there's no coefficient we can look at. So, we know the referees have statistically significant differences in HFA, but we can't figure out *by how much*.

It turns out, however, that the significance goes away if you omit one outlier referee from the study. That referee's HFA is a huge 1.2 goals. The mean was 0.412, and no other referee was higher than 0.7. When the authors exclude that one referee from the study, the statistical significance jumps to p=.18.

The authors provide a chart (Figure 1) with the HFAs of all fifty referees. From that chart, you can't really tell if the referees are all the same or not -- you need the significance test. To the naked eye, the differences look pretty consistent with what you'd expect if the differences are just random (except, of course, for the one outlier).

Only two out of fifty referees have negative HFAs (that is, they refereed games where the visiting team outscored the home team, on average). However, it does appear that the referees with lower HFAs are a little farther from the average han the referees with higher HFAs, for what that's worth.

So the question remains: *how much* is the difference in referees, as compared to HFA? We don't know. It would have been nice if the authors had given us some variances: how much would the variance be if there were no bias? Then we could subtract the theoretical from the observed, and conclude something like, "the variance of HFA bias amongst referees is X".

But, as I said before, the authors were so concerned about showing there IS bias that they didn't calculate HOW MUCH bias they actually observed.


One thing to keep in mind is that while it's possible to estimate the *differences* among referees, there's no way to know the actual level of bias. It could be that all referees are biased, but a bunch just happen to be a little less biased. Or it could be that almost all referees are unbiased, but a select few were able to overcome that, and it's those unbiased referees that are causing the statistical significance.

It's like, suppose one interviewer wants to higher all three of the black candidates interviewed for a position, and another wants to hire none of them. You can tell one or both of them is biased. But is it that one interviewer doesn't like blacks? Is it that the other interviewer is practicing affirmative action? Or is it a combination of both? You can't tell unless you know enough about the candidates to figure out which of them "should" have been hired by an unbiased interviewer.

Same thing here. We need to know what the HFA "really" would be if all referees were unbiased. But that, we don't know, and there's no real way to know from this study.

The authors of the paper acknowledge that, but nonetheless argue for the position that HFA is all refereeing, and that most referees are biased:

"Certainly, the [many referees biased option] seems more reasonable, especially given the floor near gD = 0 (no home advantage)."

I don't really understand that, and I don't really agree with it ... I think the most plausible assumption is to assume bias among the fewest number of referees that seems reasonable. Actually, I think the most plausible assumption is to note the huge outlier, and the lack of significance of the distribution of the others, and reach the tentative conclusion that (a) there's no evidence of bias in general, but (b) we should really look closely at the outlier referee, to see if we can figure out what's going on there.


Finally, even if there IS a difference in referees, it might not be bias in favor of the home team. It could just be style of refereeing. According to Table I of the paper, home teams score a lot more goals on penalty kicks than visiting teams do. Overall, the difference was 0.044 goals per game.

Suppose that certain referees are just less likely to call penalties -- say, 1/3 less likely. That would reduce the HFA on penalties from 0.044 goals, down to 0.03 goals -- a difference of 0.015 goals.

It's not much, but add in differences in yellow cards, red cards, free kicks, and so on, and see what you get. It could turn out that a significant part of variability in referee HFA could be referee characteristics that have nothing to do with the home team at all.


Hang on -- maybe we *can* get at least an upper limit for how much HFA the referees could cause. In Table 1 of the paper, the authors give home and road stats for yellow cards, red cards, and penalty kick goals.

-- Yellow cards: road teams get 0.45 more per game than home teams.

-- Red cards: road teams get 0.038 more per game than home teams.

-- Penalty goals: home teams get 0.044 more per game than road teams.

A red card sends the player off for the rest of the game (and the team plays a man short). I remember reading somewhere what that's worth, but I don't remember where I saw it. Let's say it's an entire goal.

A yellow card is a warning. It doesn't cost the team anything (other than a free kick), but, since a second yellow card leads to a red card, the player affected might play with a bit more caution. It looks like there are about 20 yellow cards for one red card. So, let's suppose the player with the yellow card would get the red card one time in 10 if he didn't adjust his play. That means the first yellow card gives him a 10% chance of costing his team a goal. If he "spends" the entire 0.1 goals on more cautious play, we could say a yellow card is worth 0.1 goals.

A penalty goal, obviously, is a goal. I think I read somewhere that there's very little HFA on penalty kicks, so we can assume that the difference is the number of penalty kicks awarded.

So, let's add this up:

0.45 yellow cards times 0.1 goals equals 0.045 goals;
0.038 red cards times 1 goal equals 0.038 goals;
0.044 penalty successes times 1 goal equals 0.044 goals.

The total: 0.127 goals.

What else could the referees be doing to influence the outcome? Well, there's free kicks. And there's extra time -- some studies have suggested that the refs allow more injury time when the home team needs it. But those seem like they'd be much smaller factors than the ones above. Let's bump up the total from 0.127 to 0.15.

Also, it could be that visiting teams have to play more cautiously because of referee bias, and those numbers are artificially low because they don't include the effects of that. We included the effects of caution in the yellow card calculation, but not in the others. I don't know how to estimate that. It could be anything, really, from 0 percent to 1000%. However, if it were seriously high, someone would have noticed how teams play so much more aggressively at home than on the road. Certainly it would be mostly a conscious decision, so players would talk about it all the time, how they have to play so much more timidly to avoid provoking the referee.

Since that doesn't happen much (or does it?), it seems reasonable to assume that there's not much of that going on. Still, I'm going to set it high, and assume the effect of unpenalized cautious play is 50 percent of the total. That unrealistic assumption brings us up to about 0.22 goals.

We're still at just a little over half of observed HFA.

And, to get to half, we had to make some seriously unrealistic assumptions -- that ALL of the difference in yellow cards, red cards, and penalty kicks was due to referee bias against the visiting team, and that players are compensating with another 50 percent on top of that.

So, Table 1 of the paper is the strongest evidence I've seen that referees can't be causing much of HFA. And no regression is required -- it's just simple arithmetic!

Labels: , ,

Saturday, May 07, 2011

Clutch hitting and getting killed by a puck

There was something someone said about clutch hitting a few months ago -- I think it was Tango -- that took a while to sink in for me.

It went something like this: saying that clutch hitting ability "does not exist" is silly. Humans are different, and different people react to pressure in different ways. So, *of course* there must be differences in clutch hitting ability. The question isn't whether or not clutch hitting exists, because we know it must, but *to what extent* it exists.

It turns out that extent is small. The best we can say is that clutch hitting studies have found that the SD of individual clutch tendencies is about 3 percent of the mean. So for players who are normally .250 hitters, two out of three of them will be between .242 and .258 in the clutch, and 19 out of 20 of them will be between .235 and .265. ("The Book" study on the topic used wOBA, rather than batting average, but this is probably still roughly true.)

That's pretty minor, especially compared to other factors like platoon advantage, and so on. More importantly, there's not nearly enough data to know which are the real clutch hitters and which are the real clutch chokers. When your favorite pundit pronounces player X as "clutch," that's still completely pulled out of his butt.

the conclusion remains that calling specific players "clutch" is silly, but the stronger statement "clutch hitting does not exist" is unjustified.


As I said, that took a while to sink in for me. I'm willing to agree with it. But I still have reservations. Because, you can take this "humans are different" stuff to extremes.

Suppose you found that a certain bench player hits much better than usual on the first day of the month. Some announcer notices and says the manager should always play him on those days. The sabermetricians step in and say, "that's silly: there's no reason to believe it's a real effect, and the guy's not very good on all the other days."

But, by the same logic ... all humans are different. For me personally, every day on the first of the month, I'm amazed how fast the past month flew by. It seems to me that time is going faster as I get older, and it makes me a little sad. On the other hand, other people might be happier on the first of the month. Maybe that's when they get paid, and they're feeling rich.

So, since humans are different, and circumstances are different, why couldn't their be a *real* "first of month" effect? The same logic says there *has* to be.

But, the thing is ... any such effect is probably very, very small. Too small to measure. There's no way it'll affect player performance (in my estimation) even one one-hundredth as much as clutch.

It's real, but it's too small to matter.

In cases like that, is it OK to say that "there's no such thing" as "first of month hitting talent"? Maybe it's not technically true, but I don't think that'll always stop me from saying it anyway. But, if I remember, I'll say "if it exists, it's probably infinitesimal."

For clutch, the Andy Dolphin study mentioned in "The Book" came up with an SD of clutch talent estimated at .008 in wOBA. I'm not 100% willing to accept that, mostly because, as Guy points out in a recent clutch thread on Tango's blog, there might be a natural clutch difference due to batter handedness or batting style that doesn't really reflect "clutch talent" as it's normally understood.

But, what I might choose to say is that there's weak evidence of a small clutch effect, and add that there isn't nearly enough evidence to know who's weakly clutch and who's weakly choke.

Or, I might say that there's "no real evidence of a meaningful clutch effect," which says the same thing, but with a more understandable spin.


Anyway, it occurred to me, while thinking about this, that we like to think about things as "yes/no" when what's really important is "how much". By that, I don't mean the moral argument about "black and white" versus "shades of grey." What I mean is something more quantitative -- "zero or not zero" versus "how much?"

Take, for example, Omega 3 fatty acids. I hear they're good for you. You read about them in the paper, and on milk cartons, and fish products.

But, *how* good for you are they? Isn't that important, to quantify it? The media doesn't seem to.

Here, for instance, is an article from the CBC. It talks about which Omega 3s are better than others, and ends with a recommendation that people eat 150 grams of fish a week. But ... how big are the benefits? The CBC article doesn't tell us. You'd expect better from the Wikipedia entry, but even that doesn't tell us.

So, how are we supposed to decide whether it's worth it?

I mean, suppose you really, really don't like fish. Won't your decision on whether or not to eat fish anyway, depend on how much the benefit is?

And, don't other foods have benefits too? If I eat fish for lunch, that means I won't be eating oat bran instead. How can I decide which is better for me without the numbers?

Or, suppose I have a choice ... go to a fish restaurant for lunch, or go for a jog around the block. Both might help my heart. Which one will help more? Or, suppose a nice piece of salmon in my favorite restaurant is $5 more than a piece of chicken. Could I do better with the $5? Maybe I could save up and use the money to switch to a more expensive gym, where I'll go more often. Or, I could put it in a travel fund, to buy better travel health insurance next trip I go abroad. Which option for my $5 is best for my long-term health?

There are lots of people who hear "fish is good for your heart," so they start buying fish. What they're really buying is a security and good feelings. Because if you ask them to quantify the benefits are, they just look at you blankly, or they quote some authority saying it's good for you. They treat it as a yes/no question -- is it good for you? -- when the important question is "how much is it good for you"?

Maybe I'm just cynical, but when someone tells me a food is good for me, without quantifying the benefits, I just assume the benefits are tiny, like clutch hitting. After all, for a study to make a journal, all you need is 5 percent significance, which means that most of the claims are probably not true. And, even if they *are* true, the study probably found that the benefits are small (though possibly statistically significant).

In his book "Innumeracy," back in the 80s, John Allen Paulos suggested there be a logarithmic scale for risk, so when the media tell you something is risky, they could also use that number to tell you how much. I'd like to see that for everything, not just risk. And, I'd like to see the scale changed. Because, if it's just a logarithm, people will still ignore it. If someone tells me I should have had the salmon, I might say, "Why? It's only a 0.3 on the Paulos scale, which is almost nothing." And they'll say, "0.3 is better than nothing. It all adds up. You should look out for your health."

But, what if you expressed the benefits in terms of something real, like exercise? Everyone intuitively understands the benefits of exercise, so the scale would make sense. And, it would make the insignificance of small numbers harder to rationalize away. So when someone insists I eat the salmon, I can say, "look, it's $5, and it's only the equivalent of 30 meters of jogging." They can still say, "30 meters is better than nothing!" And I'll say, "look, I'll have the chicken, and when we get outside the restaurant, I'll jog the 30 meters to the car, and save the $5."

They could use that scale in the supermarket, too. I buy milk with Omega 3 (not for the health benefits, but because it stays fresh longer). Wouldn't it be great if it said on the carton, "Each glass gives you the benefits of 0.3 push-ups"? That would be awesome.

I'm not sure if exercise is the absolute best reference point. Maybe "days of life" is better. ("Each floret of broccoli adds 3.2 seconds to your lifespan!") But, still, it should be possible to come up with *something* that'll work.


For risk, a good unit of measure would be "miles of driving". That works well because it's widely recognized that driving is dangerous, and we all know people who died in car accidents, so we have an idea of the risk. But, there's no moral stigma associated with it (unlike, say, smoking), so we can be fairly rational about it.

In a comment thread yesterday on Tango's blog, there was a discussion about putting a protective barrier down the lines of baseball stadiums, to prevent people from getting hurt by foul balls. That would be similar to what the NHL did, when they installed a mesh partition behind each net after the death of a spectator in 2002.

Suppose the hockey netting wasn't there. What would the risk be?

From 1917 to the end of the 2002 season, there were 37,480 regular season NHL games. Assuming 15,000 fans per game, that's 562 million fans. Suppose one-quarter of those fans are sitting in high-risk seats; that's 140 million fans. Finally, suppose that players shoot a lot harder now, so today's risk is double the historical average. That means it only takes 70 million of today's fans to shoulder the same risk as throughout the NHL's history. (We could add a bit for playoffs and pre-season, but never mind.)

So, that's one death out of 70 million people, or a risk of 1 / 70,000,000 of dying at any given game. Is that big or small? It's hard to say. What is it in terms of driving?

In 2010, there were 1.09 deaths per 100 million miles travelled. Let's round that down to 1.00, just to make the calculation easier. So there's 1 / 100,000,000 of a death per mile.

That means the hockey game is the equivalent of 1.43 miles.

So that's how I'd say it: putting the mesh up at hockey games makes each fan behind the net safer by 1.43 miles of driving.

Doesn't that give you a really good intuitive idea of the risk involved?

Of course, that's death only, and not injury. But I'm sure you could find injury data for car accidents, and for puck injuries, and come up with some kind of reasonable scale. I'm guessing that if you did that, you'd probably find that it was still about the same order of magnitude of a couple of miles. But I don't know for sure.

And, hey, now that I think about it, you could treat *healthy* things as "miles of driving saved*. If something saves you one minute of lifespan, that's easily converted to driving. 100 million miles, at an average of 30 miles an hour, is about 380 years of driving. (At six hours of driving (or passenging) a week, that's about 10,000 years of life per fatal accident. That would mean that about one American in 200 eventually winds up dying in a car accident. Sound about right?)

Suppose the average driver has 40 years of life left, on average. Then every 380 years of driving wipes out 40 years of life. That works out to 9.5 years of driving per year of life. Round that to 10. That means that every hour you drive -- 30 miles -- costs you six minutes of life. So five miles of driving costs you a minute of life.

(Again, that's death only, and not injury. In terms of quality of life, you'd probably want to bump up the "5 miles" figure, because bad health usually makes you miserable before you die, but car accidents often kill you instantly. But let's stick with five miles for now.)

So if eating salmon for a week saves you one minute of lifespan, eating salmon for a week is like cutting 5 miles off your commute one day. I made that "one minute" number up; if anyone knows how to figure out what the real number is, let me know.


Let's do cigarettes. Actually, let's just do lung cancer, to make it easier.

According to Wikipedia, 22.1% of male smokers will die of lung cancer before age 85. Let's assume that entire amount is from cigarettes.

Since this is a back-of-the-envelope calculation, let's just make some reasonable guesses. I'll assume the average male smoker starts at age 18 and smokes a pack a day. I'll also assume that when a smoker dies of lung cancer, it's at age 70 on average, and it cuts 15 years off his lifespan.

So: 52 years times 365 days times 20 cigarettes equals ... 379,600 cigarettes. 15 years of lost life equals 7,884,000 minutes. So each cigarette equals 21 minutes.

Multiply the 21 minutes by 22.1% and you get 4.6 minutes.

Google "cigarette minutes of life" and you get figures ranging from 3 to 11 minutes ... and that's of *all causes*, not just lung cancer. So, we're in the right range.

4.6 minutes equals 23 miles.

If you're a pack-a-day smoker, your risk is the same as if you drove from New York to Los Angeles every week or so.


If I were made evil dictator of the world, I would insist that every media report on risks and benefits tell you *how much*. Every report of a health scare, every quote from a safety group, every recommendation from a nutritionist, would need to include a number. Because, really, when someone tells you "vegetables are healthy," that's useless. Even if it's true, how true? Is it true like "the platoon advantage exists?" Is it true like "clutch hitting exists"? Or is it true like "first of month hitting exists?"

The difference matters. We need the numbers.

Labels: , , ,