### An NFL field goal "choking" study

In a comment to last week's post on choking in basketball, commenter Jim A. posted a link to this analysis of choking in football. It comes from a 1998 issue of "Chance" magazine, a publication of the American Statistical Association.

The paper comes to the conclusion that field-goal kickers do indeed choke under pressure.

Authors Christopher R. Bilder and Thomas M. Loughlin looked at every place kick (field goal or extra point) in the 1995 NFL season. They ran a (logit) regression to predict the probability of making the field goal, based on a bunch of criteria, like distance, altitude, wind, and so on. They designated as "clutch" all those attempts that, if successful, would have resulted in a change of lead.

I assume that kicks starting or resulting in a tie count as "change of lead" -- if so, then clutch kicks are those where the kicking team is behind by 0 to 3 points.

The authors narrowed their model down by eliminating variables that didn't appear to explain the results much. The final model had only four variables:

-- clutch

-- whether it was an extra point or a field goal

-- distance

-- distance * (dummy variable for wind > 15mph)

It turned out that clutch kicks were significantly less successful than non-clutch kicks, by an odds factor of 0.72. If, in a non-clutch situation, your odds of making a field goal were 5.45:1 (which works out to 84.5%, the overall 2008 NFL average), then, to get your clutch odds, you multiply 5.45 by 0.72. And so your corresponding odds in a clutch situation would be 3.93:1 (80%).

It's a small drop -- less than five percentage points overall -- but statistically significant nonetheless.

----

Now, to ask the usual question (albeit one the paper doesn't ask): could there be something going on other than choking? Some possibilities:

1. All attempts the study consideres "clutch" are, by definition, made by a team that's either tied or behind in the score. Wouldn't that be selection bias, since the "clutch" sample would be disproportionately comprised of teams who are, overall, a bit worse than average? Worse teams would have worse field goal kickers, which might explain the dropoff.

The paper ignores that possibility, explicitly assuming that all FG kickers are alike:

"One difference there is no difference between placekickers is that NFL-caliber placekickers are often thought of as "interchangeable parts" by teams. NFL teams regularly allow their free-agent placekickers, who are demanding more money, to leave for other teams because other placekickers are available."

That makes no sense: free-agent quarterbacks leave for other teams too, but that doesn't mean all are equal. Besides, if all placekickers were the same, those "other teams" wouldn't sign them either.

So I wonder if what's really going on is that the kickers in "clutch" situations are simply not as good as the kickers in other situations. The discrepancy seems pretty large, though, so I wonder if that effect would be enough to explain the five percentage point difference.

2. One of the other factors the authors considered was time left on the clock. It turned out to be significant, originally, but, for some reason, it was left out of the final regression.

But clutch kicks would be more likely to occur with less time on the clock. Behind by 3 points with two seconds remaining, a team would try the field goal. Behind by 5 points with two seconds remaining, the team would try for a touchdown instead.

Why does that matter? Maybe because, if there's lots of time on the clock and the team isn't forced to kick, they might not try it if conditions are unfavorable (into the wind, for instance). But with time running out, they'd have to give it a shot even if conditions were less favorable. So time-constrained kicks would have a lower success rate for reasons other than "choking".

3. The assumption in the regression is that all the coefficients are multiplicative. Perhaps that's not completely correct.

In low-wind conditions, the regression found that every yard closer to the goalposts changes your success odds to 108% of their original. And clutch changes your odds to 72% of the original. So, according to the model, going one yard closer but in a clutch situation should change your odds to 108% of 72%, or 78%.

But what if that's not the case? What if multiplying isn't strictly correct? Suppose that "clutch" makes the holder more likely to fumble the snap, by a fixed amount, and there's also an effect on the kicker that's proportional to the final probability. In that case, multiplying the two effects wouldn't be strictly correct -- only an approximation. And, therefore, the regression would give biased estimates for the coefficients. If the "distance" coefficient is biased too high, but "clutch" kicks happen to be for longer distances, that would explain a higher-than-expected failure rate.

4. The paper included kicks for extra points (PATs), which are made some 99% of the time. And there were lots of PAT attempts in the sample, even more than field goal attempts. At first I thought those could confuse the results. If there were no clutch factor, you'd expect exactly one clutch PAT to be missed. What if, by random chance, there were two instead? That would imply a large odds ratio factor for the PATs, based only on one extra miss, which wouldn't be statistically significant at all.

Could that screw up the overall results? I did a little check, and I don't think it could. I think the near-100% conversion rate for PATs is pretty much ignored by the logistic regression. But I'm not totally certain of that, so I thought I'd mention it here anyway.

5. The authors found that the odds of making a PAT were very much higher than the odds of making a field goal of exactly the same distance -- an odds ratio of 3.52. That means that if the odds of making a PAT are 100:1, the odds of making the same field goal are only 28:1.

What could be causing that difference? It could be a problem with the model, or it could be that there is indeed something different about a PAT attempt.

What could be different about a PAT attempt? Well, perhaps for an FG attempt, both teams are trying harder to avoid taking a penalty. For the defensive team, a penalty on fourth down could give the kicking team enough yards for a first down, which could turn the FG into a TD. For the offensive team, a penalty might move them out of field goal range completely. Those situations don't apply when kicking a PAT.

In clutch situations, the incentives would be different still. Suppose it's a tie game with one second left on the clock, and a 25-yard attempt coming. An offensive 10-yard penalty would hurt a fair bit: it would turn a 90% kick into an 80% kick, say. A defensive penalty wouldn't hurt as much, though: it might only turn the 90% kick into a 95% kick.

Normally, a defensive penalty hurts more than an offensive penalty: it could create a first down, rather than just a more difficult kick. But in late-game situations, an offensive penalty hurts more than a defensive penalty: it lowers the success rate by more than a defensive penalty raises it.

Therefore, in a clutch situation, could it be that FGs are intrinsically more difficult, just because the offense has to play more conservatively, but the defense can play more aggressively?

## 19 Comments:

According to their definition, the first FG attempt of a 0-0 game in the first quarter is considered "high pressure." I don't understand how papers like this get published.

From 2000-2008, defining clutch as FG attempts within 3 pts in the last 3 minutes of a game, there is no detectable difference in FG success. link

Besides, as you point out, even if there were a difference, it could be explained by normal effects other than choking. FG kickers are fairly indistinguishable, except in terms of range. In true clutch situations, kickers with shorter ranges are often asked to attempt kicks from distances they would not normally attempt. So we might expect non-clutch attempts to be biased at longer ranges with better kickers.

It's interesting that kickers are indistinguishable except for range. You'd think there'd be some human beings with high accuracy who never miss from 35 ... like foul shooters who shoot 95%.

But since kickers are being selected only on their kicking, maybe they ALL shoot 95% ... from 20 yards.

Brian, in your opinion, are they really indistinguishable, or is it just that the sample size is too small? If you needed a 30 yard kick, would you really not care which of the 30 kickers you happened to have that day?

Brian: Have you ever studied the impact of opposing teams calling a timeout (or multiple timeouts) prior to a kick, intended to increase the "choke" tendency of kickers? Is there any evidence that the strategy works?

To be fair, the Bilder-Loughin paper didn't really set out to measure clutch or choke kicking, it merely tested for many factors that affect FG conversions. Lead-changing and time happened to be two variables that turned out significant.

As to point 1, there's evidence to suggest that there aren't significant skill differences among NFL kickers. That was the conclusion of one of the references cited in the Bilder-Loughin study, another article in Chance titled "The Best NFL field goal kickers: are they lucky or good?" I couldn't find that paper online, but if you're interested I could provide it to you.

My own research shows that from 1993-2009, lead-changing FG attempts in the last two minutes or overtime are converted at a 73% rate (606/828). Although these kicks do tend to be slightly longer, the success rate seems quite low compared to overall averages and doesn't appear to be consistent with Brian's findings.

Guy, research on "icing" the kicker is mixed.

http://en.wikipedia.org/wiki/Icing_the_kicker

http://www.maa.org/mathland/mathtrek_11_15_04.html

http://sportsillustrated.cnn.com/2005/writers/dr_z/01/21/mailbag.z/index.html

Sorry, that last URL got cut off, so I linked it on my name.

A few years ago, a researcher published a study that claimed icing the kicker did work, but Paul Zimmerman at SI found that the effect disappeared with a larger (and slightly redefined) data set.

I don't think either of them really had the correct data set to measure icing, but I suspect if you did the sample size would be too small to find significance. Nobody ever calls time out with two minutes left; they would rather save it for their own offense, and that likely dilutes both studies with spurious data.

Also not mentioned--on a (kicking)PAT, ball is centered. On a field goal, it is not, unless the kicking team centers it with a play. Trying to kick the FG from the left or right hash increases the difficulty slightly. Be interesting to know what the conversion percentages of FG's are depending on if the ball is centered, or on the hash. (IMO, right hash is slightly more difficult, because of the larger # of right-footed kickers.)

Also--"clutch" kicks AT THE END OF THE FIRST HALF. At the end of the game, a coach will try the play that gives him the best chance to win. At the end of the FIRST half, he may try a long FG, knowing that a miss won't hurt him--I mean, "Hail Marys" aren't exactly high percentage plays. If you get a long FG right before the half, everybody feels better about the 2nd half. If you miss the 55 yarder, nobody feels bad because it wasn't a high-percentage kick anyway.

Jim, point taken about the authors not specifically setting out to measure clutch kicking. They do, however, say, "This result may indicate a decrease in performance for placekickers in "clutch" situations."

Anonymous: Thanks, I hadn't thought of the position of the ball being different for PATs. That probably explains it.

They don't call it clutch, but they do call it high pressure, which is still incorrect. A single referee (the academic kind, not a game official) with any football knowledge would kick the paper back and say, "This is fine, but how about you add an analysis that more narrowly defines high pressure." It would be an easy fix.

I think part of the problem is a faith in the magical powers of multivariate regression. The authors did consider time in their model, but

as a strictly linear function, which they ultimately discarded.To answer Phil's question, it's the low sample size that prevents us from telling truly good kickers from bad. That's what I meant by "indistinguishable," instead of saying they're all the same. It would be like trying to compare batters who have only 30 ABs per yr, except it's even harder due to the varying attempt distances FG kickers face.

Thanks, Brian.

I did wonder at one point why teams didn't bother trying to identify the better kickers, even by just making them kick a couple of hundred times in practice.

It seems strange that placekicker is a position where nobody really knows who's good and who's not. It would cost teams so little to work on trying to figure it out, at a huge benefit if they're successful.

And, Brian, good point about time being linear. What their model was saying, if I understand it, is if kicking with 1 minute left had X percent better odds than kicking with 30 seconds left, then kicking with 14 minutes left would have (X to the power of 27) better odds.

That doesn't really make sense. They should have used a dummy variable for "late in the game", or something like that.

I'm guessing that at the time they did the research, it didn't occur to the authors that a finding of clutch/choking would be so enlightening (or controversial) so they didn't bother to explore the issue further. That's why it was only briefly mentioned and was not a central part of the study.

I've often thought a team should try the idea of evaluating kickers by having them kick hundreds of times in practice, but then I wondered whether fatigue would set in fairly quickly, making it a not very realistic estimation of kicking 2 or 3 times a game with fresh legs.

It seems that sports study after sports study, the researchers use a much smaller sample size than is readily available. Why did they just stick to the 1995 season? If they found something interesting, wouldn't it have piqued their interest to look at other years around 1995? Maybe a 5-year stretch thus increasing their sample size and making their argument more robust? Or (as I truly suspect) is it that the small sample size is the main reason the results look interesting, and any extra work that increases sample size actually screws them since now they won't have anything interesting to present.

Anon, were you around in 1995? Getting that kind of data back then was extremely difficult and may have cost the authors a small fortune via a third party service. It wasn't until 1996-97 before the NFL started publishing Gamebooks on their site, and even then it was very incomplete and yanked off the site a few months later. Only by around 2000 did the league finally start publishing their data with any consistency and actually leaving it on their site.

There isn't a difference between an extra point attempt and a 20 yard field goal. Over the last 10 years 99.4% of all extra point kicks were successful, 99.2% of 20 yard field goals were successful. Only 2 20 yard field goals were missed so I think based on the (much) smaller sample size of 20 yard field goals that it's the distance of the kick that's important, not hash mark or being a field goal instead of an extra point attempt.

Penalties are rare on short field goal attempts. In the last 10 years only two successful 20 yard field goals were overturned by penalty, 1 penalty by the offense and 1 by the defense.

------

Place kickers have gotten much more accurate. A study looking at 1995 games is not at all representative of how kickers in today's games perform, just as 1995 kickers were much better then 1975 kickers.

Anonymous,

Thanks! That 20-yard stat, is that online somewhere? Is it FG attempts from *exactly* 20 yards, or is it 18-22 combined or something like that?

Football-reference.com only has 10-19 and 20-29, which makes comparisons difficult.

Jim, sorry for my skepticsm, but me thinks if they got 1995 data, it couldn't have been that much worse to expand to a couple of years. I have no idea how much that data costs them as far as time and money is concerned.

Or better yet, if the results were similar for the 2007-2009 seasons, and assuming the data is much easier to gather now, do you think these guys would be trumping their paper and rewriting a new one? You betcha!

By my observation, academics generally don't obsess over sports statistics in great detail the way we amateurs do. By the time this paper was published, I'm sure the authors had long moved on to something else related to their fields, and probably nothing to do with sports. Their failure to obtain a larger and/or more recent data set likely has more to do with their own interest and the lack of demand than anything more sinister.

I'd love to see someone publish a more comprehensive study on field goal kicking that addresses some of the weaknesses as well as the questions brought up here and elsewhere. That was part of the motivation for bringing this study to people's attention. If I had done so five or ten years ago when I first read the paper, I doubt anyone in the blogosphere/web would have noticed or cared. The fact that we're even discussing it represents enormous progress, IMHO.

Phil- I looked at FGs of exactly 20 yards. I don't have a link for where you can look up field goals by a specific distance.

Post a Comment

## Links to this post:

Create a Link

<< Home