Long tee shots: how much do they improve PGA golf scores?
PGA golfers are hitting for distance better than ever before. Is that contributing to an improvement in their scores? Or, by going for the long drive, are they losing so much accuracy that the increased distance doesn't help?
An article (fortunately available online) in the new issue of Chance Magazine (published by the American Statistical Association) tries to answer that question. It's Called "Today's PGA Tour Pro: Long but Not so Straight," by Erik L. Heiny.
Heiny starts by showing us that average PGA driving distance has increased substantially in recent years. Between 1992 and 2003 (these seasons are the ones used throughout the paper), Figure 1 shows an increasing trend from 260 yards to 287. Driving accuracy, though, is flatter; Figure 19 shows a relatively stable trend up to 2001, at which point accuracy suddenly drops from 68% in 2001 to 66% in 2003.
Then, Heiny runs some straight regressions between various aspects of golf performance, and gives us the year-to-year correlations. Unfortunately, he doesn't give us the regression equations, which is where most of the knowledge is. For instance, driving distance is positively correlated with score. But what's the size of the effect? If Phil Mickelson increases his distance by 5 yards, what can he expect as an improvement? Half a stroke? One stroke? Two strokes? This is important and useful information, but Heiny doesn't tell us.
I found the correlations for the different variables much more interesting than the year-to-year trends, which are mostly flat, with occasional exceptions in 2003. Given that 2003 is also the year driving accuracy dropped, you've got to wonder if something specific happened that year to make all these things happen at once.
In any case, the most interesting correlations (Figures 2-15) are:
Driving distance vs. scoring
Driving accuracy vs. scoring
Greens in regulation vs. scoring
Putts per round vs. scoring
Sand saves vs. scoring
Scrambling vs. scoring
Bounce back vs. scoring
(I should give you the definitions for some of the less obvious variables. "Driving accuracy" is percentage of drives (excluding par 3s) that landed on the fairway. "Greens in regulation" is percentage of holes in which the green was reached in (par – 2) strokes or fewer. "Sand saves" is percentage of balls in sand traps holed in two shots or fewer (I think). "Scrambling" is how often a par (or better) was made, as a percentage of holes where the green was *not* reached in regulation. And "bounce back" is the percentage of times a player gets a birdie or better after a hole of bogey or worse.)
I was surprised at some of the correlations. Which of the seven factors above would you think had the biggest influence on score? I would have thought "greens in regulation" and "putts per round." I was half right. Here are the approximate numbers, as I eyeballed them from the graphs:
0.2 Driving distance vs. scoring
0.3 Driving accuracy vs. scoring
0.7 Greens in regulation vs. scoring
0.3 Putts per round vs. scoring
0.3 Sand saves vs. scoring
0.6 Scrambling vs. scoring
0.5 Bounce back vs. scoring
(The author also repeats these correlations for money instead of scoring, but, for the most part, the conclusions are the same, so I won't mention them further.)
From these correlations, Heiny draws some tentative conclusions about how driving distance has affected play. For instance, in 2003, the correlation between driving accuracy and scoring decreased from 0.2 to 0.0. Heiny writes that this suggests that
"... with the driver going so far, it just didn’t matter whether the player was in the fairway. With longer drives, players could get close enough to the green to play short irons or wedges. Even from rough, they could control the shot into the green."
I'm not sure I'd agree with that. Driving distance went up only 6 yards that year, and I'd be more inclined to view the big drop in correlation as random. Indeed, in the first ten years of the study, distance increased 20 yards, with little change in accuracy or correlation.
In any case, the author proceeds to multiple regressions, where he predicts a player's score based on the seven variables above. He runs one regression for each year.
Again, we don't get any equations, just signficance levels. Summarizing the 12 regressions, here's what the article shows:
Highly significant: driving distance
Highly significant: greens in regulation (GIR)
Highly significant: putts per round
Highly significant: scrambling
Occasionally significant (3 years out of 12): driving accuracy
Occasionally significant (5/12): bounce back
Not significant (0/12): sand saves
Strangely enough (to me), a couple of results were slightly different when predicting (the logarithm of) money winnings instead of score: drive accuracy went from 3/12 to 7/12, and scrambling went from 12/12 (all of which were < .0001) to 2/12 (many of which were greater than 0.5, which means *negative* correlation!) I can think of a few reasons why this might occur (high money might be correlated with a longer course, which means higher scores; high money might mean better caliber golfers, which might have different characteristics; and so on). In the money regressions, driving distance was extremely signficant (p < .0001) up to 2000, when it started becoming less important, hitting p = .0545 (not significant at 5%) in 2003. Also, driving accuracy seemed to be more signficant in the early years than the later years. This prompts Heiny to say that
"it seems to be evident that accuracy off the tee is becoming less important as driving distances increase."
But it seems to me that you can't really draw any conclusions like that from the regression, because of the choice of variables.
For instance: how would in increase in driving accuracy improve scoring? It would do so by increasing the chances of landing on the green in regulation. But "greens in regulation" is another variable being used in the regression! And so even if driving accuracy doesn't come out as significant, that might be because most of its influence is on the "greens in regulation" variable.
We know that if you're a pitcher, giving up a lot of hits increases your chances of losing the game, right? But if we run a regression on losses, and include hits *and runs*, hits will come out as completely insignificant. Why? Because your chance of winning depends on how many runs you give up – but, if you give up four runs, *it doesn't matter* how many hits you allowed in producing those four runs. Hits and runs don't indpendently cause wins: hits cause runs, and runs cause wins. Hits are irrelevant if you know runs.
Hits -----> Runs -----> Wins
The same kind of thing holds for driving accuracy:
Accuracy -----> Greens in Regulation -----> Score
If Vijay Singh his 72% of greens in regulation, it doesn't matter much how he got to that 72%: his score will be roughly the same if he got it by inaccurate drives with good recoveries from the rough, or accurate drives with average approach shots. And so, just like hits are irrelevant if you know runs, accuracy is irrelevant if you know GIR.
The analogy isn't perfect: accurate tee shots affect more than just GIR. They may lead to a better position on the green, which will affect putts per round (which is also a variable in this regression). In cases where you miss the green, a more accurate tee shot might lead to an easier scramble (also a variable), or a better lie in the sand (again a variable).
Since all those other things are accounted for in the regression, the surprise is that accuracy would ever be significant at all! There must be other ways in which accuracy improves scores. What might those be?
We can start by noting that score can be computed *exactly* from these four variables:
A = Percentage of GIR
B = Percentage of greens in one stroke less than regulation
C = Number of putts taken per hole
D = On missed GIRs, average number of extra strokes taken to get on to the green
Then, the average hole's score relative to par simply equals (C – 2) - B + D(1-A). Simplifying gives
Exact (average) Score = C - B + D - DA - 2
So if you ran a regression on B, C, D, and DA, you would predict score perfectly -- you'd get a correlation of 1, and the above regression equation. (If you ran it on A, B, C and D, you'd come close to 1, but wouldn't hit it exactly, because there's an interaction between D and A that you wouldn't capture.) And adding other variables to the regression – such as accuracy – would do almost nothing, because accuracy "works" by causing a change in one or more of A, B, C, or D.
Now, Heiny didn't actually include all of A, B, C and D in his regression. But he came close. His regression did include:
Scrambles, which is significantly correlated with D and C.
So what's left that he didn't include? B, and part of D. So when one of the other variables, such as driving accuracy, comes out significant, it must because of its effect on B or D:
-- it increases the chances of getting to the green in less than regulation (B); and
-- it increases the chance that if you miss the green in regulation, you'll get back on in few strokes (D).
Looked at in that light, it's easy to see why driving distance is significant: to get yourself a B, you need to hit a par-5 green in two strokes, so you better be hitting long. And it's easy to see why accuracy is significant – landing on the fairway means you avoided losing your ball or hitting in the water, saving yourself a lot of D.
(By the way, could it be that "sand saves" comes out insignificant because all sand saves are actually scrambles? Or does the definition of "scramble" explicitly exclude bunker shots?)
But those minor results are not what the study was looking for. The point of the study was to see if distance and accuracy caused scores to drop in general, not to see if it caused scores to drop only because of "greens in better than regulation" and "water hazards avoided".
And if want the overall effects of distance and accuracy, you can't include any variables that are *also caused* by distance and accuracy. Run your regression on distance and accuracy only, and see what you get.
P.S. I wrote about this kind of flaw earlier, probably more understandably.
P.S. A PGA study once showed that "driving the green" -- trying for a "B" by going for the green in less than regulation -- was associated with lower scores.