### Long tee shots: how much do they improve PGA golf scores?

PGA golfers are hitting for distance better than ever before. Is that contributing to an improvement in their scores? Or, by going for the long drive, are they losing so much accuracy that the increased distance doesn't help?

An article (fortunately available online) in the new issue of Chance Magazine (published by the American Statistical Association) tries to answer that question. It's Called "Today's PGA Tour Pro: Long but Not so Straight," by Erik L. Heiny.

Heiny starts by showing us that average PGA driving distance has increased substantially in recent years. Between 1992 and 2003 (these seasons are the ones used throughout the paper), Figure 1 shows an increasing trend from 260 yards to 287. Driving accuracy, though, is flatter; Figure 19 shows a relatively stable trend up to 2001, at which point accuracy suddenly drops from 68% in 2001 to 66% in 2003.

Then, Heiny runs some straight regressions between various aspects of golf performance, and gives us the year-to-year correlations. Unfortunately, he doesn't give us the regression equations, which is where most of the knowledge is. For instance, driving distance is positively correlated with score. But what's the size of the effect? If Phil Mickelson increases his distance by 5 yards, what can he expect as an improvement? Half a stroke? One stroke? Two strokes? This is important and useful information, but Heiny doesn't tell us.

I found the correlations for the different variables much more interesting than the year-to-year trends, which are mostly flat, with occasional exceptions in 2003. Given that 2003 is also the year driving accuracy dropped, you've got to wonder if something specific happened that year to make all these things happen at once.

In any case, the most interesting correlations (Figures 2-15) are:

Driving distance vs. scoring

Driving accuracy vs. scoring

Greens in regulation vs. scoring

Putts per round vs. scoring

Sand saves vs. scoring

Scrambling vs. scoring

Bounce back vs. scoring

(I should give you the definitions for some of the less obvious variables. "Driving accuracy" is percentage of drives (excluding par 3s) that landed on the fairway. "Greens in regulation" is percentage of holes in which the green was reached in (par – 2) strokes or fewer. "Sand saves" is percentage of balls in sand traps holed in two shots or fewer (I think). "Scrambling" is how often a par (or better) was made, as a percentage of holes where the green was *not* reached in regulation. And "bounce back" is the percentage of times a player gets a birdie or better after a hole of bogey or worse.)

I was surprised at some of the correlations. Which of the seven factors above would you think had the biggest influence on score? I would have thought "greens in regulation" and "putts per round." I was half right. Here are the approximate numbers, as I eyeballed them from the graphs:

0.2 Driving distance vs. scoring

0.3 Driving accuracy vs. scoring

0.7 Greens in regulation vs. scoring

0.3 Putts per round vs. scoring

0.3 Sand saves vs. scoring

0.6 Scrambling vs. scoring

0.5 Bounce back vs. scoring

(The author also repeats these correlations for money instead of scoring, but, for the most part, the conclusions are the same, so I won't mention them further.)

From these correlations, Heiny draws some tentative conclusions about how driving distance has affected play. For instance, in 2003, the correlation between driving accuracy and scoring decreased from 0.2 to 0.0. Heiny writes that this suggests that

"... with the driver going so far, it just didn’t matter whether the player was in the fairway. With longer drives, players could get close enough to the green to play short irons or wedges. Even from rough, they could control the shot into the green."

I'm not sure I'd agree with that. Driving distance went up only 6 yards that year, and I'd be more inclined to view the big drop in correlation as random. Indeed, in the first ten years of the study, distance increased 20 yards, with little change in accuracy or correlation.

In any case, the author proceeds to multiple regressions, where he predicts a player's score based on the seven variables above. He runs one regression for each year.

Again, we don't get any equations, just signficance levels. Summarizing the 12 regressions, here's what the article shows:

Highly significant: driving distance

Highly significant: greens in regulation (GIR)

Highly significant: putts per round

Highly significant: scrambling

Occasionally significant (3 years out of 12): driving accuracy

Occasionally significant (5/12): bounce back

Not significant (0/12): sand saves

Strangely enough (to me), a couple of results were slightly different when predicting (the logarithm of) money winnings instead of score: drive accuracy went from 3/12 to 7/12, and scrambling went from 12/12 (all of which were < .0001) to 2/12 (many of which were greater than 0.5, which means *negative* correlation!) I can think of a few reasons why this might occur (high money might be correlated with a longer course, which means higher scores; high money might mean better caliber golfers, which might have different characteristics; and so on). In the money regressions, driving distance was extremely signficant (p < .0001) up to 2000, when it started becoming less important, hitting p = .0545 (not significant at 5%) in 2003. Also, driving accuracy seemed to be more signficant in the early years than the later years. This prompts Heiny to say that

"it seems to be evident that accuracy off the tee is becoming less important as driving distances increase."

But it seems to me that you can't really draw any conclusions like that from the regression, because of the choice of variables.

For instance: how would in increase in driving accuracy improve scoring? It would do so by increasing the chances of landing on the green in regulation. But "greens in regulation" is another variable being used in the regression! And so even if driving accuracy doesn't come out as significant, that might be because most of its influence is on the "greens in regulation" variable.

We know that if you're a pitcher, giving up a lot of hits increases your chances of losing the game, right? But if we run a regression on losses, and include hits *and runs*, hits will come out as completely insignificant. Why? Because your chance of winning depends on how many runs you give up – but, if you give up four runs, *it doesn't matter* how many hits you allowed in producing those four runs. Hits and runs don't indpendently cause wins: hits cause runs, and runs cause wins. Hits are irrelevant if you know runs.

Hits -----> Runs -----> Wins

The same kind of thing holds for driving accuracy:

Accuracy -----> Greens in Regulation -----> Score

If Vijay Singh his 72% of greens in regulation, it doesn't matter much how he got to that 72%: his score will be roughly the same if he got it by inaccurate drives with good recoveries from the rough, or accurate drives with average approach shots. And so, just like hits are irrelevant if you know runs, accuracy is irrelevant if you know GIR.

The analogy isn't perfect: accurate tee shots affect more than just GIR. They may lead to a better position on the green, which will affect putts per round (which is also a variable in this regression). In cases where you miss the green, a more accurate tee shot might lead to an easier scramble (also a variable), or a better lie in the sand (again a variable).

Since all those other things are accounted for in the regression, the surprise is that accuracy would ever be significant at all! There must be other ways in which accuracy improves scores. What might those be?

We can start by noting that score can be computed *exactly* from these four variables:

A = Percentage of GIR

B = Percentage of greens in one stroke less than regulation

C = Number of putts taken per hole

D = On missed GIRs, average number of extra strokes taken to get on to the green

Then, the average hole's score relative to par simply equals (C – 2) - B + D(1-A). Simplifying gives

Exact (average) Score = C - B + D - DA - 2

So if you ran a regression on B, C, D, and DA, you would predict score perfectly -- you'd get a correlation of 1, and the above regression equation. (If you ran it on A, B, C and D, you'd come close to 1, but wouldn't hit it exactly, because there's an interaction between D and A that you wouldn't capture.) And adding other variables to the regression – such as accuracy – would do almost nothing, because accuracy "works" by causing a change in one or more of A, B, C, or D.

Now, Heiny didn't actually include all of A, B, C and D in his regression. But he came close. His regression did include:

A

C

Scrambles, which is significantly correlated with D and C.

So what's left that he didn't include? B, and part of D. So when one of the other variables, such as driving accuracy, comes out significant, it must because of its effect on B or D:

-- it increases the chances of getting to the green in less than regulation (B); and

-- it increases the chance that if you miss the green in regulation, you'll get back on in few strokes (D).

Looked at in that light, it's easy to see why driving distance is significant: to get yourself a B, you need to hit a par-5 green in two strokes, so you better be hitting long. And it's easy to see why accuracy is significant – landing on the fairway means you avoided losing your ball or hitting in the water, saving yourself a lot of D.

(By the way, could it be that "sand saves" comes out insignificant because all sand saves are actually scrambles? Or does the definition of "scramble" explicitly exclude bunker shots?)

But those minor results are not what the study was looking for. The point of the study was to see if distance and accuracy caused scores to drop in general, not to see if it caused scores to drop only because of "greens in better than regulation" and "water hazards avoided".

And if want the overall effects of distance and accuracy, you can't include any variables that are *also caused* by distance and accuracy. Run your regression on distance and accuracy only, and see what you get.

P.S. I wrote about this kind of flaw earlier, probably more understandably.

P.S. A PGA study once showed that "driving the green" -- trying for a "B" by going for the green in less than regulation -- was associated with lower scores.

Labels: golf, statistics

## 6 Comments:

Great write-up Phil. You point on the intercorrelation between the variables is absolutely spot on.

The other one to watch is scrambling and putts per round.

Players who scramble more are likely to have fewer putts per round.

Given all these inter-relationships the value of any multiple regression on the variables is close to zero.

Still nice work deliving into how "accuracy", in this definition, results in an overall lower score

-beamer

Right, scrambling leads to lower putts per round, since a scramble, by definition, means only one putt.

BTW, Beamer, maybe you can answer this ... the point I'm trying to make is more than just correlation between the variables -- it's more like *causation* between the variables.

For instance, using hits and runs in a regression to find wins makes no sense, because hits *cause* runs. It's not because hits and runs are correlated (although they are).

There is probably a strong correlation between doubles and home runs, but it's perfectly legitimate to include both in a regression, because they are independent: doubles don't cause HRs, and HRs don't cause doubles. Rather, it's a third factor, "power," that causes both.

So is there something a little stronger than "intercorrelation" to describe the situation? A term that can differentiate between "A causes B" (so don't use both) and "a third factor causes both A and C" (so it's OK to use both)?

Phil -- the old correlation/ causation debate!

The reason why the hits -> runs -> wins regression doesn't work is, as you point out that hits = runs = wins. Therefore runs is effectively a redundant variable and should be excluded.

I'm not sure I 100% buy your causation argument. Runs causes wins but hits also cause wins. The more hits you have the more wins you are likely to have. It is just that the correlation between runs and wins is much stronger than hits and runs. This means that hits becomes a redundant variable in this model as the coefficient is irrelevant. If you have doubles, home runs etc in the same regression then runs will swamp those coefficients too. It IS all about the intercorrelation of the factors.

This example is just very exteme. Intercorrelations between the independent variables in the models aren't a disaster it just increases the standard error of the coefficients. That is why when you are trying to predict runs from hits, doubles etc the regression works. That is also why some of the coefficients can look off unless you have a huge sample size (eg, triples).

By the way the word you may be looking for is "interdependence". If you ask yourself whether you'd use any of your independent variables to predict other independent variables then you are in trouble ....

-beamer

The classic "controlling for intermediate outcomes" fallacy.

In football research, I see this a lot. Models typically include measures of running, passing, and often first down conversion rates. But first down conversions are generally functions of running and passing. They're intermediate outcomes between running and passing and scoring (or winning or whatever the ultimate outcome of interest is.)

According to this amusing example, the fallacy is alive and well in even the most scrutinized peer-reviewed scientific research.

"Controlling for intermediate outcomes." I think that's it, thanks!

Not much of a catchy name though. :)

How about "Kitchen Sink" Regression?

Post a Comment

<< Home