Does pace impact defensive efficiency? Don't use r-squared, use the regression equation
When someone runs a regression, they will wind up reporting a value for r or r-squared. If the value is small, they'll argue that the two variables don't really have much of a relationship. But that's not necessarily true.
Before I talk about why, I should say that if the result goes the other way – the r or r-squared is high, and statistically significant -- that *does* mean there's a strong relationship. If the correlation between cigarettes smoked and lung cancer is, say, 0.7, that's a pretty big number, and we can conclude that lung cancer and smoking are strongly related.
But a low value doesn't necessarily mean the opposite.
For instance: inspired by the smoking example, I looked at another lifestyle choice. Then, I ran a regression on expected remaining lifespan, based on that lifestyle choice. The results:
r = -.17, r-squared = .03
What is the effect of that lifestyle choice on lifespan? It looks like it should be small. After all, it "explains" only 3% of the variance in years lived.
But that wouldn't be correct. The lifestyle choice really does have a large effect on lifespan. In fact, the lifestyle choice I used in the equation is (literally) suicide.
Here's what I did. I took 999 random 40-year-olds, and assumed their expected remaining lifespan was 40 years, with an SD of about 8. Then, I assumed the 1,000th person jumped in front of a moving subway train, with an expected remaining lifespan of zero. (These numbers are made up, by the way.)
The results were what I showed above: an r of only –.17.
Why does this happen? It happens because the r, and the r-squared, do NOT measure whether suicide and lifespan are related. Rather, they measure something subtly different: whether suicide is a big factor in how long people live.
And suicide is NOT that big a factor in how long people live. Most people don't commit suicide; in my model, only 1 in 1000. The r-squared shows how much effect suicide has *as a percentage of all other factors*. Because there are so many other factors – in real life, heart disease and cancer are about 40 times as common as suicide -- the r-squared comes out small.
If you want to know the strength of the relationship between A and B, don't look at the r or the r-squared. Instead, look at the regression equation. In my suicide experiment, the equation turned out to be
Lifespan = 40.0 – 40.0 (suicide)
That is, exactly what you would expect: the lifespan is 40 years, but subtract 40 from that (giving zero) if you commit suicide.
And, even though the r-squared was only 0.03, that r-squared is statistically significant, at greater than 99.99%.
Again: the r-squared is heavily dependent on how "frequent" the lifestyle choice is in the population. But the significance level, and the regression equation, is not.
To prove it, let me rerun my experiment a few times, with different percentages of suicide in the population:
1 in 10,000: r-squared = .003
lifespan = 40.1 – 40.1 (suicide); p > 99.99%
1 in 1,000: r-squared = .03
lifespan = 40.0 – 40.0 (suicide); p > 99.99%
1 in 100: r-squared = .26
lifespan = 40.3 – 40.3 (suicide); p > 99.99%
1 in 10: r-squared = .71
lifespan = 37.3 – 37.3 (suicide); p > 99.7%
The r-squared varies a lot – but all the experiments tell you that suicide costs you 40 years of life, and that the result is statistically signficant.
The moral of the story:
The r-squared (or r) does NOT tell you the extent to which A causes B, or even the strength of the relationship between A and B. It tells you the extent to which A explains B relative to all the other explanations of B.
If you want to quantify the effect a change in A has on B, do not look at the r or r-squared. Instead, look at the regression equation.------
Which brings us to today's post at "The Wages of Wins." There, David Berri checks whether teams who play a fast-paced brand of basketball (as measured by possessions per game) wind up playing worse defense (as measured by points allowed per possession) because of it. Berri quotes Matthew Yglesias:
"For example, there’s a popular conception of a link between pace and defensive orientation — specifically the idea that teams that choose to play at a fast pace are sacrificing something in the defense department. On the most naive level, that’s simply because a high pace leads to more points being given up. But I think it’s generally assumed that it holds up in efficiency terms as well. The 2006-2007 Phoenix Suns, for example, were first in offensive efficiency, third in pace, and fourteenth in defense. But is this really true? If you look at the data season-by-season is there a correlation between pace and defense?"
Berri runs a regression for 34 years of team data. So, is there a relationship? He writes,
"The correlation coefficient between relative possessions and defensive efficiency is 0.17. Regressing defensive efficiency on relative possession reveals that there is a statistically significant relationship. The more possessions a team has per game - again, relative to the league average - the more points the team’s opponents will score per possession. But relative possessions only explains 2.8% of defensive efficiency. In sum, pace doesn’t tell us much about defensive efficiency ... " [emphasis mine]
But I don't think that's right. As we saw, the r-squared of 2.8% (or the r of 0.17) means only that, historically, pace is small *compared to other explanations of defensive efficiency.* And that makes sense. Even if pace has a significant impact on defense, we'd expect other factors to be even more important. The players on the team, for instance, are a big factor. Luck is also a big factor. The coach's strategy probably has a large impact on defensive efficiency. Compared to all those things, pace is pretty minor. And we probably knew that before we started, that personnel matters more than pace.
And so I would guess that's not really what Yglesias wants to know. What I bet he's interested in, and what teams would be interested in, and what I'd be interested in, is this: if a team speeds up the pace by (say) 2 possessions per team per game, how much will its defense suffer? That's an important question: if you're evaluating how good a team (or player) is on defense, you want to know if you can take the stats at face value, or if you have to bump them up to compensate for fast play, or if you have to discount them for teams who play a little slower. It's like a park factor, but for defensive efficiency. The regression should be able to tell you just how big that park factor is.
That's the real question, and the r-squared doesn't answer it at all. Given the data Berri gives us, the effect of pace on defensive efficiency could be small, or it could be large. After all, the effect of suicide on lifespan was huge, even though the r-squared was small. And just like in the suicide case, if a lot more teams suddenly decide to start playing at a different pace, the r and r-squared will go up – but the relationship between pace and defense will likely not change.
To really understand what effect pace has on defense, we need the regression equation. Berri doesn't give it to us. He does tell us the result is statistically significant, so we do know there *is* some kind of non-zero effect. But without the equation, we don’t know how big it is (or even whether it's positive or negative). All we know is that pace *does* signficantly impact on a team's defensive stats, and that the effect (as judged by statistical signficance) appears to be real.