Rebounding results were not caused by position leakage
On the subject of negative correlations between rebounders, one thing occurs to me. It's possible that much of the effect could be accounted for by certain errors in the data. Specifically, if sometimes players are classified in the wrong position, that will cause a negative correlation even if there wouldn't otherwise be one.
Why? Well, suppose there's no correlation between sales of watercolor paints and eye shadow. But, they look the same, and, so, sometimes eyeshadow sales are carelessly recorded as watercolor sales. Since the error takes from one column and adds to the other column *at the same time*, then one negative will be associated with one positive, and a negative correlation results.
For a more extreme example, suppose that the recorder on odd numbered days always records eyeshadow as watercolor, and the recorder on even number days always records watercolor as eyeshadow. Then, the correlation will be very close to minus 1. That's because, if eyeshadow is positive, watercolor is always zero. And if eyeshadow is zero, watercolor is always positive.
Let me do a basketball example. Suppose that in a certain league, there are four teams, and the PFs and SFs have rebounding percentages as follows:
If you run a regression on those for rows, you will find there's no relationship at all -- your correlation coefficient will be exactly zero.
But, now, suppose that next year, everything is the same, except that the PF and SF often play each other's position. In fact, they play each other's position 1/4 of the time. The PF now gets a "12" in three-quarters of the games and an "8" in the remaining one quarter, for an average of 11. The SF gets an "8" three quarters of the time, and a "12" the rest, for an average of 9.
Now, suppose the dataset isn't smart enough to know to create a blended PF and SF from the two players' stats, because it doesn't know exactly how many minutes or possessions each played at each position. It should be 3/4 and 1/4, but the data compiler doesn't know that. So he shrugs and says, let's just call the main SF guy an SF, and call the main PF guy a PF.
So now the data for the four teams looks like this:
Now, we get a negative relationship. In this example, it's mild: every additional rebound by a SF is linked to a .27 decline in rebounds by the PF. But that's because our example is pretty mild: we have only 1/16 blending (25% blending in 25% of the teams). If we add more blending, we get more negative relationship. For instance, if I were to add 50% blending for another team -- changing the "12 10" to a "11 11" -- now the "diminishing returns" go from .27 to .6.
Now, I'm not saying this is actually happening. I don't know much about the process by which 82games.com or DSMok1 create their datasets (which I used in the two previous posts). If they are, indeed, classifying a player's position every minute or every possession, then there shouldn't be a problem.
But if there is a problem created by misclassification, does that mean that statistics like Dave Berri's might be right after all? Well, those statistics need to use correct position data too. If they don't, they suffer from the same problem, just hidden more.
Look again at the breakdown for rebounding percentage by position:
13.8% Power Forward
5.9% Point Guard
Players are evaluated relative to their position. So if you misclassify an PG as an SG for a game's worth of playing time, you've given him three extra rebounds that he doesn't "deserve". If you misclassify a PF as an SF, you give him almost five extra rebounds a game. This is a really almost the same as my "positioning" or "scheme" argument. In the previous case, I say that perhaps some SFs take some of a PF's rebounding opportunities. In this case, I say a SF might be actually playing the PF's actual position, taking *all* of the PFs opportunities (but relinquishing his).
So, it depends on the data.
How can we check? Well, one way is to compare the results for the "position" breakdown to another breakdown that we know is always correct. As I was writing this, I discovered that Eli Witus did that almost three years ago. Instead of classifying by position, he classified by height -- each player got a number indicating where he ranked on his team from the shortest on the court all the time (1.0) to the tallest (5.0). Obviously, a player's rank would vary depending on who was on the court with him, so his final number would include a fraction. For instance, a "4.5" might mean he was the tallest player half the time, and the second-tallest player half the time.
Witus then broke the players into five approximately equal-sized groups, and checked for diminishing returns on rebounds among the groups. However, instead of running a regression on that group against the other four groups, he ran a regression on that group against the entire team (including that group). (That makes most of the correlations positive instead of negative.) In order that we have a comparison, I ran the same regressions for the half-season 2010-11 data provided to me by DSMok1.
Here are the comparisons. The numbers are the regression coefficients with standard error in parentheses (so, for instance, " + 0.27 (0.12)" means that one extra rebound for that position resulted in 0.27 extra rebounds for the team, plus or minus 0.12.)
First, defensive rebounds:
PG: +0.21 (0.35). Height Group 1: +0.27 (0.12).
SG: +0.08 (0.21). Height Group 2: +0.10 (0.11).
SF: +0.48 (0.20). Height Group 3: -0.01 (0.17).
PF: +0.06 (0.13). Height Group 4: +0.12 (0.07).
-C: +0.23 (0.14). Height Group 5: +0.02 (0.07).
From this, we can see that the numbers are different, but not statistically significantly different in any of the five cases. Moreover, the comparisons are evenly split 3-2 over which estimate is higher.
Here are offensive rebounds:
PG: -0.18 (.57). Height Group 1: 0.56 (.49).
SG: -0.06 (.46). Height Group 2: 0.10 (.24).
SF: +0.85 (.39). Height Group 3: 0.56 (.17).
PF: +0.87 (.16). Height Group 4: 0.47 (.13).
-C: +0.62 (.27). Height Group 5: 0.49 (.15).
Again, the numbers are pretty similar, except for PG. But the standard errors in the PG case are so large, that the difference is still less than one SE away from zero.
So, we can conclude: misclassification by position could certainly cause the negative correlations we saw. However, that was probably not a major cause, because, when we use height as a control group, we get approximately the same results. And we know that there is little error in classifying players by height.
However, two caveats: "We get approximately the same results" is just a gut reaction to the two sets of numbers. And, second, Witus' groups are not perfect -- there is still a bit of error. Any given set of five players could include two from the first group, so there is still at least a little bit of misclassification.
But still, I think these results are close enough. We wanted to make sure what we saw for rebounding was real, and not just caused by errors in the data. The evidence suggests that was indeed the case.
Executive Summary: False alarm. Sorry to have bothered you.