Monday, August 07, 2006

A flawed competitive balance hockey study

Pitching has nothing to do with winning baseball games. I've proved it!

Here’s what I did: I ran a regression to predict team wins. I had ten dependent variables: team home runs, batting average, OPS, winning percentage, sacrifice hits, ERA, manager experience, total average, total payroll, and pitcher strikeouts.

After running the regression, only one variable was significant: team winning percentage. The others weren’t significant at all. And, so, obviously, ERA and strikeouts have nothing to do with winning. I’ve proven it!

Well, of course, I haven’t, and the flaw is kind of obvious: the regression includes “winning percentage” as one of the dependent variables. Winning percentage is almost exactly the same as the “wins” I’m trying to predict. In fact, the regression equation would work out close to:

Wins = (162 * winning percentage) + (0 * ERA) + (0 * batting average) + (0 * OPS) …

ERA has a lot of effect on wins, but it does so by changing winning percentage. In this case, winning percentage “absorbs” all the effects of ERA. That is, once you know winning percentage, knowing ERA doesn’t help you predict wins any better. A .600 team wins 96 games, regardless of how good its pitching staff was.

(I’m sure there’s a statistical term for this effect, when you have massive cross-correlations in your dependent variables that cause otherwise-significant variables to be absorbed by others. But if there is, I don’t know it.)

Suppose we try to cure the problem by getting rid of “winning percentage” and using “expected wins” (pythagoras) instead. Our correlation would still be very high, because pythagoras predicts wins very well. And, again, we’d wind up with ERA being insignificant, for the same reason – all of the information ERA gives you is already included in the information in "expected wins". A team that scores 500 runs and allows 450 will win about 90 games, regardless of its staff’s ERA.

One more try: let’s remove "expected wins", and add separate variables for “runs scored” and “runs allowed”. The flaw is more subtle, but it’s still there. Our correlation will drop a bit because, while you can predict winning percentage by a combination of runs scored and allowed (at about 10 runs equals one win), it’s not as accurate as pythagoras. But, still, the other variables will still wind up not significant. And again, that’s because once you know a team’s runs scored and allowed, the ERA does not give you any more information.

This last situation is the flaw in
this hockey study by Tom Preissing and Aju J. Fenn.

Preissing (who is now an active
NHL player) and Fenn tried to figure out what factors are predictive of competitive balance in the NHL (as measured by single season clustering around .500). They included variables like free agency, the draft, the availability of European players, the existence of the WHA, and so on. Unfortunately, they also included variables for competitive balance in goals scored and allowed – which, as we saw, is a fatal flaw.

Since GS and GA directly cause winning percentage, most of the study’s other variables show up as insignificant. For instance, the amateur draft may have increased competitive balance measured in wins – but if it did, it would have done so by the mechanism of increasing competitive balance in goals, or by the mechanism of increasing competitive balance in goals allowed.

That is, a league that has lots of variation in goals scored and lots of variation in goals allowed will have lots of variation in wins, regardless of whether there was an amateur draft or not.

Sadly, the flaw means we don’t really get any reliable information from the study. But it would sure be interesting to run it again, without those two variables.

(Thanks to
Tangotiger for the pointer.)


At Monday, August 07, 2006 7:05:00 PM, Blogger JavaGeek said...

This is often refered to as Multicollinearity. When one does a regressions one has to be careful of these sorts of things. There are often tools to simplify these things such as looking at Variance Inflation Factors as mentioned at Multivariate Statistics - Multicollinearity and Singularity:

Also, I love table 4.1 "equation 3". P-value: 99.9% F-stat: 5800 (someone made a mistake there...)


Post a Comment

<< Home