Monday, August 07, 2006

A flawed competitive balance hockey study

Pitching has nothing to do with winning baseball games. I've proved it!

Here’s what I did: I ran a regression to predict team wins. I had ten dependent variables: team home runs, batting average, OPS, winning percentage, sacrifice hits, ERA, manager experience, total average, total payroll, and pitcher strikeouts.

After running the regression, only one variable was significant: team winning percentage. The others weren’t significant at all. And, so, obviously, ERA and strikeouts have nothing to do with winning. I’ve proven it!

Well, of course, I haven’t, and the flaw is kind of obvious: the regression includes “winning percentage” as one of the dependent variables. Winning percentage is almost exactly the same as the “wins” I’m trying to predict. In fact, the regression equation would work out close to:

Wins = (162 * winning percentage) + (0 * ERA) + (0 * batting average) + (0 * OPS) …

ERA has a lot of effect on wins, but it does so by changing winning percentage. In this case, winning percentage “absorbs” all the effects of ERA. That is, once you know winning percentage, knowing ERA doesn’t help you predict wins any better. A .600 team wins 96 games, regardless of how good its pitching staff was.

(I’m sure there’s a statistical term for this effect, when you have massive cross-correlations in your dependent variables that cause otherwise-significant variables to be absorbed by others. But if there is, I don’t know it.)

Suppose we try to cure the problem by getting rid of “winning percentage” and using “expected wins” (pythagoras) instead. Our correlation would still be very high, because pythagoras predicts wins very well. And, again, we’d wind up with ERA being insignificant, for the same reason – all of the information ERA gives you is already included in the information in "expected wins". A team that scores 500 runs and allows 450 will win about 90 games, regardless of its staff’s ERA.

One more try: let’s remove "expected wins", and add separate variables for “runs scored” and “runs allowed”. The flaw is more subtle, but it’s still there. Our correlation will drop a bit because, while you can predict winning percentage by a combination of runs scored and allowed (at about 10 runs equals one win), it’s not as accurate as pythagoras. But, still, the other variables will still wind up not significant. And again, that’s because once you know a team’s runs scored and allowed, the ERA does not give you any more information.

This last situation is the flaw in
this hockey study by Tom Preissing and Aju J. Fenn.

Preissing (who is now an active
NHL player) and Fenn tried to figure out what factors are predictive of competitive balance in the NHL (as measured by single season clustering around .500). They included variables like free agency, the draft, the availability of European players, the existence of the WHA, and so on. Unfortunately, they also included variables for competitive balance in goals scored and allowed – which, as we saw, is a fatal flaw.

Since GS and GA directly cause winning percentage, most of the study’s other variables show up as insignificant. For instance, the amateur draft may have increased competitive balance measured in wins – but if it did, it would have done so by the mechanism of increasing competitive balance in goals, or by the mechanism of increasing competitive balance in goals allowed.


That is, a league that has lots of variation in goals scored and lots of variation in goals allowed will have lots of variation in wins, regardless of whether there was an amateur draft or not.

Sadly, the flaw means we don’t really get any reliable information from the study. But it would sure be interesting to run it again, without those two variables.

(Thanks to
Tangotiger for the pointer.)

2 Comments:

At Monday, August 07, 2006 7:05:00 PM, Blogger JavaGeek said...

This is often refered to as Multicollinearity. When one does a regressions one has to be careful of these sorts of things. There are often tools to simplify these things such as looking at Variance Inflation Factors as mentioned at Multivariate Statistics - Multicollinearity and Singularity:

Also, I love table 4.1 "equation 3". P-value: 99.9% F-stat: 5800 (someone made a mistake there...)

 
At Thursday, January 01, 2009 7:37:00 AM, Blogger sexy said...

情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣,情趣,情趣,情趣,情趣,情趣,情趣用品,情趣用品,情趣,情趣,A片,A片,A片,A片,A片,A片,情趣用品,A片,情趣用品,A片,情趣用品,a片,情趣用品

A片,A片,AV女優,色情,成人,做愛,情色,AIO,視訊聊天室,SEX,聊天室,自拍,AV,情色,成人,情色,aio,sex,成人,情色

免費A片,美女視訊,情色交友,免費AV,色情網站,辣妹視訊,美女交友,色情影片,成人影片,成人網站,H漫,18成人,成人圖片,成人漫畫,情色網,日本A片,免費A片下載,性愛

色情A片,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色視訊,免費成人影片,視訊交友,視訊聊天,言情小說,愛情小說,AV片,A漫,avdvd,情色論壇,視訊美女,AV成人網,情色文學,成人交友,成人電影,成人貼圖,成人小說,成人文章,成人圖片區,成人遊戲,愛情公寓,情色貼圖,成人論壇

美女視訊,辣妹視訊,視訊交友網,免費視訊聊天,視訊,免費視訊,美女交友,成人交友,聊天室交友,微風論壇,微風成人,情色貼圖,色情,微風,聊天室尋夢園,交友,視訊交友,視訊聊天,視訊辣妹,一夜情

視訊聊天室,聊天室,視訊,,情色視訊,視訊交友,視訊交友90739,免費視訊,免費視訊聊天,視訊聊天,UT聊天室,聊天室,美女視訊,視訊交友網,豆豆聊天室,A片,尋夢園聊天室,色情聊天室,聊天室尋夢園,成人聊天室,中部人聊天室,一夜情聊天室,情色聊天室,080中部人聊天室,080聊天室,美女交友,辣妹視訊

 

Post a Comment

Links to this post:

Create a Link

<< Home