Sunday, November 25, 2007

The "K" study: now reproducible

OK, here's another update on the The "Dave Kingman strikes out a lot because his name starts with K" study.

Last post, I mentioned that I re-reran the study and couldn't duplicate the results. In a comment there, Joe Arthur mentioned that the study actually used first name, while I used last name. (Thanks, Joe!) So I reran the study for players whose first name began with K.

There, I was able to duplicate the difference in strikeout rates. The "K" players did indeed strikeout 1.5 percentage points more than the non-Ks:

14.7% for K players (37096/252439)
12.8% for the others (1365946/10607440)

I didn't recheck signficance levels, but my guess is that the difference is about 3 SDs.

However, the difference can be fully explained by the fact that first names starting with K are more popular now than they were 50 years ago. So proportionally more of the "K" hitters played in high-strikeout eras.

Go to the "Baby Name Voyager," choose "boys" only, and enter "K". You'll see a consistent rise in K names from the 19th century to the end of WWII -- about 10 times as many "K" boys at the end than at the beginning. But then, they accelerate upward even faster, doubling again by the late 1960s before dropping a little bit after that. (Most of the post-war effect, by the way, seems to be concentrated in "Kevin." Which makes sense; I can't think of any really old guys named Kevin. Or Kyle, for that matter.)

If you average out the calendar seasons played by Ks, you get 1977. If you average the seasons played by non-Ks, it's 1963.

I think this accounts for the entire effect. Here are the stirkeout rates for Ks vs. the non-Ks by decade (players with 100+ AB):

1910s: Ks 9.2%, non-Ks 8.5% (starting 1913)
1920s: Ks 7.0%, non-Ks 6.3%
1930s: Ks 8.7%, non-Ks 7.5%
1940s: Ks 8.0%, non-Ks 8.2%
1950s: Ks 12.3%, non-Ks 10.2%
1960s: Ks 14.3%, non-Ks 13.6%
1970s: Ks 12.5%, non-Ks 12.8%
1980s: Ks 13.0%, non-Ks 13.6%
1990s: Ks 15.5%, non-Ks 15.5%
2000s: Ks 17.1%, non-Ks 16.3% (up to 2003)

10s to 50s: Ks 08.8%, non-Ks 08.1%
60s to 00s: Ks 14.2%, non-Ks 14.2%

Once you normalize by decade, the effect all but disappears. From 1960 to 2003, the rates are exactly the same. There does appear to be a small "K" effect from 1914 to 1959, but it almost certainly is not statistically significant.

But maybe the authors did correct for this, or did something different. We can check for sure when the study comes out.

At Sunday, November 25, 2007 6:41:00 PM, Anonymous Anonymous said...

Seems like the easiest way to test that is to run a fixed effects panel regression with an additional dummy for k names. If the k dummy is significant, you have a K effect.

At Monday, November 26, 2007 12:12:00 AM, Anonymous Anonymous said...

You can download the paper here. Pages 4 and 5 are the baseball part.

At Monday, November 26, 2007 12:49:00 AM, Blogger Phil Birnbaum said...

Thanks, joe p ... unfortunately, when I try to download the paper, I get a "not found" error.

At Monday, November 26, 2007 9:32:00 AM, Anonymous Anonymous said...

It seems like the link left off part of the link. It should have an abstract_id of 946249 instead of 94, making the link (when you put the two lines together with no space)

According to the paper, the researchers acknowledged an increase in both the frequency of strikeouts and the number of K-initialed players. It would be interesting to see how they controlled for average year of play as they say they did.

At Monday, November 26, 2007 9:36:00 AM, Blogger Phil Birnbaum said...

Thanks, got it now. I had the whole abstract ID, but I guess the system was down last night or something.

At Friday, November 30, 2007 5:30:00 PM, Blogger Oilman said...

If anything good can come from this, it's that Roger Clemens arrogance has doomed his kids (Koby, Kory, Kacy, and Kody) to be marginal players at best....finally, karma strikes a small blow against that ass


