The "K" study: now reproducible
OK, here's another update on the The "Dave Kingman strikes out a lot because his name starts with K" study.
Last post, I mentioned that I re-reran the study and couldn't duplicate the results. In a comment there, Joe Arthur mentioned that the study actually used first name, while I used last name. (Thanks, Joe!) So I reran the study for players whose first name began with K.
There, I was able to duplicate the difference in strikeout rates. The "K" players did indeed strikeout 1.5 percentage points more than the non-Ks:
14.7% for K players (37096/252439)
12.8% for the others (1365946/10607440)
I didn't recheck signficance levels, but my guess is that the difference is about 3 SDs.
However, the difference can be fully explained by the fact that first names starting with K are more popular now than they were 50 years ago. So proportionally more of the "K" hitters played in high-strikeout eras.
Go to the "Baby Name Voyager," choose "boys" only, and enter "K". You'll see a consistent rise in K names from the 19th century to the end of WWII -- about 10 times as many "K" boys at the end than at the beginning. But then, they accelerate upward even faster, doubling again by the late 1960s before dropping a little bit after that. (Most of the post-war effect, by the way, seems to be concentrated in "Kevin." Which makes sense; I can't think of any really old guys named Kevin. Or Kyle, for that matter.)
If you average out the calendar seasons played by Ks, you get 1977. If you average the seasons played by non-Ks, it's 1963.
I think this accounts for the entire effect. Here are the stirkeout rates for Ks vs. the non-Ks by decade (players with 100+ AB):
1910s: Ks 9.2%, non-Ks 8.5% (starting 1913)
1920s: Ks 7.0%, non-Ks 6.3%
1930s: Ks 8.7%, non-Ks 7.5%
1940s: Ks 8.0%, non-Ks 8.2%
1950s: Ks 12.3%, non-Ks 10.2%
1960s: Ks 14.3%, non-Ks 13.6%
1970s: Ks 12.5%, non-Ks 12.8%
1980s: Ks 13.0%, non-Ks 13.6%
1990s: Ks 15.5%, non-Ks 15.5%
2000s: Ks 17.1%, non-Ks 16.3% (up to 2003)
10s to 50s: Ks 08.8%, non-Ks 08.1%
60s to 00s: Ks 14.2%, non-Ks 14.2%
Once you normalize by decade, the effect all but disappears. From 1960 to 2003, the rates are exactly the same. There does appear to be a small "K" effect from 1914 to 1959, but it almost certainly is not statistically significant.
But maybe the authors did correct for this, or did something different. We can check for sure when the study comes out.