"K" study results can't be duplicated
The "Dave Kingman strikes out a lot because his name starts with K" study found that "K" batters struck out 18.8% of the time, versus 17.2% for everyone else. It also found the difference to be statistically significant.
I can't duplicate either result.
First, the numbers are very high. The authors found at least a 17.2% strikeout rate. But between 1913 and 2003, the years in the study, only four seasons had an overall strikeout rate at least that high. And the highest overall was only 17.8%, while some of the early years had rates below 10%, and the middle years are in the low teens. (And all those numbers come from considering only AB and BB in the definition of PA.)
The authors did limit their dataset to players with 100 PA or more, but that should *lower* the strikeout rates, by eliminating lots of pitchers. So how did they get 17.2%?
Maybe they used AB as the denominator instead of PA. That gives an overall rate of only about 14% up to 2003.
Second, the difference between the K players and everyone else isn't as big as the authors say. The authors found a 1.6 percentage-point difference. David Gassko's study (using a different dataset) found a difference of only 0.5 percentage points (15.5% versus 15.0%). I checked all players from 1913 to 2003, and found a difference of only 0.2 points:
13.1% for "K" players (50761 out of 387611)
12.9% for them others (1352281 out of 10472268)
(I used all players, though, not just players with enough PA.)
Finally, to check statistical significance, I ran a simulation. I pulled random players who were born after 1934 (to roughly match Gassko's dataset), and arbitrarily decided their last names started with "K". I kept pulling until the total PA went past 464664, to come close to Gassko's number (although they would all be a bit higher than 464664). Then I computed their overall strikeout rate.
I repeated that 100 times. The results:
Mean 14.94%, SD 0.44%
That means that Gassko's result – an 0.5 point difference -- is only about 1 SD higher than the mean. And my result is only half an SD higher than the mean. So, no statistical significance.
So what's going on? I guess we have to wait for the original study to be released before we find out.
P.S. From a quick glance, it looks like only one letter in Gassko's study is statistically signficant – the "O". And exactly one significant result out of 26 is itself not significant.