Thursday, November 22, 2007

"K" study results can't be duplicated

The "Dave Kingman strikes out a lot because his name starts with K" study found that "K" batters struck out 18.8% of the time, versus 17.2% for everyone else. It also found the difference to be statistically significant.

I can't duplicate either result.

First, the numbers are very high. The authors found at least a 17.2% strikeout rate. But between 1913 and 2003, the years in the study, only four seasons had an overall strikeout rate at least that high. And the highest overall was only 17.8%, while some of the early years had rates below 10%, and the middle years are in the low teens. (And all those numbers come from considering only AB and BB in the definition of PA.)

The authors did limit their dataset to players with 100 PA or more, but that should *lower* the strikeout rates, by eliminating lots of pitchers. So how did they get 17.2%?

Maybe they used AB as the denominator instead of PA. That gives an overall rate of only about 14% up to 2003.

Second, the difference between the K players and everyone else isn't as big as the authors say. The authors found a 1.6 percentage-point difference.
David Gassko's study (using a different dataset) found a difference of only 0.5 percentage points (15.5% versus 15.0%). I checked all players from 1913 to 2003, and found a difference of only 0.2 points:

13.1% for "K" players (50761 out of 387611)
12.9% for them others (1352281 out of 10472268)

(I used all players, though, not just players with enough PA.)

Finally, to check statistical significance, I ran a simulation. I pulled random players who were born after 1934 (to roughly match Gassko's dataset), and arbitrarily decided their last names started with "K". I kept pulling until the total PA went past 464664, to come close to Gassko's number (although they would all be a bit higher than 464664). Then I computed their overall strikeout rate.

I repeated that 100 times. The results:

Mean 14.94%, SD 0.44%

That means that Gassko's result – an 0.5 point difference -- is only about 1 SD higher than the mean. And my result is only half an SD higher than the mean. So, no statistical significance.

So what's going on? I guess we have to wait for the original study to be released before we find out.

P.S. From a quick glance, it looks like only one letter in Gassko's study is statistically signficant – the "O". And exactly one significant result out of 26 is itself not significant.



At Sunday, November 25, 2007 4:12:00 PM, Anonymous Anonymous said...

I can't fully replicate the Nelson/Simmons results either, but I can get a somewhat similar result, when the initial used is the initial of the FIRST name.

As you mention, the authors appear to have used a definition of PA=AB+BB. Using this I do match the number of players mentioned in the newsweek article as meeting the plate appearance criterion. For K/PA, 1st initial K strikes out 14.69% versus 12.90% for everyone else. Using K/AB, 1st initial K strikes out 16.15% compared to 14.15% for everyone else. Using the initial of the last name, there is virtually no difference between "K" names and everyone else (13.00/12.94 for K/PA and 14.26/14.19 for K/BA).

This is probably an illustration of the danger of relying on a second hand source's summary of the original. When I read the Newsweek article (by Sharon Begley) I saw a weird mix of examples of the effect in action for hitting, school grades and law school admissions, sometimes using first names and sometimes using last names.

As you say, we'll have to wait for the study itself to fully understand what the reasearchers did ...

At Sunday, November 25, 2007 5:27:00 PM, Blogger Phil Birnbaum said...

Aha ... FIRST names! Thanks! I should have thought of that. When the SI article mentioned Kingman, I assumed it was last names. As you say, secondhand sources can be unreliable. Off I go to check first names!

At Sunday, November 25, 2007 6:15:00 PM, Blogger Phil Birnbaum said...

OK, addressed in my new post, here.


Post a Comment

<< Home