(Note: please read Part I and Part II to understand what I'm talking about here.)
OK, after rereading the Sulloway/Zweigenhaft paper, and digesting some comments from the previous posts, I think I now have a pretty good idea of what the authors did.
In the previous post, I compared all brothers to their siblings. I found that when the older brother was called up first, which was most of the time, the younger brother attempted more steals 54% of the time. When the *younger* brother was called up first, which was seldom, he attempted more steals 87% of the time.
That's because when the younger brother was called up first, he must have been quite a bit better than his older sibling, in order to make the major leagues at a much younger age. So 87% is reasonable for younger brothers who are so good that they get called up even before their older brothers. On the other hand, 54% is normal for the more common case where the older brother gets called up first.
Now, since the authors say explicitly say they ran a regression and controlled for who got called up first, let's do that ourselves.
Simplifying the numbers, let's say
60% outsteal their brothers when called up last and younger.
40% outsteal their brothers when called up first and older.
90% outsteal their brothers when called up first and younger.
10% outsteal their brothers when called up last and older.
Now, let's set that up as a regression. We'll create dummy variables for "called up first" and "older". And so each line of our regression, indpendent variable followed by dependent variables, is:
0.6 ... 0 0
0.4 ... 1 1
0.9 ... 1 0
0.1 ... 0 1
However, the paper used odds ratios instead of probabilities, so let's do that.
A probability of 0.6 corresponds to 0.6 successes to 0.4 failures. That's an odds ratio of (0.6 divided by 0.4), which is 1.5.
Similarly, the odds ratio of 0.4 is 0.667. The odds ratio of 0.9 is 9 (9 successes to 1 failure). And the odds ratio of 0.1 is 0.111.
Now we have
1.50 ... 0 0
0.67 ... 1 1
9.00 ... 1 0
0.11 ... 0 1
We'll do one more thing: The authors say they did a logarithmic tranformation, so we'll take the logarithm of the odds ratios. That makes sense: you'd expect odds to be multiplicative, not additive, and the log of the odds ratio is standard in this kind of regression. So let's do that. Now we have:
+0.405 ... 0 0
-0.405 ... 1 1
+2.197 ... 1 0
-2.197 ... 0 1
Because of the symmetry, we have to tweak the numbers just a little bit to avoid singularity. (I think I changed one of the "405s" to a "407", or something, and a "187" to a "195" or something.)
If we now run that regression, what happens? We get
log of the odds ratio = (1.79 * dummy for called up first) - (2.60 * dummy for being older) + 0.404
And that means that being older subtracts 2.60 from the log of the odds ratio. The antilog of 2.60 equals 13.5. And so, being younger means your odds ratio goes up by 13.5 times!
The authors of the study came up with 10.6 times, but their data was different from mine. I think that what I've done and what the study did are pretty much the same thing.
(One quick note: the odds ratio of 13.5 does NOT mean that your chance is 13.5 times higher. It means that your ODDS are 13.5 times higher. Suppose your original odds were 2:1 in your favor, which is a probability of .667. Now, your odds multiply by 13.5 times, which is 27:1. That's a probability of .964. Not 13.5 times the chance -- 13.5 times the odds. Obviously, a .964 shot is not 13.5 times the chance of a .667 shot.
Where the original study and NYT article say "10 times as likely", they really mean "10 times the odds" in this narrow sense. The authors use "X times more likely" throughout the paper, when I think they mean "has X times the odds ratio".)
Going back to the 13.5 odds ratio: why is it wrong? Well, it isn't. It's right!
Suppose that you're an older player who got called up first. Your odds of attempting more steals are 0.667 to 1 (40%). Now, suppose you're magically turned into the younger player, holding everything else equal (so you still got called up first). Now your odds of attempting more steals are multiplied by 13.5. 0.667 times 13.5 equals 9. Your odds are 9 to 1, or 90%, just as we assumed at the beginning!
And, suppose that you're the older player who got called up last. Your odds of attempting more steals are 0.111 to 1 (10%). Now, suppose you're magically turned into the younger player, holding everything else equal (so you still got called up last). Now your odds of attempting more steals are multiplied by 13.5. 0.111 times 13.5 equals 1.5. Your odds are 1.5 to 1, or 60%, just as we assumed at the beginning!
It actually works out!
So does that really mean that younger players DO have 13.5 times (or 10.6 times, as the real study found) the odds of outstealing their older brothers? Yes, in the regression, but not in real life.
In real life, whether you get called up first doesn't really affect your steals directly. Being called up first matters because it's a proxy for ability. If you're younger and called up first, you're likely to be a great ballplayer. Medium-skill slow ballplayers might get called up at 23 or 24, but not at 20. Early callups are reserved for the most excellent players, who are also likely to be fast.
So, if you change old to young, and "keep everything constant," you're really NOT keeping everything constant. By keeping "called up first" constant, but changing older to younger, you're actually CHANGING "of roughly average ability" (old player called up first) to "of much better than average ability" (young player called up first). It's a little sleight of hand, using a proxy variable that means different things depending on the other variable. Younger brothers steal at 13.5 times the odds of old players only if, at the same time, they are much, much better players than the old players.
Analogy again: suppose I run this new regression
Amount of money = number of $5 bills in wallet + dummy for whether number of $1000 bills equals number of $5 bills
Suppose you have no $5 bills and the dummy is "yes", meaning you have no $1000 bills either. Your estimated wealth is therefore $0.
But now, if your inventory of $5 bills increases by 1, *holding everything else constant*, the regression will tell you that $5 bill is worth $1005! Because "holding everything else constant" means you "still" have the same number of thousands as of fives, which means you now have another $1000 bill! The "constant" refers only to holding the regression variable constant. It isn't holding the *real-life* variable constant, which is the number of $1000 bills. What's hidden is that in order to hold the regression variable constant, I have to give you $1000.
Same thing happening here. If you hold "called up first" constant, while changing "older brother" to "younger brother", you're not really holding things constant in the real life sense. To hold the regression variable "called up first" constant, you have to give the younger brother a hidden $1000 worth of talent. And that's why the odds ratio turns out so high.
Want another analogy? Suppose you predict whether a person is likely to be diagnosed a dwarf based on their height and age. Suppose you find that if you hold height constant at 4 feet tall, but increase the person's age from 8 to 28, he's now much more likely to be diagnosed a dwarf. Does that mean age causes dwarfism?
So, anyway, that's what's going on. I believe their regression does indeed find a coefficient equal to 10.6, as they report, but when you use a bit of common logic, you see that what they found is absolutely consistent with about 50 to 60% of younger brothers outstealing their older brothers -- not 90%.
Labels: baseball, baserunning, psychology, regression, siblings