Friday, June 18, 2010

Do younger brothers steal more bases than older brothers? Part III

(Note: please read Part I and Part II to understand what I'm talking about here.)

OK, after rereading the Sulloway/Zweigenhaft paper, and digesting some comments from the previous posts, I think I now have a pretty good idea of what the authors did.

In the previous post, I compared all brothers to their siblings. I found that when the older brother was called up first, which was most of the time, the younger brother attempted more steals 54% of the time. When the *younger* brother was called up first, which was seldom, he attempted more steals 87% of the time.

That's because when the younger brother was called up first, he must have been quite a bit better than his older sibling, in order to make the major leagues at a much younger age. So 87% is reasonable for younger brothers who are so good that they get called up even before their older brothers. On the other hand, 54% is normal for the more common case where the older brother gets called up first.

Now, since the authors say explicitly say they ran a regression and controlled for who got called up first, let's do that ourselves.

Simplifying the numbers, let's say

60% outsteal their brothers when called up last and younger.
40% outsteal their brothers when called up first and older.
90% outsteal their brothers when called up first and younger.
10% outsteal their brothers when called up last and older.

Now, let's set that up as a regression. We'll create dummy variables for "called up first" and "older". And so each line of our regression, indpendent variable followed by dependent variables, is:

0.6 ... 0 0
0.4 ... 1 1
0.9 ... 1 0
0.1 ... 0 1

However, the paper used odds ratios instead of probabilities, so let's do that.

A probability of 0.6 corresponds to 0.6 successes to 0.4 failures. That's an odds ratio of (0.6 divided by 0.4), which is 1.5.

Similarly, the odds ratio of 0.4 is 0.667. The odds ratio of 0.9 is 9 (9 successes to 1 failure). And the odds ratio of 0.1 is 0.111.

Now we have

1.50 ... 0 0
0.67 ... 1 1
9.00 ... 1 0
0.11 ... 0 1

We'll do one more thing: The authors say they did a logarithmic tranformation, so we'll take the logarithm of the odds ratios. That makes sense: you'd expect odds to be multiplicative, not additive, and the log of the odds ratio is standard in this kind of regression. So let's do that. Now we have:

+0.405 ... 0 0
-0.405 ... 1 1
+2.197 ... 1 0
-2.197 ... 0 1

Because of the symmetry, we have to tweak the numbers just a little bit to avoid singularity. (I think I changed one of the "405s" to a "407", or something, and a "187" to a "195" or something.)

If we now run that regression, what happens? We get

log of the odds ratio = (1.79 * dummy for called up first) - (2.60 * dummy for being older) + 0.404

And that means that being older subtracts 2.60 from the log of the odds ratio. The antilog of 2.60 equals 13.5. And so, being younger means your odds ratio goes up by 13.5 times!

The authors of the study came up with 10.6 times, but their data was different from mine. I think that what I've done and what the study did are pretty much the same thing.


(One quick note: the odds ratio of 13.5 does NOT mean that your chance is 13.5 times higher. It means that your ODDS are 13.5 times higher. Suppose your original odds were 2:1 in your favor, which is a probability of .667. Now, your odds multiply by 13.5 times, which is 27:1. That's a probability of .964. Not 13.5 times the chance -- 13.5 times the odds. Obviously, a .964 shot is not 13.5 times the chance of a .667 shot.

Where the original study and NYT article say "10 times as likely", they really mean "10 times the odds" in this narrow sense. The authors use "X times more likely" throughout the paper, when I think they mean "has X times the odds ratio".)


Going back to the 13.5 odds ratio: why is it wrong? Well, it isn't. It's right!

Suppose that you're an older player who got called up first. Your odds of attempting more steals are 0.667 to 1 (40%). Now, suppose you're magically turned into the younger player, holding everything else equal (so you still got called up first). Now your odds of attempting more steals are multiplied by 13.5. 0.667 times 13.5 equals 9. Your odds are 9 to 1, or 90%, just as we assumed at the beginning!

And, suppose that you're the older player who got called up last. Your odds of attempting more steals are 0.111 to 1 (10%). Now, suppose you're magically turned into the younger player, holding everything else equal (so you still got called up last). Now your odds of attempting more steals are multiplied by 13.5. 0.111 times 13.5 equals 1.5. Your odds are 1.5 to 1, or 60%, just as we assumed at the beginning!

It actually works out!

So does that really mean that younger players DO have 13.5 times (or 10.6 times, as the real study found) the odds of outstealing their older brothers? Yes, in the regression, but not in real life.

In real life, whether you get called up first doesn't really affect your steals directly. Being called up first matters because it's a proxy for ability. If you're younger and called up first, you're likely to be a great ballplayer. Medium-skill slow ballplayers might get called up at 23 or 24, but not at 20. Early callups are reserved for the most excellent players, who are also likely to be fast.

So, if you change old to young, and "keep everything constant," you're really NOT keeping everything constant. By keeping "called up first" constant, but changing older to younger, you're actually CHANGING "of roughly average ability" (old player called up first) to "of much better than average ability" (young player called up first). It's a little sleight of hand, using a proxy variable that means different things depending on the other variable. Younger brothers steal at 13.5 times the odds of old players only if, at the same time, they are much, much better players than the old players.

Analogy again: suppose I run this new regression

Amount of money = number of $5 bills in wallet + dummy for whether number of $1000 bills equals number of $5 bills

Suppose you have no $5 bills and the dummy is "yes", meaning you have no $1000 bills either. Your estimated wealth is therefore $0.

But now, if your inventory of $5 bills increases by 1, *holding everything else constant*, the regression will tell you that $5 bill is worth $1005! Because "holding everything else constant" means you "still" have the same number of thousands as of fives, which means you now have another $1000 bill! The "constant" refers only to holding the regression variable constant. It isn't holding the *real-life* variable constant, which is the number of $1000 bills. What's hidden is that in order to hold the regression variable constant, I have to give you $1000.

Same thing happening here. If you hold "called up first" constant, while changing "older brother" to "younger brother", you're not really holding things constant in the real life sense. To hold the regression variable "called up first" constant, you have to give the younger brother a hidden $1000 worth of talent. And that's why the odds ratio turns out so high.

Want another analogy? Suppose you predict whether a person is likely to be diagnosed a dwarf based on their height and age. Suppose you find that if you hold height constant at 4 feet tall, but increase the person's age from 8 to 28, he's now much more likely to be diagnosed a dwarf. Does that mean age causes dwarfism?


So, anyway, that's what's going on. I believe their regression does indeed find a coefficient equal to 10.6, as they report, but when you use a bit of common logic, you see that what they found is absolutely consistent with about 50 to 60% of younger brothers outstealing their older brothers -- not 90%.

Labels: , , , ,


At Friday, June 18, 2010 1:00:00 PM, Blogger BMMillsy said...

I found using base stealing attempts to be an odd way of showing risk-taking to begin with.

While on the surface they certainly could indicate risk taking, the choice is often made by someone else (the coaching staff). While that doesn't necessarily make the conclusion as to the number of attempts different, it certainly makes me question the idea that the study is evaluating the choices made by the younger brother himself.

A corollary would be that two brothers join the same branch and specialty area of the military. The younger one exogenously gets stationed in Iraq, while the other is stationed in San Diego. Then to see who's the bigger risk taker, you count up the number of younger vs. older brothers. It could be that the military chooses to send younger guys overseas (just like there's some advantage to having them steal more because they're faster). The study declares the younger brother to be the bigger risk taker, when in fact they're simply judging the preferences of the military.

At Wednesday, June 23, 2010 5:19:00 PM, Blogger bradluen said...

Hi Phil,
I'm not sure if the authors got their odds ratio through a regression -- I don't think they did, but I don't have any better idea as to what they actually did -- but I think your idea is right. Whatever method they're using, they're comparing younger brothers called up first to elder brothers called up first (slanted in favour of the exceptional younger players), youngers called up at same time to elders called up at same time (also slanted in favour of the exceptional younger players), and youngers called up later to elders called up later (slanted against the weak elders). Even if they're at first comparing brothers in the same family, the stratification is breaking the within-family structure. Combine this with some tiny subsample sizes, and the huge odds ratio isn't so surprising.

Also, come to think of it, Millsy is probably right that the connection between stealing and risk-taking (as opposed to, you know, being fast) is not entirely evident.

At Wednesday, June 23, 2010 5:44:00 PM, Blogger Phil Birnbaum said...

Hi, Brad,

I think it might even be worse than that. They say they're holding "called up first" constant and comparing young to old. But I think they're not comparing young STEALS to old -- they're comparing young "odds of beating his sibling" to old "odds of beating his sibling".

Hold "called up first" constant at "yes". Then they're comparing young players called up first (exceptionally better than siblings) to old players called up first (average against siblings). Therefore, high odds ratio.

Hold "called up first" constant at "no". Then they're comparing young players called up last (average against siblings) to old players called up last (exceptionally worse than siblings). Therefore, high odds ratio again.

I'm not 100% sure, but I think that's what's going on. It's entirely consistent with what they say on page 7.


Post a Comment

<< Home