## Sunday, October 31, 2010

### Can estimates of wheat production improve our evaluation of baseball talent?

Suppose that in a certain amount of playing time, a baseball player has 10 home runs. What's your best estimate of his true talent? One easy answer is: 10 home runs.

We know we can get a better estimate if we do a regression to the mean for all similar players. However, "Stein's Paradox" has a stronger result, one that sounds like it's just plain nuts. According to Stein, you don't have to use home run numbers to get your mean. You'll still get a more accurate estimate if you average the number "10" with a bunch of completely unrelated other samples, and then regress to the mean of all those numbers.

"Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements [and regressing each one to the mean of all three]."

As I said, it sounds nuts. But it's an established result. A longer explanation, with baseball examples, is here (.pdf). It's a 1977 piece from Scientific American, by Bradley Efron and Carl Morris. (Hat tip to Craig Heldreth of "Less Wrong".)

After a bit of thought and reading, I believe I may understand why it's true -- true, but not useful. I'm still a little fuzzy on it, but I'm going to write out what I think is the explanation. If there's anyone reading this who understands this stuff well, please let me know if I got it right.

----

Suppose a baseball batter hits .750 in a single game. What's your estimate of his true talent? It's going to be a lot less than .750. You're going to have to regress your estimate towards the mean of all players, and you might wind up estimating that he's really a .300 hitter who got lucky. (There are formal ways to make a more precise estimate, but we'll just keep it intuitive for now.)

We were able to make reasonable estimates because we already knew what mean of the population we were regressing him to: maybe about .260 or something. But what if we didn't know that?

For instance, I just made up a new offensive stat. I won't tell you what it is. But I selected a player at random, and he had a 13.43 last season. What do you estimate his talent is?

So far, your best estimate of his talent is that very same 13.43. Why? Because you don't know how good or how bad 13.43 is -- you don't know the mean to compare it to. If 13.43 is the equivalent of a .350 average, you'd regress down. If 13.43 is the equivalent of a .200 average, you'd regress up. But without an idea of where the center is, you don't know what to do with the 13.43, and your best estimate is to just leave it alone.

But now, what if I give you *three* random players? The first one is still 13.43. The second one is 9.31, and the third is 18.80. Now things are different. You now have some idea of where the mean is -- maybe around 14 -- and so you can regress all three guys accordingly. The first one regresses up a tiny bit, the second one moves up substantially, and the third one drops down. Maybe your intuitive estimates are now 13.6, 11, and 16. Or something like that.

What "Stein's Paradox" tells you is this: even though you want to estimate all three of these players separately, you get better estimates if you estimate each player based on a formula based on the observations for *all three* players.

That doesn't seem all that paradoxical ... sabermetrics has known that for a long time. (However, in 1977, when the Scientific American article was written, this might have been a surprising result.)

Where the *real* paradox appears to arise is when the three things you're looking at have nothing to do with each other. Because, as Charles Stein proved in 1955, you *still* get a better estimate when you regress to the mean, even when it looks like there's no reason to, as in the wheat/Wimbledon/candy example at the top of the post.

Suppose our formal statistical estimates are 10,000,000,000 billion bushels for US wheat, 200,000 for Wimbledon, and 50g for the candy bar. The mean is around 3,000,000,000, and we'll get a more accurate combination of three estimates if we regress all three numbers a bit closer to 3,000,000,000.

That's just weird. If we estimate the weight of a candy bar at 50g, why should we increase our estimate just because the US wheat harvest is in the millions?

It doesn't seem right, and I think that's why it's called a "paradox".

------

How to explain it, then? Well, for one thing, the regression to the mean for the wheat/Wimbledon/candy case is going to be infinitesimal, and so almost useless. As the article puts it,

" ... the James-Stein estimator does substantially better than the [non-regressed] averages only if the true means lie near each other ... what is surprising is that the estimator does at least marginally better no matter what the true means are."

My translation:

"... regression to the mean works really well if the talent levels are close together relative to the size of errors of the estimates ... but if they're not, then it doesn't work that great."

Or, translated into baseball:

"... regression to the mean works well in batting averages, because the SD of talent in batting average is not that much different from the amount of luck in a season's totals ... but regression to the mean doesn't work well in the wheat/Wimbledon case, because the SD of the means (10,000,000,000, 200,000, and 50) is much higher than the SD of the measurement errors."

So now we're getting closer to our intuition, which says that regressing to the mean makes sense for batting averages compared to other batting averages, but not for wheat estimates compared to candy bar estimates.

------

Still, why does the wheat/Wimbledon/candy regression to the mean work even "at least marginally better" than no regression to the mean? It seems the effect should be zero, since those three means appear to have nothing to do with each other at all.

That is, suppose you weigh the candy bar and get an estimate of 50g. And then someone comes up to you and says, "I estimate there were 10,000,000,000 bushels of wheat produced in 1993."

Why, then, should you suddenly say, "Ha! If you change your estimate to 9,999,999,999.999997, and I change my estimate to 50.000000001, we'll reduce our combined error!" That doesn't seem like it should be true, no matter how many decimals I go to.

But I think it does work.

To see why, let me start with a baseball example where I have only one player. I pick a player A at random, look at his season stats, and compute his "blorgs created" (BC) as 375. Should I regress to the mean? It doesn't matter, because, with only one player in the sample, his score IS the mean. That means: without knowing the properties of "blorgs created", the chance of his talent being less than 375 is the same as the chance of his talent being more than 375.

Now, introduce another random player B, and observe that he had a BC of 300. It looks like I don't have any more information about A -- but I do. Before, I only knew he was at 375. Now, I know he was at 375, and ALSO that 375 happens to be the highest value of everyone in the sample. So 375 is now more likely to be a good number, which means it was likely the beneficiary of good luck, which means my estimate of A's talent should be less than 375.

If I look at another random player, C, and he's at 322 ... well, now I have even *more* reason to believe that A's talent is less than 375. And so on.

We can generalize that to:

If you look at a bunch of different outcomes that have some randomness in them, the highest outcome will likely be higher than its talent, rather than lower.

Suppose you have ten players of varying batting talents, and each one takes 500 AB. What is the chance that the highest batting average will turn out to have been higher than the talent of the player who hit it?

It's higher than 50 percent, a lot higher. Here's why.

Consider the player with the highest talent -- call him Bob. Bob has a 50 percent chance of doing better than his talent (and 50 percent of doing worse). If he does better, then it must be true that the highest BA exceeded the talent. So there's 50% right there.

Now, if he does *worse* than his talent, consider the second-best guy, Ted. Maybe Ted is only a little bit worse than Bob.

What's the chance that Ted will finish ahead of Bob's talent? Well, there's a 50% chance that Ted will finish ahead of Ted's talent. And we decided that Bob is only a little bit more talented than Ted. So the chance of Ted beating Bob's talent is maybe 49%.

If Ted beats Bob's talent, then it must be true that the highest batting average is higher than the talent of the player hitting it -- either Ted finished first and beat his own talent (and Bob's), or one of the other eight guys did.

So now our probability is 74.5% -- 50% that Bob beat his talent, plus 49% of 50% that Bob didn't beat his talent but Ted did. And we haven't even got to the other eight guys yet! (Not only that, but we haven't considered the chance that Ted may have beat his own talent but not Bob's, but still finished first in batting.)

If we do add the others in, we'll probably wind up pretty close to 100%. In real life, we don't have ten guys at the top all so close to each other, but I'd still bet that the chance is up past 99%.

Again, what this means is: the highest result is probably higher than the talent, and needs to be regressed to the mean.

This holds even if the measurements are of different things. Suppose you take all the NHL goal scorers and combine them with the MLB home run hitters. The highest value, whether MLB or NHL, still needs to be regressed to the mean, for exactly the same reasons. And you can even combine those with other things, like, say, your estimate of the number of candies in a pack. The highest value, whether Jose Bautista, Alex Ovechkin, or M&Ms, probably needs to be regressed to the mean. It doesn't make a difference that all the measurements aren't of the same thing.

However: it DOES matter that the talent levels be close together. Why? Because, if not, then the argument a few paragraphs above will fail. Suppose you have a sample consisting of Babe Ruth, Rod Carew, Jose Oquendo, and Ozzie Smith. For that sample, what are the odds that the highest home run value will be higher than its talent? It's not 90% any more -- now, it's almost exactly 50%. Why? Because the winner is almost always going to be Babe Ruth. So the chance that the highest home runs exceeds the player's talent is exactly the change that Babe Ruth exceeds his talent, which is 50%. In the previous example, there were at least 10 guys with a legitimate chance to exceed their talent and also lead the league. Now, there's only one.

That is, suppose the Babe's talent is 50 HR. If you've got nine guys behind him with talent of 48, then it's very likely, well over 90%, that at least one of the 10 guys will hit over 50 HR and exceed his talent. But if the other guys are only at 5 HR, then nobody has a chance of hitting 50+ home runs except the Babe himself, and the chance that Babe exceeds his talent is only 50%.

That means that when you have a tight distribution of HR talent, you have to regress to the mean. If you have a wide spread of HR talent, then you don't.

Well, you *almost* don't. There is a small but non-zero chance that Rod Carew could, just by luck, hit more home runs than the Babe and surpass his talent. Because the chance is small but positive, the amount of regressing to the mean is also small but positive.

Suppose there's a 1 in 100,000 chance that Rod Carew will lead the league instead, even though his talent level is only 5 HR. Then, if the league leader hits X homers, your estimate of the talent of the league leader is (.99999 * X) + (.00001*5). That is, you DO have to regress to the mean, a tiny, tiny bit.

And that brings us to bushels of wheat vs. Wimbledon fans vs. candy bars. If you are absolutely sure that there is no way the actual candy bar weight could be higher than the number of bushels of wheat, then you don't have to regress to the mean, and Stein's Paradox doesn't apply.

But you can never be absolutely sure. There's always the infinitesimal probability that, because of measurement error, the bushels of wheat got lucky and hit 10,000,000,000 home runs instead of 50. Which means there's an infinitesimal probability that the 10 billion should actually be just 50. Which means there's an infinitesimal amount of regressing to the mean you have to do to account for that infinitesimal probability.

My guess is that for this example, the amount of regression you have to do is so small, so far down the tail of the probability distribution, that you couldn't even calculate it if you wanted to. But as small as it is, it does exist, as Charles Stein proved.

-----

But what if the numbers are fairly close, but still unrelated, like MLB home runs per season, and NBA points per game? Couldn't we *still* get a very slight benefit by taking the batting average and adding in some irrelevant data? If we're trying to figure out how good a home run hitter Alex Rodriguez is, based on his 2008 record, won't it help, even just a little bit, to add in NBA player's game scoring numbers and regress to those, too?

I don't think so. I think there's one hidden assumption in Stein's Paradox that doesn't apply here: the assumption that we know *nothing else* except the individual numbers themselves.

If we were to add in NBA numbers, we'd be violating that assumption: we'd know which numbers NBA, and which were MLB. Stein's Paradox would help us only if we chose to "forget" that information, and the "forgetting" would lead to a big loss of accuracy.

Suppose I mix a bunch of NBA and MLB numbers 100 in total, and don't tell you which are which. Here are 10 of them:

2, 5, 10, 10, 17, 32, 50, 50

And I want to estimate the talent of each of those players.

Stein's Paradox tells me that if I want to estimate one of the "50" guys, I shouldn't just guess "50" -- I should regress him to the mean. There is a mathematical formula to tell me how much to regress him. Because I can't tell the NBA guys from the MLB guys, the regression will be the same in both cases.

But, now, suppose I know which of those guys are NBA guys:

10, 10, 32, 50

And which ones are MLB guys:

2, 5, 17, 50

Now, by running Stein's formula on each of those two groups separately, I can do better. Specifically, the MLB "50" guy is going to be regressed some (maybe to 35 home runs of talent) and the NBA "40" guy is going to be regressed more (maybe to 25 PPG of talent).

So, having the unrelated data made things worse, not better, because the Stein formula ignores that the new data has different properties (a different prior) than the old data.

-----

What Stein's Paradox says is this:

1. You can do better by properly regressing to the mean than by not regressing to the mean.

2. If you have no other information about the data that informs your analysis, here's a formula for the amount to regress that will be an improvement.

Fine, but, for practical purposes, not very useful. First, we *already* regress to the mean. When looking at an observation, we NEVER accept it as a direct estimate of the talent behind it. We never say, "well, the Red Sox are 2-0, so their talent must be 162-0". We never say, "George Bell hit three home runs on opening day, so we estimate his talent at 486 home runs per season." We never say, "Jose Bautista hit 54 home runs last year, so we should expect the same next year."

Second, we *do* have lots of other information. For instance, we know something about the "prior distribution" of home runs and points per game. We know that we should avoid estimating a talent level of 35 PPG, because NBA players just don't have that expected level of performance. But we *can* estimate a talent level of 35 HR, because there are many players who *are* that good.

Stein's Paradox is a very interesting mathematical result, but it's not all that applicable in real life. It's like saying, "You'll get from Boston to Los Angeles faster if you jog instead of walking." You'd be right, but who was planning on going on foot anyway?

----

At Wednesday, November 03, 2010 12:55:00 PM,  Vic Ferrari said...

The linked article is a heavy read, I started skimming part of the way through. Love the 1977 calculator advertisement at the bottom.

As I understand it (perhaps wrongly), and in Bayesian terms, it distills down to using a multivariate normal as the prior, with randomly assigned parameters. This instead of using a Gaussian disribution as the prior, which is the mehod of sabermetrics.

So with your usual math, and the example used in the article, you're taking the naive binomial likelihood distribution for Clemente (looks like he had 18 hits in 45 AB) ... so g(p) :: p^18 * (1-p)^27.

Then approximating that as a Gaussian form.

Then you're approximating the distribution of talent in the population as a Gaussian distribution, with the standard deviation determined by autocorrelation.

Plot those two out as histograms with the same bin widths, then multiply the heights of all overlapping bins.

Adjust that final histogram (technically a multivariate normal with the assumption that ABs and batting average don't covary) to make it Gaussian ... voila!

Stein appears to be doing the same thing, except instead of arbitrarily assuming that hitting talent for the selected population is normally distributed, he creates a multivariate normal form, with the parameters defined by the weighted joint probabilities of other (presumed) normal distributions that are selected a random.

Similarly, you could play a Warren Zevon song on your computer's MP3 player, and do a random screen capture. They use the equalizer bars as a histogram for your ability distribution in he population. Scale it so it is in he same general range as the gaussian approximation. Multiply the bins (brute force empirical Bayes) and see how Warren Zevon did.

It shouldn't take too many tries to find a Zevon form that consistently outperforms the Gaussian assumption for batting average in MLB. This in terms of predictive value.

Of course that isn't really a paradox. Nor does it imply that Warren Zevon controls the universe, or even MLB. Though it would be cool if he controlled the latter.

I think that's it, no? Correct me if I'm wrong, I have a thick skin.