Sabermetric Research: Regression to the likely

Thursday, March 25, 2010

Regression to the likely

In the previous post, I gave an example of a statistical test on clutch hitting. It went like this:

"Joe Blow hits .300 in the clutch, and .230 non-clutch. Someone does a standard statistical test on the difference, and finds a significance level of .05. That means that, if the player were actually exactly the same in the clutch as the non-clutch, there would have been only a 5% chance of him having a 70-point-or-more difference just by luck."

Typically, when you get a statistically significant result like this, you use the observed effect as your estimate of the real effect -- in this case, 70 points. Previously, Tango had argued that you shouldn't do that. All you've shown by "statistical significance" is that the result is significantly different from zero. It could be 40 points, it could be 20 points, it could be anything non-zero. You shouldn't just assume it's 70.

I agreed. The point is that you have to take this result, and combine it with everything else you know about clutch hitting, before making a "baseball" estimate of what that observed 70 point difference really tells you.

To clarify what I meant in the previous post, let me give you an example that makes it more obvious. Suppose that you decide to study how good a hitter Albert Pujols is. You randomly pick 8 of his at-bats, and it turns out that in those AB, he went 7-for-8. And suppose your null hypothesis is that Pujols is average, just a .270 hitter.

If you were to do a traditional binomial test, you would find that Pujols' observed .875 batting average is high enough that you would easily reject the null hypothesis that he's .270.

But even though the sample showed .875, would anyone seriously argue that the evidence shows that Albert Pujols is an .875 hitter? That only makes sense if you're willing to ignore everything that you know about baseball, and if you're also willing to ignore everything that *everybody else* knows about baseball -- that there's no such thing as an .875 hitter.

There is nothing wrong with the statistical calculations and statistical test that came up with the .875 estimate. It's just that a naive statistical test doesn't know anything about baseball. And if you want to make an argument about baseball, you have to use baseball knowledge. The fact that you did a statistical test, and it came up significantly different from conventional wisdom, does NOT mean that conventional wisdom is wrong. When you have piles and piles of evidence that says that Pujols is not an .875 hitter, and one statistical test that estimates he is, based only on 8 AB, then if you consider only the 8 AB and ignore everything else, you're not being a sabermetrician -- you're doing a first-year STAT 101 class assignment.

Anyway, I think the ".875 Pujols" example makes the point clearer than the ".070 clutch" example, because it's more obvious that a .875 hitter is absurd than that a .070 clutch talent is absurd. Less so to Tango, of course, who has studied the clutch issue, and probably reacted the same way to ".070 clutch" as I did to ".875 batting average."

I think the example also makes the arguments of some of the other post's commenters more understandable. A couple of them were arguing that, OK, if you know something about baseball, you might discount the .070. But *in the absence of any other information*, you can take the .070 as the best estimate. To which I say, yes, that's true (subject to serious caveats I'll get to in a minute), but not all that relevant. Because, if you rephrase it as, "in the absence of any other information about Albert Pujols, you can take .875 as the best estimate of his batting average talent" ... well, that's still true, but completely irrelevant to any study that's trying to learn something about baseball.

------

So, we ran the Joe Blow statistical test, and we found that the .070 was statistically signficant at .05. And we're also willing to say, "in the absence of any other information, we can take the .070 as the best unbiased estimate of Joe Blow's clutch ability."

Now, I'm going to create a simulated teammate for Joe -- call him David Doe. I'm going to play God, and I'm going to create David Doe to be a clutch hitter. I'm going to go to my computer, and ask for a random number between 0 and 1, and divide it by 10 to get David's random clutch talent. I'm not going to tell you that true random clutch talent, because I'm God and you're the sabermetrician, so it's up to you to figure that out.

However, I'm going to simulate a bunch of non-clutch AB for David, and a bunch of clutch AB. I'll tell you the results and let you do the statistical test. Actually, I'll even do the test for you and tell the result and the significance level.

OK, let me run my randomizer ... done. And, hey, it turns out, coincidentally, the random numbers came up exactly the same for David as for Joe -- he hit exactly 70 points better in the clutch. And, coincidentally, the significance level is the same, .05.

As we said back when we were talking about Joe, in the absence of any other information than that given, we can say the best estimate for Joe's clutch ability is .070. Similarly, in the absence of any other information than that given, what's the best estimate for David's clutch ability?

The answer is not .070.

For David, the estimate of .070 is biased too high. Why? Because we know something in the David case that we didn't know in the Joe case: that because of the way I randomly chose David's clutch ability, it can never be more than .100.

David's true clutch ability might be between .040 and .070, and he just got lucky. It might be between .070 and .100, and he just got unlucky. Those two possibilities are symmetrical, so if we consider only those possibilities, .070 *is* our best estimate.

But there is another possibility: that David's true clutch ability is between .000 and .040, and he got even more lucky. That's not balanced by anything on the "unlucky" side, because there is no way David was actually a .100 to .140 hitter who got unlucky -- that case is impossible.

So David is more likely to have been lucky than unlucky, and the best estimate of his clutch ability is less than .070.

A way to make this more obvious is to give the standard confidence interval around .070. For both Joe and David, it might be the interval (.005, .135). For Joe, that makes sense. For David, it becomes clear that it doesn't make sense: everything above .100 is impossible, and so you know David's confidence interval is wrong.

This looks like a trick, but it's not, not really. It's just a special case of the principle that the estimate and confidence interval are not correct if some possible values of the parameter were less likely to be true than others. "Impossible" is just a special case of "less likely".

The real God does these things too. He's made it nearly impossible for anyone to be an .875 hitter. And, evidence shows, he's made it very unlikely for anyone to be a .070 clutch hitter. If you ignore those facts, you'll come up with implausible predictions. If you really believe that a 95% confidence interval around Joe Blow is centered at .070 clutch, I'll be happy to bet you on Joe's performance next year. You should be willing to give me 19:1 odds that Joe will hit within his confidence interval. I'll be very happy to take 10:1. If you're not willing to take my bet, then you don't really believe in your results, do you?

-----

When we said that you can trust the .070 estimate "in the absence of any other information," that phrase, "in the absence of any other information," is a bit of a fuzzy way of saying what the true condition is. There's a mathematical Bayesian way of phrasing the condition, but I'll just use a rough approximation:

-- You can accept the point estimate and the confidence interval if, before the fact, you could say that every value was equally likely to be true.

That's not always the case. In my simulation, it was explicitly not the case, since I told you in advance that clutch couldn't be less than .000 or higher than .010.

It's not the case for clutch, either. Even if you didn't know anything about clutch hitting, it would be obvious, wouldn't it, that a clutch talent of .900 was impossible? And so, technically, not every value is equally likely to be true -- .070 is plausible, .900 is not. So, technically speaking, citing the .070 is invalid. It is NOT accurate to give a confidence interval for your parameter unless you are willing to assume everything is equally likely, from minus infinity to plus infinity.

That's a technicality, of course ... .900 is so far away from the .070 that the study showed, that you don't lose much accuracy from the fact that it's impossible. We don't actually have to go to infinity -- we can actually just say (and, again, this is a paraphrase of the math),

-- You can accept the point estimate and the confidence interval for all practical purposes if, before the fact, you could say that every value within a reasonable distance of the estimate was roughly equally likely to be true.

In my opinion, this is NOT strictly the same as saying "in the absence of any other information." It's an explicit assumption that has to be made, one that just happens to be reasonable in the absence of other information. Normally, it's just ignored or taken for granted. But that doesn't mean it's not really lurking there.

So, in the case of Joe Blow, is every value within a reasonable distance of the confidence interval equally likely to be true? I don't think so. Our hypothetical confidence interval is (.005, .135). That is within a reasonable distance of .000, and, you'd think, values around .000 are more likely to be true than any other value.

Why are values around .000 more likely to be true than other values? You could argue that from the evidence of previous studies. But you don't need to. You can argue it on first principles.

Most human attributes are normally distributed, or, at least, distributed in a "normal-like" way where there are more people near average than far from average. Assuming that we have already normalized Joe's clutch stats to the league average clutch stats (as most studies do), the league average is .000, and so we should have expected that Joe's clutch talent would be more likely found around .000 than .070. Therefore, the the condition "every value within a reasonable distance is roughly equally likely to be true" does not hold.

(Notice that you don't need to know whether clutch hitting exists or how big it is -- or even know anything about baseball -- to know that you can't take the point estimate for it at face value! All you need to know is that the distribution is higher in the middle. I think that's kind of cool, even though it's really just the same argument as regression to the mean.)

And so, you can't just take the .070 as your estimate without further argument.

If you disagree with me, you can still try the "without further information" argument. You might reply, "well, Phil, I agree with you, but you have to admit that *in the absence of any other information*, the .070 is a good unbiased estimate."

To which my first reaction is,

"if you're not willing to even consider that human tendencies are clustered near average, if you're willing to go that far to ignore that other information, then you're not studying baseball, you're just doing mathematical exercises."

And then I say,

"It's not even technically true that the .070 is correct 'in the absence of additional information.' That's just a fuzzy way of phrasing it. What is *really* true is that the .070 is correct only if you are willing to assume that all values were, a priori, equally likely. That's an assumption that you have to make, even though you avoid making it, and even though you may not even realize you're making it. And your assumption is false. It's not just a case of ignoring information -- it's a case of ignoring information *and then assuming the opposite*."

-----

I mentioned "regression to the mean," which, in sabermetrics, is the idea that when you try to estimate a player's or team's talent, you have to take the performance and "regress it" (move it closer) to the mean. If you find ten players who hit .350, you'll find that next season they may only hit .310 or so. And if you find a bunch of teams who go 30-51 in the first half of the season, you'll find they do better in the second half.

This happens because there are more players with talent below .350 than above, so that when a player does hit .350, he's more likely to have been a lower-than-.350 player who got lucky than a higher-than-.350 player who got unlucky.

Regression to the mean is actually a special case of what I'm describing here. You might call this Bayesian stuff "regression to the likely." It's likely the .350 hitter is actually a lower-than-.350 player, so that's the way you regress.

"Regression to the likely" is just another way of saying, "take all the other evidence and arguments into account," because it's all those other things that made you come up with what you thought was likely in the first place.

If you accept a .070 clutch number at face value, you are implicitly saying "when I regressed to the likely, I made no change to my belief in the .070, because there was nothing more likely to regress to." If that's what you think, you have to argue it. You can't just ignore or deny that you're implying it.

-----

It may sound like I'm arguing that everyone who has ever created a confidence interval in an academic study is wrong. But, not really -- much of the time, in academic studies, the hidden assumption is true.

Suppose you do a study that tries to figure out how much it costs in free agent salary to get an extra win. And you wind up with an estimate of $5,500,000, with a standard deviation of $240,000. Is there any reason to believe that any number within a reasonable distance of $5.5MM is likelier than any other? I can't think of one. I mean, it was easy to think of a reason why more players should be average clutch than extreme clutch, but, before doing this study, was there any reason to think that wins should be more likely to cost $4,851,334 than, say, $6,033,762? I can't think of any.

Much of the time things are like this: a study will try to estimate some parameter for which there was no strong logical reason to expect a certain result over any other result. And, in those cases, the confidence interval is just fine.

But other times it's not. Clutch is one of them.

If you take the study's observed point estimate of at face value -- and, as Tango observes, most studies do -- you're making that hidden assumption that (in the case of clutch) .070 is just as likely as .000. You're making that assumption whether you realize it or not; and the fact that the assumption is hidden does not mean you're entitled to assume it's true. In the clutch case, it seems obvious that it's not. And so, the .070 estimate, and accompanying confidence interval, are simply not correct in any predictive baseball sense.

Labels: baseball, statistics

11 Comments:

At Thursday, March 25, 2010 9:45:00 PM, Vic Ferrari said...: Well written, Phil. That looked like a daunting essay when it first popped up on my feed reader, it was an easy read, though.

We're just looking at likelihood here, no? That "regression to the likely" turn of phrase has puzzled a bit. Might just be semantics.

So, in the naive form, just the continuous form of the binomial equation you mention, from the player's POV. Or a discete form (histogram) if you choose.

So for Pujols in your example ... what are the chances of his true batting average abilty being between .150 and .155 and him going 7/8? What about between .155 and .160? ... and on and on.

Plot that out and it's the discrete likelihood distribution for Pujol's ability. And it's going to have a mode of .875 but will be left skewed I think.

Or you could just plot out the continuous distribution.

Now, as you say, that's the fairest assumption if Pujols is the only player you have information about. But if you have a population of players, and you know the element of chance variation at play (implied with the use of the binomial eq'n) ... then the problem becomes apparent pretty quickly.

So the new likelihood distribution for Pujols is the joint probablity distribution of the population and the naive (calced above). Makes sense, no?

We don't know the population ability distribution, so we take an educated guess.

If the sum of these brings you back to your original guess for the ability distribution, and the joint probability of the new player likelihoods and the binomials brings you right back where you started in terms of actual results distribution (actual will be squigglier, of course, depending on how many players and ABs you're dealing with) ... then you're done! And you would be a helluva guesser, as well.

Failing that, take what you learned from the first guess and try again.

We're talking about the same methodology, yes?
At Thursday, March 25, 2010 10:25:00 PM, Phil Birnbaum said...: Hmmm ... let me think about what you're saying.

The mathematical way to figure out how far to "regress" is this: you figure that the probability of Pujols talent being X is proportional to the product of the "prior" multiplied by the "likelihood". The "prior" is what you have reason to believe the probability was before the experiment, and the "likelihood" is the probability of getting the observed result if X were true.

It's hard to do for a continuous variable like here, but here's a simpler example:

Suppose we know for sure, somehow, that Pujols had a 30% probability of being a .300 hitter, and 70% of .350.

If he WERE a .300 hitter, the chance of him going 7-for-8 would be about 1 in 817 (.00122). If he were a .350 hitter, the chance would be about 1 in 299 (.00335).

So for .300, the "prior times the likelihood" is 30% of .00122, which is .0003674. For .350, the "prior times the likelihood" is 70% of .00335, or .002342.

So the probability of .300 equals

.0003674 divided by (.0003674 + .002342)

which equals .136.

So if before the experiment there was a 30% chance that Pujols was a .300 hitter, now, after the experiment, there's only a 13.6 chance.

Before the experiment, your expectation was that Pujols was a .323 hitter (30% of .300 + 70% of .333). After the experiment, your expectation is that he's a .3285 hitter (13.6% of .300 + 86.4% of .333).

Does that help? Is that approximately what you're describing? I'm not sure ...
At Friday, March 26, 2010 10:44:00 AM, Vic Ferrari said...: I think we're on a different wavelength right now, Phil. I'm struggling to follow you.

We would never make assumptions about Pujol's true ability. We would only make estimates of the distribution of talent in the population. Then rinse and repeat until we found a distribution that yielded the best postdictive and (much more importantly) predictive result.

The only prior knowledge we're applying here is the notion that batting, over significant frame sizes (100+ PAs) is binomial. And there has been some compelling streakiness studies that strongly suggest this is the case (most notably Albert in the Journal for Statistics in Sport (IIRC) and James in a contributed article at Alan Reifman's site.

And you've agreed with that implicitly in your original post, anyways, no?

So, another example.

If Doug Flynn had 15 hits in 100 ABs in 1982 ... what is his hitting ability?

In isolation it is the binomial likelihood function. y α x^15 * (1-x)^85.

If you plot that out you get a bell shaped curve that is right skewed a bit.

On the same paper, draw a freehand curve that represents your estimation of how batting average ability is distributed in the league. Parlay those through as you've done in the comment above ... voila, the new Birnbaum Mach I likelihood distribution for Doug Flynn.

Now check how Flynn did in his next 100 ABs. If he got 25 hits, check that against his new Birnbaum likelihood distribution and see where it ranks. Let's say it's at the 95th percentile ... make note of that rank.

Do the same for every player in the population ... plot out the ranks as a histogram. If it isn't flat ... it's time for Birnbaum Mach II. If it is concave you're going to want to widen out your ability distribution.

As you get closer you can start messing with the shape. The kurtosis and skew especially, and if you've included NL pitchers you'll end with a distribution with twin peaks ... and a computational nightmare.

Sample and frame sizes are significant issues, so my advice would be that you don't want to start painting with too fine a brush until you've checked similar samples from nearby MLB seasons.

It's as simple as that. If you go on to become sabermetric's first Bayesian ... my advice, admittedly not worth much, would be to avoid Beta priors (or Dirchlet in the case of multinomial, such as a BABIP falling for a single, double, triple or home run). Conjugate prior is mathspeak for "I know what is going to happen with the math in a few minutes, and this makes it miles simpler".

And I think we'd both agree, convenient assumptions are the reason that sabermetrics has some real problems. Exchanging them for better convenient assumptions would seem to be going half way, at least to my mind.

Both Null are Albert are frank with this. The latter expressly states that the ability distribution form is chosen "for convenience". Gotta love the honesty, I always feel insulted when someone rationalizes their assumptions, and I stop reading when I realize that they are oblivious to them.

Brute force methods likely won't get you published in a math journal, but if you're goal is high predictive value, and you're not docking points for inelegance ... it's the way to go imo. At least for someone who has mad programming skills and access to very fast machines, I suspect that you fit that bill.

Two more points:

1. Sorry for rambling.

2. Where DO you get such terrific data? I've turned the internet upside down and shaken it vigorously ... nothing. Surely to God I don't have to learn how to strip it from the retrosheet PBP files? Say it ain't so.
At Friday, March 26, 2010 10:49:00 AM, Vic Ferrari said...: Edit for above:

"twin peaks" should read "two peaks"

"Dirchlet" should read "Dirichlet".

And just assume that the other spelling, typographical and grammatical errors are intentional. :)
At Friday, March 26, 2010 11:01:00 AM, Phil Birnbaum said...: >"We would never make assumptions about Pujol's true ability. We would only make estimates of the distribution of talent in the population."

And that's your assumption about Pujols' true ability: that it comes from that distribution. Once you know more about Pujols, you adjust that. My assumption about Pujols this year is that he's very likely to be near the mean of what he's done in previous years. That's my "prior".

Now, if this year he starts off going 0-for-20, I adjust my assumption -- not to the mean of the population, but to my most likely prior assumption of Pujols' ability, which is maybe .333. I won't do the math (mostly because it's too complicated), but maybe I now adjust my guess for Pujols down to (guessing) .300.

I'm not sure about how your method would work ...

BTW, what terrific data?
At Friday, March 26, 2010 1:12:00 PM, Brian Burke said...: It's touched on in the main post above, but I think you're calling for Bayesian inferences rather than 'frequentist' inferencesn (i.e. Central Limit Theorem, p-values, etc.)

I know very little about Bayes, but I think of it like this--you start with a prior distribution of outcomes (say the distribution of all MLB batting averages). Then you get some new data which influences that the prior distribution in some way. A sample of going 7 for 8 would slide Pujol's distribution upward slightly, but not much.
At Friday, March 26, 2010 1:23:00 PM, Phil Birnbaum said...: Brian: yes, that's exactly what I'm calling for. Well, I'm not exactly demanding that you do formal Bayesian calculations. I'm just saying that the frequentist method requires you to make an assumption that you don't want to make.

Most of the time, people ignore that assumption ... at least until they wind up having to claim that Pujols is most likely an .875 hitter. Then they realize that the frequentist results have to be interpreted in the context of other evidence (the prior).

My point is that the same thing applies when you come up with a clutch estimate of .070. It's not appropriate to dismiss the .875 on grounds that it doesn't make sense, but not dismiss the .070 on the same grounds.

The only difference is: the .875 is obvious, the .070 isn't. But people tend to resist qualifying their estimates and confidence intervals when the results don't look obviously absurd. My argument is that you ALWAYS have to adjust if appropriate, whether you call yourself a Bayesian or not.
At Wednesday, March 31, 2010 1:49:00 PM, Vic Ferrari said...: Phil said:

And that's your assumption about Pujols' true ability: that it comes from that distribution. Once you know more about Pujols, you adjust that. My assumption about Pujols this year is that he's very likely to be near the mean of what he's done in previous years. That's my "prior".

What would you do if that was a legitimate question posed to you, Phil? That being "what is Pujols true BA ability right now?"

And not just taking a stab at the most likely (the mode), but also what are the chances that he's 10 points worse than that? And what are the chances that he's 20 points better than that? And what are the chances that he's actually 30 points better than that and has just hit some bad luck in the past? ad infinitum.

That's an honest question, I really hope you answer. I'm sure your estimate will be better than mine, you know more about MLB than me.

And for now neither of us worrying about what flavour of math is being used, just knowledge of the game. We can argue about what "the prior" was after the fact :)

If the language used to express your idea for Pujols is translatable to math ... we can apply it to everyone else in the league. And if the trees don't add up to make the forest we expected ... we'll apply common sense and have another go.

Within a few hours of work I'd be shocked if we hadn't equalled or bettered PECOTA marginally by average absolute and 2nd moment deviation (or whatever efficacy measures you'd prefer) and substantially improved on likelihood distributions for player BA. Perhaps I'll regret making such a bold prediction. We'll see.

I have no special interest in player forecasts, I don't play fantasy pools. So, in the event that you indulge me, this particular exercise would be an academic one from my point of view. I'm going somewhere with it, though.
At Wednesday, March 31, 2010 1:59:00 PM, Vic Ferrari said...: Phil said:

BTW, what terrific data?

The play by play data, Phil.

The sort of information that you and several others use in countless articles. This article from last autumn is the example that springs to mind (which was an outstanding post IMO), but short of improving my programming skills (which are poor) I can't find a way to get the play by play information I'm after.

How did you get this specific data?
At Wednesday, March 31, 2010 2:30:00 PM, Phil Birnbaum said...: Vic,

>legitimate question ..."what is Pujols true BA ability right now?"

I'd probably just go with the Marcel prediction, or one of the other forecasting systems. They regress to the mean appropriately, AFAIK, which is the equivalent of Bayesian.

The data for the other study came from the Lahman database, www.baseball1.com. Other data comes from Retrosheet, www.retrosheet.org .
At Wednesday, March 31, 2010 3:30:00 PM, Vic Ferrari said...: Phil said:

I'd probably just go with the Marcel prediction, or one of the other forecasting systems. They regress to the mean appropriately, AFAIK, which is the equivalent of Bayesian.

That's not what I mean, Phil. I was thinking you'd just trust your instincts. As if you were talking to a bartender, and explaining what you thought Pujols would do this year. Just honest, first principles stuff, even if it's not easy to articulate with math.

I mean for crying out loud we could use Albert's embarrassingly crude beta prior from your "By The Numbers" issue of February 2006 ... it would beat marcel soundly, and absolutely whallop it in terms of distribution. On the latter, I don't know what the odds would be ... 100,000 to 1 for the three low event items there, maybe more. Something in that range, in any case, closer on the fourth I suspect.

Humour me; just for a minute, let go of most everything you've read from other sabermetricians ... what do you think about Pujols?

There is no shame in having knowledge of the game, Phil. And I'm sure we'd both agree that there is no earthly reason to make sound reason fit conventional mathematical models, other than convenience.

Sabermetric Research

Thursday, March 25, 2010

Regression to the likely

11 Comments:

About Me

Previous Posts