Sabermetric Research: Research conclusions *have* to be bayesian

The last couple of posts here have been about interpreting the results of statistical studies. I argued that the statistical method itself might be just fine, but the *interpretation* of what it means, the conclusions you draw about real life, require an argument. That is, you can get the regression right, but the conclusions wrong, because the conclusions call for argument and judgment.

Or, as some commenters have substituted, "intuition" and "subjectivity". Those are negative things, in academic circles. Objectivity is the ideal, and the idea that the reliability of a work of scholarship depends on a subjective evaluation of the author's judgment doesn't seem to be something that people like.

But, I think it absolutely has to follow. If you find a connection between A and B, how do you know if it's A that causes B, or B that causes A, or if it's all just random? That's something no statistical analysis can tell you. By definition, it calls for judgment, doesn't it? At least a little bit. Recall the recent (contrived) study that showed that listening to kids' music is linked to being physically older. Nobody would conclude that the music MAKES you older, right? But that's not a result of the statistical analysis -- it's a judgment based on outside knowledge. An easy, obvious judgment, but a judgment nonetheless.

It occurred to me that this judgment, that takes you from regression results to conclusions, is really an informal Bayesian inference. I don't think this is a particularly novel insight, but it helps to make the issue clearer. My argument is this: first, even if you do a completely normal, ("frequentist") experiment, the step from the results to the conclusions HAS to be Bayesian. And, more importantly, because Bayesian techniques sometimes require judgment, and are therefore not completely objective, the convention has been to avoid such judgment in academic papers. Therefore, these studies have locked themselves in to a situation in which they have to suspend judgment, and use strict rules, which sometimes lead to wrong -- or seemingly absurd -- answers.

OK, let me start by explaining Bayesianism, as I understand it, first intuitively, then in a baseball context. As always, real statisticians should correct me where I got it wrong.

----------

Generally, Bayesian is a process by which you refine your probability estimate. You start out with whatever evidence you have which leads you to a "prior" estimate for how things are. Then, you get more evidence. You add that to the pile, and refine your estimate by combining the evidence. That gives you a new, "posterior" estimate for how things are.

You're a juror at a trial. At the beginning of the trial, you have no idea whether the guy is guilty or not. You might think it's 50/50 -- not necessarily explicitly, but just intuitively. Then, a witness comes up that says he saw the crime happen, and he's "pretty sure" this is the guy. Combining that with the 50/50, you might now think it's 80/20.

Then, the defense calls the guy's boss, who said he was at work when the crime happened. Hmmm, you say, that sounds like he couldn't have done it. But there's still the eyewitness. Maybe, then, it's now 40/60.

And so on, as the other evidence unfolds.

That's how Bayesian works. You start out with your "prior" estimate, based on all the evidence to date: 50/50. Then, you see some new evidence: there's an eyewitness, but the boss provides an alibi. You combine that new evidence with the prior, and you adjust your estimate accordingly. So your new best estimate, your "posterior," is now 40/60.

---------

That's an intuitive example, but there is a formal mathematical way this works. There's one famous example, which goes like this:

People are routinely tested for disease X, which 1 in 1000 people have overall. It is known that if the person has the disease, the test is correct 100% of the time. If the person does not have the disease, the test is correct 99% of the time.

A patient goes to his doctor for the test. It comes out positive. What is the probability that the patient has the disease?

If you've never seen this problem before, you might think the chance is pretty high. After all, the test is correct at least 99% of the time! But that's not right, because you're ignoring all the "prior" evidence, which is that only 1 in 1000 people have the disease to begin with. Therefore, there's still a strong chance that the test is a false positive, despite the 99 percent accuracy.

The answer turns out to be about 1 in 11. The (non-rigorous) explanation goes like this: 1000 people see the doctor. One has the disease and tests positive. Of the other 999 who don't have the disease, about 10 test positive. So the positive tests comprise 10 people who don't have the disease, and 1 person who does. So the chance of having the disease is 1 in 11.

Phrasing the answer in terms of Bayesian analysis: The "prior" estimate, before the evidence of the test, is 0.001 (1 in 1000). The new evidence, though, is very significant, which means it changes things a fair bit. So, when we combine the new evidence with the prior, we get a "posterior" of 0.091 (1 in 11).

If that still seems counterintuitive to you, think of it this way: if the test is 99% positive, that's 1 in 100 that it's wrong. That's low odds, which makes you think the test is probably right! But ... the original chance of having the disease is only 1 in 1000. Those are even worse odds. The prior of 1/1000 competes with the new evidence of 1/100. Because the new number (test being wrong) is more likely than the old number (no disease), the odds are skewed to the test being wrong: odds of 10:1 that the test is wrong, compared to the patient having the disease.

Another way to put it: the less likely the disease was to start with, the more evidence you need to overcome those low odds. 1/100 isn't enough to completely overcome 1/1000.

(Perhaps you can see where this will be going, which is: if a research study's hypothesis is extremely unlikely in the first place, even a .01 significance level shouldn't be enough to overcome your skepticism. But I'm getting ahead of myself here.)

---------

Let's do an oversimplified baseball example. At the beginning of the 2011 baseball season, you (unrealistically) know there's a 50% chance that Albert Pujols' batting average talent will be .300 for the season, and a 50% chance that his batting average talent will be .330. Then, in April, he goes 26 for 106 (.245). What is your revised estimate of the chance that he's actually a .300 hitter?

You start with your "prior" -- a 50% chance he's a .300 hitter. Then, you add the new evidence: 26 for 106. Doing some calculation, you get your "posterior." I won't do the math here, but if I've got it right, the answer is that now the chance is 80% that Pujols is actually a .300 hitter and not a .330 hitter.

That should be in line with your intuition. Before, you thought there was a good chance he was a .330 hitter. After, you think there's still a chance, but less of a chance.

We started thinking Pujols was still awesome. Then he hit .245 in April. We thought, "Geez, he probably isn't really a .245 hitter, because we have many years of prior evidence that he's great! But, on the other hand, maybe there's something wrong, because he just hit .245. Or maybe it's just luck, but still ... he's probably not as good as I thought."

That's how Bayesian thinking works. We start with an estimate based on previous evidence, and we update that estimate based on the new evidence we add to the pile.

--------

Now, for the good part, where we talk about academic regression studies.

You want to figure out whether using product X causes cancer. You do a study, and you find statistical significance at p=0.02, and the coefficient says that using product X is linked with a 1% increased chance of cancer. You are excited about your new discovery. What do you put in the "conclusions" section of your paper?

Well, maybe you say "this study has found evidence consistent with X causing cancer." But that isn't helpful, is it? I mean, you also found evidence that's consistent with X *not* causing cancer -- because, after all, it could have just been random luck. (A significance level of .02 would happen by chance 1 out of 50 times.)

Can you say, "this is strong evidence that X causes cancer?" Well, if you do, it's subjective. "Strong" is an imprecise, subjective word. And what makes the evidence "strong"? You better have a good argument about why it's strong and not weak, or moderate. The .02 isn't enough. As we saw in the disease example, a positive test -- which is equivalent to a significance level of .01, since a positive test happens only 1 in 100 times -- was absolutely NOT strong evidence of having the disease. (It meant only a 1 in 11 chance.)

Similarly, you can't say "weak" evidence, because how do you know? You can't say anything, can you?

It turns out that ANY conclusion about what this study means in real life has to be Bayesian, based not just on the result, but on your PRIOR information about the link between cancer and X. There is no conclusion you can draw otherwise.

Why? Well, it's because the study has it backwards.

What we want to know is, "assuming the data came up the way it did, what is the chance that X causes cancer?"

But the study only tells us the converse: "assuming X does not cause cancer, what is the chance that the data would come up the way it did?"

The p=0.02 is the answer to the second question only. It is NOT the answer to the first question, which is what we really want to know. There is a step of logic required to go from the second question to the first question. In fact, Bayes' Theorem gives us the equation for finding the answer to the first question given the second. That equation requires us to know the prior.

What the study is asking is, "given that we got p=0.02 in this experiment, what's the chance that X causes cancer?" Bayes' Theorem tells us the question is unanswerable. All we can answer is, "given that we got p=0.02 in this experiment, what is the chance that X causes cancer, given our prior estimate before this experiment?"

That is: you CANNOT make a conclusion about the likelihood of "X causes cancer" after the experiment, unless you had a reliable estimate of the likelihood of "X causes cancer" BEFORE the experiment. (In mathematical terms, to calculate P(A|B) from P(B|A), you need to know p(B) and p(A).)

Does this sound wrong? Do you think you can get a good intuitive estimate just from this experiment alone? Do you feel like the .02 we got is enough to be convincing?

Well, then, let me ask you this: what's your answer? What do you think the chance is that X causes cancer?

If you don't agree with me that there's no answer, then figure out what you think the answer is. You may assume the experiment is perfectly designed, the sample size is adequate, and so on.

If you don't have a number -- you probably don't -- think of a description, at least. Like, "X probably causes cancer." Or, "I doubt that X causes cancer." Or, "by the precautionary principle, I think everyone should avoid X." Or, "I don't know, but I'd sure keep my kids away from X until there's evidence that it's safe!"

Go ahead. I'll leave some white space for you. Get a good intuitive idea of what your answer is.

(Link to Jeopardy music while you think (optional))

OK. Now, I'm going to tell you: product X is a bible.

Does that change your mind?

It should. Your conclusion about the dangers of X should absolutely depend on what X is -- more specifically, what you knew about X before. That is, your PRIOR. Your prior, I hope, had a probability of close to 0% that a Bible can cause cancer. That's not just a wild-ass intuition. There are very good, rational, objective reasons to believe it. Indeed, there is no evidence that the information content of a book can cause cancer, and there is no evidence or logic that would lead you to believe that bibles are more carcinogenic than, say, copies of the 1983 Bill James Baseball Abstract.

Call this "intuition" or "subjectivity" if you want. But if you decide not to use your own subjective judgment, what are you going to do? Are you going to argue that bibles cause cancer just to avoid having to take a stand?

I suppose you can stop at saying, "this study shows a statistically significant relationship between bible use and cancer." That's objectively true, but not very useful. Because the whole point of the study is: do bibles cause cancer? What good is the study if you can't apply the evidence to the question?

--------

You could do the Bayesian approach thing more formally. That's what researchers usually mean when they talk about "Bayesian methods" -- they mean formal statistical algorithms.

To do a Bayesian analysis, you need a prior. You could just arbitrarily take something you think is reasonable. "Well, we don't believe there's much of a chance bibles cause cancer, so we're going to assume a prior 99.9999% probability that there's no effect, and we'll split up the last remaining .0001 in a range between -2% and +2%." Now, you do the study, and recalculate your posterior distribution, to see if you now have enough evidence to conclude there's a danger.

If you did that, you'd find that your posterior distribution -- your conclusion -- was that the probability of no effect went down, but only from 99.9999% to 99.995%, or something. That would make your conclusion easy: "the evidence should increase our worry that bibles cause cancer, but only from 1 in a million to 1 in 20,000."

But, that Bayesian technique is not really welcome in academic studies. Why? Because that prior distribution is SUBJECTIVE. The author can choose any distribution he wants, really. I chose 99.9999%, but why not 99.99999% (which is probably more realistic)? The rule is that academic papers are required to be objective. If you allow the author to choose any prior he wants, based on his own intuition or judgment, then, first, the paper is no longer objective, and second, there is the fear that the author could get any conclusion he wanted just by choosing the appropriate prior.

So papers don't want to assume a prior. So instead of arguing about the chance the effect is real, the paper just assumes it's real, and takes it at face value. If X appears to increase cancer by 1%, and it's statistically significant, then the conclusion will assume that X actually *does* increase cancer by 1%.

That sounds like it's not Bayesian. But, in a sense, it is. It's exactly the result you'd get from a Bayesian analysis with a prior that assumes every result is equally likely. Yes, it's objective, because you're always using the same prior. But it's the *wrong* prior. You're using a fixed assumption, instead of the best assumption you can, just because the best assumption is a matter of discretion. You're saying, "Look, I don't want to make any subjective assumptions, because then I'm not an objective scientist. So I'm going to assume that bibles are just as likely to cause 1% more cancers as they are to cause 0% more cancers."

That's obviously silly in the bible case, and, when it's that obvious, it looks "objective" enough that the study can acknowledge it. But most of the time, it's not obvious. In those cases, the studies will just take their results at face value, *as if theirs is the only evidence*. That way, they don't have to decide if their result is plausible or not, in terms of real-life considerations.

Suppose you have two baseball studies. One says that certain batters can hit .375 when the pitcher throws lots of curve balls. Another says that batters gain 100 points on their batting average after the manager yells at them in the dugout. Both studies find exactly the same size effect, with exactly the same significance level of, say, .04.

Of the two conclusions, which one is more likely to be true? The curve ball study, of course. We know that some batters hit curve balls better than others, and we know some batters hit well over .300 in talent. It's fairly plausible that someone might actually have .375 talent against curve balls.

But the "manager yells at them" study? No way. We have a strong reason to believe it's very, very unlikely that batters would improve by that much just because they were yelled at. We have clutch hitting studies, that barely find an effect even when the game is on the line. We have lots of other studies that, even when they do find an effect, like platooning, find it to be much, much less than 100 points. Our prior for the "manager yelling is worth 100 points" hypothesis is so low that a .04 will barely move it.

Still ... I guarantee you that if these two studies were published, the two "conclusions" sections would not give the reader any indication of the relative real-life likelihood of the conclusions being correct, except by reference to the .04. In their desire to be objective, the two studies would not only fail to give shadings of their hypotheses' overall plausibility, but they'd probably just treat both conclusions as if they were almost certainly true. That's the general academic standard: if you have statistical significance, you're entitled to just go ahead and assume the null hypothesis is false. To do anything else would be "subjective."

But while that eliminates subjectivity, it also eliminates truth, doesn't it? What you're doing, when you use a significance level instead of an argument, is that you're choosing what's most objective, instead of what's most likely to be right. You're saying, "I refuse to make a judgment, and so I'm going to go by rote and not consider that I might be wrong." That's something that sounds silly in all other aspects of life. Doesn't it also sound silly here?

--------

So, am I arguing that academics need to start doing explicit Bayesian analysis, with formal mathematical priors? No, absolutely not. I disagree with that approach for the same reasons other critics do: it's too subjective, and too subject to manipulation. As opponents argue, how do you know you have the right prior? And how can you trust the conclusions if you don't?

So, that's why I actually prefer the informal, "casual Bayesian" approach, where you use common sense and make an informal argument. You take everything you already know about the subject -- which is your prior -- and discuss it informally. Then, you add the new evidence from your study. Then, finally, you conclude about your evaluation of the real-life implications of what you found.

You say, "Well, the study found that reading the bible is associated with a 1% increase in cancer. But, that just sounds so implausible, based on our existing [prior] knowledge of how cancer works, that it would be silly to believe it."

Or, you say, "Yes, the study found that batters hit 100 points better after being yelled at by their manager. But, if that were true, it would be very surprising, given the hundreds of other [prior] studies that never found any kind of psychological effect even 1/20 that big. So, take it with a grain of salt, and wait for more studies."

Or, you say, "We found that using this new artificial sweetener is linked to one extra case of cancer per 1,000,000 users. That's not much different from what was found in [prior] studies with chemicals in the same family. So, we think there's a good chance the effect is real, and advise caution until other evidence makes the answer clearer."

That's what I meant, two posts ago, where I said "you have to make an argument." If you want to go from "I found a statistically significant 4% connection between cancer and X," to "There is a good chance X causes cancer," you can't do that, logically or mathematically, without a prior. The p value is NEVER enough information.

The argument is where you informally think about your prior, even if you don't use that word explicitly. The argument is where you say that it's implausible that bibles cause cancer, but more plausible that artificial sweeteners cause cancer. It's where you say that it's implausible that songs make you older, but not that the effect is just random. It's where you say that there's so much existing evidence that triples are a good thing, that the fact that this one correlation is negative is not enough to change your mind about that, and there must be some other explanation.

You always, always, have to make that argument. If you disagree, fine. But don't blame me. Blame Bayes' Theorem.

Labels: academics, bayes, statistics

17 Comments:

At Wednesday, November 23, 2011 10:48:00 AM, Phil Birnbaum said...: 1. Sorry for the long post. I need an editor.

2. Bayesian techniques usually involve distributions, rather than point estimates, but I wanted to keep it simple.

3. I feel like this point MUST have been made before, but I didn't find it online. Please LMK if there's a better explanation than this one.

4. This is only one aspect of my "make an argument" point. You also should argue that the regression "works" properly, in addition to this Bayesian argument about what it means GIVEN that the regression is good.
At Wednesday, November 23, 2011 11:30:00 AM, DSMok1 said...: Excellent article, Phil. This needed to be said clearly like you have done here.

The lack of understanding of this facet of coming to conclusions plagues both the scientific community and practical research like FDA approvals.
At Wednesday, November 23, 2011 2:12:00 PM, Mike said...: Really enjoyed this post, coming from someone with a minor stats background (minor enough to never hear of Bayes in class). Enlightening.
At Wednesday, November 23, 2011 5:21:00 PM, Ryan J. Parker said...: I feel obligated to at least link to this for the interested reader:

http://ba.stat.cmu.edu/vol03is03.php
At Wednesday, November 23, 2011 5:31:00 PM, Phil Birnbaum said...: Ryan: was there a particular article in there that you wanted to point out? Or just in general, in case anyone is interested?
At Wednesday, November 23, 2011 5:47:00 PM, Phil Birnbaum said...: Ryan: Oh, I get it! You're saying that you disagree that explicit Bayesian methods are too subjective.

Want to elaborate? Are Bayesian methods always appropriate? Or only in certain cases? I don't know much about the controversy. It does seem to me that formal Bayesian analyses wouldn't be that useful in the examples I used, but ... I'm willing to be educated.
At Wednesday, November 23, 2011 5:51:00 PM, Ryan J. Parker said...: I forgot to specify the section "Objections to Bayesian statistics". It has a lot of good arguments on both sides.

I don't like the philosophy debate, but I'd just like to point out that you assume something regardless of the method you use.

I think it's easier to see how someone could use a prior to influence results, but it's certainly not the only way to lie with statistics!
At Wednesday, November 23, 2011 10:02:00 PM, Anonymous said...: Phil, great stuff as always. We've talked about this on our blog (The Book), but it always bears repeating. And yes, WAY too long and repetitive!

MGL
At Wednesday, November 23, 2011 10:26:00 PM, Phil Birnbaum said...: MGL, thanks! If you have some links to the previous discussion, I'll add them to the post.
At Wednesday, November 23, 2011 10:28:00 PM, bradluen said...: Hi Phil,
Not sure if you're saying anything other than "you should consider all relevant evidence when evaluating hypotheses" (which is a pretty uncontroversial statement, even among journal editors). Perhaps I just disagree with your definition of Bayesianism, which to me is something quantitative. If we consider any argument that goes outside the data Bayesian, the term seems too broad to be useful.

(If the point is that science can't be entirely objective, well yeah, philosophers have been pointing that out for centuries. It's necessary, however, for scientists to make clear what's objective and what isn't.)
At Wednesday, November 23, 2011 10:35:00 PM, Phil Birnbaum said...: Hi, bradluen,

Here's a brief explanation of what "Bayesianism" means non-mathematically:

http://lesswrong.com/lw/1to/what_is_bayesianism/

I guess "you should consider all evidence when evaluating hypotheses" is a pretty good summary. My main point is that "all evidence" includes informal evidence and logic.

For instance, the part about not believing that yelling at a player can add 100 points to his average ... there's good reason for that, but it's not the traditional scientific definition of "evidence". But you still have to consider it.
At Thursday, November 24, 2011 12:58:00 AM, bradluen said...: Oh, I'm totally on board with broadening the definition of "evidence", though, as you say, informal evidence should be used informally.

One thing that may or may not be relevant is it doesn't matter what order you do the conditioning in. That is, in theory summarising all available evidence in a prior and then adjusting for the result of a new experiment gives the same posterior as starting with the experiment result then adjusting for all other evidence. Since there's rarely an objective prior, you should post all the data and let anyone who wants to update their posterior do so.

(Footnote: In practice, humans have all kinds of cognitive biases, not to mention they're generally not great at integration. You should post the data, but you should help your readers out by providing informative and honest summaries of the data. Hypothesis tests can be nice, but graphs are often more useful.)

~brad.
At Thursday, November 24, 2011 10:55:00 PM, Anonymous said...: Would you please show the math for the Pujols example? I follow the logic but the computation escapes me?
At Thursday, November 24, 2011 11:01:00 PM, Phil Birnbaum said...: OK, hope I've got it right.

The SD of batting average over 106 AB is about 44 or 45 points, using binomial approximation to normal.

If Pujols' expected was .300, he was 1.23 SDs below expected. If Pujols' expected was .330, he was 1.91 SDs below expected.

The chance of being 1.23 SD below expected (or worse) is .109. The chance of being 1.91 SDs below expected (or worse) is .028.

So the odds he was .330 vs. the odds he was .300 were .028 to .109, which is about 1 to 4, which is about 20%.
At Monday, November 28, 2011 12:08:00 PM, Percy P Tron said...: That is probably correct, but I don't think any Bayesian would do an analysis like that.

Let's give Pujols' batting average a beta(31.5,68.5) prior. This is equivalent to saying we are 95% sure his batting average is between .230 and .410 with a mean about .315. Furthermore, .300 and .330 are about the same probability.

If Pujols then bats 26 for 106, the psoterior is a beta(57.5, 148.5). This says we know his true batting average is between .220 and .340, with an expected value of about .280. Furthermore, a .300 batting average is about 3 times more likely than .330. So we would say that Pujols is about 75% more likely to bat .300 than .330.

It seems to me that this article, and the discussion over at the book blog, are focused more on Bayesian ideas while treating Bayesian computation as slapping a prior on frequentist methods. Sorry if I am completely wrong here.
At Monday, November 28, 2011 12:10:00 PM, Phil Birnbaum said...: Percy,

I said in the first comment (point 2) that I was using point estimates instead of distributions to keep the explanation simpler.

But, even though it's a point estimate, it's still Bayesian. The prior is just 50/50 between two points.
At Monday, November 28, 2011 12:47:00 PM, Percy P Tron said...: Ah, I see, sorry about that. Of course, mentioning that you did point estimates vs distributions is a misnomer. Both are distributions, you just defined a different likelihood and prior than I did.

Plus, there is never any reason (besides computational speed) to ever use approximations, like you did. In your example, it didn't hurt at all, but in more complex problems, it could, especially with small sample sizes.

<< Home

Sabermetric Research

Wednesday, November 23, 2011

Research conclusions have to be bayesian

17 Comments:

About Me

Previous Posts