Sabermetric Research: R-squared and "twenty questions"

(This is a follow-up to two previous posts on r-squared.)

------

You've probably played the game "Twenty Questions." Here's how it works. I choose a subject, which can be anything I want -- "baseball glove," or "Hillary Clinton". Then, you have twenty "yes/no" questions to try to figure out what it is.

To win the game, you try to narrow it down as fast as you can, and as much as possible. To start, you might ask the traditional question: "is it bigger than a bread box?" That one question won't tell you what it is immediately, but it starts narrowing it down. According to Wikipedia, other good questions are, "can I put it in my mouth?" and "does it involve technology for communications, entertainment, or work?"

Some questions are obviously bad. Starting off by asking, "is it a DVD of a Sylvester Stallone movie?" is a waste of a question. The answer is probably "no," in which case you're left pretty much where you started. Of course, if it's a "yes," you're almost certain to get it, but the chances of that "yes" are pretty slim.

-----

So, now, here's a variation of the game. This time, I'm going to pick a random American person. Your job is to guess his or her 2011 income, and come as close as you can.

Instead of twenty questions, I'm only going to give you one, for now. However, it doesn't have to be a "yes/no" question -- if you want, it can be any question that can be answered by a number. (You can't ask specifically about the salary, though.) Once you get the answer, you take your guess at the income.

What kind of question do you ask?

Well, one good question might be, "how many years of education does the person have?" You can then go by the general rule that the more education, the more likely they were to have a higher salary.

Or, you can ask, "what's the person's IQ?" Again, you can assume that the higher a person's people's intelligence, the more likely they are to have a higher income.

But if you ask, "did they win a lottery jackpot last year?", that's a waste. It's the like the Stallone DVD question. Most of the time, the answer is no, and you get barely any useful information. It's just not worth asking, just on the off-chance that you get a yes.

-----

That all makes sense, right? Well, if you understand the strategy of the game of "Twenty Questions," you understand r-squared. Because, r-squared is really just a measure of how good your question is. Seriously -- the correspondence between the two is almost perfect. The better the question, the higher the r-squared; and, the higher the r-squared, the better the question.

If you were to run a regression of IQ against income, you'd probably wind up with a decent r-squared -- maybe, I don't know, .15 or something. That means that if you know a random person's IQ, you can knock 15 percent off your average error squared. Maybe if you were completely ignorant, you would guess $30,000 for everyone. But if you know IQ, you can guess $20,000 for low, $30,000 for average, and $40,000 for high, and your guesses would be closer.

But, the bad question, the lottery question: the r-squared of that might only be .001. Originally, you guessed $30,000. Now, if you find out they didn't win the lottery, you guess $29,999, and are a tiny bit closer, on average. If you find out they *did* win the lottery, then you guess, say, $5 million, and you come a lot closer that you would have before. But that happens very infrequently -- so infrequently that you're square error is still going to be well over 99.99 percent of what it was without the question.

-----

I said the analogy between the game and r-squared was *almost* perfect. If you care, here's how to make it exact:

After you ask the IQ question, you're given a table of all 300,000,000 people in the US, with their income and answer to your question (IQ). Then, before I tell you the random person's IQ, you have to decide in advance what you're going to answer for each possible IQ, and your decision has to have each point of IQ worth the same amount of income (that is, it has to be linear, since it's a linear regression).

Once you've decided, I give you the IQ, and we figure your answer, and your negative score is the square of how much you missed it by.

Under those rules, the analogy is exact: the r-squared exactly corresponds to how good a question you asked.

(Oh, and if you want to actually ask twenty questions instead of one ... that's just a multiple regression with 20 variables.)

-----

In the past, I've been critical of analyses that find a low r-squared, and assume that, therefore, there's only a weak relationship. For instance, I've written about the study that found, in MLB, an r-squared of .18 for team payroll vs. team wins. The authors of the study then said something like, "The r-squared is low. Therefore, there's not much of a relationship. Therefore, salary doesn't lead to wins."

Well, that's not right. It's like saying, for the lottery example, "The r-squared is low. Therefore, winning the lottery doesn't lead to more money."

That's obviously incorrect.

The r-squared does NOT measure the direct relationship between the variables. It just measures how good a question it is to ask about the one variable.

But, the thing is, if what you really want is the relationship between winning the lottery and getting rich ... well, that's easy. Just look at the regression equation!

If you do that regression, the one on lottery winnings that gives you an r-squared of .001, you'll wind up with an equation like

Expected salary = $30,000 + $5,000,000 if they won the lottery.

It gives you exactly what you want -- winning the lottery is worth $5 million. Why would you focus on the r-squared, when the exact answer is right there? In fact, the r-squared is completely irrelevant!

I think the reason we sometimes focus on the r-squared, though, is that we make a false assumption. It is true that (a) if you have a high r-squared, you have a strong relationship. But it is NOT necessarily true that (b) if you *don't* have high r-squared, you *don't* have a strong relationship. I think that maybe we just assume because (a) is true, (b) is also true. But it's not.

------

So, in summary, three different ways to think about it:

--- One

The regression equation answers, "how much does winning the lottery affect income?" [lots.]

The r-squared answers, "is asking about the lottery a good "twenty questions" way to help estimate income?" [not very.]

--- Two

The regression equation answers, "how much does winning the lottery affect income?" [lots.]

The r-squared answers, "when people differ in income, how much of that is because some of them won the lottery?" [not much.]

--- Three

The regression equation answers, "if you change the value of the lottery variable from "no" to "yes," how much does income change? [lots.]

The r-squared answers, "if you change the value of the lottery value from one random person's to another random person's, how much does income change?" [not much -- two random people are probably both "no", so the change is usually zero.]

------

If you've got any more good ones, let me know, and I'll add them in.

Labels: r-squared, regression, statistics

10 Comments:

At Wednesday, August 22, 2012 3:55:00 PM, Alex said...: I would quibble with one of the characterizations a little bit. In your lottery example, the r-squared isn't "irrelevant". It's still a measure of the *quality* of the relationship between salary and winning the lottery. Let's say that the r-squared was even worse than the .001 you mention and say that it's .00000001 (it could be 0, but you would never get 0 in practice). It would, in my opinion, be more fair to look at the regression and say "winning the lottery tends to increase salary by 5 million, but there's so much noise and inconsistency as to make the information useless". I would personally make the same claim at .001; it becomes an issue of how large an r-squared you find to be meaningful.

And to be even more picky, r-squared is analogous to the quality of the question assuming a linear relationship. If I asked you to guess a person's memory ability (or athletic ability) and you wanted to ask their age that would be a fantastic question, but the r-squared would be low because there's more of a quadratic relationship across the lifespan.
At Wednesday, August 22, 2012 4:28:00 PM, Phil Birnbaum said...: Hi, Alex,

I think I disagree with you about noise and inconsistency.

I suspect that if you took 1,000,000 random Z-scores that didn't win the lottery, and 10 scores of +5 that did win the lottery, you'd find

(a) a low r-squared, and
(b) a statistically significant value for lottery winning, around +5.

On the linear relationship, I agree. That's what I caveated in the small print in the middle of the post.
At Wednesday, August 22, 2012 9:21:00 PM, Alex said...: Sure, I don't disagree with that (and a quick simulation verifies it, r-squared for mine was .0002). With a large enough sample any correlation will be significant. It's more a question of being meaningful; how much do you gain from your knowledge? Like you said, how good a question is it? If the r-squared is low, it's telling you that you're asking a poor question. That would be true even if the regression is significant or if it gives you a reasonable regression equation.

Another way to put it would be to ask, how worse off am I if I ignore the regression? In the simulation I ran from your suggestion, the mean absolute error for the regression (fitted values subtracted from actual values) is .79774425. If I just guess a z-score of 0 for everyone, I get an error of .797793. That is, a difference in the fifth decimal place, or on the order of thousandths of a percent difference. The relationship exists and is significant with all the data I have, but I would barely notice the difference if I didn't know it existed. That, to me, suggests that the information is pretty useless. Your lotto example is basically the exact analogue of your Stallone DVD example. You gain a tiny bit of information, so tiny it's basically not worth knowing.
At Wednesday, August 22, 2012 9:27:00 PM, Phil Birnbaum said...: Hi, Alex,

I'd argue that you gain a lot -- you find out that winning the lottery is worth +5. That's valuable information.

But, if your goal is just to reduce error across the entire population, then, sure, knowing about the lottery doesn't help much, as your numbers confirm.

But that's not necessarily the goal. If you're researching whether (say) power lines cause cancer, you don't care about reducing error across the population. You just want to find out whether power lines cause cancer. And that's the coefficient, regardless of whether the r-squared is .00001 or .5.
At Wednesday, August 22, 2012 9:38:00 PM, Phil Birnbaum said...: Another way to put it: a low r-squared means you don't have a very good grip on what causes income to vary, in general.

But you can still have a good grip (via the coefficient) on one small part of the picture -- how winning the lottery causes income to vary.
At Friday, August 24, 2012 1:21:00 AM, Alex said...: The cancer example is an interesting one. Say it turned out exactly that way: living under power lines increased the chance of getting cancer by a statistically significant amount with a tiny R-squared. Would you call a press conference and tell everyone to move away from power lines? That's very explosive information for a result that basically says a lot of people who live near lines get cancer and a lot don't, and obviously a lot of people who don't live near lines still get cancer.

Or to go back to your lotto example - how valuable is it, exactly, to know that winning is worth +5 if my salary guesses are only .001% better?
At Friday, August 24, 2012 9:00:00 AM, Phil Birnbaum said...: You could have a tiny r-squared even if EVERYONE who lives near power lines gets cancer. That would happen if very few people lived near power lines.
At Wednesday, August 29, 2012 4:29:00 PM, David said...: Just got around to reading this.

First, great post. I also totally agree with your response to Alex. If you're interested in marginal effects, R-squared is close to irrelevant.

Second, I would quibble with the statement "If you have a high R-squared, then you have a strong relationship." If there's very little variation in the outcome variable, this may not be true. (Suppose I have a sample where everyone has income of 20k except one guy who has income of 20,000.01 who won a 1 cent lottery today. With a lottery variable, I get an r-squared of 1 but the 1 cent effect of the lottery seems small.)

Third, I happen to have data on income, test scores, education, gender, and race handy. Test scores and years of schooling are equally good predictors. The r-squared values for univariate regressions are:

Test Score: 0.07
Years of schooling: 0.07
Gender: 0.02
Race dummies: 0.03
Height: 0.03
At Wednesday, August 29, 2012 4:38:00 PM, David said...: Oh, and for the record, the best variable for your twenty questions I know of is:

"What was your parents' income?"

R-squared between income and parents' income in the first paper I looked at was 0.18
At Wednesday, August 29, 2012 4:55:00 PM, Phil Birnbaum said...: Thanks, David!

Let me quibble with your quibble: if you're willing to do a regression after seeing the summary statistics and noting the very small variance, then perhaps 1 cent is big to you. :)

But point otherwise well taken.

The correlations were a bit lower than I expected, but reasonable ... except parents' income was a bit higher than I expected.

Appreciate it!

<< Home

Sabermetric Research

Wednesday, August 22, 2012

R-squared and "twenty questions"

10 Comments:

About Me

Previous Posts