Wednesday, August 29, 2012

Do NHL teams draw more fans when they're more likely to win?

When you do a regression, you've got to figure out not just what the coefficients are, but also what they *mean*.  Sometimes, there's no obvious answer.

I came across a recent hockey study (which will remain nameless) that tries to figure out if teams sell more tickets when they're more likely to win the game.

They did a regression, to estimate attendance (actually, the log of attendance) based on these variables:

-- previous season attendance
-- new arena
-- first game of season
-- home previous season win %
-- home current season win % to date
-- home goals scored per game to date
-- home goals allowed per game to date
-- visitor win % to date
-- visitor goals scored per game to date
-- visitor goals allowed per game to date
-- visitor penalty minutes allowed per game to date
-- home team game number (actual and squared)
-- franchise and season dummies
-- probability of home team winning (Vegas odds).

The last coefficient, "probability of home team winning," was significant and positive.  The authors concluded that fans are attracted to the games where the home team is more likely to win.

But ... that's not necessarily the case. 

What the regression might suggest is that more fans show up to the high-probability games *holding all the other variables constant*.  That changes everything.

Let's hold those other variables constant.  Suppose the home team was .600 last year, is .600 this year, and has scored 2.5 goals and allowed 2.0 goals per game so far this year.  And the visiting team is .470 this year, and has scored 3.0 goals and allowed 3.25 goals per game so far this year. 

Holding all those things constant, more fans show up when the odds of winning are better for the home team.

But what does that actually mean? 

Don't the odds of winning depend almost completely on those numbers, the ones we're holding constant?  A .600 team playing a .450 team should be expected to win, say, 70% of its home games (making that number up).  

If the Vegas odds are higher than 70%, what does that mean?  It means that all those performance numbers aren't an adequate indication of the team's quality.  Why might that be?  Maybe an opposing star player is injured.  Maybe Sidney Crosby has just come back from injury.  Maybe the home team's rookies have started to blossom, or it made a good trade recently.  Maybe ... well, there are probably other things. 

In any case, it's hard to see how any of that stuff would lead to the conclusion that the home fans care about the probability of winning.  Couldn't they just be responding to Sidney Crosby being back?  Might they not just want to see the new superstar they just acquired?  Or, maybe fans respond more to a team that's gotten better recently, rather than a team that's been the same quality for a couple of years now.

Or, maybe they're more likely to come to the game if the *opposition's* star player was injured.  But that doesn't sound right.  Aren't visiting teams' stars supposed to be an attraction?

In any case, the conclusion "fans respond to probability of winning" doesn't necessarily follow from the regression -- because the regression coefficients have to be interpreted while keeping everything constant. 

-------

By the way, here's what I think is really going on.  The study includes factors for the quality of the teams, but they're not that accurate.  The teams' records on a "so far this year" basis are almost meaningless early in the season.  For the first few games played, they're mostly random. 

Therefore, many of the games have only an imperfect measure of how good the teams are.  Therefore, the Vegas odds pick up the slack.  And, therefore, what that coefficient might be telling you is that a team draws more fans *when they're a better team*.  That's not news.

But ... looking again at the list of variables, there's a measure of the home team's record last year, but not the visiting team's.  So, it could also be that the results hold because fans come out more for bad visiting teams than good ones.  Except that the regression appears to show higher attendance for better visiting teams ... but, again, you interpret the coefficients that way, you again have to hold everything else constant, including win probability.

There's no real way to tell from this regression whether it's the good home team or the bad visiting team causing the effect.  My gut says it's all the home team, but I could be wrong.

-------

More importantly, what exactly is it that the authors are trying to figure out? 

They talk about whether fans come out more when there's a better chance of winning.  Well, the chance of winning depends almost completely on two things: the quality of the home team, and the quality of the visiting team. 

We already know, don't we, that a winning home team draws more fans?  If the authors agree that we already know that, then they must be wondering whether more fans come out for a bad visiting team. 

In that case, why include the visiting team quality variables, which eliminate the possibility of answering that question? 




Labels: ,

Wednesday, August 22, 2012

R-squared and "twenty questions"

(This is a follow-up to two previous posts on r-squared.)

------
 
You've probably played the game "Twenty Questions." Here's how it works.  I choose a subject, which can be anything I want -- "baseball glove," or "Hillary Clinton".  Then, you have twenty "yes/no" questions to try to figure out what it is.

To win the game, you try to narrow it down as fast as you can, and as much as possible.  To start, you might ask the traditional question: "is it bigger than a bread box?"  That one question won't tell you what it is immediately, but it starts narrowing it down.  According to Wikipedia, other good questions are, "can I put it in my mouth?" and "does it involve technology for communications, entertainment, or work?"

Some questions are obviously bad.  Starting off by asking, "is it a DVD of a Sylvester Stallone movie?" is a waste of a question.  The answer is probably "no," in which case you're left pretty much where you started.  Of course, if it's a "yes," you're almost certain to get it, but the chances of that "yes" are pretty slim.

-----

So, now, here's a variation of the game.  This time, I'm going to pick a random American person.  Your job is to guess his or her 2011 income, and come as close as you can. 

Instead of twenty questions, I'm only going to give you one, for now.  However, it doesn't have to be a "yes/no" question -- if you want, it can be any question that can be answered by a number.  (You can't ask specifically about the salary, though.)  Once you get the answer, you take your guess at the income. 

What kind of question do you ask? 

Well, one good question might be, "how many years of education does the person have?"  You can then go by the general rule that the more education, the more likely they were to have a higher salary.

Or, you can ask, "what's the person's IQ?"  Again, you can assume that the higher a person's people's intelligence, the more likely they are to have a higher income.

But if you ask, "did they win a lottery jackpot last year?", that's a waste.  It's the like the Stallone DVD question.  Most of the time, the answer is no, and you get barely any useful information.  It's just not worth asking, just on the off-chance that you get a yes.

-----

That all makes sense, right?  Well, if you understand the strategy of the game of "Twenty Questions," you understand r-squared.  Because, r-squared is really just a measure of how good your question is.  Seriously -- the correspondence between the two is almost perfect.  The better the question, the higher the r-squared; and, the higher the r-squared, the better the question.

If you were to run a regression of IQ against income, you'd probably wind up with a decent r-squared -- maybe, I don't know, .15 or something.  That means that if you know a random person's IQ, you can knock 15 percent off your average error squared.  Maybe if you were completely ignorant, you would guess $30,000 for everyone.  But if you know IQ, you can guess $20,000 for low, $30,000 for average, and $40,000 for high, and your guesses would be closer. 

But, the bad question, the lottery question: the r-squared of that might only be .001.  Originally, you guessed $30,000.  Now, if you find out they didn't win the lottery, you guess $29,999, and are a tiny bit closer, on average.  If you find out they *did* win the lottery, then you guess, say, $5 million, and you come a lot closer that you would have before.  But that happens very infrequently -- so infrequently that you're square error is still going to be well over 99.99 percent of what it was without the question.

-----

I said the analogy between the game and r-squared was *almost* perfect.  If you care, here's how to make it exact:

After you ask the IQ question, you're given a table of all 300,000,000 people in the US, with their income and answer to your question (IQ).  Then, before I tell you the random person's IQ, you have to decide in advance what you're going to answer for each possible IQ, and your decision has to have each point of IQ worth the same amount of income (that is, it has to be linear, since it's a linear regression). 

Once you've decided, I give you the IQ, and we figure your answer, and your negative score is the square of how much you missed it by.

Under those rules, the analogy is exact: the r-squared exactly corresponds to how good a question you asked.

(Oh, and if you want to actually ask twenty questions instead of one ... that's just a multiple regression with 20 variables.)


-----

In the past, I've been critical of analyses that find a low r-squared, and assume that, therefore, there's only a weak relationship.  For instance, I've written about the study that found, in MLB, an r-squared of .18 for team payroll vs. team wins.  The authors of the study then said something like, "The r-squared is low.  Therefore, there's not much of a relationship.  Therefore, salary doesn't lead to wins."

Well, that's not right.  It's like saying, for the lottery example, "The r-squared is low.  Therefore, winning the lottery doesn't lead to more money."

That's obviously incorrect. 

The r-squared does NOT measure the direct relationship between the variables.  It just measures how good a question it is to ask about the one variable.

But, the thing is, if what you really want is the relationship between winning the lottery and getting rich ... well, that's easy.  Just look at the regression equation!

If you do that regression, the one on lottery winnings that gives you an r-squared of .001, you'll wind up with an equation like

Expected salary = $30,000 + $5,000,000 if they won the lottery.


It gives you exactly what you want -- winning the lottery is worth $5 million.  Why would you focus on the r-squared, when the exact answer is right there?  In fact, the r-squared is completely irrelevant!

I think the reason we sometimes focus on the r-squared, though, is that we make a false assumption.  It is true that (a) if you have a high r-squared, you have a strong relationship.  But it is NOT necessarily true that (b) if you *don't* have high r-squared, you *don't* have a strong relationship.  I think that maybe we just assume because (a) is true, (b) is also true.  But it's not.

------

So, in summary, three different ways to think about it:

--- One

The regression equation answers, "how much does winning the lottery affect income?"  [lots.]

The r-squared answers, "is asking about the lottery a good "twenty questions" way to help estimate income?"  [not very.]

--- Two

The regression equation answers, "how much does winning the lottery affect income?"  [lots.]

The r-squared answers, "when people differ in income, how much of that is because some of them won the lottery?" [not much.]

--- Three

The regression equation answers, "if you change the value of the lottery variable from "no" to "yes," how much does income change?  [lots.]

The r-squared answers, "if you change the value of the lottery value from one random person's to another random person's, how much does income change?"  [not much -- two random people are probably both "no", so the change is usually zero.]

------

If you've got any more good ones, let me know, and I'll add them in.





Labels: , ,

Friday, August 17, 2012

A benefit of r-squared

(This is a sequel to a previous post on r-squared.)

-----

Bob and Sam are arguing about how raffles are won.  Bob says, "People tend to win because they buy lots of tickets.  That's the biggest factor."  And Sam replies, "Well, you can buy a lot of tickets, but you still have to get lucky.  People win more because they get lucky, rather than because of how many tickets they buy."

You do a regression to predict prizes won, based on how many tickets were bought.  And the r-squared winds up at .52. 

Well, Bob has won the argument, hasn't he?  The number of tickets explains 52 percent of the variance in prizes won.  If Sam were right, then luck would have to explain more than 52 percent.  Since tickets and luck are presumably independent, you can just add the variances (by the Pythagorean theorem of statistics).  That adds up to 104 percent, which is impossible.

Anything above 50 percent must be the biggest factor -- at least, of all other factors independent of that one.

----

And that's where it gets tricky.  Because, it's hard to find factors that are legitimately independent.

Suppose you're looking to "explain" differences in salary.  Why does Chris make $30,000, while Pat makes $50,000?  What factors contribute to a higher or lower salary?

Someone does a regression, to predict salary based on intelligence (as imperfectly measured by an IQ test, say).  He winds up with an r-squared of, say, .41.  That's pretty impressive!  He concludes that how smart you are is a big predictor of how much money you'll make.

But, then, a colleague comes along.  She thinks it's education that leads to higher salaries.  She does her own regression, to predict salary based number of years of schooling.  The r-squared is .36.  Again, impressive!  She writes a study claiming that schooling increases salaries.

Finally, a third colleague thinks it's family culture.  He does a third regression, this time using parental income.  There's again a high r-squared, this time .43.  The conclusion, this time, is that high salary is something that parents influence you to achieve.

(These r-squareds are all made up; they're probably way too high to be realistic.)

Now, if you just add up the three r-squareds, you might conclude that those three factors, taken together, explain 120% of the variation in salary!  Obviously, that can't be right.

And it's not right -- because you can only add variances when the variables are independent. 

These aren't.  They're highly correlated.  If you have a high IQ, you're more likely to stay in school longer.  If your parents are academics, they probably had a high IQ, and therefore, probably, so would you.

So you can't just add up the r-squareds, like you could for the lottery example, or the dice example from the last post. 

All you can do, if you choose, is perhaps to say that the r-squared for IQ is the highest, so that's the most plausible theory right now.  But, really, it's probably some combination of all three factors, which overlap quite a bit.  The r-squareds don't help a whole lot, here, in supporting one hypothesis over another.

What we *can* do, to help a little bit, is run a multiple regression, using all three variables.  Multiple regression is smart enough to adjust for the fact that the variables aren't independent, by taking them all at once.

Let's suppose we did that, and we got an r-squared of 0.6. That's half of the total sum, which says that exactly half the variance is shared by the three factors.  That means that in the aggregate, our three researchers are "half right".  It doesn't mean they're all half right; one of them might be 100% right, one might be 40%, and one might be 10%.  But, on average, they're half.  That doesn't really do us a lot of good.  We still don't know which of the three factors are important in what proportion. 

Or, even, if there are other factors that just happen to correlate with these. 

------

It's obvious when I spell it out like this, with all three studies.  But, suppose only the first one was done.  You might just read that schooling explains 36% of the variation in salary, and, without thinking, conclude that getting more formal education makes you richer.  But, you'd probably be wrong.  It might be IQ.  It might be culture.  It might be other things, that correlate with schooling, that we haven't thought of yet (how much you study, for instance, or what kind of degree you have).

This is why, I say, you always have to make an argument.  Sure, you got a high r-squared when you looked at schooling.  But why do you assume cause and effect?  How do you know it's not something else instead, something that correlates with schooling?  No statistical test can tell you.  You have to argue for it.

-------

Let me give you a baseball example. 

I took every Major League Baseball team since 1970 (excluding 1981 and 1994), and ran a regression to predict their runs scored from their doubles hit.

The r-squared came out to .462.

I was surprised how high that was.  Knowing 2B, you can reduce the variance by almost half. 

Can we conclude that 46 percent of baseball is doubles?  No, of course not.  It's not the doubles -- it's a confounding factor, something that correlates with doubles.  What I think is actually happening is that teams that hit a lot of doubles also hit a lot of home runs.  (The correlation between HR and 2B was .559.)

Let's get rid of doubles and substitute home runs.  Now, the r-squared is .594.

Can we conclude that 59 percent of baseball is HR?  Again, of course not.  Again, what's likely happening is that teams that hit a lot of home runs also hit a lot of doubles (and probably other things).

All we can say is that knowing HR lets us reduce our mean squared error by 59 percent.  If we want to argue *why*, that's fine, but the r-squared alone doesn't tell us.

--------

Another thing we can do is a multiple regression: predict runs based on both HR and 2B.  I did that, and I got an r-squared of .684.

So, if you start with HR (r-squared .594), and then add doubles, you gain an extra .084.  If you start with doubles (r-squared .462), and then add home runs, you gain an extra .222. 

So, HR and 2B have .378 in common.  HR adds .222 to that.  2B adds .084 to that.  You could draw a Venn diagram to illustrate the overlap, if you wanted to.

But ... those numbers don't mean a whole lot to me, in real life terms. 

-------

So what good is the r-squared, then?

For me, there's one particular task that r-squared is great for: helping figure out how much luck is embedded in performance.  For instance, how much of the variation in team W-L records is based on clutch performance, as opposed to just scoring and preventing runs?

Most sabermetricians say, not much.  My impression is that most mainstream baseball people would say, quite a bit.

Well, here's what I did.  I ran a regression to predict team wins based on runs scored and runs allowed, for 1973-2011 (omitting 1981 and 1994).  The r-squared came out to .87.

That leaves only .13 remaining for clutch performance.  That is, assuming clutch is uncorrelated with RS and RA.  It isn't, quite, but it's probably close.  In any case, you can argue that the *independent* portion has only .13 remaining to claim.  That should satisfy the clutch advocates, since they usually argue that clutch matters in a way *not measured* by raw runs.  (As in, "sure, team A and B scored 750 runs each, but team B scored them when they counted most.")

That's what I like r-squared for.  It lets you estimate the "explanation space" available for the unusual theories, the ones that don't correlate with other variables.  Effectively, it takes some of the air out of the weird hypotheses. 

Suppose I have a theory that having a job interview on a good biorhythm day is an important explanation for differences in salary.  If I run a regression to account for the other, mainstream variables -- IQ, education, study habits, height, sex, race, and so on -- and I get an r-squared of .90, that means there's only .10 left for unrelated factors.  So, I have to be aware that those factors are *at least* nine times more important than my biorhythm theory.

So that's the thing I like about r-squared: it gives you a mental pie chart of the strength of competing explanations -- or, at least, competing *independent* explanations.






Labels: ,

Thursday, August 09, 2012

The standard error vs. the mean absolute error


Every day, a restaurant owner estimates how many lobsters she'll need to order.  If she orders too many, she'll have to throw some out in the evening, at a cost of $10 each.  If she orders too few, she may have to turn away customers, again at a cost of $10 each.

A statistician comes along, and analyzes her estimates over the past 1000 days.  It turns out that the mean of her errors is zero.  That's good, because it means her guesses are unbiased -- she's as likely to overestimate as underestimate.  (For instance, some days she's +5, and other days she's -5.)

The statistician also calculates that the standard deviation of her errors -- her "standard error" -- is 10, and that the errors are normally distributed.

Over the 1,000 days, then, how much money have the errors cost her?

If you asked me that question a few days ago, I would have said, well, the standard deviation is 10 ... so the typical error is 10 lobsters either way, or $100.  That means, over 1,000 days, the total cost would be $100,000.

But now, I think that's not quite right.

To figure out what the errors cost, we need to know the mean [absolute] error.  But all we have is the standard deviation, which is the square root of the average square error.  Those are two different things.  And, as it turns out, the SD is always larger than (or rarely, equal to) the mean error.

For instance, suppose the mean is zero, and we have three errors, 0, +5, and -5.  The mean error is 3.33.  But the SD is 4.08.

So, my guess of $100,000, based on the SD, is too high.   Can we get a better estimate?

I think we can.

Since the normal distribution is symmetrical, the average error of the entire bell curve is the same as the average error for the right half of the bell curve. 

That right side is the "half normal distribution," and, conveniently, Wikipedia tells us the mean of that distribution is
 
sigma * sqrt(2/pi)

Where "sigma" is the SD of the full bell curve.

The square root of 2/pi is approximately equal to 0.7979.  Call it 0.8, for short.

That means that for a normal distribution, the average [absolute] error is 20% smaller than the SD.  Which means that the cost of the lobster errors isn't $100,000 -- it's only $80,000.

That's something I didn't realize until just a couple of days ago.

------

What difference does it make?  Well, maybe not a lot, unless you're actually trying to estimate your average error (as our hypothetical restauranteur did).  But, if you know you're dealing with a normal distribution, why not just throw in the 20% discount when it's appropriate?

For instance, the normal approximation to binomial says that a .500 baseball team will average 81 wins out of 162 games, with a standard deviation of 6.36.  That suggests that if you have to guess a team's final W-L record, your typical error will be 6.36. 

But, now we know that the mean error is only 80% of that.  Therefore, you should expect to be off by 5 wins, on average, not 6.4.

------ 

Along the same lines, I've always wondered why, when a regression looks for the best-fit straight line, it looks to minimize the sum of squared errors.  Shouldn't it be better to minimize the sum of absolute errors, even if that's not as mathematically elegant? 

That is, suppose the restaurant owner hires you to try to reduce her losses.  You run a regression based on month, day of the week, whether there's a convention in town, and so on, in order to help estimate how many lobster-eating customers will arrive that day.

In that regression, wouldn't it be better to work to minimize the errors, rather than the squared errors?  It's more complicated mathematically, but it might give better estimates, in terms of lobster money saved.

I was never sure about that.  But, now that I know the "80%" relationship between SD and mean error, I realize it should lead to the same results.  When you find the line that minimizes the sum of squared errors, you must also be minimizing the sum of absolute errors.

Why? 

If you minimize the sum of squared errors, you must also be minimizing the average of square errors.  And so you must also be minimizing the square root of that average (which is the SD).

If you minimize the SD, must also be minimizing 80% of the SD.  Since regressions assume errors are normal, 80% of the SD is the mean error.

Therefore, if you minimize the sum of squared errors, you must simultaneously be minimizing the mean error.  QED, kind of.

------

There's one extra advantage, though, of minimizing sum of squared errors instead of just sum of absolute errors: using squared errors breaks ties nicely.

Suppose your series consists of 1, 3, 1, 3, 1, 3 ... repeated alternately.  What line best fits the data?

The obvious answer is: a horizontal line at 2.  It bisects the 1s and 3s perfectly. 

But: if all you want to do is minimize the *absolute* errors, you can use a horizontal line at 1, or at 3, or at any value in between.  At 1, the errors will alternate between 0 and 2, for an average of 1.  At 3, the errors will alternate between 2 and 0, again for an average of 1.  At 2, the errors will all be 1, which is still an average of 1.

But ... the horizontal line at 2 seems "righter" than a line at 1, or 3, or another value.  It seems that "right in the middle" should somehow come up as the default answer.

And that's the advantage of sum of *squared* errors.  It breaks the tie, in favor of the "2". 

A horizontal line at 1 will alternate squared errors between 0 and 4, for an average of 2.  But a horizontal line at 2 will have an average squared error of 1. 

------

The moral, as I see it:  Regressions find the best fit line based on minimizing the sum of squared errors.  That works very well.  However, meaningful real-life conclusions often are better expressed in actual errors.  Fortunately, for standard regressions, the mean error is easy to estimate -- it's just 80% of the standard error that the regression reports.





Labels: ,

Monday, August 06, 2012

Why r-squared works

My most visited post on this blog, by far, is this one, which is about regression, rather than sports.  Most readers got there by googling "difference between r and r-squared."  

I've been meaning to do a long followup post explaining the differences more formally ... but, I guess, that was too ambitious a quest, because I never got around to it.  So, what I figured I'd do is just write about certain aspects as I thought about them.  So here's the first one, which is only about r-squared, for now.

------

You roll a single, fair die.  I try to predict what number you'll roll.  I'm trying to be as accurate as possible.  That means, I want my average error to be as small as possible. 

If the die is fair, my best strategy is to guess 3.5, which is the average roll.  If I do that, and you roll a 1 or a 6, I'll be off by 2.5.  If you roll a 2 or a 5, I'll be off by 1.5.  And if you roll a 3 or a 4, I'll be off by 0.5.

My average error, then, will be 1.5. 

What if you roll two dice instead of one?  My best guess will be 7, which again is the average.  One time out of 36, you'll roll a 2, and I'll be off by five.  Two times out of 36, you'll roll a 3, and I'll be off by four.  And so on.  If you average out all 36 rolls, you'll see my mean error will be 1.94.

With three dice, my best guess is again the average, or 10.5.  This time, there are 216 possible rolls.  I did the calculation ... if I've got it right, my mean error is now 2.42.

Let me repeat these for you:

1.50 -- mean error with one die
1.94 -- mean error with two dice
2.42 -- mean error with three dice

As I said, these are the mean errors when I have no information at all.  If you give me some additional information, I can reduce the mean error.

Suppose we go back to the three dice case, where I guess 10.5 and my mean error is 2.42.  But, this time, let's say you tell me the number showing on the first die (designated in advance -- let's suppose it's the blue one).

That is: you roll the dice out of my sight, and come back and say, "the blue die shows a 6."  How can I reduce my error?

Well, I should guess that the total of all three dice is 13.  That's because the first die is 6, and the expectation is that the other two will add up to 7.  The obvious strategy is just to add 7 to the number on the blue die, and guess that number.

If I do that, what's my average error?  Well, it's obviously 1.94, because it's the exactly the same as trying to predict two dice.

Now, what if you tell me the numbers on two dice, the blue one and the red one?  Then, my best guess is that total, plus 3.5.  My mean error will now be 1.50, because this is exactly the same as the case of guessing one die.

And, what if you tell me the numbers on all three dice?  Now, my best guess is the total, and it's going to be exactly right, for an error of zero.

Here's another summary, this time of what my mean error is depending on what information you give me:

2.42 -- no information
1.94 -- given the number on the blue die
1.50 -- also given the number on the red die
0.00 -- also given the number on the green die

Since the blue die reduces my mean error from 2.42 to 1.94, it's "worth" 0.48.  Since the red die reduces my mean error from 1.94 to 1.50, it's "worth" 0.44.  Since the green die reduces my mean error from 1.50 to 0.00, it's "worth" 1.50:

0.48 -- value of blue die
0.44 -- value of red die
1.50 -- value of green die

Now, in a way, this doesn't make sense, does it?  The dice are identical except for their colors.  How can one be worth more than another?

What's really happening is that the difference isn't because of the color, but because of the *order*.  The last die reduces the error much more than the first two dice -- even by more than the first two dice combined. 

It's like, if you have a car with no tires.  Adding the first tire does you no good.  Adding the second tire does you no good.  Adding the third tire does you no good.  But, adding the fourth tire ... that fixes everything, and now you can drive away.  Even though the tires are identical, they have different values because of the different situations.

-----

But ... let's do the same thing again, but, this time, instead of the mean error, let's look at the mean SQUARE error.  That is, let's square all the errors before taking the average.

For one die, my best strategy is still to guess the average, 3.5.  One third of the time, the roll will be 1 or 6, which means my error will be 2.5, which means my *square* error will be 6.25.  One third of the time, the roll will be 2 or 5, so my error will be 1.5, and my *square* error will be 2.25.  Finally, one third of the time the roll will be 3 or 4, which means my error will be 0.5, which means my *square* error will be 0.25.

Average out all the square errors, and you get a mean square error of 2.92.

If you repeat this exercise for two dice, when I guess 7, you get a mean square error of 5.83.  And for three dice, when I guess 10.5, the mean square error is 8.75.

Again repeating the logic: if I have to guess three dice with no information, my mean square error will be 8.75.  If you tell me the number on the blue die, we now have the equivalent of the two-dice case, and the mean square error drops to 5.83.  If you tell me the values on the blue and red dice, we're now in the one-die case and the mean square error drops further, to 2.92.  And, finally, if you tell me all three dice, the error drops all the way to zero.

8.75 -- no information
5.83 -- given the number on the blue die
2.92 -- also given the number on the red die
0.00 -- also given the number on the green die

So if we again subtract to figure out how much each die is "worth", we get:

2.92 -- value of blue die
2.92 -- value of red die
2.92 -- value of green die

Ha!  Now, we have something that makes sense: each die is worth exactly the same value, regardless of where it came in the order.  When we didn't square the errors, the last die was worth much more than the other dice.  But, when we *do* square the errors, each die is worth the same!

How come?

It's the way God designed the math.  It just so happens that errors of independent variables are like Pythagorean triangles.  Just like, in a right triangle, "red squared" plus "blue squared" equals "hypotenuse squared," it's also true that "mean squared red error" plus "mean squared blue error" equals "mean squared red+blue error".

It's not something statisticians made up, any more than the right triangle Pythagorean Theorem is something geometers made up.  It's just the way the universe is.

------

So, the second way, the one where all the dice seem to have equal values, is "nicer" in at least two ways.  First, it means the order in which we consider the dice doesn't matter.  Second, it means that the dice all have equal values. 

This "niceness" is the reason that statisticians use the mean *square* error, instead of just the mean error.  The mean square error, as you probably know, is a common statistical term, called the "variance".  The mean error -- without squaring -- is just called the "mean error".  It doesn't get used much in statistics at all, because it doesn't have those two nice properties (and others).

Instead of the mean error, statisticians will use the "standard deviation," or "SD", which is just the square root of the variance.  That is, it's the square root of the mean square error.  The SD is usually not too much different from the mean error.  For instance, in the two dice case, the mean error was 1.94.  The SD, on the other hand, is the square root of 5.83,  which is 2.41.  They're roughly similar, though not exact.  (The SD will always be equal to or larger than the mean error, but usually not too far off.  For a normal distribution, the SD is 25.3% higher than the mean error, I think).

The statistical Pythagorean Theorem, then, says that if you have two independent variables, A and B, then

SD(A+B)^2  = SD(A)^2 + SD(B)^2

------

This is really important ... I learned it in the first statistics course I took, but I finished my entire statistics degree without realizing just how important it was, and how it makes certain regression results work.  I searched the internet, and I only found one page that talks about it: it's a professor from Cornell calling it "the second most important theorem" in statistics.  He's right ... it really is amazing, and useful, that it all works out. 

In the three dice case, the Pythagorean relationship meant that each die reduced the variance by the same amount, 2.92 out of 8.75 total.  That allows statisticians to say, "The blue die explains one-third of the variance.  The red die explains one-third of the variance.  And the green die explains one-third of the variance.  Therefore, together, the three dice explain 100% of the total variance."

Because it works out so neatly, it means statisticians are going to like to talk in variances, in squares of errors instead of just errors.  That's because they can just add them up, which they couldn't do if they didn't square them first.

And that's what r-squared is all about.

Let's suppose you actually ran an experiment.  You got someone to run 1,000 rolls of three dice each, and report to you (a) the total of the three dice, and (b) the number on the blue die.  He reports back to you with 1,000 lines like this:

Total was 13; blue die was 4
Total was 11; blue die was 6
Total was 08; blue die was 1
Total was 17; blue die was 6
...

With those 1,000 lines, you run a univariate regression, to try to predict the total from just the blue die.  The regression equation might wind up looking like this (which I made up):

Predicted total = 6.97 + (1.012 * blue die)

(The coefficients should "really" be 7 and 1.000, but they won't be exact because of random variation in the rolls of the green and red dice.)

And the regression software will spit out that the r-squared is approximately 0.33 -- meaning, the mean square error was reduced by approximately 1/3, when you change your guess from the average (around 10.5), to "6.97 + (1.012* blue die)".

That is: "the blue die explains 33% of the variance." 

------

This makes perfect sense.  But, as I argued in my 2006 post -- and others -- the figure 33 percent is limited in its practical application.  Because, in real life, you're normally not concerned about squared errors -- you're concerned about actual errors. 

Suppose you can't perfectly estimate how many customers your restaurant will have.  Sometimes, you order too many lobsters, and you have to throw out the extras, at a cost of $10 each.  Sometimes, you order too few, and you have to turn away customers, which again costs you $10 each.  Suppose the SD of the discrepancy is 10 lobsters a day. 

If a statistician comes along and says he can reduce your error by 50% of the variance, what does that really mean to you?  Well, your squared error was 100 square lobsters.  The statistician can reduce it to 50 square lobsters.  The square root of that is about 7 lobsters.  So, even though the statistician is claiming to reduce your error by half ... in practical terms, he's only reducing it from 10 lobsters to 7.  That's an improvement of 30 percent, not 50 percent, since your monetary loss is proportional to the number of lobsters, not the number of square lobsters.

The r-squared of 50% is something that's meaningful to statisticians in some ways, but not always that meaningful to people in real life. 

Statements like "the model explains 50% of the variance" are useful in certain specific ways, that I'll talk about in a future post ... but it usually makes more sense, in my opinion, to think about things in lobsters, rather than in square lobsters.



Labels: ,