## Thursday, August 09, 2012

### The standard error vs. the mean absolute error

Every day, a restaurant owner estimates how many lobsters she'll need to order.  If she orders too many, she'll have to throw some out in the evening, at a cost of $10 each. If she orders too few, she may have to turn away customers, again at a cost of$10 each.

A statistician comes along, and analyzes her estimates over the past 1000 days.  It turns out that the mean of her errors is zero.  That's good, because it means her guesses are unbiased -- she's as likely to overestimate as underestimate.  (For instance, some days she's +5, and other days she's -5.)

The statistician also calculates that the standard deviation of her errors -- her "standard error" -- is 10, and that the errors are normally distributed.

Over the 1,000 days, then, how much money have the errors cost her?

If you asked me that question a few days ago, I would have said, well, the standard deviation is 10 ... so the typical error is 10 lobsters either way, or $100. That means, over 1,000 days, the total cost would be$100,000.

But now, I think that's not quite right.

To figure out what the errors cost, we need to know the mean [absolute] error.  But all we have is the standard deviation, which is the square root of the average square error.  Those are two different things.  And, as it turns out, the SD is always larger than (or rarely, equal to) the mean error.

For instance, suppose the mean is zero, and we have three errors, 0, +5, and -5.  The mean error is 3.33.  But the SD is 4.08.

So, my guess of $100,000, based on the SD, is too high. Can we get a better estimate? I think we can. Since the normal distribution is symmetrical, the average error of the entire bell curve is the same as the average error for the right half of the bell curve. That right side is the "half normal distribution," and, conveniently, Wikipedia tells us the mean of that distribution is sigma * sqrt(2/pi) Where "sigma" is the SD of the full bell curve. The square root of 2/pi is approximately equal to 0.7979. Call it 0.8, for short. That means that for a normal distribution, the average [absolute] error is 20% smaller than the SD. Which means that the cost of the lobster errors isn't$100,000 -- it's only \$80,000.

That's something I didn't realize until just a couple of days ago.

------

What difference does it make?  Well, maybe not a lot, unless you're actually trying to estimate your average error (as our hypothetical restauranteur did).  But, if you know you're dealing with a normal distribution, why not just throw in the 20% discount when it's appropriate?

For instance, the normal approximation to binomial says that a .500 baseball team will average 81 wins out of 162 games, with a standard deviation of 6.36.  That suggests that if you have to guess a team's final W-L record, your typical error will be 6.36.

But, now we know that the mean error is only 80% of that.  Therefore, you should expect to be off by 5 wins, on average, not 6.4.

------

Along the same lines, I've always wondered why, when a regression looks for the best-fit straight line, it looks to minimize the sum of squared errors.  Shouldn't it be better to minimize the sum of absolute errors, even if that's not as mathematically elegant?

That is, suppose the restaurant owner hires you to try to reduce her losses.  You run a regression based on month, day of the week, whether there's a convention in town, and so on, in order to help estimate how many lobster-eating customers will arrive that day.

In that regression, wouldn't it be better to work to minimize the errors, rather than the squared errors?  It's more complicated mathematically, but it might give better estimates, in terms of lobster money saved.

I was never sure about that.  But, now that I know the "80%" relationship between SD and mean error, I realize it should lead to the same results.  When you find the line that minimizes the sum of squared errors, you must also be minimizing the sum of absolute errors.

Why?

If you minimize the sum of squared errors, you must also be minimizing the average of square errors.  And so you must also be minimizing the square root of that average (which is the SD).

If you minimize the SD, must also be minimizing 80% of the SD.  Since regressions assume errors are normal, 80% of the SD is the mean error.

Therefore, if you minimize the sum of squared errors, you must simultaneously be minimizing the mean error.  QED, kind of.

------

There's one extra advantage, though, of minimizing sum of squared errors instead of just sum of absolute errors: using squared errors breaks ties nicely.

Suppose your series consists of 1, 3, 1, 3, 1, 3 ... repeated alternately.  What line best fits the data?

The obvious answer is: a horizontal line at 2.  It bisects the 1s and 3s perfectly.

But: if all you want to do is minimize the *absolute* errors, you can use a horizontal line at 1, or at 3, or at any value in between.  At 1, the errors will alternate between 0 and 2, for an average of 1.  At 3, the errors will alternate between 2 and 0, again for an average of 1.  At 2, the errors will all be 1, which is still an average of 1.

But ... the horizontal line at 2 seems "righter" than a line at 1, or 3, or another value.  It seems that "right in the middle" should somehow come up as the default answer.

And that's the advantage of sum of *squared* errors.  It breaks the tie, in favor of the "2".

A horizontal line at 1 will alternate squared errors between 0 and 4, for an average of 2.  But a horizontal line at 2 will have an average squared error of 1.

------

The moral, as I see it:  Regressions find the best fit line based on minimizing the sum of squared errors.  That works very well.  However, meaningful real-life conclusions often are better expressed in actual errors.  Fortunately, for standard regressions, the mean error is easy to estimate -- it's just 80% of the standard error that the regression reports.

---

UPDATE: As commenter David explained, and eventually got through my thick skull (see the comments), the minimum sum of squared errors is unbiased for the mean, while the minimum sum of absolute errors is unbiased for the median.  So it's not just that the "square" breaks ties -- it also targets the mean, which is usually what you're interested in.

Labels: ,

At Thursday, August 09, 2012 1:00:00 PM,  David said...

Minimizing the sum of absolute errors gives you an estimate of the conditional MEDIAN, whereas minimizing sum of squared errors gives you a conditional MEAN.

They are not the same, as shown by your example. If the data are 1,3,1,3,... and you regress on only a constant, minimizing sum of squared deviations gives you the mean (also zero slope, but the slope just complicates my point so I'm leaving it out).

Now, think about minimizing absolute deviations. The reason you get an indeterminant answer for the sum of absolute deviations is because the median is undefined due to the equal samples of 1's and 3's.

Now imagine that the data are 2.5, 1, 3, 1, 3, etc. Then minimizing sum of absolute deviations is exactly 2.5 (or whatever number is in the first position, as long as its between 1 and 3) for precisely the reasons you gave before. Moving between 1 and 3 has no effect on sum of absolute deviations except for how close it is to 2.5. It's not a far jump to see how this happens still even without the data being almost all 1's and 3'd. So, the estimate is exactly the median.

Things get a little more complicated when you add a slope coefficient. But just like regular regression gives you a conditional mean, min abs dev gives you a conditional median.

Long response with a short moral: sum of squares and sum of abs dev do not lead to the same thing.

(For a more formal description, look up quantile regression on Wikipedia. Wherever they put \tau replace it with 0.5 for the median)

Then, the big question becomes: why do we use a regression that gives conditional means instead of conditional medians?

At Thursday, August 09, 2012 1:14:00 PM,  David said...

One more thought: given what I said above, I was trying to figure out what is wrong in your proof sketch. Turns out, nothing! Everything you said is, I think, correct ASSUMING the distribution is normal. This is because the conditional mean and the conditional median are the same (bell curve is symmetric). But if it's not normal (more generally, symmetric), then you're getting a median.

At Thursday, August 09, 2012 1:57:00 PM,  Phil Birnbaum said...

Yes, my little proof was meant for the normal case only.

What if you have equal numbers of 1s and 3s, and a single 2.1? Then, the best fit horizontal line will no longer be the median. The median will be 2.1, but the best fit will be lower than that.

At Thursday, August 09, 2012 2:32:00 PM,  David said...

I'm imagining a situation where we're only fitting a constant, not a line. If we're only fitting a constant, then we'll get the median, whether it is 2.1 2.5 or 2.9. Agree?

Now, suppose I want to fit a line. I need to know what’s going on with X in addition to Y. To keep me from tying my head in knots, let’s go with something simple. Suppose, I have six observations:

X Y
0 1
0 3
0 3
1 1
1 1
1 3

The least squares line will go through the mean at each value of X, (0,7/3) and (1,5/3). That gives constant 7/3 and slope -2/3.

To minimize the sum of absolute deviations, I will pick a line that goes through the medians, hitting the points (0,3) and (1,1). Why? Suppose I didn’t. Suppose at X=0 the line went through the point on the least squares line, which was (0,7/3). Well, then we could reduce the sum of absolute deviations by increasing it. When we increase by a small amount, say 0.1, it gets 0.1 closer to the first two data points and 0.1 further away from the third. Since we care about absolute distances, that’s a gain of 2*0.1 – 0.1 = +0.1. We can get that gain until we reach the median. At that point, the number of data points we leave behind is the same as we’re getting closer to, and the line stops.

So, the least absolute deviations line has to go through (0,3) and (1,1) for a constant of 0 and slope of -2.

(The reason least squares goes exactly through the mean and absolute deviations goes exactly through the median in this case is because the line has two parameters to fit only two X values in the data. With more X values, you start getting a linear approximation.)

You could also do this in such a way that you get a least squares line with slope zero and a least absolute deviations line with non-zero slope:

X Y
0 1
0 2.1
0 3.9
1 1
1 2.9
1 3.1

At Thursday, August 09, 2012 2:39:00 PM,  David said...

Maybe more to the point, I'm getting increasingly convinced that your rule of thumb for errors (the main point) is correct. In big samples, the Central Limit Theorem will give us a normal distribution for our regression estimate. So, then your approximation works fine in practice, regardless of whether the data has a normal distribution. It's just the other claim that minimizing absolute vs squared deviations than requires a stronger assumption.

The median regression stuff may be somewhat off topic then...I just find it useful to know what difference it would make to minimize absolute deviations.

At Thursday, August 09, 2012 2:59:00 PM,  Phil Birnbaum said...

Sorry, yes, I meant a constant.

Suppose you have fifty 0s, fifty 100s, and a 2. The median is 2. But the point that minimizes the mean error is close to 50.

At Thursday, August 09, 2012 3:35:00 PM,  David said...

In that case:

Estimate Mean Error
2 (2-0)*50 + (100-2)*50 = 100*50 = 5,000
50 (50-0)*50 + (100-50)*50 + (50-2)*1 = 100*50 + 48*1 = 5,048

The best estimate for absolute deviations is 2.

As long as your estimate is between 0 and 100, you'll always get that 5,000 of error. Any distance you move toward 100 is moving away from 0. So, the only gain you get is by eliminating the deviation from the median.

At Thursday, August 09, 2012 3:36:00 PM,  David said...

Sorry, unclear before. Hopefully this is better:

Estimate Mean Error
2 (2-0)*50 + (100-2)*50 = 100*50 = 5,000
50 (50-0)*50 + (100-50)*50 + (50-2)*1 = 100*50 + 48*1 = 5,048

At Thursday, August 09, 2012 3:37:00 PM,  David said...

One more time:

If the estimate is 2, sum of absolute errors is: (2-0)*50 + (100-2)*50 = 100*50 = 5,000

If the estimate is 50, sum of absolute errors is: (50-0)*50 + (100-50)*50 + (50-2)*1 = 100*50 + 48*1 = 5,048

At Thursday, August 09, 2012 4:29:00 PM,  Phil Birnbaum said...

You're right. Sorry for being so dense!

At Thursday, August 09, 2012 4:34:00 PM,  David said...

Not at all! I have been and will continue to be wrong in your comments frequently...

At Wednesday, October 02, 2013 3:28:00 PM,  Anonymous said...

You're not alone in making your initial mistake; one study found that around 95% of financial professionals made exactly the same mistake:

http://www-stat.wharton.upenn.edu/~steele/Courses/434/434Context/Volatility/ConfusedVolatility.pdf

One wacky place it shows up is when you've got a Cauchy distribution instead of a normal distribution; the Cauchy distribution has no mean, but it does have a median. It ~looks~ like it should have a mean, but it doesn't.

At Thursday, March 19, 2015 5:18:00 AM,  Rune Nielsen said...

Have you considered that an overestimated number of customers is more expensive than an underestimated number??