The standard error vs. the mean absolute error
Every day, a restaurant owner estimates how many lobsters she'll need to order. If she orders too many, she'll have to throw some out in the evening, at a cost of $10 each. If she orders too few, she may have to turn away customers, again at a cost of $10 each.
A statistician comes along, and analyzes her estimates over the past 1000 days. It turns out that the mean of her errors is zero. That's good, because it means her guesses are unbiased -- she's as likely to overestimate as underestimate. (For instance, some days she's +5, and other days she's -5.)
The statistician also calculates that the standard deviation of her errors -- her "standard error" -- is 10, and that the errors are normally distributed.
Over the 1,000 days, then, how much money have the errors cost her?
If you asked me that question a few days ago, I would have said, well, the standard deviation is 10 ... so the typical error is 10 lobsters either way, or $100. That means, over 1,000 days, the total cost would be $100,000.
But now, I think that's not quite right.
To figure out what the errors cost, we need to know the mean [absolute] error. But all we have is the standard deviation, which is the square root of the average square error. Those are two different things. And, as it turns out, the SD is always larger than (or rarely, equal to) the mean error.
For instance, suppose the mean is zero, and we have three errors, 0, +5, and -5. The mean error is 3.33. But the SD is 4.08.
So, my guess of $100,000, based on the SD, is too high. Can we get a better estimate?
I think we can.
Since the normal distribution is symmetrical, the average error of the entire bell curve is the same as the average error for the right half of the bell curve.
That right side is the "half normal distribution," and, conveniently, Wikipedia tells us the mean of that distribution is
sigma * sqrt(2/pi)
Where "sigma" is the SD of the full bell curve.
The square root of 2/pi is approximately equal to 0.7979. Call it 0.8, for short.
That means that for a normal distribution, the average [absolute] error is 20% smaller than the SD. Which means that the cost of the lobster errors isn't $100,000 -- it's only $80,000.
That's something I didn't realize until just a couple of days ago.
What difference does it make? Well, maybe not a lot, unless you're actually trying to estimate your average error (as our hypothetical restauranteur did). But, if you know you're dealing with a normal distribution, why not just throw in the 20% discount when it's appropriate?
For instance, the normal approximation to binomial says that a .500 baseball team will average 81 wins out of 162 games, with a standard deviation of 6.36. That suggests that if you have to guess a team's final W-L record, your typical error will be 6.36.
But, now we know that the mean error is only 80% of that. Therefore, you should expect to be off by 5 wins, on average, not 6.4.
Along the same lines, I've always wondered why, when a regression looks for the best-fit straight line, it looks to minimize the sum of squared errors. Shouldn't it be better to minimize the sum of absolute errors, even if that's not as mathematically elegant?
That is, suppose the restaurant owner hires you to try to reduce her losses. You run a regression based on month, day of the week, whether there's a convention in town, and so on, in order to help estimate how many lobster-eating customers will arrive that day.
In that regression, wouldn't it be better to work to minimize the errors, rather than the squared errors? It's more complicated mathematically, but it might give better estimates, in terms of lobster money saved.
I was never sure about that. But, now that I know the "80%" relationship between SD and mean error, I realize it should lead to the same results. When you find the line that minimizes the sum of squared errors, you must also be minimizing the sum of absolute errors.
If you minimize the sum of squared errors, you must also be minimizing the average of square errors. And so you must also be minimizing the square root of that average (which is the SD).
If you minimize the SD, must also be minimizing 80% of the SD. Since regressions assume errors are normal, 80% of the SD is the mean error.
Therefore, if you minimize the sum of squared errors, you must simultaneously be minimizing the mean error. QED, kind of.
There's one extra advantage, though, of minimizing sum of squared errors instead of just sum of absolute errors: using squared errors breaks ties nicely.
Suppose your series consists of 1, 3, 1, 3, 1, 3 ... repeated alternately. What line best fits the data?
The obvious answer is: a horizontal line at 2. It bisects the 1s and 3s perfectly.
But: if all you want to do is minimize the *absolute* errors, you can use a horizontal line at 1, or at 3, or at any value in between. At 1, the errors will alternate between 0 and 2, for an average of 1. At 3, the errors will alternate between 2 and 0, again for an average of 1. At 2, the errors will all be 1, which is still an average of 1.
But ... the horizontal line at 2 seems "righter" than a line at 1, or 3, or another value. It seems that "right in the middle" should somehow come up as the default answer.
And that's the advantage of sum of *squared* errors. It breaks the tie, in favor of the "2".
A horizontal line at 1 will alternate squared errors between 0 and 4, for an average of 2. But a horizontal line at 2 will have an average squared error of 1.
The moral, as I see it: Regressions find the best fit line based on minimizing the sum of squared errors. That works very well. However, meaningful real-life conclusions often are better expressed in actual errors. Fortunately, for standard regressions, the mean error is easy to estimate -- it's just 80% of the standard error that the regression reports.
UPDATE: As commenter David explained, and eventually got through my thick skull (see the comments), the minimum sum of squared errors is unbiased for the mean, while the minimum sum of absolute errors is unbiased for the median. So it's not just that the "square" breaks ties -- it also targets the mean, which is usually what you're interested in.