## Wednesday, May 06, 2020

### Regression to higher ground

We know that if an MLB team wins 76 games in a particular season, it's probably a better team than its record indicates. To get its talent from its record, we have to regress to the mean.

Tango has often said that straightforward regression to the mean sometimes isn't right -- you have to regress to the *specific* mean you're concerned with. If Wade Boggs hits .280, you shouldn't regress him towards the league average of .260. You should regress him towards his own particular mean, which is more like .310 or something.

This came up when I was figuring regression to the mean for park factors. To oversimplify for purposes of this discussion: the distribution of hitters' parks in MLB is bimodal. There's Coors Field, and then everyone else. Roughly like this random pic I stole from the internet:

Now, suppose you have a season of Coors Field that comes in at 110. If you didn't know the distribution was bimodal, you might regress it back to the mean of 100, by moving it to the left. But if you *do* know that the distribution is bimodal, and you can see the 110 belongs to the hump on the right, you'd regress it to the Coors mean of 113, by moving it to the right.

But there are times when there is no obvious mean to regress to.

--------

You have a pair of perfectly fair 9-sided dice. You want to count the number of rolls it takes before you roll your first snake eyes (which has a 1 in 81 chance each roll). On average, you expect it to take 81 rolls, but that can vary a lot.

You don't have a perfect count of how many rolls it took, though. Your counter is randomly inaccurate with an SD of 6.4 rolls (coincidentally the same as the SD of luck for team wins).

You start rolling. Eventually you get snake eyes, and your counter estimates that it took 76 rolls. The mean is 81. What's your best estimate of the actual number?

This time, it should be LOWER than 76. You actually have to regress AWAY from the mean.

-------

Let's go back to the usual case for a second, where a team wins 76 games. Why do we expect its talent to be higher than 76? Because there are two possibilities:

(a) its talent was lower than 76, and it got lucky; or
(b) its talent was higher than 76, and it got unlucky.

But (b) is more likely than (a), because the true number will be higher than 76 more often than it'll be lower than 76.

You can see that from this graph that represents distribution of team talent:

The blue bars are the times that talent was less than 76, and the team got lucky.  The pink bars are the times the talent was more than 76, and the team got unlucky.

The blue bars around 76 are shorter than the pink bars around 76. That means better teams getting unlucky are more common than worse teams getting lucky, so the average talent must be higher than 76.

But the dice case is different. Here's the distribution of when the first snake-eyes (1 in 81 chance) appears:

The mean is still 81, but, this time, the curve slopes down at 76, not up.

Which means: it's more likely that you rolled less than 76 times and counted too high, than that you rolled more than 76 times and counted too low.

Which means that to estimate the actual number of rolls, you have to regress *down* from 76, which is *away* from the mean of 81.

--------

That logic --let's call it the "Dice Method" -- seems completely correct, right?

But, the standard "Tango Method" contradicts it.

The SD of the distribution of the dice graph is around 80.5. The SD of the counting error is 6.4. So we can calculate:

SD(true)     = 80.5
SD(error)    =  6.4
--------------------
SD(observed) = 80.75

By the Tango method, we have to regress by (6.4/80.75)^2, which is less than 1% of the way to the mean. Small, but still towards the mean!

-- Dice Method: regress away from the mean
-- Tango Method: regress towards the mean

Which is correct?

They both are.

The Tango Method is correct on average. The Dice Method is correct in this particular case.

If you don't know how many rolls you counted, you use the Tango Method.

If you DO know that the count was 76 rolls, you use the Dice Method.

------

Side note:

The Tango Method's regression to the mean looked wrong to me, but I think I figured out where it comes from.

Looking at the graph at a quick glance, it looks like you should always regress to the left, because the left side of every point is always higher than the right side of every point. That means that if you're below the mean of 81, you regress away from the mean (left). If you're above the mean of 81, you regress toward the mean (still left).

But, there are a lot more datapoints to the left of 81 than to the right of 81 -- by a ratio of about 64 percent to 36 percent. So, overall, it looks like the average should be regressing away from the mean.

Except ... it's not true that the left is always higher than the right. Suppose your counter said "1". You know the correct count couldn't possibly have been zero or less, so you have to regress to the right.

Even if your counter said "2" ... sure, a true count of 1 is more likely than a true count of 3. but 4, 5, and 6 are more likely than 0, -1, or -2. So again you have to regress to the right.

Maybe the zero/negative logic is a factor when you have, say, 8 tosses or less, just to give a gut estimate. Those might constitute, say, 10 percent of all snake eyes rolled.

So, the overall "regress less than 1 percent towards the mean of 81" is the average of:

-- 36% regress left  towards the mean a bit (>81)
-- 54% regress left  away from the mean a bit (9-81)
-- 10% regress right towards the mean a lot (< 8)
-----------------------------------------------------
-- Overall average: regress towards the mean a tiny bit.

--------

The "Tango Method" and the "Dice Method" are just consequences of Bayes' Theorem that are easier to implement than doing all the Bayesian calculations every time. The Tango Method is a mathematically precise consequence of Bayes Theorem, and the Dice Method is an heuristic from eyeballing. Tango's "regress to the specific mean" is another Bayes heuristic.

We can reduce the three methods into one by noting what they have in common -- they all move the estimate from lower on the curve to higher on the curve. So, instead of "regress to the mean," maybe we can say

"regress to higher ground."

That's sometimes how I think of Bayes' Theorem in my own mind. In fact, I think you can explain Bayes exactly, as a more formal method of figuring where the higher ground is, by explicitly calculating how much to weight the closer ground relative to the distant ground.