Wednesday, August 05, 2020

The NEJM hydroxychloroquine study fails to notice its largest effect

Before hydroxychloroquine was a Donald Trump joke, the drug was considered a promising possibility for prevention and treatment of Covid-19. It had been previously shown to work against respiratory viruses in the lab, and, for decades, it was safely and routinely given to travellers before departing to malaria-infested regions. A doctor friend of mine (who, I am hoping, will have reviewed this post for medical soundness before I post it) recalls having taken it before a trip to India.

Travellers start on hydroxychloroquine two weeks before departure; this gives the drug time to build up in the body. Large doses at once can cause gastrointestinal side effects, but since hydroxychloroquine has a very long half-life in the body -- three weeks or so -- you build it up gradually.

For malaria, hydroxychloroquine can also be used for treatment. However, several recent studies have found it to be ineffective treating advanced Covid-19.

That leaves prevention. Can hydroxychloroquine be used to prevent Covid-19 infections? The "gold standard" would be a randomized double-blind placebo study, and we got one a couple of months ago, in the New England Journal of Medicine (NEJM). 

It concluded that there was no statistically significant difference between the treatment and placebo groups, and concluded

"After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19 or confirmed infection when used as postexposure prophylaxis within 4 days after exposure."

But ... after looking at the paper in more detail, I'm not so sure.


The study reported on 821 subjects who had been exposed, within the past four days, to a patient testing positive for Covid-19. They received a dose of either hydroxychloroquine or placebo for the next five days (the first day was a higher "loading dose"), and followed over the next couple of weeks to see if they contracted the virus.

The results:

49 of 414 treatment subjects (11.8%) became infected
58 of 407   placebo subjects (14.3%) became infected.

That's about 17 percent fewer cases in patients who got the real drug. 

But that wasn't a large enough difference to show statistical significance, with only about 400 subjects in each group. The paper recognizes that, stating the study was designed only with sufficient power to find a reduction of at least 50 percent, not the 17 percent reduction that actually appeared. Still, by the usual academic standards for this sort of thing, the authors were able to declare that "hydroxychloroquine did not prevent illness."

At this point I would normally rant about statistical significance and how "absence of evidence is not evidence of absence."  But I'll skip that, because there's something more interesting going on.


Recall that the study tested hydroxychloroquine on subjects who feared they were already exposed to the virus. That's not really testing prevention ... it's testing treatment, albeit early treatment. It does have elements of prevention in it, as perhaps the subjects may not have been infected at that point, but would be infected later. (The study doesn't say explicitly, but I would assume some of the exposures were to family members, so repeated exposures over the next two weeks would be likely.)

Also: it did take five days of dosing until the full dose of hydroxychloroquine was taken. That means the subject didn't get a full dose until up to nine days after exposure to the virus.

So this is where it gets interesting. Here's Figure 2 from the paper:

These lines are cumulative infections during the course of the study. As of day 5, there were actually more infections in the group that took hydroxychloroquine than in the group that got the placebo ... which is perhaps not that surprising, since the subjects hadn't finished their full doses until that fifth day. By day 10, the placebo group has caught up, but the groups are still about equal.

But now ... look what happens from Day 10 to Day 14. The group that got the hydroxychloroquine doesn't move much ... but the placebo group shoots up.

What's the difference in new cases? The study doesn't give the exact numbers that correspond to the graph, so I used a pixel ruler to measure the distances between points of the graph. It turns out that from Day 10 to Day 14, they found:

-- 11 new infections in the placebo group
--  2 new infections in the hydroxychloroquine group.

What is the chance that of 13 new infections, they would get split 11:2? 

About 1.12 percent one-tailed, 2.24 percent two-tailed.

Now, I know that it's usually not legitimate to pick specific findings out of a study ... with 100 findings, you're bound to find one or two random ones that fall into that significance level. But this is not an arbitrary random pattern -- it's exactly what we would have expected to find if hydroxychloroquine worked as a preventative. 

It takes, on average, about a week for symptoms to appear after COVID-19 infection. So for those subjects in the "1-5" group, most were probably infected *before* the start of their hydroxychloroquine regimen (up to four days before, as the study notes). So those don't necessarily provide evidence of prevention. 

In the "6-10" group, we'd expect most of them to have been already infected before the drugs were administered; the reason they were admitted to the study in the first place was because they feared they had been exposed. So probably many of those who didn't experience symptoms until, say, Day 9, were already infected but had a longer incubation period. Also, most of the subsequently-infected subjects in that group probably got infected in the first five days, while they didn't have a full dose of the drug yet.

But in the last group, the "11-14" group, that's when you'd expect the largest preventative effect -- they'd have had a full dose of the drug for at least six days, and they were the most likely to have become infected only after the start of the trial.

And that's when the hydroxychloroquine group had an 84 percent lower infection rate than the placebo group.


In everything I've been reading about hydroxychloroquine and this study, I have not seen anyone notice this anomaly, that beyond ten days, there were almost seven times as many infections among those who didn't get the hydroxychloroquine. In fact, even the authors of the study didn't notice. They stopped the study on the basis of "futility" once they realized they were not going to achieve statistical significance (or, in other words, once they realized the reduction in infections was much less than the 50% minimum they would endorse). In other words: they stopped the study just as the results were starting to show up! 

And then the FDA, noting the lack of statistical significance, revoked authorization to use hydroxychloroquine.

I'm not trying to push hydroxychloroquine here ... and I'm certainly not saying that I think it will definitely work. If I had to give a gut estimate, based on this data and everything else I've seen, I'd say ... I dunno, maybe a 15 percent chance. Your guess may be lower. Even if your gut says there's only a one-in-a-hundred chance that this 84 percent reduction is real and not a random artifact ... in the midst of this kind of pandemic, isn't even 1 percent enough to say, hey, maybe it's worth another trial?

I know hydroxychloroquine is considered politically unpopular, and it's fun to make a mockery of it. But these results are strongly suggestive that there might be something there. If we all agree that Trump is an idiot, and even a stopped clock is right twice a day, can we try evaluating the results of this trial on what the evidence actually shows? Can we not elevate common sense over the politics of Trump, and the straitjacket of statistical significance, and actually do some proper science?


Sunday, May 17, 2020

Herd immunity comes faster when some people are more infectious

By now, we all know about "R0" and how it needs to drop below 1.0 for use to achieve "herd immunity" to the COVID virus. 

The estimates I've seen is that the "R0" (or "R") for the COVID-19 virus is around 4. That means, in a susceptible population with no interventions like social distancing, the average infected person will pass the virus on to 4 other people. Each of those four passes it on to four others, and each of those 16 newly-infected people pass it on to four others, and the incidence grows exponentially.

But, suppose that 75 percent of the population is immune, perhaps because they've already been infected. Then, each infected person can pass the virus on to only one other person (since the other three who would otherwise be infected, are immune). That means R0 has dropped from 4 to 1. With R0=1, the number of infected people will stay level. As more people become immune, R drops further, and the disease eventually dies out in the population.

That's the argument most experts have been making, so far -- that we can't count on herd immunity any time soon, because we'd need 75 percent of the population to get infected first.

But that can't be right, as "Zvi" points out in a post on LessWrong*. 

(*I recommend LessWrong as an excellent place to look to for good reasoning on coronavirus issues, with arguments that make the most sense.)

That's because not everyone is equal in terms of how much they're likely to spread the virus. In other words, everyone has his or her own personal R0. Those with a higher R0 -- people who don't wash their hands much, or shake hands with a lot of people, or just encounter more people for face-to-face interactions -- are also likely to become infected sooner. When they become immune, they drop the overall "societal" R0 more than if they were average.

If you want to reduce home runs in baseball by 75 percent, you don't have to eliminate 75 percent of plate appearances. You can probably do it by eliminating as little as, say, 25 percent, if you get rid of the top power hitters only.

As Zvi writes,

"Seriously, stop thinking it takes 75% infected to get herd immunity...

"... shame on anyone who doesn’t realize that you get partial immunity much bigger than the percent of people infected. 

"General reminder that people’s behavior and exposure to the virus, and probably also their vulnerability to it, follow power laws. When half the population is infected and half isn’t, the halves aren’t chosen at random. They’re based on people’s behaviors.

"Thus, expect much bigger herd immunity effects than the default percentages."


But to what extent does the variance in individual behavior affect the spread of the virus?  Is it just a minimal difference, or is it big enough that, for instance, New York City (with some 20 percent of people having been exposed to the virus) is appreciably closer to herd immunity than we think?

To check, I wrote a simulation. It is probably in no way actually realistic in terms of how well it models the actual spread of COVID, but I think we can learn something from the differences in what the model shows for different assumptions about individual R0.

I created 100,000 simulated people, and gave them each a "spreader rating" to represent their R0. The actual values of the ratings don't matter, except relative to the rest of the population. I created a fixed number of "face-to-face interactions" each day, and the chance of being involved in one is directly proportional to the number. So, people with a rating of "8" are four times as likely to have a chance to spread/catch the virus than people with a rating of "2". 

Each of those interactions, if it turns out person was infected and one wasn't, there was a fixed probability of the infection spreading to the other person.

For every simulation, I jigged the numbers to get the R0 to be around 4 for the first part of the pandemic, from the start until the point where around three percent of the population was infected. 

The simulation started with 10 people newly infected. I assumed that infected people could spread the virus only for the first 10 days after infection. 


The four simulations were:

1. Everyone has the same rating.

2. Everyone rolls a die until "1" or "2" comes up, and their spreader rating is the number of rolls it took. (On average, that's 3 rolls. But in a hundred thousand trials, you get some pretty big outliers. I think there was typically a 26 or higher -- 26 consecutive high rolls happens one time in 37,877.)

3. Same as #2, except that 1 percent of the population is a superspreader, with a spreader rating of 30. The first nine infected people were chosen randomly, but the tenth was always set to "superspreader."

4. Same as #3, but the superspreaders got an 80 instead of a 30.


In the first simulation, everyone got the same rating. With an initial R0 of around 4, it did, indeed, take around 75 percent of the population to get infected before R0 dropped below 1.0. 

Overall, around 97 percent of the population wound up being infected before the virus disappeared completely.

Here's the graph:

The point where R0 drops below 1.0 is where the next day's increase is smaller than the previous day's increase. It's hard to eyeball that on the curve, but it's around day 32, where the total crosses the 75,000 mark.


As I mentioned, I jigged the other three curves so that for the first days, they had about the same R0 of around 4, so as to match the "everyone the same" graph.

Here's the graph of all four simulations for those first 22 days:

Aside from the scale, they're pretty similar to the curves we've seen in real life. Which means, that, based on the data we've seen so far, we can't really tell from the numbers which simulation is closest to our true situation.

But ... after that point, as Zvi explained, the four curves do diverge. Here they are in full:

Big differences, in the direction that Zvi explained. The bigger the variance in individual R0, the more attenuated the progression of the virus.

Which makes sense. All four curves had an R0 of around 4.0 at the beginning. But the bottom curve was 99 percent with an average of 3 encounters, and 1 percent superspreaders with an average of 80 encounters. Once those superspreaders are no longer superspreading, the R0 plummets. 

In other words: herd immunity brings the curve under control by reducing opportunity for infection. In the bottom curve, eliminating the top 1% of the population reduces opportunity by 40%. In the top curve, eliminating 1% of the population reduces opportunity only by 2%.


For all four curves, here's where R0 dropped below 1.0:

75% -- all people the same
58% -- all different, no superspreaders
44% -- all different, superspreaders 10x average
20% -- all different, superspreaders 26x average

And here's the total number of people who ever got infected:

97% -- all people the same
81% -- all different, no superspreaders
65% -- all different, superspreaders 10x average
33% -- all different, superspreaders 26x average


Does it seem counterintuitive that the more superspreaders, the better the result?  How can more infecting make things better?

It doesn't. More *initial* infecting makes things better *only holding the initial R0 constant.*  

If the aggregate R0 is still only 4.0 after including superspreaders, it must mean that the non-superspreaders have an R0 significantly less than 4.0. You can think of a "R=4.0 with superspreaders" society like maybe a "R=2.0" society that's been infected by 1% gregarious handshaking huggers and church-coughers.

In other words, the good news is: if everyone were at the median, the overall R0 would be less than 4. It just looks like R0=4 because we're infested by dangerous superspreaders. Those superspreaders will more quickly turn benign and lower our R0 faster.


So, the shape of the distribution of spreaders matters a great deal. Of course, we don't know the shape of our distribution, so it's hard to estimate which line in the chart we're closest to. 

But we *do* know that we at least a certain amount of variance -- some people shake a lot of hands, some people won't wear masks, some people are probably still going to hidden dance parties. So I think we can conclude that we'll need significantly less than 75 percent to get to herd immunity.

How much less?  I guess you could study data sources and try to estimate. I've seen at least one non-wacko argument that says New York City, with an estimated infection rate of at least 20 percent, might be getting close. Roughly speaking, that would be something like the fourth line on the graph, the one on the bottom.  

Which line is closest, if not that one?  My gut says ... given that we know the top line is wrong, and from what we know about human nature ... the second line from the top is a reasonable conservative assumption. Changing my default from 75% to 58% seems about right to me. But I'm pulling that out of my gut. The very end part of my gut, to be more precise. 

At least we know for sure is that the 75%, the top line of the graph, must be too pessimistic.  To estimate how far pessimistic, we need more data and more arguments. 


Wednesday, May 06, 2020

Regression to higher ground

We know that if an MLB team wins 76 games in a particular season, it's probably a better team than its record indicates. To get its talent from its record, we have to regress to the mean.

Tango has often said that straightforward regression to the mean sometimes isn't right -- you have to regress to the *specific* mean you're concerned with. If Wade Boggs hits .280, you shouldn't regress him towards the league average of .260. You should regress him towards his own particular mean, which is more like .310 or something.

This came up when I was figuring regression to the mean for park factors. To oversimplify for purposes of this discussion: the distribution of hitters' parks in MLB is bimodal. There's Coors Field, and then everyone else. Roughly like this random pic I stole from the internet:

Now, suppose you have a season of Coors Field that comes in at 110. If you didn't know the distribution was bimodal, you might regress it back to the mean of 100, by moving it to the left. But if you *do* know that the distribution is bimodal, and you can see the 110 belongs to the hump on the right, you'd regress it to the Coors mean of 113, by moving it to the right.

But there are times when there is no obvious mean to regress to.


You have a pair of perfectly fair 9-sided dice. You want to count the number of rolls it takes before you roll your first snake eyes (which has a 1 in 81 chance each roll). On average, you expect it to take 81 rolls, but that can vary a lot. 

You don't have a perfect count of how many rolls it took, though. Your counter is randomly inaccurate with an SD of 6.4 rolls (coincidentally the same as the SD of luck for team wins).

You start rolling. Eventually you get snake eyes, and your counter estimates that it took 76 rolls. The mean is 81. What's your best estimate of the actual number? 

This time, it should be LOWER than 76. You actually have to regress AWAY from the mean.


Let's go back to the usual case for a second, where a team wins 76 games. Why do we expect its talent to be higher than 76? Because there are two possibilities:

(a) its talent was lower than 76, and it got lucky; or
(b) its talent was higher than 76, and it got unlucky.

But (b) is more likely than (a), because the true number will be higher than 76 more often than it'll be lower than 76. 

You can see that from this graph that represents distribution of team talent:

The blue bars are the times that talent was less than 76, and the team got lucky.  The pink bars are the times the talent was more than 76, and the team got unlucky.

The blue bars around 76 are shorter than the pink bars around 76. That means better teams getting unlucky are more common than worse teams getting lucky, so the average talent must be higher than 76.

But the dice case is different. Here's the distribution of when the first snake-eyes (1 in 81 chance) appears:

The mean is still 81, but, this time, the curve slopes down at 76, not up.

Which means: it's more likely that you rolled less than 76 times and counted too high, than that you rolled more than 76 times and counted too low. 

Which means that to estimate the actual number of rolls, you have to regress *down* from 76, which is *away* from the mean of 81.


That logic --let's call it the "Dice Method" -- seems completely correct, right? 

But, the standard "Tango Method" contradicts it.

The SD of the distribution of the dice graph is around 80.5. The SD of the counting error is 6.4. So we can calculate:

SD(true)     = 80.5
SD(error)    =  6.4
SD(observed) = 80.75

By the Tango method, we have to regress by (6.4/80.75)^2, which is less than 1% of the way to the mean. Small, but still towards the mean!

So we have two answers, that appear to contradict each other:

-- Dice Method: regress away from the mean
-- Tango Method: regress towards the mean

Which is correct?

They both are.

The Tango Method is correct on average. The Dice Method is correct in this particular case.

If you don't know how many rolls you counted, you use the Tango Method.

If you DO know that the count was 76 rolls, you use the Dice Method.


Side note:

The Tango Method's regression to the mean looked wrong to me, but I think I figured out where it comes from.

Looking at the graph at a quick glance, it looks like you should always regress to the left, because the left side of every point is always higher than the right side of every point. That means that if you're below the mean of 81, you regress away from the mean (left). If you're above the mean of 81, you regress toward the mean (still left).

But, there are a lot more datapoints to the left of 81 than to the right of 81 -- by a ratio of about 64 percent to 36 percent. So, overall, it looks like the average should be regressing away from the mean.

Except ... it's not true that the left is always higher than the right. Suppose your counter said "1". You know the correct count couldn't possibly have been zero or less, so you have to regress to the right. 

Even if your counter said "2" ... sure, a true count of 1 is more likely than a true count of 3. but 4, 5, and 6 are more likely than 0, -1, or -2. So again you have to regress to the right.

Maybe the zero/negative logic is a factor when you have, say, 8 tosses or less, just to give a gut estimate. Those might constitute, say, 10 percent of all snake eyes rolled. 

So, the overall "regress less than 1 percent towards the mean of 81" is the average of:

-- 36% regress left  towards the mean a bit (>81)
-- 54% regress left  away from the mean a bit (9-81)
-- 10% regress right towards the mean a lot (< 8)
-- Overall average: regress towards the mean a tiny bit.


The "Tango Method" and the "Dice Method" are just consequences of Bayes' Theorem that are easier to implement than doing all the Bayesian calculations every time. The Tango Method is a mathematically precise consequence of Bayes Theorem, and the Dice Method is an heuristic from eyeballing. Tango's "regress to the specific mean" is another Bayes heuristic.

We can reduce the three methods into one by noting what they have in common -- they all move the estimate from lower on the curve to higher on the curve. So, instead of "regress to the mean," maybe we can say

"regress to higher ground."

That's sometimes how I think of Bayes' Theorem in my own mind. In fact, I think you can explain Bayes exactly, as a more formal method of figuring where the higher ground is, by explicitly calculating how much to weight the closer ground relative to the distant ground. 

Labels: , ,

Wednesday, April 15, 2020

Regressing Park Factors (Part III)

I previously calculated that to estimate the true park factor (BPF) for a particular season, you have to take the "standard" one and regress it to the mean by 38 percent.

That's the generic estimate, for all parks combined. If you take Coors Field out of the pool of parks, you have to regress even more.

I ran the same study as in my other post, but this time I left out all the Rockies. Now, instead of 38 percent, you have to regress 50 percent. (It was actually 49-point-something, but I'm calling it 50 percent for simplicity.)

In effect, the old 38 percent estimate comes from a combination of 

1. Coors Field, which needs to be regressed virtually zero, and
2. The other parks, which need to be regressed 50 percent.

For the 50-percent estimate, the 93% confidence interval is (41, 58), which is very wide. But the theoretical method from last post, which I also repeated without Colorado, gave 51 percent, right in line with the observed number.


I tried this method for the Rockies only, and it turns out that the point estimate is that you have to regress slightly *away* from the mean of 100. But with so few team-seasons, the confidence interval is so huge that I'd just take the park factors at face value and not regress them at all. 

The proper method would probably be to regress the Rockies' park factor to the Coors Field mean, which is about 113. You could probably crunch the numbers and figure out how much to regress. 


The overall non-Coors value is 50 percent, but it turns out that every decade is different. *Very* different:

1960s:   regress 15 percent
1970s:   regress 27 percent
1980s:   regress 80 percent
1990s:   regress 84 percent
2000s:   regress 28 percent
2010-16: regress 28 percent 

Why do the values jump around so much? One possibility is that it's random variation on how teams are matched to parks. The method expects batters in hitters' parks to be equal to batters in pitchers' parks, but if (for instance) the Red Sox had a bad team in the 80s, this method would make the park effect appear smaller.

As soon as I wrote that, I realized I could check it. Here are the correlations between BPF and team talent in terms of RS-RA (per 162 games) for team-seasons, by decade. I'll include the regression-to-the-mean amount to make it easier to compare:

             r    RTM
1960s:    +0.14   15% 
1970s:    +0.06   27%
1980s:    -0.14   80%
1990s:    +0.03   84%
2000s:    +0.16   28%
2010s:    +0.23   28%
overall:  +0.05   50%

It does seem to work out that the more positive the correlation between hitting and BPF, the more you have to regress. The two lowest correlations were the ones with the two highest levels of regression to the mean.

(The 1990s does seem a little out of whack, though. Maybe it has something to do with the fact that we're leaving out the Rockies, so the NL BPFs are deflated for 1993-99, but the RS-RA are inflated because the Rockies were mediocre that decade. With the Rockies included, the 1990s correlation would turn negative.)

The "regress 50 percent to the mean" estimate seems to be associated with an overall correlation of +.05. If we want an estimate that assumes zero correlation, we should probably bump it up a bit -- maybe to 60 percent or something.

I'd have to think about whether I wanted to do that, though. My gut seems more comfortable with the actual observed value of 50 percent. I can't justify that.

Labels: , ,

Monday, March 23, 2020

Regressing Park Factors (Part II)

Note: math/stats post.


Last post, I figured the breakdown of variance for (three-year average) park effects (BPFs) from the Lahman database. It came out like this:

All Parks   [chart 1]
4.8 = SD(3-year observed)
4.3 = SD(3-year true) 
2.1 = SD(3-year luck)

Using the usual method, we would figure, theoretically, that you have to regress park factor by (2.1/4.8)^2, which is about 20 percent. 

But when we used empirical data to calculate the real-life amount of regression required, it turned out to be 38 percent.

Why the difference? Because the 20 percent figure is to regress the observed three-year BPF to its true three-year average. But the 38 percent is to regress the observed three-year BPF to a single-year BPF.

My first thought was: the 3-year true value is the average of three 1-year true values. If each of those were independent, we could just break the 3-year SD into three 1-year SDs by dividing by the square root of 3. 

But that wouldn't work. That's because when we split a 3-year BPF into three 1-year BPFs, those three are from the same park. So, we'd expect them to be closer to each other than if they were three random BPFs from different parks. (That fact is why we choose a three-year average instead of a single year -- we expect the three years to be very similar, which will increase our accuracy.)

Three years of the same park are similar, but not exactly the same. Parks do change a bit from year to year; more importantly, *other* parks change. (In their first season in MLB, the Rockies had a BPF of 118. All else being equal, the other 13 teams would see their BPF drop by about 1.4 points to keep the average at 100.)

So, we need to figure out the SD(true) for different seasons of the same park. 


From the Lahman database, I took all ballparks (1960-2016) with the same name for at least 10 seasons. For each park, I calculated the SD of its BPF for those years. Then, I took the root mean square of those numbers. That came out to 3.1.

We already calculated that the SD of luck for the average of three seasons is 2.1. That means we can fill in SD(true)=2.3.

Same Park     [Chart 2]
3.1 = SD(3-year observed, same park)
2.3 = SD(3-year true, same park)
2.1 = SD(3-year luck, any parks)

(That's the only number we will actually need from that chart.)

Now, from Chart 1, we found SD(true) was 4.3 for all park-years. That 4.3 is the combination of (a) variance of different years from the same park, and (b) variance between the different parks. We now know (a) is 2.3, so we can calculate (b) is root (4.3 squared minus 2.3 squared), which equals 3.6.

So we'll break the "4.3" from Chart 1 into those two parts:

All Parks     [Chart 3]
4.8 = SD(3-year observed)
3.6 = SD(3-year true between parks)
2.3 = SD(3-year true within park)
2.1 = SD(3-year luck)

Now, let's assume that for a given park, the annual deviations from its overall average are independent from year to year. That's not absolutely true, since some changes are more permanent, like when Coors Field joins the league. But it's probably close enough.

With the assumption of independence, we can break the 3-year SD down into three 1-year SDs.  That converts the single 2.3 into three SDs of 1.3 (obtained by dividing 2.3 by the square root of 3):

All Parks     [Chart 4]
4.8 = SD(3-year observed)
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)
1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)

What we're interested in is the SD of this year's value. That's the combination of the first two numbers in the breakdown: the SD of the difference between parks, and the SD of this year's true value for the current park.

The bottom three numbers are different kinds of "luck," for what we're trying to measure. The actual luck in run scoring, and the "luck" in how the park changed in the other two years we're using in the smoothing for the current year. 

All Parks     [Chart 4a]
4.8 = SD(3-year observed)
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)

1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)

Combining the top three and bottom three, we get

All Parks      [Chart 5]
4.8 = SD(3-year observed)
3.8 = SD(true values impacting observed BPF)
2.8 = SD(random values impacting observed BPF)

So we regress by (2.8/4.8) squared, which works out to 34 percent. That's pretty close to the actual figure of 38 percent.

We can do another attempt, with a slightly different assumption. Back in Chart 2, when we figured SD(three year true, same park) was 3.1 ... that estimate was based on parks with at least ten years of data. If I reduce the requirement to three years of data, SD(three year true, same park) goes up to 3.2, and the final result is ... 36 percent regression to the mean.

So there it is.  I think this is method is valid, but I'm not completely sure. The 95% confidence interval for the true value seems to be wide -- regression to the mean between 28 percent and 49 percent -- so it might just be coincidence that this calculation matches. 

If you see a problem, let me know.

Labels: , ,

Thursday, March 12, 2020

Regressing Park Factors (Part I)

I think park factors* are substantial overestimates of the effect of the park. 

At their core, park effects are basically calculated based on runs scored at home divided by runs scored on the road. But that figure is heavily subject to the effects of luck. One random 10-8 game at home can skew the park effect by more than half a point.

Because of this, most sabermetric sources calculate park effects based on more than one year of data. A three-year sample is common ... I think Fangraphs, Baseball Reference, and the Lahman database all use three years. 

That helps, but not enough. It looks like park factors are still too extreme, and need to be substantially regressed to the mean.


* Here's a quick explanation of how park factors work, for those not familiar with them.

Park Factor is a number that tells us how relatively easy or difficult it is for a team to score runs because of the characteristics of its home park. The value "100" represents the average park. A number bigger than 100 means it's a hitters' park, where more runs tend to score, and smaller than 100 means it's a pitchers' park, where fewer runs tend to score.

Perhaps confusingly, the park factor averages the home park with an amalgam of road parks, in equal proportion. So if Chase Field has a park factor of 105, which is 5 percent more than average, that really means it's about 10 percent more runs at home, and about average on the road.

The point of park factor is that you can use it to adjust a hitter's stats to account for his home park. So if Edouardo Escobar creates 106 runs for the Diamondbacks, you divide by 1.05 and figure he'd have been good for about 101 runs if he had played in a neutral home park.


For my large sample of batters (1960-2016, minimum 300 PA), I calculated their runs created per 500 PA (RC500), and their park-adjusted RC500. Then, I binned the players by batting park factor (BPF), and took the average for each bin. If park adjustment worked perfectly, you'd expect every bin to have the same level of performance. After all, there's no reason to think batters who play for the Red Sox or Rockies should be any better or worse overall than batters who are current Mets or former Astros.

(Because of small sample sizes, I grouped all parks 119+ into a single bin. The average BPF for those parks was 123.4.)

Here's the chart:

BPF      PA     Runs   Adj'd  Regressed
 88   8554   67.07   76.22   72.67
 89   2960   56.57   63.56   60.85
 90  43213   62.53   69.48   66.78
 91  61195   62.42   68.59   66.19
 92 121203   61.04   66.35   64.29
 93 195382   62.21   66.89   65.07
 94 241681   61.93   65.89   64.35
 95 304270   64.72   68.12   66.80
 96 325511   64.13   66.81   65.77
 97 463537   63.05   65.00   64.24
 98 520621   65.62   66.96   66.44
 99 712668   64.29   64.94   64.69
100 674090   64.86   64.86   64.86
101 589401   66.53   65.87   66.12
102 514724   66.19   64.89   65.39
103 440940   65.48   63.58   64.32
104 415243   66.07   63.53   64.51
105 319334   67.35   64.15   65.39
106 177680   66.15   62.41   63.86
107 138748   65.85   61.54   63.21
108 105850   67.58   62.57   64.51
109   25751   68.50   62.84   65.04
110   48127   65.46   59.51   61.82
111   34278   69.61   62.71   65.39
112   36977   65.12   58.14   60.85
113   23001   67.94   60.13   63.16
115   13778   74.55   64.83   68.60
116    7994   72.21   62.25   66.12
117   20901   73.62   62.92   67.07
123.4  39629   79.28   64.28   70.10
row/row diff    0.60%   0.39%   0.00%

Start with the third column, which is the raw RC500. As you'd expect, the higher the park factor, the higher the unadjusted runs. That's the effect we want BPF to remove.

So, I adjusted by BPF, and that's column 4. We now expect everything to even out, and the column to look uniform. But it doesn't -- now, it goes the other way. Batters in pitchers' parks now look like they're better hitters than the batters in hitters' parks.

That shows that we overadjusted. By how much? 

Take a look at the bottom row of the chart. Unadjusted, each row is about 0.6 percent higher than the row above it. We'd expect about 1 percent, if BPF worked perfectly. Adjusted, each row is about 0.4 percent lower. 

So BPF overestimates the true park factor by around 40 percent. Which means, if we regress park factors to the mean by 40 percent, we should remove the bias.

That's what the last column is. And the numbers look pretty level.

Actually, I didn't use 40 percent ... I used 38.8 percent. That's what gave the best flat fit. (Part of the difference is due to rounding, and the rest due to the fact that I ignored nonlinearity when I calculated the percentages.)

Just to be more rigorous and get a more accurate estimate, I ran a regression. Instead of binning the players, I just used all players separately, and did a "weighted regression" that effectively adjusts for the number of PA associated with each bin. Because of the weights, I was able to drop the minimum from 300 PA to 10 PA. Also, I included a dummy variable for year, just in case there were a lot of pitchers' parks in 1987, or something.

The result came out almost exactly the same -- regress by 38.3 percent.


Could we have calculated this mathematically just from raw park factors? Yes, I think so -- but not quite in the usual way. 

I'll show the usual way here, and save the rest for the next post. If you don't care about the math, you can just stop here.


If we used the usual technique for figuring how much to regress, we'd use

SD^2(observed) = SD^2(true) + SD^2(luck)

We can figure luck. The SD of team runs in a game is about 3. For two teams combined, multiply by root 2. Calculating the different from a road game, multiply by root 2 again. Then, for 81 games, divide by the square root of 81, which is 9. Finally, because we're using 3 years, divide by the square root of 3. 

You get 0.385 runs. 

That figure, 0.385 runs, is 4.27 percent of the usual 9 runs per game. To convert that to a park factor, take half. That's 2.13 points. I'll round to 2.1.

The observed SD, from the Lahman database, is 4.8 points. 

We can now calculate SD(true), since 4.8^2 = SD^2(true) + 2.1^2. It works out to 4.3 points.

SD(observed)= 4.8
    SD(true)= 4.3
    SD(luck)= 2.1

So, to regress observed to true, we regress the park factor by (2.1/4.8)^2, which is about 19 percent.

Why isn't it 38.3 percent?

Because, in this case, the "true" value is the three-year average. For that, regress 19 percent. But, that's not what we really want when it comes to a single year's performance. For that, we want SD(true) and SD(luck) for just that one year's park factor, not the average of the three years.

It makes sense you need to regress more for one year than three, because there's more randomness: the first 19 percent is for the randomness in the three-year average, and the next 19 percent is for the randomness in the fact that the other two of the three years the park might have been different, so the three-year BPF might not be representative of the year you're looking at.

There's no obvious relationship between the 19 percent and the 38.3 percent -- it's just coincidence that it comes out double. 

But I think I've worked out how we could have calculated the 38.3 figure. I'll write that up for Part II.  

(P.S.  You might have noticed that the last column of the chart was fairly level, except for the extreme hitters' parks.  I'll talk about that in part III.)

Update, 3/24/20: Part II is here.

Labels: , ,

Friday, February 28, 2020

Park adjusting a player's batting line has to depend on who the player is

Suppose home runs are scarce in the Astrodome, so that only half as many home runs are hit there than in any other park. One year, Astros outfielder Joe Slugger hits 15 HR at the Astrodome. How do you convert that to be park-neutral? 

It seems like you should adjust it to 30. Take the 15 HR, double it, and there you go. 

But I don't think that works. I think if you do it that way, you overestimate what Joe would have done in a normal park. I think you need to adjust Joe to something substantially less than 30.


One reason is that Joe might not necessarily be hurt by the park as much as other players. Maybe the park hurts weaker hitters more, the kind who hit mostly 310-foot home runs. Maybe Joe is the kind who hits most of his fly balls 430 feet, so when the indoor dead air shortens them by 15 feet, they still have enough carry to make it over the fence.

It's almost certain that some players should have different park factors than others. Many parks are asymmetrical, so lefties and righties will hit to different outfield depths. Some parks may have more impact on players more who hit more line drive HRs, and less impact on towering fly balls. And so on.

I suspect that's actually a big issue, but I'm going to assume it away for now. I'll continue as if every player is affected by the park the same way, and I'll assume that Joe hit exactly 15 HR at home and 30 HR on the road, exactly in line with expectations.

Also, to keep things simple, two more assumptions. First I'll assume that the park factor is caused by distance to the outfield fence -- that the Astrodome is, say, 10 percent deeper than the average park. Second, I'll assume that in the alternative universe where Joe played in a different home park, he would have hit every ball with exactly the same trajectory and distance that he did in the Astrodome.

My argument is that with these assumptions, the Astros overall would have hit twice as many HR at home as they actually did. But Joe Slugger would have hit *fewer* than twice as many.


Let's start by defining two classes of deep fly balls:

A: fly balls deep enough to be HR in any park, including the Astrodome; 
B: fly balls deep enough to be HR in any park *except* the Astrodome.

We know that, overall, class A is exactly equal in size to class B, since (A+B) is exactly twice A.

That's why, when we saw 15 HR in class A, we immediately assumed that implies 15 HR in class B. And so we assumed that Joe would have hit an extra 15 HR in any other park.

That seems like it should work, but it doesn't. Here's a series of analogies that shows why.

1. You have a pair of fair dice. You expect them to come up snake eyes (1-1) exactly as often as box cars (6-6). You roll the dice 360 times, and find that 1-1 came up 15 times. 

Since 6-6 comes up as often as 1-1, should you estimate that 6-6 also came up 15 times? You should not. Since the dice are fair, you expect 6-6 to have come up 1 time in 36, or 10 times.* The fact that 1-1 got lucky, and came up more often, doesn't mean that 6-6 must have come up more often.

(*Actually, you should expect that 6-6 came up only 9.86 times, since there are 5 fewer tosses left for 6-6 after taking out the successful 1-1s. But never mind for now.)

2. You have a pair of fair dice, and an APBA card. On that card, 1-1 is a home run, and 6-6 represents a home run anywhere except the Astrodome.

You roll the dice 360 times, and 1-1 comes up 15 times. Do you also expect that 6-6 came up 15 times? Same answer: you expect it came up only 10 times. The fact that 1-1 got lucky doesn't mean that 6-6 must also have gotten lucky.

3. You have a simulation game, with some number of fair dice, and a card for Joe Slugger. You know the probability of Joe hitting an Astrodome HR is equal to the probability of Joe hitting an "anywhere but Astrodome" HR.  But that probability -- Joe's talent level -- isn't necessarily 1 in 36.

You play a season's worth of Joe's home games, and he hit 15 HR. Can you assume that he also hit 15 "anywhere but Astrodome" HR? 

Well, in one special case, you can. If the 15 HR was Joe's actual expectation, based on his talent -- that is, his card -- then, yes, you can assume 15 near-HR. 

But, in all other cases, you can't. If Joe's 15 HR total was lucky, based on his talent, you should assume fewer than 15 near-HR. And if the 15 HR was unlucky, you should assume more than 15 near-HR.


So I think you can't park adjust players via the standard method of multiplying their performance by their park factor. The park adjustment has to be based on their *expected* performance, not their observed performance.

Suppose Joe Slugger, at the beginning of the season, was projected by the Marcel system to hit 10 HR at home. That means that he was expected to hit 10 HR at the Astrodome, and 10 "almost HR" at the Astrodome.

Instead, he winds up hitting 15 HR there. But we still estimate that he hit only 10 "almost HR". So, instead of bumping his 15 HR total to 30, we bump it only to 25.


I was surprised by this, that there's no way to convert the Astrodome to a normal park that doesn't require you to estimate the player's talent. 

But here's what surprised me even more, when I worked it out: you only need to know the player's talent when you're adjusting from a pitchers' park. When you adjust from a hitters' park, one formula works for everyone!

Let's take it the other way, and suppose that Fenway affords twice as many home runs as any other park. And, suppose Joe Slugger, now with the Red Sox, hits 40 at Fenway and 20 on the road.

How many would he have hit if none of his games were at Fenway?

Well, on average, half of his 40 HR would have been HR on the road. So, that's 20. End of calculation. 

It doesn't matter who the batter is, or what his talent is -- as long as we stick to the assumption that every player's expectation is twice as many HR at Fenway, the expectation is that half his Fenway HR would also have been HR on the road.

(In reality, it might have been more, or it might have been less, since the breakdown of the 40 HR is random, like 40 coin tosses. But the point is, it doesn't depend on the player.)


If you're not convinced, here's a coin toss analogy that might make it clearer.

We ask MLB players to do a coin toss experiment. We give them a fair coin. We tell them, take your day of birth, multiply it by 10, toss the coin that many times, and count the heads. Then, toss the coin that many times again, but this time, count the number of tails.

For the Fenway analogy: heads are "Fenway only" HR. Tails are "any park" HR.

We ask each player to come back and tell us H+T, the total number of Fenway HR. We then try to estimate the heads, the number of "Fenway only" HR.

That's easy: we just assume half the number. Mathematically, the expectation for any player, no matter who, is that H will be half of (H+T). That's because no matter how lucky or unlucky he was, there's no reason to expect he was luckier in H than T, or vice-versa.

Now, for the Astrodome analogy. Heads are "Any park including Astrodome" HR. Tails are "other park only" HR.

We ask each player to come back and tell us only the number of heads, which is the the Astrodome HR total. We'll try to estimate tails, the non-Astrodome HR total.

Rob Picciolo comes back and says he got 15 heads. Naively, we might estimate that he also tossed 15 tails, since the probabilities are equal. But that would be wrong. Because, we would check Baseball Reference, and we would see that Picciolo was born on the 4th of the month, not the 3rd. Which means he actually had 40 tosses, not 30, and was unlucky in heads.

In his 40 tosses for tails, there's no reason to expect he'd have been similarly unlucky, so we estimate that Picciolo tossed 20 tails, not 15.

On the other hand, Barry Bonds comes back and says he got 130 heads. On average, players who toss 130 heads would also have averaged about 130 tails. But Barry Bonds was born on the 24th of July. We should estimate that he tossed only 120 tails, not 130.

For Fenway, when we know the total number of heads and tails, the player's birthday doesn't factor into our estimate of tails. For the Astrodome, when we know only the total number of heads, the player's birthday *does* factor into our estimate of tails.


So, when Joe Slugger plays 81 games at the Astrodome and tosses 15 home run "heads," we can't just expect him to have also tossed 15 long fly ball "tails". We have to look up his home run talent "date of birth". If he was only born on the 2nd of the month, so that we'd have only expected him to hit 10 HR "heads" and 10 near-HR "tails" in the first place, then we estimate he'd have hit only 10 neutral-park HR, not 15. 

If we don't do that -- if we don't look at his "date of birth" talent and just double his actual Astrodome HR -- our estimates will be too high for players who were lucky, and too low for players who were unlucky. 

Obviously, players who were lucky will have higher totals. That means that if we park-adjust the numbers for the Astros every year, the players who have the best seasons will tend to be the ones we overadjust the most. In other words, when a player was both good and lucky, we're going to make his good seasons look great, his great seasons look spectacular, and his spectacular seasons look like insane outliers. When a player is bad and unlucky, his bad seasons will look even worse.

But if we park-adjust the Red Sox every year ... there's no such effect, and everything should work reasonably well.

My gut still doesn't want to believe it, but my brain thinks it's correct. 

Well, my gut *didn't* want to believe it, when I wrote that sentence originally. Now, I realize that the effect is pretty small. When a player gets lucky by, say, 20 runs, with a season park factor of 95 ... well, that's only 1 run total. My gut is more comfortable with a 1-run effect.

But, suppose you're adjusting a Met superstar, trying to figure out what he'd hit in Colorado. Runs are about 60 percent more abundant in Coors Field than Citi Field, which means the park factor is around 30 percent higher. If the player was 20 runs lucky in that particular season, you'd wind up overestimating him by 6 runs, which is now worth worrying about.


(Note: After writing this, but before final edit, I discovered that Tom Tango made a similar argument years ago. His analysis dealt with the specific case where the player's observed performance matches his expectation, and for that instance I have reinvented his wheel, 15 years later.)

Labels: , , ,