Saturday, October 31, 2020

Calculating park factors from batting lines instead of runs

I missed a post Tango wrote back in 2019 about park factors. In the comments, he said,

"That’s one place where we failed with our park factors, using actual runs instead of "component" runs. They should be based on Linear Weights or RC or wOBA, something like that.

"Using actual runs means introducing unnecessary random variation in the mix."

Yup. One of those bits of brilliance that's obvious in retrospect.

The idea is, there's a certain amount of luck involved in turning batting events into runs, which depends on the sequence -- in other words, "clutch hitting," which is thought to be mostly random. If teams wind up scoring, say, 20 runs above average in a certain park, it could be that the park lends itself to higher offense. But, it could also be that the park is neutral, and those 20 runs just came from better clutch hitting.

So if we calculated park factors from raw batting lines, instead of actual runs, we eliminate that luck, and should get better estimates. We can still convert to expected runs afterwards.

Let's do it. I'll start with using runs as usual. Then, I'll do it for wOBA, and we'll compare.

-------

I used team-seasons from 2000-2019, except Coors Field (because it`s so extreme an outlier). I included only parks that were used at least 16 of the 20 seasons. 

To get the observed park effects, I just took home scoring (both teams combined) and subtracted road scoring (both teams combined). 

For those 444 datapoints, I got

SD(observed) = 81.6 runs

To estimate luck, I used the rule of thumb that SD(runs) for a single team's games is about 3. (Tango uses the square root of total runs for both teams, but I didn't bother.)  

If SD(1 game) = 3, then SD(81 games) = 27. But we want both teams combined, so multiply by root 2. Then, we want (home - road), so multiply by root 2 again. That gives us 54.

SD(luck) = 54 runs

Since var(observed) = var(luck) + var(non-luck), we get*

SD(non-luck) = 61.2 runs

*"var" is variance, the square of SD. I'm using it instead of "SD^2" because it makes it much easier to read.

Now, what's this thing I called "non-luck"? It's a combination of the differences between parks, and season-to season differences within the same park -- weather, how well the players are suited to the park, the parks used by other teams in the division (because of the unbalanced schedule), the parks used by interleague opponents, the somewhat-random distribution of opposing pitchers ... stuff like that.

var(non-luck) = var(between parks) + var(within park)

To estimate SD(within park), I just looked at the observed SDs of the same park across the 16-20 seasons in the dataset. There were 23 parks in the sample, and I took the root-mean-square of those 23 individual SDs. I got

SD(different seasons of park) = 64.1

But ... that 64.1 includes luck, and we want only the non-luck portion. So let's remove luck:

var(diff. seas. of park)= var(luck) + var(within park)
64.1 squared = 54 squared  + var(within park)
SD(within park) = 34.5 runs

And now we can estimate SD(between parks):

var(non-luck) = var(between parks) + var(within park)
61.2 squared = var(between parks) + 34.5 squared
SD(between parks) = 50.5 runs

Summarizing:

81.6  runs total
---------------------------------
54    luck
50.5  between parks
34.5  within park between seasons

Park squared is only 38 percent of the total squared. That means that only 38 percent of the observed park effect is real, and you have to regress to the mean by 62 percent to get an unbiased estimate.

That's a lot. And it's one reason that most sites publish park factors based on more than one season, to give luck a chance to even out.

-------

Now, let's try Tango's suggestion to use wOBA instead, and see how much luck that squeezes out.

For the same individual parks, I calculated every year's observed park difference the same way as for runs -- home wOBA minus road wOBA, both teams combined.

For the sample, SD(observed) was 0.01524, against an average wOBA of .3248. That's a ratio of 4.7%. I did a regression and found runs-per-PA increase 1.8x as fast as wOBA (probably proportional to the 1.77th power, or something), so 4.7% in wOBA is 8.45% in runs.

In the full sample, there were .118875 runs per PA, and an average 6207 PA for each home park-season. That's about 738 runs. Taking 8.45 percent of that works out to an SD of 67.3 runs.

SD(observed) = 67.3 runs

The luck SD for wOBA for a single PA is .532 (as calculated from an average batting line APBA card). I'll spare you repeating the percentage calculations, but for 6207 PA,

SD(luck) = 41.9 runs

As before, var(observed) = var(luck) + var(non-luck), so

SD(non-luck) = 52.7 runs

Looking at the RMS between-season SD of the 23 teams in the sample, 

SD(different seasons of park) = 51.2 runs

Eliminating luck to get true season-to-season differences:

var(diff. seas. of park)= var(luck) + var(within park)
51.2 squared = 41.9 squared  + var(within park)
SD(within park) = 29.4 runs

And, finally,

var(non-luck) = var(between parks) + var(within park)
52.7 squared  = var(between parks) + 29.4 squared
SD(between parks) = 43.7 runs

The summary:

67.3  runs total
---------------------------------
41.9  luck
43.7  between park
29.4  within park between seasons

Here the "between park" variance is 42 percent of the total, up from 38 percent when we used runs. So we have, in fact, gotten more accurate estimates.

------

But wait! The two methods really should give us the same estimate of the SD of the "between" and "within" park factors, since they're trying to measure the same thing. But they don't:

runs  wOBA
-----------------------------------------
81.6  67.3   runs total
-----------------------------------------
54    41.9   luck
50.5  43.7   between park
34.5  29.4   within park between seasons

(The "luck" SD is supposed to be different, since that was the whole purpose of using wOBA, to eliminate some of the random noise.)

I think the difference is due to the fact that the wOBA variances were all based on averages per PA, while the runs variances were based on averages per game (roughly, per 27 outs).

On average, the more runs you score, the more PA you'll have. So changing the denominator to PA reduces the high-scoring games relative to the low-scoring games, which compresses the differences, which reduces the SD. 

Although the differences in PA look small, they actually indicate large differences in scoring. Because, per season, every park gets roughly the same number of outs, which means roughly the same number of PA that are outs. So any "extra PA" are mostly baserunners, and very valuable in terms of runs.

If you switch from "observed runs per game" to "observed runs per 6207 PA," the observed SD drops from 81.6 to 72.7 runs.  That's an 11 percent drop. When I did the same for wOBA, the observed SD dropped by 13 percent. So, let's estimate that the difference between "per game" and "per PA" is 12 percent, and reduce everything in the runs column by 12 percent:

runs  wOBA
--------------------------------------------
71.8  67.3   runs total
--------------------------------------------
47.5  41.9   luck
44.4  43.7   between park
30.4  29.4   within park between seasons
--------------------------------------------
62%   58%    regression to long-term mean 

I'm not 100% sure this is legitimate, but it's probably pretty close. One thing I want to do to make the comparison better, is to use the same value for "between park" and "within park", since we expect the methods to produce the same estimate, and we expect that any difference is random (in things like wOBA to run conversion, or how PA vary between games, or the fact that the wOBA calculation omits factors like baserunning).

So after my manual adjustment, we have:

runs  wOBA
--------------------------------------------
71.4  67.8   runs total
--------------------------------------------
47.5  41.9   luck
44    44     between park
30    30     within park between seasons
--------------------------------------------
62%   58%    regression to long-term mean 

-------

That's still a fair bit you have to regress either way -- more than half -- but that would be reduced if you used more than one season in your sample. If we go to the average of four seasons, "luck" and "within park" both get cut in half (the square root of 1/4). 

I'll divide both of those by 2, and recalculate the top and bottom line:

runs  wOBA
--------------------------------------------
52.3  51.0   runs total
--------------------------------------------
24    21     luck
44    44     between park
15    15     within park between seasons
--------------------------------------------
29%   26%    regression to long-term mean 

So if we use a four-year park average, we should only have to regress 29 percent (for runs) or 26 percent (for wOBA). 

-------

Thanks to Tango for the wOBA data making this possible, and for other observations I'm saving for a future post.

My three previous posts on park factors are here:  one two three



Labels: , , ,

Thursday, August 27, 2020

Charlie Pavitt: Open the Hall of Fame to sabermetric pioneers

This guest post is from occasional contributor Charlie Pavitt. Here's a link to some of Charlie's previous posts.

-----

Induction into the National Baseball Hall of Fame (HOF) is of course the highest honor available to those associated with the game.  When one thinks of the HOF, one first thinks of the greatest players, such as the first five inductees in 1936 (Cobb, Johnson, Matthewson, Ruth, and Wagner). But other categories of contributors were added almost immediately; league presidents (Morgan Bulkeley, Ban Johnson) and managers (Mack, McGraw) plus George Wright in 1937, pioneers (Alexander Cartwright and Henry Chadwick) in 1938, owners (Charles Comiskey) in 1939, umpires (Bill Klem) and what would now be considered general managers (Ed Barrow) in 1953, and even union leaders (Marvin Miller, this year for induction next year). There is an additional type of honor associated with the HOF for contributions to the game; the J. G. Taylor Spink Award (given by the Baseball Writers Association of America) annually since 1962, the Ford C. Frick Award for broadcasters annually since 1978, and thus far five Buck O’Neill Lifetime Achievement Awards given every three years since 2008.  Even songs get honored ("Centerfield", 2010; "Talkin' Baseball", 2011).

But what about sabermetricians? Are they not having a major influence on the game?  Are there not some who are deserving of an honor of this magnitude?

I am proposing that an honor analogous to the Spink, Frick, and O’Neill awards be given to sabermetricians who have made significant and influential contributions to the analytic study of baseball. I would have called it the Henry Chadwick Award to pay tribute to the inventor of the box score, batting average, and earned run average, but SABR has already reserved that title for its award for research contributions, a few of which have gone to sabermetricians but most to other contributors. So instead I will call it the F. C. Lane award, not in reference to Frank C. Lane (general manager of several teams in the 1950s and 1960s) but rather Ferdinand C. Lane, editor of the Baseball Magazine between 1911 and 1937. Lane wrote two articles for the publication ("Why the System of Batting Should Be Reformed," January 1917, pages 52-60; "The Base on Balls," March 1917, pages 93-95) in which he proposed linear weight formulas for evaluating batting performance, the second of which is remarkably accurate.

I shall now list those whom I think have made "significant and influential contributions to the analytic study of baseball" (that phrase was purposely worded in order to delineate the intent of the award). The HOF began inductions with five players, so I will propose who I think should be the first five recipients:


George Lindsay

Between 1959 and 1963, based on data from a few hundred games either he or his father had scored, George Lindsay published three academic articles in which he examined issues such as the stability of the batting average, average run expectancies for each number of outs during an inning and for different innings, the length of extra-inning games, the distribution of total runs for each team in a game, the odds of winning games with various leads in each inning, and the value of intentional walks and base stealing. It was revolutionary work, and opened up areas of study that have been built upon by generations of sabermetricians since.

 

Bill James

Starting with his first-self-published Baseball Abstract back in 1977, James built up an audience that resulted in the Abstract becoming a conventionally-published best seller between 1982 and 1988.  During those years, he proposed numerous concepts – to name just three, Runs Created, the Pythagorean Equation, and the Defensive Spectrum – that have influenced sabermetric work ever since.  But at least if not more important were his other contributions.  He proposed and got off the ground Project Scoresheet, the first volunteer effort to compile pitch-by-pitch data for games to be made freely available to researchers; this was the forerunner and inspiration for Retrosheet. During the same years as the Abstract was conventionally published, he oversaw a sabermetric newsletter/journal, the Baseball Analyst, which provided a pre-Internet outlet for amateur sabermetricians (including myself) who had few if any other opportunities to get their work out to the public.  Perhaps most importantly, his work was the first serious sabermetric (a term he coined) analysis many of us saw, and served as an inspiration for us to try our hand at it too. I might add that calls for James to be inducted into the Hall itself can be found on a New York Times article from January 20, 2019 by Jamie Malinowski and the Last Word on Baseball website by its editor Evan Thompson.


Pete Palmer

George Lindsay’s work was not readily available. The Hidden Game of Baseball, written by Palmer and John Thorn, was, and included both a history of previous quantitative work and advancement on that work in the spirit of Lindsay’s. Palmer’s use of linear-weight equations to measure offensive performance and of run expectancies to evaluate strategy options were not entirely new, as Lane and Lindsay had respectively been first, but it was Palmer’s presentation that served to familiarize those that followed with these possibilities, and as with James these were inspirations to many of us to try our hands at baseball analytics ourselves.  Probably the most important of Palmer’s contributions has been On-base Plus Slugging (OPS), one of the few sabermetric concepts to have become commonplace on baseball broadcasts.

 

David Smith

I’ve already mentioned Project Scoresheet, which lasted as a volunteer organization from 1984 through 1989. I do not wish to go into its fiery ending, a product of a fight about conflict of interest and, in the end, money.  Out of its ashes like the proverbial phoenix rose Retrosheet, the go-to online source for data describing what occurred during all games dating back to 1973, most games back to 1920, and some from before then. Since its beginning, those involved with Retrosheet have known not to repeat the Project’s errors and have made data freely available to everyone even if the intended use for that data is personal financial profit. Dave Smith was the last director of Project Scoresheet, the motivator behind the beginning of Retrosheet, and the latter’s president ever since. Although it is primed to continue when Dave is gone, Retrosheet’s existence would be inconceivable without him.  Baseball Prospectus’s analyst Russell Carleton, whose work relies on Retrosheet, has made it clear in print that he thinks that Dave should be inducted into the Hall itself.

 

Sean Forman

It is true that Forman copied from other sources, but no matter; it took a lot of work to begin what is now the go-to online source for data on seasonal performance. Baseball Reference began as a one-man sideline for an academic, and has become home to information about all American major team sports plus world-wide info on “real” football. 

 

-----

Here are two others that I believe should eventually be recipients.

Sherri Nichols

Only two women have been bestowed with HOF-related awards; Claire Smith is a past winner of the Spink Award and Rachel Robinson is a recipient of the O’Neill Award.  Sherri Nichols would become the third. I became convinced that she deserved it after reading Ben Lindbergh’s tribute, and recommend it for all interested in learning about the "founding mother" of sabermetrics. I remember when the late Pete DeCoursey (I was scoring Project Scoresheet Phillies games and he was our team captain) proposed the concept of Defensive Average, for which (as Lindbergh’s article noted) Nichols did the computations. This was revolutionary work at that time, and laid the groundwork for all of the advanced fielding information we now have at our disposal.

 

Tom Tango

Tango has had significant influence on many areas of sabermetric work, two of which have joined Palmer’s OPS as commonplaces on baseball-related broadcasts. Wins Above Replacement (WAR) was actually Bill James’s idea, but James never tried to implement it. Tango has helped define it, and his offensive index wOBA is at the basis of the two most prominent instantiations, those from Baseball Reference (alternatively referred to as bWAR and rWAR) and FanGraphs (fWAR).  Leverage was an idea whose time had come, as our blogmaster Phil Birnbaum came up with the same concept at about the same time, but it was Tango’s usage that became definitive. His Fielding Independent Pitching (FIP) corrective to weaknesses in ERA is also well-known and often used. Tango currently oversees data collection for MLB Advanced Media, and has done definitive work on MLBAM’s measurement of fielding (click here for a magisterial discussion of that topic).

There are some historical figures that might be deserving; Craig Wright, Dick Cramer, and Allan Roth come to mind as possibilities. Maybe even Earnshaw Cook, as wrong as he was about just about everything, because of what he was attempting to do without the data he needed to do it right (see his Percentage Baseball book for a historically significant document). Perhaps the Award could also go to organizations as a whole, such as Baseball Prospectus and FanGraphs; if so, SABR should get it first.


Labels: ,

Wednesday, August 05, 2020

The NEJM hydroxychloroquine study fails to notice its largest effect

Before hydroxychloroquine was a Donald Trump joke, the drug was considered a promising possibility for prevention and treatment of Covid-19. It had been previously shown to work against respiratory viruses in the lab, and, for decades, it was safely and routinely given to travellers before departing to malaria-infested regions. A doctor friend of mine (who, I am hoping, will have reviewed this post for medical soundness before I post it) recalls having taken it before a trip to India.

Travellers start on hydroxychloroquine two weeks before departure; this gives the drug time to build up in the body. Large doses at once can cause gastrointestinal side effects, but since hydroxychloroquine has a very long half-life in the body -- three weeks or so -- you build it up gradually.

For malaria, hydroxychloroquine can also be used for treatment. However, several recent studies have found it to be ineffective treating advanced Covid-19.

That leaves prevention. Can hydroxychloroquine be used to prevent Covid-19 infections? The "gold standard" would be a randomized double-blind placebo study, and we got one a couple of months ago, in the New England Journal of Medicine (NEJM). 

It concluded that there was no statistically significant difference between the treatment and placebo groups, and concluded

"After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19 or confirmed infection when used as postexposure prophylaxis within 4 days after exposure."

But ... after looking at the paper in more detail, I'm not so sure.

-------

The study reported on 821 subjects who had been exposed, within the past four days, to a patient testing positive for Covid-19. They received a dose of either hydroxychloroquine or placebo for the next five days (the first day was a higher "loading dose"), and followed over the next couple of weeks to see if they contracted the virus.

The results:

49 of 414 treatment subjects (11.8%) became infected
58 of 407   placebo subjects (14.3%) became infected.

That's about 17 percent fewer cases in patients who got the real drug. 

But that wasn't a large enough difference to show statistical significance, with only about 400 subjects in each group. The paper recognizes that, stating the study was designed only with sufficient power to find a reduction of at least 50 percent, not the 17 percent reduction that actually appeared. Still, by the usual academic standards for this sort of thing, the authors were able to declare that "hydroxychloroquine did not prevent illness."

At this point I would normally rant about statistical significance and how "absence of evidence is not evidence of absence."  But I'll skip that, because there's something more interesting going on.

------

Recall that the study tested hydroxychloroquine on subjects who feared they were already exposed to the virus. That's not really testing prevention ... it's testing treatment, albeit early treatment. It does have elements of prevention in it, as perhaps the subjects may not have been infected at that point, but would be infected later. (The study doesn't say explicitly, but I would assume some of the exposures were to family members, so repeated exposures over the next two weeks would be likely.)

Also: it did take five days of dosing until the full dose of hydroxychloroquine was taken. That means the subject didn't get a full dose until up to nine days after exposure to the virus.

So this is where it gets interesting. Here's Figure 2 from the paper:



These lines are cumulative infections during the course of the study. As of day 5, there were actually more infections in the group that took hydroxychloroquine than in the group that got the placebo ... which is perhaps not that surprising, since the subjects hadn't finished their full doses until that fifth day. By day 10, the placebo group has caught up, but the groups are still about equal.

But now ... look what happens from Day 10 to Day 14. The group that got the hydroxychloroquine doesn't move much ... but the placebo group shoots up.

What's the difference in new cases? The study doesn't give the exact numbers that correspond to the graph, so I used a pixel ruler to measure the distances between points of the graph. It turns out that from Day 10 to Day 14, they found:

-- 11 new infections in the placebo group
--  2 new infections in the hydroxychloroquine group.

What is the chance that of 13 new infections, they would get split 11:2? 

About 1.12 percent one-tailed, 2.24 percent two-tailed.

Now, I know that it's usually not legitimate to pick specific findings out of a study ... with 100 findings, you're bound to find one or two random ones that fall into that significance level. But this is not an arbitrary random pattern -- it's exactly what we would have expected to find if hydroxychloroquine worked as a preventative. 

It takes, on average, about a week for symptoms to appear after COVID-19 infection. So for those subjects in the "1-5" group, most were probably infected *before* the start of their hydroxychloroquine regimen (up to four days before, as the study notes). So those don't necessarily provide evidence of prevention. 

In the "6-10" group, we'd expect most of them to have been already infected before the drugs were administered; the reason they were admitted to the study in the first place was because they feared they had been exposed. So probably many of those who didn't experience symptoms until, say, Day 9, were already infected but had a longer incubation period. Also, most of the subsequently-infected subjects in that group probably got infected in the first five days, while they didn't have a full dose of the drug yet.

But in the last group, the "11-14" group, that's when you'd expect the largest preventative effect -- they'd have had a full dose of the drug for at least six days, and they were the most likely to have become infected only after the start of the trial.

And that's when the hydroxychloroquine group had an 84 percent lower infection rate than the placebo group.

------

In everything I've been reading about hydroxychloroquine and this study, I have not seen anyone notice this anomaly, that beyond ten days, there were almost seven times as many infections among those who didn't get the hydroxychloroquine. In fact, even the authors of the study didn't notice. They stopped the study on the basis of "futility" once they realized they were not going to achieve statistical significance (or, in other words, once they realized the reduction in infections was much less than the 50% minimum they would endorse). In other words: they stopped the study just as the results were starting to show up! 

And then the FDA, noting the lack of statistical significance, revoked authorization to use hydroxychloroquine.

I'm not trying to push hydroxychloroquine here ... and I'm certainly not saying that I think it will definitely work. If I had to give a gut estimate, based on this data and everything else I've seen, I'd say ... I dunno, maybe a 15 percent chance. Your guess may be lower. Even if your gut says there's only a one-in-a-hundred chance that this 84 percent reduction is real and not a random artifact ... in the midst of this kind of pandemic, isn't even 1 percent enough to say, hey, maybe it's worth another trial?

I know hydroxychloroquine is considered politically unpopular, and it's fun to make a mockery of it. But these results are strongly suggestive that there might be something there. If we all agree that Trump is an idiot, and even a stopped clock is right twice a day, can we try evaluating the results of this trial on what the evidence actually shows? Can we not elevate common sense over the politics of Trump, and the straitjacket of statistical significance, and actually do some proper science?




Labels:

Sunday, May 17, 2020

Herd immunity comes faster when some people are more infectious

By now, we all know about "R0" and how it needs to drop below 1.0 for use to achieve "herd immunity" to the COVID virus. 

The estimates I've seen is that the "R0" (or "R") for the COVID-19 virus is around 4. That means, in a susceptible population with no interventions like social distancing, the average infected person will pass the virus on to 4 other people. Each of those four passes it on to four others, and each of those 16 newly-infected people pass it on to four others, and the incidence grows exponentially.

But, suppose that 75 percent of the population is immune, perhaps because they've already been infected. Then, each infected person can pass the virus on to only one other person (since the other three who would otherwise be infected, are immune). That means R0 has dropped from 4 to 1. With R0=1, the number of infected people will stay level. As more people become immune, R drops further, and the disease eventually dies out in the population.

That's the argument most experts have been making, so far -- that we can't count on herd immunity any time soon, because we'd need 75 percent of the population to get infected first.

But that can't be right, as "Zvi" points out in a post on LessWrong*. 

(*I recommend LessWrong as an excellent place to look to for good reasoning on coronavirus issues, with arguments that make the most sense.)

That's because not everyone is equal in terms of how much they're likely to spread the virus. In other words, everyone has his or her own personal R0. Those with a higher R0 -- people who don't wash their hands much, or shake hands with a lot of people, or just encounter more people for face-to-face interactions -- are also likely to become infected sooner. When they become immune, they drop the overall "societal" R0 more than if they were average.

If you want to reduce home runs in baseball by 75 percent, you don't have to eliminate 75 percent of plate appearances. You can probably do it by eliminating as little as, say, 25 percent, if you get rid of the top power hitters only.

As Zvi writes,


"Seriously, stop thinking it takes 75% infected to get herd immunity...

"... shame on anyone who doesn’t realize that you get partial immunity much bigger than the percent of people infected. 

"General reminder that people’s behavior and exposure to the virus, and probably also their vulnerability to it, follow power laws. When half the population is infected and half isn’t, the halves aren’t chosen at random. They’re based on people’s behaviors.

"Thus, expect much bigger herd immunity effects than the default percentages."

-------

But to what extent does the variance in individual behavior affect the spread of the virus?  Is it just a minimal difference, or is it big enough that, for instance, New York City (with some 20 percent of people having been exposed to the virus) is appreciably closer to herd immunity than we think?

To check, I wrote a simulation. It is probably in no way actually realistic in terms of how well it models the actual spread of COVID, but I think we can learn something from the differences in what the model shows for different assumptions about individual R0.

I created 100,000 simulated people, and gave them each a "spreader rating" to represent their R0. The actual values of the ratings don't matter, except relative to the rest of the population. I created a fixed number of "face-to-face interactions" each day, and the chance of being involved in one is directly proportional to the number. So, people with a rating of "8" are four times as likely to have a chance to spread/catch the virus than people with a rating of "2". 

Each of those interactions, if it turns out person was infected and one wasn't, there was a fixed probability of the infection spreading to the other person.

For every simulation, I jigged the numbers to get the R0 to be around 4 for the first part of the pandemic, from the start until the point where around three percent of the population was infected. 

The simulation started with 10 people newly infected. I assumed that infected people could spread the virus only for the first 10 days after infection. 

--------

The four simulations were:

1. Everyone has the same rating.

2. Everyone rolls a die until "1" or "2" comes up, and their spreader rating is the number of rolls it took. (On average, that's 3 rolls. But in a hundred thousand trials, you get some pretty big outliers. I think there was typically a 26 or higher -- 26 consecutive high rolls happens one time in 37,877.)

3. Same as #2, except that 1 percent of the population is a superspreader, with a spreader rating of 30. The first nine infected people were chosen randomly, but the tenth was always set to "superspreader."

4. Same as #3, but the superspreaders got an 80 instead of a 30.

--------

In the first simulation, everyone got the same rating. With an initial R0 of around 4, it did, indeed, take around 75 percent of the population to get infected before R0 dropped below 1.0. 

Overall, around 97 percent of the population wound up being infected before the virus disappeared completely.

Here's the graph:





The point where R0 drops below 1.0 is where the next day's increase is smaller than the previous day's increase. It's hard to eyeball that on the curve, but it's around day 32, where the total crosses the 75,000 mark.

-------

As I mentioned, I jigged the other three curves so that for the first days, they had about the same R0 of around 4, so as to match the "everyone the same" graph.

Here's the graph of all four simulations for those first 22 days:





Aside from the scale, they're pretty similar to the curves we've seen in real life. Which means, that, based on the data we've seen so far, we can't really tell from the numbers which simulation is closest to our true situation.

But ... after that point, as Zvi explained, the four curves do diverge. Here they are in full:





Big differences, in the direction that Zvi explained. The bigger the variance in individual R0, the more attenuated the progression of the virus.

Which makes sense. All four curves had an R0 of around 4.0 at the beginning. But the bottom curve was 99 percent with an average of 3 encounters, and 1 percent superspreaders with an average of 80 encounters. Once those superspreaders are no longer superspreading, the R0 plummets. 

In other words: herd immunity brings the curve under control by reducing opportunity for infection. In the bottom curve, eliminating the top 1% of the population reduces opportunity by 40%. In the top curve, eliminating 1% of the population reduces opportunity only by 2%.

-------

For all four curves, here's where R0 dropped below 1.0:

75% -- all people the same
58% -- all different, no superspreaders
44% -- all different, superspreaders 10x average
20% -- all different, superspreaders 26x average

And here's the total number of people who ever got infected:

97% -- all people the same
81% -- all different, no superspreaders
65% -- all different, superspreaders 10x average
33% -- all different, superspreaders 26x average

--------

Does it seem counterintuitive that the more superspreaders, the better the result?  How can more infecting make things better?

It doesn't. More *initial* infecting makes things better *only holding the initial R0 constant.*  

If the aggregate R0 is still only 4.0 after including superspreaders, it must mean that the non-superspreaders have an R0 significantly less than 4.0. You can think of a "R=4.0 with superspreaders" society like maybe a "R=2.0" society that's been infected by 1% gregarious handshaking huggers and church-coughers.

In other words, the good news is: if everyone were at the median, the overall R0 would be less than 4. It just looks like R0=4 because we're infested by dangerous superspreaders. Those superspreaders will more quickly turn benign and lower our R0 faster.

---------

So, the shape of the distribution of spreaders matters a great deal. Of course, we don't know the shape of our distribution, so it's hard to estimate which line in the chart we're closest to. 

But we *do* know that we at least a certain amount of variance -- some people shake a lot of hands, some people won't wear masks, some people are probably still going to hidden dance parties. So I think we can conclude that we'll need significantly less than 75 percent to get to herd immunity.

How much less?  I guess you could study data sources and try to estimate. I've seen at least one non-wacko argument that says New York City, with an estimated infection rate of at least 20 percent, might be getting close. Roughly speaking, that would be something like the fourth line on the graph, the one on the bottom.  

Which line is closest, if not that one?  My gut says ... given that we know the top line is wrong, and from what we know about human nature ... the second line from the top is a reasonable conservative assumption. Changing my default from 75% to 58% seems about right to me. But I'm pulling that out of my gut. The very end part of my gut, to be more precise. 

At least we know for sure is that the 75%, the top line of the graph, must be too pessimistic.  To estimate how far pessimistic, we need more data and more arguments. 



Labels:

Wednesday, May 06, 2020

Regression to higher ground


We know that if an MLB team wins 76 games in a particular season, it's probably a better team than its record indicates. To get its talent from its record, we have to regress to the mean.

Tango has often said that straightforward regression to the mean sometimes isn't right -- you have to regress to the *specific* mean you're concerned with. If Wade Boggs hits .280, you shouldn't regress him towards the league average of .260. You should regress him towards his own particular mean, which is more like .310 or something.

This came up when I was figuring regression to the mean for park factors. To oversimplify for purposes of this discussion: the distribution of hitters' parks in MLB is bimodal. There's Coors Field, and then everyone else. Roughly like this random pic I stole from the internet:




Now, suppose you have a season of Coors Field that comes in at 110. If you didn't know the distribution was bimodal, you might regress it back to the mean of 100, by moving it to the left. But if you *do* know that the distribution is bimodal, and you can see the 110 belongs to the hump on the right, you'd regress it to the Coors mean of 113, by moving it to the right.

But there are times when there is no obvious mean to regress to.

--------

You have a pair of perfectly fair 9-sided dice. You want to count the number of rolls it takes before you roll your first snake eyes (which has a 1 in 81 chance each roll). On average, you expect it to take 81 rolls, but that can vary a lot. 

You don't have a perfect count of how many rolls it took, though. Your counter is randomly inaccurate with an SD of 6.4 rolls (coincidentally the same as the SD of luck for team wins).

You start rolling. Eventually you get snake eyes, and your counter estimates that it took 76 rolls. The mean is 81. What's your best estimate of the actual number? 

This time, it should be LOWER than 76. You actually have to regress AWAY from the mean.

-------

Let's go back to the usual case for a second, where a team wins 76 games. Why do we expect its talent to be higher than 76? Because there are two possibilities:

(a) its talent was lower than 76, and it got lucky; or
(b) its talent was higher than 76, and it got unlucky.

But (b) is more likely than (a), because the true number will be higher than 76 more often than it'll be lower than 76. 

You can see that from this graph that represents distribution of team talent:



The blue bars are the times that talent was less than 76, and the team got lucky.  The pink bars are the times the talent was more than 76, and the team got unlucky.

The blue bars around 76 are shorter than the pink bars around 76. That means better teams getting unlucky are more common than worse teams getting lucky, so the average talent must be higher than 76.

But the dice case is different. Here's the distribution of when the first snake-eyes (1 in 81 chance) appears:




The mean is still 81, but, this time, the curve slopes down at 76, not up.

Which means: it's more likely that you rolled less than 76 times and counted too high, than that you rolled more than 76 times and counted too low. 

Which means that to estimate the actual number of rolls, you have to regress *down* from 76, which is *away* from the mean of 81.

--------

That logic --let's call it the "Dice Method" -- seems completely correct, right? 

But, the standard "Tango Method" contradicts it.

The SD of the distribution of the dice graph is around 80.5. The SD of the counting error is 6.4. So we can calculate:

SD(true)     = 80.5
SD(error)    =  6.4
--------------------
SD(observed) = 80.75

By the Tango method, we have to regress by (6.4/80.75)^2, which is less than 1% of the way to the mean. Small, but still towards the mean!

So we have two answers, that appear to contradict each other:

-- Dice Method: regress away from the mean
-- Tango Method: regress towards the mean

Which is correct?

They both are.

The Tango Method is correct on average. The Dice Method is correct in this particular case.

If you don't know how many rolls you counted, you use the Tango Method.

If you DO know that the count was 76 rolls, you use the Dice Method.

------

Side note:

The Tango Method's regression to the mean looked wrong to me, but I think I figured out where it comes from.

Looking at the graph at a quick glance, it looks like you should always regress to the left, because the left side of every point is always higher than the right side of every point. That means that if you're below the mean of 81, you regress away from the mean (left). If you're above the mean of 81, you regress toward the mean (still left).

But, there are a lot more datapoints to the left of 81 than to the right of 81 -- by a ratio of about 64 percent to 36 percent. So, overall, it looks like the average should be regressing away from the mean.

Except ... it's not true that the left is always higher than the right. Suppose your counter said "1". You know the correct count couldn't possibly have been zero or less, so you have to regress to the right. 

Even if your counter said "2" ... sure, a true count of 1 is more likely than a true count of 3. but 4, 5, and 6 are more likely than 0, -1, or -2. So again you have to regress to the right.

Maybe the zero/negative logic is a factor when you have, say, 8 tosses or less, just to give a gut estimate. Those might constitute, say, 10 percent of all snake eyes rolled. 

So, the overall "regress less than 1 percent towards the mean of 81" is the average of:

-- 36% regress left  towards the mean a bit (>81)
-- 54% regress left  away from the mean a bit (9-81)
-- 10% regress right towards the mean a lot (< 8)
   -----------------------------------------------------
-- Overall average: regress towards the mean a tiny bit.

--------

The "Tango Method" and the "Dice Method" are just consequences of Bayes' Theorem that are easier to implement than doing all the Bayesian calculations every time. The Tango Method is a mathematically precise consequence of Bayes Theorem, and the Dice Method is an heuristic from eyeballing. Tango's "regress to the specific mean" is another Bayes heuristic.

We can reduce the three methods into one by noting what they have in common -- they all move the estimate from lower on the curve to higher on the curve. So, instead of "regress to the mean," maybe we can say

"regress to higher ground."

That's sometimes how I think of Bayes' Theorem in my own mind. In fact, I think you can explain Bayes exactly, as a more formal method of figuring where the higher ground is, by explicitly calculating how much to weight the closer ground relative to the distant ground. 


Labels: , ,

Wednesday, April 15, 2020

Regressing Park Factors (Part III)

I previously calculated that to estimate the true park factor (BPF) for a particular season, you have to take the "standard" one and regress it to the mean by 38 percent.

That's the generic estimate, for all parks combined. If you take Coors Field out of the pool of parks, you have to regress even more.

I ran the same study as in my other post, but this time I left out all the Rockies. Now, instead of 38 percent, you have to regress 50 percent. (It was actually 49-point-something, but I'm calling it 50 percent for simplicity.)

In effect, the old 38 percent estimate comes from a combination of 

1. Coors Field, which needs to be regressed virtually zero, and
2. The other parks, which need to be regressed 50 percent.

For the 50-percent estimate, the 93% confidence interval is (41, 58), which is very wide. But the theoretical method from last post, which I also repeated without Colorado, gave 51 percent, right in line with the observed number.

--------

I tried this method for the Rockies only, and it turns out that the point estimate is that you have to regress slightly *away* from the mean of 100. But with so few team-seasons, the confidence interval is so huge that I'd just take the park factors at face value and not regress them at all. 

The proper method would probably be to regress the Rockies' park factor to the Coors Field mean, which is about 113. You could probably crunch the numbers and figure out how much to regress. 

--------

The overall non-Coors value is 50 percent, but it turns out that every decade is different. *Very* different:

1960s:   regress 15 percent
1970s:   regress 27 percent
1980s:   regress 80 percent
1990s:   regress 84 percent
2000s:   regress 28 percent
2010-16: regress 28 percent 

Why do the values jump around so much? One possibility is that it's random variation on how teams are matched to parks. The method expects batters in hitters' parks to be equal to batters in pitchers' parks, but if (for instance) the Red Sox had a bad team in the 80s, this method would make the park effect appear smaller.

As soon as I wrote that, I realized I could check it. Here are the correlations between BPF and team talent in terms of RS-RA (per 162 games) for team-seasons, by decade. I'll include the regression-to-the-mean amount to make it easier to compare:

             r    RTM
---------------------
1960s:    +0.14   15% 
1970s:    +0.06   27%
1980s:    -0.14   80%
1990s:    +0.03   84%
2000s:    +0.16   28%
2010s:    +0.23   28%
---------------------
overall:  +0.05   50%

It does seem to work out that the more positive the correlation between hitting and BPF, the more you have to regress. The two lowest correlations were the ones with the two highest levels of regression to the mean.

(The 1990s does seem a little out of whack, though. Maybe it has something to do with the fact that we're leaving out the Rockies, so the NL BPFs are deflated for 1993-99, but the RS-RA are inflated because the Rockies were mediocre that decade. With the Rockies included, the 1990s correlation would turn negative.)

The "regress 50 percent to the mean" estimate seems to be associated with an overall correlation of +.05. If we want an estimate that assumes zero correlation, we should probably bump it up a bit -- maybe to 60 percent or something.

I'd have to think about whether I wanted to do that, though. My gut seems more comfortable with the actual observed value of 50 percent. I can't justify that.



Labels: , ,

Monday, March 23, 2020

Regressing Park Factors (Part II)


Note: math/stats post.

-------------

Last post, I figured the breakdown of variance for (three-year average) park effects (BPFs) from the Lahman database. It came out like this:

All Parks   [chart 1]
-------------------------
4.8 = SD(3-year observed)
-------------------------
4.3 = SD(3-year true) 
2.1 = SD(3-year luck)
-------------------------

Using the usual method, we would figure, theoretically, that you have to regress park factor by (2.1/4.8)^2, which is about 20 percent. 

But when we used empirical data to calculate the real-life amount of regression required, it turned out to be 38 percent.

Why the difference? Because the 20 percent figure is to regress the observed three-year BPF to its true three-year average. But the 38 percent is to regress the observed three-year BPF to a single-year BPF.

My first thought was: the 3-year true value is the average of three 1-year true values. If each of those were independent, we could just break the 3-year SD into three 1-year SDs by dividing by the square root of 3. 

But that wouldn't work. That's because when we split a 3-year BPF into three 1-year BPFs, those three are from the same park. So, we'd expect them to be closer to each other than if they were three random BPFs from different parks. (That fact is why we choose a three-year average instead of a single year -- we expect the three years to be very similar, which will increase our accuracy.)

Three years of the same park are similar, but not exactly the same. Parks do change a bit from year to year; more importantly, *other* parks change. (In their first season in MLB, the Rockies had a BPF of 118. All else being equal, the other 13 teams would see their BPF drop by about 1.4 points to keep the average at 100.)

So, we need to figure out the SD(true) for different seasons of the same park. 

--------

From the Lahman database, I took all ballparks (1960-2016) with the same name for at least 10 seasons. For each park, I calculated the SD of its BPF for those years. Then, I took the root mean square of those numbers. That came out to 3.1.

We already calculated that the SD of luck for the average of three seasons is 2.1. That means we can fill in SD(true)=2.3.

Same Park     [Chart 2]
------------------------------------
3.1 = SD(3-year observed, same park)
------------------------------------
2.3 = SD(3-year true, same park)
2.1 = SD(3-year luck, any parks)
------------------------------------

(That's the only number we will actually need from that chart.)

Now, from Chart 1, we found SD(true) was 4.3 for all park-years. That 4.3 is the combination of (a) variance of different years from the same park, and (b) variance between the different parks. We now know (a) is 2.3, so we can calculate (b) is root (4.3 squared minus 2.3 squared), which equals 3.6.

So we'll break the "4.3" from Chart 1 into those two parts:

All Parks     [Chart 3]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
2.3 = SD(3-year true within park)
2.1 = SD(3-year luck)
-----------------------------------

Now, let's assume that for a given park, the annual deviations from its overall average are independent from year to year. That's not absolutely true, since some changes are more permanent, like when Coors Field joins the league. But it's probably close enough.

With the assumption of independence, we can break the 3-year SD down into three 1-year SDs.  That converts the single 2.3 into three SDs of 1.3 (obtained by dividing 2.3 by the square root of 3):

All Parks     [Chart 4]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)
1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------

What we're interested in is the SD of this year's value. That's the combination of the first two numbers in the breakdown: the SD of the difference between parks, and the SD of this year's true value for the current park.

The bottom three numbers are different kinds of "luck," for what we're trying to measure. The actual luck in run scoring, and the "luck" in how the park changed in the other two years we're using in the smoothing for the current year. 

All Parks     [Chart 4a]
-----------------------------------
4.8 = SD(3-year observed)
-----------------------------------
3.6 = SD(3-year true between parks)
1.3 = SD(this year true for park)

1.3 = SD(next year true for park)
1.3 = SD(last year true for park)
2.1 = SD(3-year luck)
-----------------------------------

Combining the top three and bottom three, we get

All Parks      [Chart 5]
----------------------------------------------
4.8 = SD(3-year observed)
----------------------------------------------
3.8 = SD(true values impacting observed BPF)
2.8 = SD(random values impacting observed BPF)
----------------------------------------------

So we regress by (2.8/4.8) squared, which works out to 34 percent. That's pretty close to the actual figure of 38 percent.

We can do another attempt, with a slightly different assumption. Back in Chart 2, when we figured SD(three year true, same park) was 3.1 ... that estimate was based on parks with at least ten years of data. If I reduce the requirement to three years of data, SD(three year true, same park) goes up to 3.2, and the final result is ... 36 percent regression to the mean.

So there it is.  I think this is method is valid, but I'm not completely sure. The 95% confidence interval for the true value seems to be wide -- regression to the mean between 28 percent and 49 percent -- so it might just be coincidence that this calculation matches. 

If you see a problem, let me know.



Labels: , ,