Tuesday, March 13, 2012

An economist predicts the Olympic medal standings: summer edition

Dan Johnson, a professor at Colorado College, did a regression to predict medal wins at the Olympics for a given country. There are articles about it in various newspapers.

I wrote about this a couple of years ago, when he did the same thing for the Winter Olympics. At the time, I was a little skeptical. Nothing much has changed.

Except ... this time we have the equation! The National Post was kind enough to provide it in the print edition.

Here it is:

Medals = 0.33 +
.00271 * total medals available +
.02 * income +
.0000549 * income squared +
.024 * population +
19.02 * population squared +
11.91 if it's the home nation +
3.85 if it'll be the next home nation +
3.35 if it was the home nation last time or the time before +
0.29 if it borders the current home nation +
coefficients for dummy variables for nation

Apparently, the dummy variables are new -- they represent a

... "'cultural specific factor' to account for things that are hard to quantify, like the prevalence of doping or the culture of competition, and also to correct the historical under-predictions for countries such as China and Australia."

The dummy variables should make the predictions even more accurate.

BTW, an interesting thing about the regression is that if you have a dummy variable for each country, your accuracy should be the same very similar even if you don't include the income and population variables! Those variables are nice to have because they illustrate the effects of income and population, but if you leave them out, the dummy variables will pick up the slack and adjust themselves for the income and population of each country.
(UPDATE: it won't be exactly the same, just close. See the bottom of the post for an explanation.)

(The Post story also says that Johnson also updated his previous regression to remove variables for climate and politics. By the same token, I think he'd get the same results if he left them in.)

So I think that if you leave out income and population, you'll get the same coefficients for everything except the dummy variables:

Medals = 0.33 +
.00271 * total medals available +
11.91 if it's the home nation +
3.85 if it'll be the next home nation +
3.35 if it was the home nation last time or the time before +
0.29 if it borders the current home nation +
new coefficients for dummy variables for nation

I don't think this is the best way to account for the variables still included. That's because (as I think I said in the other post) all the factors are linear. I'm not sure that's right: it implies that if the USA is the home team, it gains 12 medals on top of the 100 or so it usually wins, but if Canada is the home team, they also gain 12 medals (to go with their 15 or so).

The "12 extra medals" is, roughly speaking, the average. It should work for a roughly average country. Since the UK isn't too far from average, the formula should probably work pretty well this year.


UPDATE: it now occurs to me that that's not exactly right. I was assuming that the population and income were constants for each country. They're not -- they vary over time. So, the regression coefficients with population/income will be *close* to the ones without, but won't be exactly equal.

(I should have also realized that the regression wouldn't work if population and income were constants, because you'd get multicollinearity.)

Labels: ,


At Wednesday, March 14, 2012 3:35:00 PM, Anonymous Patrick Kilgo said...

Hey Phil, great article as usual. A couple of thoughts, mostly nit-picky, about the data analysis:

1) This is best modeled using Poisson regression since the outcome (number of medals) is a count and there are likely a lot of zeros. From the way the reported model is parameterized I doubt Poisson regression was used. It's probably not a huge deal but theoretically the estimates would be better under the Poisson structure.

2) The independence assumption will be violated unless he clustered on country. That further complicates the model and requires a generalized estimating equations approach (or something similar).

3) I don't think that using the same values of population and income across years necessarily results in collinearity. By doing so he would be treating those variables as fixed effects. Collinearity has more to do with the correlation of the predictors (in this case, across all observations). Also, collinearity is not as big a deal in a predictive model like this one (as opposed to an associative model where it is a huge deal).

4) The "dummy variable for each nation" approach would likely lead to poor future predictions and little reproducibility because the model would be over-parameterized. It would also chew up a ton of degrees of freedom.

5) The "12 extra medals" intercept could be adjusted based on income, population, etc if there was meaningful interaction.

Hope you are well... Pat Kilgo

At Wednesday, March 14, 2012 4:38:00 PM, Blogger Phil Birnbaum said...

Hi, Pat,

2. Won't that only affect the confidence intervals, and not the predictions?

3. If all the rows with China dummy variables had the same population, and all the Russia dummy variables had the same population, and so on ... you'd get a singular matrix, wouldn't you? It would be impossible to untangle the population effects from the country effects.

4. Agreed. You'd have to hope that you were getting more signal than noise, doing that.

At Wednesday, March 14, 2012 5:07:00 PM, Anonymous Patrick Kilgo said...

2) Properly clustering would affect the standard errors of the estimates (making them larger) which in turn would make the CIs larger and statistical significance less likely. If significance is less likely then his model selection choices might have been different. Variables would look more significant than they really are.

3)I would have to think about that one some. My gut says it would be OK still, something akin to an ANOVA with fixed levels and a random outcome. I could certainly be wrong but I think it would be fine.


At Sunday, April 01, 2012 11:34:00 PM, Anonymous Jerome Solanum said...

I was going to make a similar comment as Partick's, but he basically said what I was going to. I think the point to emphasize is about the country-specific dummies... how is that not over-parameterization?


Post a Comment

<< Home