Sabermetric Research: Psychocinemetrics

Netflix, the big rent-movies-by-mail company, has a website that tries to guess which movies you’ll like based on your evaluations of movies you’ve already seen. They have now announced that they will award a million dollars to anyone who can improve their prediction algorithm by 10%. It’s open to anyone, including sabermetricians, anywhere in the free world except Québec.

Here’s how it works. When you register at the site, you download a “training” database of 100,000,000 assorted movie ratings (1 to 5 stars, integers only) from 450,000 different customers. You then devise the best algorithm you can to accurately predict some of those ratings from other of those ratings. When you’re happy with your algorithm, you run it on a “final exam” database of 2,800,000 movies from those same customers (with the actual ratings not provided). You send Netflix your 2,800,000 estimates, they compare it to the “real” ratings to calculate your accuracy, and post your results on the site.

If you beat Netflix’s standard error by more than 10%, you win a million bucks. If you can’t beat 10%, but you’re still the best for the year, you get $50,000.

My first reaction is that reducing the standard error by 10% could very easily be impossible. I’m pretty sure it’s impossible to chop 10% off the standard error of, say, Runs Created, because there’s a minimum level of inherent noise in the data, and Runs Created is already within 10% of it. You could figure this out mathematically – take a typical batting line, scramble its at-bats around randomly, and see how many runs score. Repeat a couple of hundred thousand times. Find the standard error of all your results. That’s the minimum intrinsic error of any stat that’s based on a batting line. If that’s more than 90% of the standard error of Runs Created, well, you can’t win the million dollars.

To say that another way, suppose a team matching Derek Jeter’s batting line scored 80 runs half the time, and 90 runs the other half. In that case, the best you could do would be to guess 85 runs, for a standard error of 5. If the rules of baseball are flaky enough to evenly split real life between 80 and 90, that’s your limit of accuracy, period. No matter how much money you were offered, it would be impossible to do any better.

And I think estimators like Runs Created are already butting up against that limit. I’m pretty sure that 10% is out of the question.

Can you do the same kind of calculation for Netflix data? Not the same way – there are decent simulations of baseball, but no reliable simulation of the human brain’s response to movies. But maybe by analyzing the hundred million ratings, and the statistical properties of the data, you might be able to come up with a persuasive argument that the million dollar threshold is unreachable. I don’t think they’d pay you for the argument, but I’d certainly buy you a beer.

Of course, I might be wrong about this. Someone has already taken 1% off, and the contest has been running for only a week …

(Thanks to Freakonomics for the pointer.)

2 Comments:

At Wednesday, October 11, 2006 2:04:00 PM, JavaGeek said...: I took a course in regressions at univeristy, and I'll always remember the great example of why not do studies like these: if you throw enough darts you'll eventually get a bullseye...

In other words if Netflix get ~1,000,000,000 entries they'll likely be one that improves by 10% that is no better than the original. But they'll have gotten a lot of advertising (likely worth over $1,000,000)
At Friday, October 13, 2006 11:52:00 AM, Ted said...: javageek's point is different from Phil's, and it's worth looking at the difference.

Phil points out that you can't expect "in the long run" to do much better than the standard offensive measures in baseball. However, as javageek notes, it *is* possible to do much better on a particular sample.

So, it could be possible for someone to claim the million-dollar prize Netflix is offering -- but to do so, they might wind up overfitting in their model, which might pick up chance configurations of relationships in the sample (or between the training sample and the testing sample) that are just dumb luck, and not something that will work out in the long haul -- just like in a particular sample, you can sometimes construct really wacky-looking run estimators that work well on the sample, but don't represent true "fundamentals."

<< Home

Sabermetric Research

Wednesday, October 11, 2006

Psychocinemetrics

2 Comments:

About Me

Previous Posts