Sunday, October 29, 2006

How fast did the market learn from "Moneyball?"

In “Moneyball,” Michael Lewis told the story of how the Oakland A’s were able to succeed on a small budget by realizing that undervalued talent could be picked up cheap. Specifically, GM Billy Beane realized that the market was undervaluing on-base percentage, and acquired hitters who took lots of walks at salaries that undervalued their ability to contribute to victory.

But once Moneyball was released in 2003, every GM in baseball learned Beane’s "secret." Furthermore, a couple of his staff members left that year for GM jobs with other teams. In theory, this should have increased competition for high-OBP players, and raised their salaries to the point where the market inefficiency disappeared.

Did that really happen? In “
An Economic Evaluation of the Moneyball Hypothesis,” economists Jahn K. Hakes and Raymond D. Sauer say that yes, it did.

Hakes and Sauer ran regressions for each year from 2000 to 2004, trying to predict (the logarithm of) players’ salary from several other variables: on-base percentage, slugging percentage, whether they were free agents, and so on. They found that in each year from 2000 to 2003, slugging percentage contributed more to salary than on-base percentage. But, in 2004, the first year after Lewis’s book, the coefficients were reversed – on-base percentage was now more highly valued that slugging percentage.

"This diffusion of statistical knowledge," they write, " … was apparently sufficient to correct the mispricing of skill."

But I’m not really sure about that.

The main reason is that, taking the data at face value, the 2004 numbers show not the market correcting, but the market overcorrecting.

For 2004, the study shows 100 points of on-base percentage worth 44% more salary, and 100 points of slugging worth only 24% more salary.

At first, that looks like confirmation: a ratio of 44:24 (1.8) is almost exactly the “correct” ratio of the two values (for instance, see the last two studies
here). But the problem is the “inertia” effects that Hakes and Sauer mention: the market can’t react to a player until his long-term contract is up.

Suppose only half the players in 2004 were signed that year. The other half would have been signed at something around the old, 2002, ratio, which was a 14% increase for OBP, but a 23% increase for SLG.

So half the players are 14:23, and the average of both halves is 44:24. That means the other half must be about 74:25. That is, the players signed in 2004 would have had their on-base valued at three times their slugging, when the correct ratio is only around two. Not only did GMs learn from “Moneyball,” they overlearned. The market is just as inefficient, but in the other direction!

And only free-agent salaries are set in a competitive market. The rest are set arbitrarily by the teams, or by arbitrators. If those salaries didn’t adjust instantly, then the market ratio for 2004 is even higher than 3:1. In the extreme case, if you assume that only free-agent salaries were affected by the new knowledge, and only half of players were free agents, the ratio would be higher, perhaps something like 6:1.

Could this have happened, that teams suddenly way overcompensated for their past errors? I suppose so. But the confidence intervals for the estimates are so wide that the difference between the 14% OBP increase for 2003 and the 44% increase for 2004 isn’t even statistically significant – it’s only about one standard deviation. So we could just be looking at random chance.

Also, the regression included free-agents, arbitration-eligible players, and young players whose salaries are pretty much set by management. This forces the regression to make the implicit assumption that all three groups will be rewarded in the same ratios. For instance, if free agents who slug 100 points higher increase their salaries by 24%, this assumes that even rookies with no negotiating power will get that same 24%. As commenter Guy wrote
here, this isn’t necessarily true, and a case could be made that these younger players should have been left out entirely.

Suppose that teams knew the importance of OBP all along, but players didn’t. Then it would make sense that it would appear to be “undervalued” among players whose salaries aren’t set by the market. Teams are paying players the minimum they would accept, and if those high-OBP players don’t know to ask what they’re worth, they would appear to be underpaid. That could conceivably account for the study’s entire effect, since the study combines market-determined salaries equally with non-market salaries.

So the bottom line is this: if you consider the 2004 data at face value, the conclusion is that GMs overcompensated for the findings in “Moneyball” and paid far more than intrinsic value for OBP. But because of the way the study is structured, there is reason to be skeptical of this result.

In any case, the study does seem to suggest something about salaries in 2004 was very different from the four years preceding. What is that something? I don’t think there’s enough evidence there yet to embrace the authors’ conclusions … but if you restricted the study to newly-signed free agents, and added 2005 and 2006, I’d sure be interested in seeing what happens.

Saturday, October 28, 2006

Are lottery retailers stealing jackpots from customers?

Making the news in Ontario this week is a CBC report finding that people who sell lottery tickets are winning jackpots much more frequently than you’d expect. The CBC estimates that lottery clerks should have won 57 jackpots, but they won "nearly 200." According to statistician Jeffrey Rosenthal, the probability of this happening by chance "is about one chance in a trillion, trillion, trillion, trillion."

The suspicion is that when a winner brings his ticket in to be checked, the operator tells him it’s a loser, pretends to throw it away, and cashes it in himself later. Indeed, there was a lawsuit to that effect by one customer a few years ago – the retailer settled out of court for $150,000 – so it’s not unreasonable to suspect these things happen.

Taking the reported numbers at face value, the implication is that lottery retailers rip off unsuspecting winners twice as often as they actually win legitimately. That doesn’t strike me as unreasonable – if the average operator verifies 1000 times as many tickets as he buys himself, only 1 in 500 retailers has to be dishonest for these numbers to happen.

The Ontario Lottery and Gaming Corporation (OLGC), the government entity that runs the lotteries, is digging in its heels and insisting that
security is fine. They say they investigate all large jackpots.

But if the guys behind the counter are ripping off jackpots, shouldn’t it be easy to tell, at least in some cases? After all, the OLGC has a record of every lottery transaction made.

-- sometimes, when jackpots are very large, the lineups are huge at the busiest terminals. If an operator cashes a ticket that was bought in the middle of a continuous rush, wouldn’t it be implausible that he kept everybody in line waiting just because he had a whim to buy his own ticket at that moment?

-- a lot of people play the same numbers every week. The OLGC has the records to figure out if that’s the case with those winning numbers. If so, the winner can be asked to verify how often he buys those numbers, and whether he ever bought them after the win. Only the real buyer would be able to answer correctly. Moreover, if those same numbers were occasionally bought at other places, that’s a pretty good indication something’s wrong. Why would the owner go out of his way to buy a ticket elsewhere, when he can get it immediately (and at a discount) at his own machine?

-- it’s not always the same person – owner or employee – running the machine. Was the claimant actually working the day and hour the ticket was sold? Does he remember what time he bought the ticket?

-- Stores often display signs that say "$500,000 winning ticket sold here." Does the operator find out he sold the winning ticket immediately, even before the ticket is verified? If so, find out when the ticket was first checked. If it was, say, days after the draw, that would be suspicious. After all, if you had a ticket, and you found out that a $250,000 winner came from your store, wouldn’t you pull your ticket out right away to check it?

-- in the past, has the owner regularly cashed winning tickets before or after his store was closed? If so, those are probably his own tickets. Was this jackpot ticket also checked when the store was closed? If not, why the difference? And was this winning ticket in line with the pattern of other owner tickets? For instance, if the owner usually buys a $10 Quick Pick, but the jackpot was won on a $5 ticket where the customer chose his own numbers, that’s something of a red flag.

-- many people pay for their tickets with debit cards. Was their a debit card transaction at the time the winning ticket was sold? Was it for at least the cost of the tickets printed? If so, whose debit card was it?

There are probably a lot more ways a cheater might be caught – these are just off the top of my head. It sure does seem like there would be a lot of opportunity for an OLGC investigation to discover suspicious circumstances.

Probably, the government won’t want to release details of how the investigations actually work. But the Ontario
ombudsman is investigating, and it’ll be interesting to see how this plays out.

Right now, this is more Freakonomics than sabermetrics ... but we’ll see if we eventually get enough information to draw our own statistical conclusions.

Not that I don’t trust the government or anything.

Friday, October 27, 2006

WSJ on baseball strategy

To bunt or not to bunt? The Wall Street Journal's Carl Bialik goes to the right sources for answers ... Tango, Lichtman, Levitt, Fox, and Woolner.

I shouldn't be so happy about this ... it's just good journalism, and, after all, Mr. Bialik is "The Numbers Guy." But still, it's always encouraging to see an article in the MSM that gets it.

Thursday, October 26, 2006

New evidence that cold weather affects hitting

Chris Constanzio breaks offense down by game-time temperature, and finds some very interesting things. Walks and strikeouts are higher in cold weather, but home runs are down, even after adjusting for fewer balls in play.

Most astounding to me was that batting average on balls in play -- the DIPS measure -- is a full 20 points lower in very cold weather, as compared to very hot weather. Twenty points is huge, isn’t it?

Constanzio writes, "The most straightforward explanation for these findings is that the ball simply does not carry very well in cold weather. Batted baseballs are slowed down by air resistance in the heavy, dense air of cool April and October nights."

Thanks to “The Book” blog for the pointer. There’s some discussion there … Guy asks if day/night breakdowns have anything to do with it, which is a very good question.

Great little study!

Sunday, October 22, 2006

Wall Street Journal on managers

A new Wall Street Journal article extensively quotes Chris Jaffe's recent manager study.

Strangely, reporter Russell Adams lists outgoing managers in order of Chris's rankings, but don't actually give the numbers.

(Update: Chris has posted part III of his series here.)

Saturday, October 21, 2006

Study: error rates and official scorer bias both declining

The latest issue of JQAS came out this week, and its five articles include one paper on baseball.

Simply titled “
Baseball Errors,” it’s a nice research study on the history of error rates. It’s written in a more conversational style than most other academic papers, and its citation list includes a number of mainstream articles and interviews.

Authors David E. Kalist and Stephen J. Spurr start out with a discussion of official scorers, who, of course, are the ones who decide where a ball is a hit or an error. They include a bit of history, including a discussion on how much scorers have been paid over the years (this year, $130 per game).

Kalist and Spurr are economists, so their interest turns to whether scorers may be biased. One of their interesting counterintuitive observations is that if the scorer does indeed favor the home side, he should call more errors on the home team than on the visiting team. That’s because an error on the visiting team hurts the home batter’s stats, depriving him of a base hit. But when the home team misplays a ball, either the home pitcher’s record is hurt (via a hit allowed and potential earned runs), or the home fielder’s record is hurt (via fielding percentage). There is no obvious reason for the scorer to prefer the hitter to the pitcher, and so there’s no incentive to call fewer errors on the home team.

The authors also quote LA Times writer Bill Platchke “that scorers are allowed to drink on the job, fraternize with players, and play in rotisserie leagues in which a fictitious ‘team’ can win thousands of dollars.” And, they note cases in which scorers made controversial calls that kept alive a hitting streak or an no-error streak. “While we certainly do not claim that our survey … is complete,” they write, “all the articles we have seen involved calls made in favor of a player on the home team.”

So, are scorers actually biased towards the home team? In the paper’s second regression, using Retrosheet data for games from 1969-2005, Kalist and Spurr find that from 1969-1976, scorers called 4.2% more errors against the home team. But from 1977-2005, the difference between home and visitors was essentially zero. Is that the result of an decreasing trend? We don’t know. The authors don’t give us year by year data for this variable, just the 1976 cutoff.

The rest of the regression doesn’t tell us a whole lot we didn’t already know. Error rates have steadily fallen over the last 36 years, fewer errors are committed on turf, expansion teams commit more errors than established teams, and the more balls in play, the more errors. What was interesting is that signficantly more errors were committed in April than in other months (5% more than in May, for instance), and the difference is statistically significant.

The paper’s other regression predicts annual error rates, instead of per-game rates. The authors divide baseball history into decades, and again find, unsurprisingly, that error rates have steadily declined. One finding that was surprising is that during World War II, error rates continued to decline from pre-war levels (even though conventional wisdom is that the caliber of play declined in almost all other respects).

The authors also tried to determine a relationship between error rates and speed. However, they used stolen base rates as a proxy for speed, and, over periods as long as a decade, SBs are probably much more related to levels of offense than to leaguewide speed. Managers play for a single run more frequently when runs are scarce and games are close – so you’d expect more SBs in the sixties because it was a low-scoring era, rather than just because those players were faster than players in the 70s. (And, of course, more steals may simply mean the catchers’ arms aren’t as good.)

However, the study does find a very significant relationship between stolen base rates and errors, in a positive direction. So I may be wrong. Or there might be other factors operating; for instance, with steals being more important, managers started filling their rosters with speedier players.

Also, the National League sported error rates 2% higher than in the American League – and the study starts in the 19th century, so the difference can’t be just the DH. The authors discuss this a bit (although you wish they had included a DH dummy variable). They found that the NL error rate was much higher (they don’t say by how much, but p=.0006) than the AL rate from 1960-72. And they note that fielders may be better in the AL, since old sluggers can DH instead of playing defense.

One feature of the article I liked is that not only do the authors give a brief explanation of what “multiple regression on the log of the error rate” means, but they also illustrate how to interpret the results mathematically. It’s a small courtesy that acknowledges that JQAS has many layman readers, and helps make this study one that should be of interest to many different kinds of baseball fans, not just the sabermetrics crowd.

Friday, October 20, 2006

Edmonton's hot hand over Saskatchewan ends after 29 seasons

In 2006, for the first time in 30 years, the CFL’s Saskatchewan Roughriders will finish with a better record than the Edmonton Eskimos.

this article (link may expire), Brian Grest, a math teacher from the Saskatoon area, quotes the odds against 29 straight 50-50 underachievings at 536,870,912 to 1.

Actually, the correct odds are actually 536,870,911 to 1. But I bet that error is the reporter’s fault, not Grest’s. (Also, the reporter calls this “nearly half-a-billion to one” -- the word “nearly” seems kind of inappropriate here.)

A quick check of the CFL website shows that the two teams actually tied in the standings twice in that 29-year span. In both cases, the CFL website shows Edmonton ahead. The article says the Eskimos got the nod in
2004 based on points differential, which was the CFL tiebreaker rule. However, in 1988, it was the Saskatchewan that had the better differential. Maybe the tiebreaker rule was different then, but I’m too lazy to try to find out.

Mr. Grest seems he would have thought of ties, and I’m betting he got it right. But if we count ties as just ties, and they happen 5% of the time, the probability of 29 straight wins or ties is “only” 1 in 130,430,813. If ties happen twice in 29 years on average, the chance is 1 in 77,610,895.

Finally, I think we have a new winner of the “reporter attributing every mathematical fact to someone else, just in case” award:

“Grest calculated that, all things being equal, Edmonton has a one in two chance of finishing ahead of Saskatchewan in any given season.”

Not to mention that that the word “calculated” also seems kind of inappropriate here.

By the way, I don’t know whether 29 consecutive years of domination is exceedingly rare, or just plain rare. Did the Yankees ever finish ahead of anyone for 29 consecutive years? Maybe the A’s?

Thursday, October 19, 2006

Another reporter doesn't understand sabermetrics

This one gets more facts wrong than usual, even for articles of this type.

Monday, October 16, 2006

On predicting log(salary)

It’s common in the economics literature to run studies trying to predict how various factors affect a person’s salary. When those studies run regressions, they don’t try to predict salary – they try to predict the logarithm of the salary.

The reason you’d use log(salary) is that you expect that an arithmetic (additive) change in the inputs produce a geometric (multiplicative) change in outputs. For instance, leaving your money in a 5% savings account for one extra year (adding 1) to the amount of time your money stays in a 5% savings account increases your amount of money by five persent (multiplying by 1.05). Because of that, if you were to run a regression to predict log(balance) based on years, you’d get a perfect correlation of 1.00. But if you ran the regression on just balance, instead of log(balance), it woudn’t work as well.

On the other hand, the relationship between wages and hours worked is just the opposite. If you work an extra day (at, say, $10 per hour), you’re going to increase your wages by a fixed $80. It’s additive, not multiplicative – add one day, add $80. In this case, using just plain “wages” would give the perfect correlation, and using log(wages) would be the less accurate method.

So sometimes “salary” is the better choice, and sometimes “log(salary)” is the better choice. Which to choose depends on whether the relationship is additive or multiplicative. Does adding one X change salary by a fixed number? If so, don’t use the log. Does adding one X change salary by a certain proportion? If so, then the logarithm is necessary.

But why does it have to be one or the other? Isn’t there actually an infinite number of possible relationships between salary and other factors?

Here’s a hypothetical example. Suppose you sell pay-per-use cell phones to high school kids, and you get a commission every time two of your customers call each other. (And assume that every user is equally likely to call any other user.) In that case, the right regression is not log(salary), and it’s not just salary. A better choice is to use the square root of salary. That’s because, as the number of users gets large, the number of calls starts to become almost proportional to the square of the number of users.

And what do you do if some of the dependent variables have an artithmetic relationship but others have a geometric one? Suppose you want to figure the impact of hiring assistants for a car salesman. Some assistants scout for clients and bring in 5 new prospects a day. This increases the salesman’s commission arithmetically. Other assistants help the salesman close, and each assistant increases the probability of closing by 1%. This increases the salesman’s commission geometrically. Which means that the effect of assistants on salary is neither arithmetic nor geometric.

So what should you use, salary or log(salary)? My guess is that researchers would try both and see which works better, and use that one. This seems a completely reasonable way to decide.

But whether you use salary or log(salary), you’re probably going to be somewhat wrong. Baseball studies that I’ve seen all use log(salary) – but it’s very unlikely that any performance measure actually does work exactly geometrically, at all points on the salary scale. Suppose hitting an extra 5 home runs increases pay by 10%. Does it necessarily follow that hitting an extra 10 home runs should increase pay by 21% (10% on top of 10% compounded)? Or that hitting an extra 15 home runs should increase pay by exactly 33.1%? Who says it has to work that way? Shouldn’t you have to show it?

What you can do is show that a regression appears to give reasonable results – that Barry Bonds doesn’t come out to be worth a billion dollars a year. But that’s far from proving that log(salary) is as accurate as it needs to be.

And, of course, there are many different performance measures, all of which differ in some way. Even the most accepted measurements can have significant differences. If log(salary) is proportional to linear weights batting runs, it cannot also be proportional to VORP. If log(salary) is proportional to VORP, then it cannot be proportional to Win Shares. If log(salary) is proportional to Win Shares, it cannot also be proportional to RC27. It can be roughly proportional to all of them at the same time, of course, but not exactly proportional. But what is the size of the “roughly” error? It can be pretty substantial. I’d bet that at the extremes, an increase of 10% in Linear Weights would easily be a 20% difference in RC27.

Which brings me, finally, to the point of all this. The academic studies I’ve seen that do this kind of thing are all very careful about the detailed calculations they do on the dependent variables. They’ll adjust for season, they’ll adjust for DH. They won’t hesitate to criticize other studies for using offensive measure X when measure Y has a higher correlation to runs scored. They’ll add lots of indicator variables for all kinds of things, like managers and parks, just to make sure they’ve considered everything reasonable.

But isn’t that kind of overkill, to try to make sure all the dependent variables are correct to three decimal places, when the results might be so much more dependent on which function of salary you take as the independent variable? Your results are only as valid as your weakest link. And the salary function, to me, seems like a pretty weak link.

Saturday, October 14, 2006

Flawed study on stolen base risk analysis

Here’s a working paper on stolen base decision making. It’s badly flawed mainly because it fails to consider all the costs of a CS – specifically, it forgot to note that a caught stealing significantly reduces the chance of having a big inning.

I won’t review it here, because I'd just be repeating others' analysis at "The Sports Economist"
and "The Book" (scroll down to October 14).

Thanks to
Sabernomics for the original pointer.

Friday, October 13, 2006

NHL: Does the two-referee system deter on-ice evildoers?

If criminals are rational, they should commit fewer crimes when the chances of getting caught increase. This is the “economic model of crime,” and economists are fond of trying to find data to show that fear of getting caught means that increased enforcement leads to fewer offenses.

In this 2002 paper, “
Testing the Economic Model of Crime: The National Hockey League’s Two-Referee Experiment,” economist Steven Levitt tries to test this theory using NHL hockey data.

For most of NHL history, there was only one referee on the ice. Recently, the league switched to a two-referee system, in order to be able to spot and call more penalties.

Should we expect more penalties, or fewer, in games with the additional referee? Economic theory says that we can’t know just from theory. Either result is possible.

On the one hand, the extra referee means that more offenses will be spotted and called. That means more penalties. On the other hand, the fact that offenses are more likely to be called means that players are less inclined to commit them. That means fewer penalties.

In theory, there’s no way to figure out which effect is larger. Penalties could go up, or they could go down. You have to look at real-life data to find out.

As it happens, there’s an ideal body of data to look at for this question. In the first half of the 1998-99 season, to allow the NHL to evaluate the new system, about one-third of games were played with two referees, and the other two-thirds were played using the old system of only one referee. This unexpectedly provided Levitt with a natural control on which to base his study.

In that sample, there were 10.33 minor penalties called in the 510 games with one referee, and 10.9 penalties called in the 270 games with two referees. That’s a smaller difference than I expected – only a 5.5% increase -- and it’s barely statistically significant (z = 1.9). It appears that roughly, the two effects cancel each other out – more infractions are being called at about the same rate as players are choosing to play nicer. The change in enforcement roughly equals the change in deterrence.

What Levitt wants to do, though, is to break out the two effects separately, to see if the economic theory of crime indeed holds – that is, whether players are indeed responding to the increased enforcement by committing fewer offenses. For that, he wants to isolate the absolute change in number of offenses. And based on only this data, there’s no way to tell.

It could be that players aren’t changing their behavior at all, but the second ref is seeing more of their offenses, causing those extra 5.5% of penalties. It could be that players are committing only half as many fouls as before, but the second ref more than doubles the chance of a foul being spotted. Or it could be any other combination of X% fewer penalties and Y% more enforcement, where the combination of X and Y multiplies out to 5.5%.

Separating the two factors out is a hard problem. Levitt thinks about it a bit, and then comes up with a set of assumptions that make it possible. As it will turn out, one of the assumptions is questionable, and so the idea doesn’t quite work – but it’s an intriguing attempt nonetheless.

What Levitt does is assume that all penalties are committed defensively. That is, players will take penalties only in an attempt to prevent the opposition from getting an immediate scoring chance. If players are rational, they will take a penalty only when the expected cost of the offense is less than the benefit of curtailing the scoring chance.

For instance, suppose a single referee calls a penalty only 50% of the time. A penalty has a “linear weight” [my term] of 0.17 goals, Levitt finds. So a hook or a slash costs 0 goals if not called, and 0.17 goals if called, for a linear weight of 0.085 goals. A rational player will therefore commit the hook if and only if it reduces the immediate probability of a goal by more than 0.085.

However, suppose the addition of an additional referee increases the chance of getting caught from 50% to 75%. Now, the linear weight of a hook rises to 0.1275. So the defense will hook only to reduce a scoring chance by 12.75% or more. So, with the second referee, all those hooks that previously occurred between 8.5% and 12.75% will no longer happen. That’s the amount of deterrence we want to measure.

If that deterrence is actually happening, what evidence would it leave behind?

Answer: more even-strength goals.

Previously, the defense had lots of profitable opportunities to foul the opposition for free, and prevent them from scoring. Now, because of increased enforcement, some of those goal-prevention opportunities are no longer profitable. And so, the defense has to allow the offense more scoring chances, and the opposition scores more often.

(To see that more vividly, consider the extreme case. If penalties are never called, opponents are manhandled off the puck whenever they cross the blue line, and no goals are ever scored. If penalties are always called, opponents are left alone and have a chance to score. So increased enforcement means more goals.)

So what do the data show? Only a 5% increase in even-strength goals. That means, under Levitt’s assumptions, that not many penalties are becoming unprofitable. Furthermore, and after doing a bit of algebra, Levitt finds that the numbers show that the second referee actually leads to an increase in offenses committed! It’s only a 1.7% increase, but it’s still the opposite of what you’d expect – more enforcement should lead to less crime, not more.

But Levitt’s equations also allow him to estimate the change in probability of getting caught. And it’s small, which explains why fouls are committed as often as ever.

That is, under Levitt’s assumptions, the data show that because the second referee didn’t help much in spotting offenses, players didn’t have any reason to stop hooking and slashing. He writes,

“while the result … might superfically argue against the deterrence hypothesis … the true explanation … seems to be that there was no discernible change in the probability of detection.”

Intutively, that doesn’t make sense; adding a second referee should substantially increase the chances of catching offenders. For instance, suppose both referees watch the action the same way, and each independently has an 80% chance of catching an offender. In that case, one referee will catch 80%, and the second will catch 16% more (80% of the 20% the first guy didn’t see). That’s a 20% increase, which is substantial. And when you consider that the second referee is positioned specifically to try to spot what the first referee cannot, the real-life number should be even higher than 20%.

The problem, of course, is Levitt’s assumption that all penalties are defensive. I’m not an expert, but from watching hockey I know that’s not even close to being the case – penalties are common in the offensive zone. (If they weren’t, few teams would be penalized while on the power play, but, of course, many are.) Further, many penalties are taken for purposes of intimidation or retaliation, which can happen anywhere.

The change in even-strength scoring, and thus the calculation of deterrence, is very sensitive to what percentage of penalties are defensive, what percentage are offensive, and what percentage are neither. Any combination where defensive equals offensive – say, 40% defensive, 40% offensive, 20% neither -- will make the numbers come out roughly equivalent to Levitt’s.

That is, there are two competing explanations for the observations:

(a) Penalties are fairly evenly divided between offensive and defensive; or
(b) Penalties are all defensive, but the second referee does not lead to an increase in enforcement.

It seems obvious that (a) is a much more reasonable explanation than Levitt’s (b).

So if there’s one criticism I have of this paper, it’s that I wish that Levitt would have made that point more clear, that the results are unreliable because of his assumption. He does imply that in a footnote (page 8, footnote 12), but I’d have also liked to see it stated more explicitly. As the paper stands, the reader who examines only the abstract and conclusions will be led to believe that the “second referee doesn’t make a difference” conclusion is strongly shown. And I don’t think it is.

In any case, this is a fun study. We learn a bit about hockey. We find out that adding a referee increases penalties by 5% but has little effect on scoring. And Levitt also shows us that fighting penalties decreased by 14%. He conjectures that’s because fights arise from escalations of uncalled fouls, and the second referee reduces those uncalled fouls. That makes sense to me.

But mostly, I was intrigued by Levitt’s idea for solving the problem of separating out the effects of deterrence. It’s a creative “Eureka!” kind of discovery -- that if you increase enforcement on the defense, it leads to less obstruction, better scoring chances, and therefore a measurable change in even-strength goals. That insight alone makes the paper worth a look.

Wednesday, October 11, 2006


Netflix, the big rent-movies-by-mail company, has a website that tries to guess which movies you’ll like based on your evaluations of movies you’ve already seen. They have now announced that they will award a million dollars to anyone who can improve their prediction algorithm by 10%. It’s open to anyone, including sabermetricians, anywhere in the free world except Québec.

Here’s how it works. When you register at the site, you download a “training” database of 100,000,000 assorted movie ratings (1 to 5 stars, integers only) from 450,000 different customers. You then devise the best algorithm you can to accurately predict some of those ratings from other of those ratings. When you’re happy with your algorithm, you run it on a “final exam” database of 2,800,000 movies from those same customers (with the actual ratings not provided). You send Netflix your 2,800,000 estimates, they compare it to the “real” ratings to calculate your accuracy, and post your results on the site.

If you beat Netflix’s standard error by more than 10%, you win a million bucks. If you can’t beat 10%, but you’re still the best for the year, you get $50,000.

My first reaction is that reducing the standard error by 10% could very easily be impossible. I’m pretty sure it’s impossible to chop 10% off the standard error of, say, Runs Created, because there’s a minimum level of inherent noise in the data, and Runs Created is already within 10% of it. You could figure this out mathematically – take a typical batting line, scramble its at-bats around randomly, and see how many runs score. Repeat a couple of hundred thousand times. Find the standard error of all your results. That’s the minimum intrinsic error of any stat that’s based on a batting line. If that’s more than 90% of the standard error of Runs Created, well, you can’t win the million dollars.

To say that another way, suppose a team matching Derek Jeter’s batting line scored 80 runs half the time, and 90 runs the other half. In that case, the best you could do would be to guess 85 runs, for a standard error of 5. If the rules of baseball are flaky enough to evenly split real life between 80 and 90, that’s your limit of accuracy, period. No matter how much money you were offered, it would be impossible to do any better.

And I think estimators like Runs Created are already butting up against that limit. I’m pretty sure that 10% is out of the question.

Can you do the same kind of calculation for Netflix data? Not the same way – there are decent simulations of baseball, but no reliable simulation of the human brain’s response to movies. But maybe by analyzing the hundred million ratings, and the statistical properties of the data, you might be able to come up with a persuasive argument that the million dollar threshold is unreachable. I don’t think they’d pay you for the argument, but I’d certainly buy you a beer.

Of course, I might be wrong about this. Someone has already taken 1% off, and the contest has been running for only a week …

(Thanks to Freakonomics for the pointer.)

Saturday, October 07, 2006

Did the baseball salary market anticipate DIPS?

According to the famous Voros McCracken DIPS hypothesis, pitchers have little control over what happens once the ball is hit off them. As long as they stay in the park, balls hit off bad pitchers are no more likely to drop for hits as balls hit off good pitchers. No matter who the pitcher is, his batting average on balls in play (BABIP) should be roughly the same.

If that’s true, then teams shouldn’t be evaluating pitchers based on their BABIP, since it’s not evidence of skill. And, therefore, they shouldn’t be paying players based on BABIP.

J.C. Bradbury’s study, “
Does the Baseball Market Properly Value Pitchers?” sets out to check whether that is actually the case.

The paper starts off by verifying whether the DIPS hypothesis holds. That part of the paper is technical and hard to summarize concisely, so I’ll skip the details and run it down in one paragraph. What Bradbury does is, first, he shows that if you choose the right combination of variables to predict this year’s ERA, adding last year’s ERA and BABIP doesn’t help with the prediction. Then, he runs a second regression to see what other statistics correlate with BABIP – and the answer, it turns out, is strikeouts and home runs.

One important point Bradbury makes is that even if that last year’s BABIP doesn’t help other stats to predict this year’s BABIP, that doesn’t by itself imply that pitchers have no control over BABIP. It could be that pitchers do have BABIP skill, but that skill correlates perfectly with the other stats he considered. He finds a high correlation with strikeouts, but not a perfect one.

In summary, Bradbury writes,

“… it appears that pitchers do have some minor control over hits on balls in play; but, this influence is small … this skill just happens to be captured in strikeouts.”

Having concluded that BABIP isn’t much of a skill, Bradbury now checks whether players are nonetheless being paid based on it. He regresses (the logarithm of) salary on various measures of the pitcher’s performance in his previous year – strikeouts, walks, home runs, hit batters, and BABIP. He takes every year separately from 1985-2004.

His findings:

-- strikeouts are a significant predictor of salary 16 out of 20 seasons;
-- walks are significant 11 out of 20 seasons;
-- home runs allowed are signficant 8 out of 20 seasons;
-- HBP are significant 0 out of 20 seasons;
-- BABIP is significant 4 out of 20 seasons.

Bradbury concludes that because BABIP shows as so many fewer significant seasons than some of the other factors, this means that GMs were less likely to base decisions on it. In effect, they (or rather, the market) knew about DIPS before even Voros did:

“… the market seems to have solved the problem of estimating pitcher MRPs [marginal revenue products – the benefit of signing a pitcher versus not signing him] well before McCracken published his findings in 2001.”

This is not as farfetched as it sounds – economists are fond of finding ways in which markets are capable of appearing to “figure out” things that no individual knows. For instance, even though every individual has a different idea of what a stock may be worth, the market “figures it out” so well that it’s very, very difficult for any individual to outinvest any other. (And
here is another possible example of the market “knowing” something about sports that most of its participants do not.)

So it’s certainly possible. But I think the evidence points the other way.

Bradbury’s conclusion, that GMs aren’t paying for BABIP, is based only on the number of years that came out statistically significant. Those other seasons, the ones that do not show significance, are treated as if they confirm no evidence of effect that year. But, taken all together, they show obvious significance.

A look at the study’s Table 6 shows that from 1985-1999, the direction of the relationship between BABIP and salary is almost exactly as you’d expect – all negative but one. (Negative, because higher BABIP means lower salary.) Plus, the positive one is only 0.02, and only two of the other t-statistics are lower than 0.5. Clearly, this is a very strong positive relationship between BABIP and salary.

Here, I’ll list those 15 scores so you can see for yourself. Remember, if there were no effect, they should be centered around zero:


Only the three highest of those are individually significant, but taken as a whole, there’s no doubt. The chance of getting fourteen or more negative t-statistics out of fifteen is 1 in 2048. (Of course, that’s not a proper test of significance because I chose the criterion after I saw the data. But still…)

If you combined all fifteen years into one regression (you’d have to adjust for salary inflation, of course), you’d wind up with a massive t-statistic. It’s the data being broken up into small samples that hides the substantial significance.

If you look at an entire season, Rod Carew would easily score a statistically significantly better hitter than Mario Mendoza. But if you carved their season into individual weeks, not that many of those weeks would show a statistically significant difference.

In terms of actual salary impact, the numbers show a great deal of baseball significance. Take 1990, which is pretty close to average (z=-1.04). If your BABIP in 1989 was .320, you would earn about 5.4% less money in 1990 than if your BABIP was only .300. (Assuming I’ve done the math right.) That’s a reasonable difference, 5% of salary for only a moderate increase in BABIP.

Furthermore, the real-life effect is almost certainly higher than that. Players don’t sign new contracts every year, so there’s a time lag between performance and salary. Suppose, on average, half of pitchers sign a new contract in any given year. The 5% difference overall must then be a 10% difference on the pitchers who actually sign (to counterbalance the 0% effect on pitchers whose salary didn’t change).

And still further, pitchers aren’t evaluated on just their most recent season. Suppose that GMs intuitively give the most recent season only 50% weight in their forecast of what the pitcher will do for them. Again, that makes the 5% difference really a 10% difference in the GMs evaluations.

Combine the two adjustments, and now you’re talking real money.

So it seems to me that in terms of both statistical significance and baseball significance, it seems pretty solid that the market for pitchers does consider BABIP to be significant in determining pitcher skill.

But there’s still something interesting in the data. From 2000 to 2004, the years McCracken’s original DIPS study was in the public domain, the numbers become less consistent:


Two of the five years have the correlation between BABIP and salary going the wrong way. One of the others is close to zero, and the numbers seem to be jumping up and down a bit more. All this might be because the number of pitchers in the sample is a bit lower – but it also might be that GMs are catching on.

It’s weak, but it’s something. This study provides the first hint I’ve seen that baseball’s labor market might actually have learned something about DIPS. It’ll be very interesting to extend the table ten years from now and see if that’s really true.

Thursday, October 05, 2006

Liver follow-up

Tuesday, I argued that academics show little respect for the work of us amateurs. Specifically, I said that a recent academic paper by J. C. Bradbury on DIPS should have cited non-academic work that looked at the same question.

Many readers disagreed and said I was out of line.

Some of that discussion is
here. Dr. Bradbury’s response is here.

Also, some people objected to my description of the DIPS consensus, saying that’s not the consensus at all, and if it were, the consensus would be wrong. While the specific view on the DIPS question didn’t affect my argument from yesterday, I do want to get it right. I’ll investigate the more recent DIPS work and post on that soon.

Tuesday, October 03, 2006

Chopped liver

In 1999, Voros McCracken discovered what has come to be known as the “DIPS” theory – that pitchers have only a small influence on whether a ball in play becomes a hit. Since then, and in the standard fashion of scientific enquiry, there have been frequent studies testing McCracken’s hypothesis. The consensus, seven years later, after countless hours of research, analysis, and dialogue, is that the theory is generally true – pitchers have much less control over batted balls than previously believed, and any material deviation from league average is almost always just luck.

But to
J.C. Bradbury, that research might as well not exist.

As he writes, all that research was done only by the “analytical baseball community” – not by reputable Ph.D.s in economics. It has not undergone “formal peer review.” Further, “it has not been tested with sufficient statistical rigor” and “has undergone very little formal scrutiny.” It does not use “proper econometric techniques.”

And so Dr. Bradbury sets out to correct this. How? Not by reviewing the existing research, and validating it academically. Not by finding those studies which have “insufficient statistical rigor” and analyzing them statistically. Not by summarizing what’s already out there and criticizing it.

No, Dr. Bradbury’s paper ignores it. Completely. He mentions none of it in his article, not even in the bibliography. Instead, Dr. Bradbury’s explains his own study as if it’s the first and only test of the DIPS hypothesis.

This happens all the time in academic studies involving baseball. Years of sabermetric advances, as valid as anything in the journals, are dismissed out of hand because of a kind of academic
credentialism, an assumption that only formal academic treatment entitles a body of knowledge to be considered, and the presumption that only the kinds of methods that econometricians use are worthy of acknowledgement.

The truth is, there’s pretty decent statistical rigor in some of what us amateurs have done. In “
Solving DIPS,” a bunch of really smart people, statistically literate, probably no less intelligent than academic economists and much better versed in sabermetrics, do an awesome and groundbreaking job of determining the causes of variation in the results of balls in play. There’s Tom Tippett’s famous study that showed that power pitchers performed better than projected by DIPS -- the same conclusion that Dr. Bradbury reaches. (Tom didn’t do any significance testing, though, which I guess makes his excellent analysis unworthy of citation.) The May, 2001 issue of “By the Numbers” had three articles on DIPS, one of which (mine) did roughly the same kind of regressions (although admittedly much less thorough) as Dr. Bradbury does in his own paper.

Dr. Bradbury knows about our previous work. In
this article from last year, he mentions the Tippett article and a few others. He knows what’s out there. He must know that all this research has been peer reviewed, albeit informally, in hundreds, or even thousands, of blog postings and message boards. He must understand that the network of sabermetricians is perfectly respectful of the scientific method and understanding of the way theories must be tested and revised in the face of contradictory evidence. He’s got to know that our method of peer review, while informal, is much, much more likely to expose flaws in the theory than submitting a paper to a couple of economics referees who know little about sabermetrics.

And he’s got to be aware that in the field of sabermetrics, the achievements of non-academics, starting with Bill James and Pete Palmer, have been orders of magnitude above anything that’s come out of academia. Think of all the discoveries that have changed the way we look at baseball – DIPS, runs created, linear weights, the clutch hitting myth, catcher ERA, major league equivalencies, and so forth. Of the most important principles of sabermetrics, how many of those ideas were developed in academia? If the answer isn’t zero, it’s pretty close.

I feel bad singling out Dr. Bradbury. He does some good work on his blog, and, outside of this paper, seems very supportive and appreciative of the work we do. He seems like a nice guy. And ignoring non-academic research is pandemic when professors venture into sabermetrics – “The Wages of Wins” being the most prominent recent example.

So perhaps I’m being unfair. Maybe it’s one of the realities of academic life that to get an article published, you have to ignore non-academic work. Maybe the professorial culture requires that you presume no knowledge is real until it’s been published in peer-reviewed journals. Maybe Dr. Bradbury is right that all the previous results are questionable unless he uses the exact technique that he does. And maybe I’m just overreacting to a couple of throwaway sentences intended only to get his paper past the referees.

But still, there is a basic tradition among scientists of acknowledging that we stand on the shoulders of giants. When it comes to sabermetrics, academia repeatedly pretends that it hoisted itself up on its own bootstraps.

And that’s an insult to all of us.

Monday, October 02, 2006

Golfers improved by 10 strokes since 1934

This five-year-old golf study isn’t all that exciting, but it’s got useful information on how professional golfers have improved substantially over the last decades.

The study is called “
Studying Improved Performance in Golf,” by Sangit Chatterjee, Frederick Wiseman, and Robert Perez. It looks at the winning scores from the Masters (72 holes), from 1934-2001, and finds:

-- the mean score improved from 296.5 to 286.3;
-- the median score improved at almost exactly the same rate;
-- the variance of scores dropped until the mid-80s, then levelled off;
-- the winning scores have been improving, but not as much as the mean;
-- but the extent to which the winner dominates the field has been fairly level.

The last two findings seem to contradict each other – if the mean is dropping, but the winning score isn’t dropping as much is the mean, doesn’t it follow that players are winning by less? I’m thinking it does follow. But the authors used Z-score (number of SDs above or below the mean) for their measure of domination -- since the variance of scores is also dropping, the Z-score is staying fairly level.