## Tuesday, January 30, 2007

### Home field advantage seems to disappear in 3-run games

The first article in the new issue of JQAS is a baseball paper on home field advantage (HFA), by William Levernier and Anthony Barilla. Unfortunately, the authors appear to be unfamiliar with some pertinent sabermetric results, and there are, in my opinion, a couple of problems with their analysis.

They start by examining run scoring: they find that in games of 2004-2005, the home team scored .093 runs per game more than the visiting team. This number looks small, and so the authors conclude this "little supports" the explanation that home teams are "more proficient at scoring runs."

Of course, .093 runs per game is reasonably significant. Using the rule of thumb that 10 runs equals one win, the effect the authors found is about 1.5 wins per season -- or a winning percentage difference of .009. Given that the entire home field advantage in 2004-5 (as found by this study) was .036, the run differential explains 25% of it.

But the authors didn't take into account the fact that home teams often don't bat in the bottom of the ninth inning. That is, the home team scores only .093 runs per game more than the visiting team, but *in fewer opportunities*. If we make a rough guess that home teams lose the equvalent of 36 full innings over an 81-home-game season, and that they score 5 runs per nine-inning game on average, that's 20 runs right there -- another two wins out of 81 home games, or 4 wins in 162. That brings home teams up to about .534, almost exactly what the authors found.

(The difference is some combination of the home team both scoring more runs and preventing opposition runs.)

The authors then note that the observed HFA is higher in close games, and lower in blowouts:

.602 Games decided by one run
.539 Games decided by two runs
.500 Games decided by three or more runs

This surprised me a bit; I didn't expect to see this kind of effect. Why can it happen? I can only think of two explanations:

First, you have a small effect caused by walk-off games where the home team doesn't get to pile on more runs (but the visiting team does). Second, blowout games are disproportionately won by the better team, and better teams have smaller home field advantages than average teams. (I think Bill James showed this once, and started by observing that a 1.000 team must have a home field advantage of zero.)

It seems to me that these two factors alone shouldn't be enough to account for home teams' .500 record in 3+ run games. But I don't know. Are there other explanations?

In any case, the authors run a logistic regression on home/road, runs scored, run differential (unsigned), and roster size (25 or 40). They find everything significant except home/road. I'm not sure how to interpret that, but I think the idea is that if you know that the home team scored 7 runs, you're pretty much assured that they won, home or road. If they scored 1 run and it was a two run game, you know that they lost, home or road. The leftover games may be sufficiently few that a statistically significant result doesn't appear -- especially considering that runs scored aren't adjusted for innings.

In any case, it's a bit weird trying to predict home field advantage based on the run differential of the final score of the game. Shouldn't the predictions go the other way? Following the authors' logic, you could say the HFA is infinite in games won by walk-off home runs.

It may be true that teams were only about .500 in games decided by more than two runs. But the statement "HFA is non-existent in games decided by more than two runs" is false. There is a home field advantage in those games, but selective sampling makes it look like there isn't.

Labels: ,

### New issue of JQAS

The new issue of JQAS has just come out. I'm away on vacation this week, but will take a look at it when I get back.

From the table of contents, there are four articles, one each on baseball, highland dance (!), soccer, and ping pong.

## Friday, January 26, 2007

### Are NFL coaches too conservative on fourth down?

NFL coaches are cowards.

That’s the conclusion of a study by David Romer, “
Do Firms Maximize? Evidence From Professional Football.” Romer finds that teams could do better by kicking on fourth down less often, and “going for it” more often.

Romer took NFL play-by-play data from 1998, 1999, and 2000. (He looked only at first quarter results, in order to minimize the effects of score and time remaining.) He then used “dynamic programming” to fit a smooth curve to the point values of first downs on every yard line on the field. Having done that, he’s now easily able to figure the best fourth-down strategy.

For instance, Romer’s analysis found that having the ball first-and-ten on your own 30- yard line is worth about +0.9 points (these numbers may not be exact – I’m reading them off a graph). The opponent having the ball in the same spot is worth –2.9 points.

Now, suppose you have a fourth-and-short situation on your 30. If you go for it and make it, you’ll be at +0.9 points. If you go for it and miss, you’re at –2.9. If you kick to your opponent’s 30, you’re at –0.9 (since the other team will be at +0.9).

Therefore, Romer would find that if you can make fourth down with at least 45% probability, you should go for it. That’s because (45% of 0.9) + (55% of –2.9) equals the sure -0.9 you get from punting.

Romer then goes one more step. He figures out that the chance of making five yards or more is very close to that 45% chance. Therefore, a team who’s fourth-and-5 would presumably be indifferent (in the economic sense) between punting and going for it. And so you’d expect that teams in that situation would punt about half the time.

But in real life, teams don’t go for it 50% of the time. Romer doesn’t tell us the actual percentage. However, he does figure out the empirical “yards to go” breakeven point for coaches based on their behavior. That is, if coaches won’t go for five yards 50% of the time, when *will* they go 50% of the time? Is it three yards? Two? One?

In their own territory, it’s effectively zero – coaches never went for it more than 50% of the time, even on fourth and 1. In the opponents half of the field, though, they did. Eyeballing the graph shows the trend is irregular, but appears to average about 1/4 of the theoretical yardage. For instance, on the opponent’s 30, the breakeven point is fourth-and-7. But coaches acted as if the breakeven were about fourth-and-3. (This information comes from figure 5, which is a very interesting chart.)

It would be interesting to see the converse – now that we know that coaches go for it half the time on fourth and 3, how often do coaches actually go for it on fourth and 7? Is it 10%? 20%? Never? Romer doesn’t give us this information. This means (at least in theory) that coaches may be rational, but their own calculations are different. For instance, maybe they think that fourth-and-6 is the breakeven, so they seldom go for it with 7 yards to go, but often go for it with 5 yards to go. This is theoretically possible, especially since in “The Hidden Game of Football,” authors Bob Carroll, Pete Palmer, and John Thorn [CPT] come up with different numbers.

CPT’s methods seem similar to Romer’s, but their conclusions are different. Romer writes that CPT “do not spell out their method for estimating the values of different situations ... and it yields implausible results.” (I wasn’t able to notice anything that implausible, but I’ll give Chapter 10 another read.) Also, Romer writes that “their conclusions differ substantially from mine.”

One significant difference is that Romer advocates going for it on your own 30 when you have fewer than 5 yards go to. CPT, on the other hand, advocate punting even on fourth-and-1 (according to CPT, punting is worth –0.5 points, going for it on fourth-and-1 is worth –0.8, and going for it on fourth-and-6 is worth –1.9).

The difference seems to be the values assigned to the different field positions. Both studies give –2 points for first-and-ten on your own goal line, and +6 points for first-and-goal on the opponent’s one-inch line. But CPT assigns values linearly between those two values, while Romer has an S-shaped curve, flat in the middle but steep near both goal lines. To CPT, the value of a punt from one 30-yard line to the other is more than three points – but to Romer, it’s only two points. That’s a big difference, and it’s enough to make the difference between punting and not punting.

I don’t know whose numbers are right. It would have been nice if either source had shown us a chart of actuals – that is, how many times did team X have the ball on their 30, and what were the eventual results? That way, we could try to evaluate both authors ourselves.

In any case, it’s probably true that all the conclusions, on both sides, are dependent on the situation values. Without actuals, we really don’t know which is correct. If CPT are correct, and Romer is wrong, it could be that coaches aren’t all that conservative at all.

This is a meaty paper. It’s interesting to read, and provides much data (mostly in graphs) that would be useful to football sabermetricians. I do wish I understood Romer’s method a bit better – I’m not familiar with “dynamic programming” – so I could form an opinion on whether his values are empirically superior to CPT’s. (CPT don’t provide any raw data either.)

But even with Romer’s interesting data, I’d still wish for more. We are told that when it’s a 50% shot, coaches go for it less than 50% of the time. But what about when it’s a 70% shot? Or a 90% shot? Does that make a difference? Do coaches go by yards, so that they’ll always go for it on fourth-and-inches regardless of field position? And, as I wondered above, is it possible that coaches agree with Romer’s analysis, but are just using different numbers?

It seemed like Romer found one way of looking at the data that showed strong risk aversion, and stopped there. Sabermetricians, who are more interested in the particulars of coaches’ strategy than the yes/no question of whether they maximize, may be frustrated that the author has so much data yet chose to show only a small part of it.

And if anyone has any data or suggestions on figuring out whether Romer’s data is better or worse than CPT’s, I’d be interested in knowing.

---

Hat tip:
The Wages of Wins. This study got a fair bit of press ... see this espn.com article for NFL coaches’ reactions. Here’s the first of four-part series from Doug Drinen. And you can google “Romer NFL” for others.

Labels:

## Wednesday, January 24, 2007

### New issues of "By the Numbers" now available

Two new issues of “By the Numbers,” the SABR Statistical Analysis Newsletter that I edit, are now available at my website, www.philbirnbaum.com .

The
August issue contains:

-- a review of two recent “Chance” articles by Charlie Pavitt;
-- an analysis by Victor Wang of the historical value breakdown between OBP and SLG;
-- a steroid-related study by Yoshio Muki and John Hanks, showing that sluggers today see their power diminish much less slowly with age than they used to.

The
November issue contains:

-- a Charlie Pavitt review of the JQAS recent paper on error rates;
-- an article by Gary Gillette and Pete Palmer on measuring middle relief performance, complete with updated data for 2006;
-- a study by Abbott Katz on the consistent historical relationship between at-bats and batting average.

<bleg>

If you’re wondering why the August and November issues are coming out in January, it’s because we were short of material until recently. If you’re interested in contributing an article, we’d be pretty darn grateful. Even if you’ve already had a blog post on your work, or published it online, we’d be interested in having it. Our only rule is that we won’t publish an exact duplicate of something that’s appeared elsewhere – if you want to do some revisions, or update your data for a more recent year, or anything else that constitutes value added, we’re happy.

To see the kinds of articles we’ve used, check out the many back issues at the above link. (People have said that the
February, 1999 issue was one of the best.)

“By the Numbers” goes out to about 1,000 SABR members – 900 by download, 100 by postal mail – of which some are “famous” established sabermetricians. I suspect that most BTN readers don’t frequent online sabermetrics sites, so you might consider a contribution as a way to reach a new audience.

</bleg>

Thanks, and hope you enjoy the issues.

Labels:

## Saturday, January 20, 2007

### Do NBA teams value consistency?

Last week, in honor of Martin Luther King, Jr. Day, David Berri posted a list of academic studies of racial discrimination in basketball. Some found discrimination, some didn’t – check out the link for a summary of all the findings.

I thought I’d take a look at some of these studies. So far I’ve looked at only one: "Do Employers Pay for Consistent Performance?" by Orn B. Bodvarsson and Raymond T. Brastow. I should state in advance that the copy I was able to download from my library was missing a lot of the math and tables, but I think I was able to get the gist of what the study was trying to do. An expensive download is
here.

The idea is something like this. In any employer/employee relationship, it is difficult for the employer to figure out how good the employee’s output is, in quality, quantity, or both. Therefore, "costly monitoring" of the employee’s output is required.

In this study, the employer is the team, and the employee is the player. Assume the player’s MRP ("Marginal Revenue Product," which is basically the monetary value of his output) has an expected value of theta. (At least I think it’s theta – the crappy text version I have uses "0", which makes no sense.) The team initially doesn’t know what theta is, so it has to monitor the player. The team will spend c dollars per game in monitoring, until it has seen enough to be able to estimate theta within a narrow range. Then it will stop spending those c dollars.

Therefore, players who are more consistent will earn more money than players who are less consistent. That’s because inconsistent players have to be monitored for more games. Since teams pay for the monitoring by the game, inconsistent players will cost more, and the team will factor that in to their salary offers.

With this model in mind, the authors run a regression on player salary based on a whole bunch of variables. They used points per minute as the main performance statistic, and include the observed variance of the player’s PPM, since the model assumes that should be an important determinant of salary.

I don’t have the full regression results in my copy of the study, but the authors kindly list the most important findings, which are:

-- variance is significant at the 5% level;
-- the more seasons played, the less the variance predicts salary;
-- there are no observed effects for race.

My main problem with this study is the applicability of its assumptions to the NBA. I have no doubt that in real life, as opposed to professional sports, knowing the capabilities of your employees is difficult, and monitoring costs are indeed high. Magazines are full of articles trying to help managers to figure out who their best performers are, and how to know whom to hire. I’m a software developer, and where I used to work, there were programmers of all different ability levels, and some of them were literally ten times as productive as others. But management was generally oblivious to the differences. And if that’s the case for programmers – where you actually can measure output with not too much difficulty if you choose to – I can imagine that for jobs that don’t have quantitative outputs, like customer service or management, the monitoring problem is indeed very significant.

But in basketball? There aren’t very many fields of human endeavor where it’s easier or cheaper to measure an employee’s output. If your metric is points per minute, like in this study, the cost is pretty much zero – USA Today, and the NBA itself, will do it for you, almost instantly.

And so as for the variance of output affecting salaries, I doubt that the authors’ explanation is the correct one. If the cost of monitoring is zero, it’s hard to accept that the lower salaries are caused by that cost. It seems more likely that players who vary in terms of points per minute are inconsistent because of lesser playing time. Theoretically, the standard deviation of single game PPMs is proportional to the square root of minutes played per game. (Minutes played is included in the regression as a separate variable, but I don’t think there’s any interaction term between PPM and the inverse of the square root of minutes.) The fewer minutes played, the less skilled the player is likely to be, and that would explain the lower salary.

Even if I’m mistaken, and the study does actually adjust standard deviation is for minutes played, there are other possible explanations. For instance, players with a high SD might be played in many different kinds of situations – with different teammates, or more often in garbage time, and so on. Those players are likely to be role players, since guys with lots of minutes pretty much play in all kinds of situations. And role players earn less than stars.

Even if all else were equal, I don’t see why teams should overvalue players who are more consistent. For one thing, if a player’s performance varies by situation, he becomes more valuable, not less. If a player is twice as good at home as on the road, the team can play him only at home, as a kind of platoon player. Who’s more valuable in baseball: the consistent .240 hitter, or the guy who hits .300 against lefties but .210 against righties? The inconsistent one, obviously.

Second, I’d bet that almost all the observed differences in SD are due to luck. If you look at just foul shooting, you can calculate the theoretical variance from the binomial distribution. Any player who has a higher per-game variance, and is thus “inconsistent,” must be having hot streaks and cold streaks. And, as numerous studies have told us, streakiness is almost always random. (See Alan Reifman’s "
hot hand" blog for links and references.)

Finally, I’ve never heard any sports executive complain that a player was too streaky, except in the context of how badly he played during his off-days. Take, for example, baseball. In the 80s, both Bill James and the Elias people broke down every regular batter’s record by month. Do you remember any of them? Do you remember anyone ever complaining that a particular .300 hitter was less desirable than another because he got his .300 by hitting .250 in May but .350 in August? (It’s true that if a player hits .170 in September, you might wonder if something’s wrong with him, or he’s washed up. But that goes to the question of whether is established ability has changed, rather than any inconsistency.)

It makes sense that consistency is desirable in the non-sports context, for exactly the reasons the authors give – it makes it less costly for the employer to evaluate the employee. But in the NBA context, I don’t think that’s really the case.

Labels: , , ,

### Did the NHL arbitrarily void Rory Fitzpatrick votes?

Rory wuz robbed!

Rory Fitzpatrick, a near-replacement-level journeyman defenseman for the Vancouver Canucks, was the focus of an All-Star ballot write-in campaign started by a fan named Steve Schmid. Schmid wrote some automated voting software, and got himself some publicity and followers. By mid-December, Fitzpatrick was in second place among Western Conference defensemen, and set to make the team.

According to
Slate’s Daniel Engber, the NHL didn’t like this state of affairs. After ineffectual efforts to foil the automated voting programs, the league simply took the step of wiping out 100,000 of Fitzpatrick’s votes.

How did Engber (and the fans he links to) figure that out? Pretty easily. Engber writes,

"Since the league counted only ballots that were entirely filled in, there should have been an equal number of votes cast for hockey's two conferences. But for the week after Christmas, players in the Eastern Conference received 6 percent more votes than those in Fitzpatrick's Western Conference. Among defensemen, the results were even more skewed: The guys in the West—Rory among them—got 16 percent fewer votes overall. (These discrepancies were about three times bigger than any that had come before.) As bloggers were quick to point out, the numbers were exactly what you'd expect to see if the league had manually dumped 100,000 Rory votes. Nothing has been proved, but I'm hard-pressed to come up with another reasonable explanation." [Links in original.]

That is: there should have been exactly as many votes for Western Conference defenseman as Eastern Conference defenseman, since the league required ballots to be completely filled out. But there were 100,000 less. And according to the NHL’s own rules, that can’t happen.

The funny thing is, this is the absolute worst possible way to fudge the results without getting caught. It would have taken barely any extra effort for the league to have fully voided the 100,000 ballots instead of just the 100,000 Fitzpatrick votes. But they didn’t. Either they wanted to get caught, or they’re a lot less competent than you'd expect from a professional sports league.

Labels:

## Thursday, January 18, 2007

### How much better is the AL than the NL?

It seems clear that in 2006, the AL was a better league than the NL. “Better,” that is, in terms of the quality of the players in the league. The most obvious reason is that in interleague games last year, the AL went 154-98 (.611). But there is other evidence to look at, too.

In his
“Keeping Score” column in last Sunday’s New York Times, Alan Schwarz looked at starting pitchers who switched leagues between 2000-01 and 2005-06. He found that the ALers who moved to the NL chopped 0.85 off their ERA, while those who went the other way had their ERA jump by 0.70.

Of course, much of this is due to the DH. The old Bill James rule of thumb is that the designated hitter adds half a run per game. Half a run is about 0.45 of an earned run. It might be reasonable to bump that back up to 0.50 because of increased scoring since the 80s. Anything above that – or the 20 to 35 points that Schwarz found – might have something to do with the relative caliber of the two leagues.

When Schwarz repeated his study using “ERA+” – which adjusts for league averages – he found a consistent difference of 13%. For instance, pitchers whose ERA was 10% better than average in the NL, found they were now 3% below average in the AL. However, I wonder if ERA+ really does take out the effects of the DH – 13% is about 0.65 earned runs per game, roughly equivalent to the raw totals Schwarz found without a DH adjustment.

And any difference found by this method is due only to the relative caliber of the hitters, of course. Schwarz’s study can’t tell us anything about how good the pitching is. Suppose the pitching is absolutely the same in both leagues. That would mean AL teams are about a quarter of a run per game better than the AL. That works out to 4 wins per 162 games (.524).

Also, Schwarz’s study covered the past five years; the difference might have varied over that time. It has for W-L record – as recently as 2003, the AL had a losing record in interleague games, at 115-137.

Another way to look at the difference is by payroll. In 2006, the
average National League payroll was lower AL’s, \$72 million to \$83 million. The difference is eleven million dollars, which should buy, say, three wins per season. That puts the average AL team three wins above the average NL team, which means that, all else being equal, the AL should have gone .519 (84-78) against the NL last year (see chart below).

These two studies are dwarfed by the state-of-the-art
three-part investigation by Mitchel Lichtman last summer. Here’s what Lichtman did:

First, he showed us interleague records for the past few years, which I will reproduce here because they’re interesting (I've filled in full-season numbers for 2006):

Year ... AL-NL record

----------------------
1997 .... 97-117
1998 ... 114-110
1999 ... 116-135
2000 ... 136-115
2001 ... 132-120
2002 ... 123-129
2003 ... 115-137
2004 ... 126-125
2005 ... 136-116
2006 ... 154- 98
----------------------
Total . 1249-1202

Then, he investigated the actual talent shifts between the two leagues. That is, if lots of good players are moving from the NL to the AL, while it’s mostly mediocre players moving from the AL to the NL, that would certainly explain the record.

You can check out Lichtman’s study for the technical details, but the bottom line is that overall, since 1999, talent has indeed flowed, in substantial quantities, from the NL to the AL. In 2006, the American League’s talent advantage was 0.57 runs per game, split almost equally between hitting and pitching. That translates to an AL winning percentage of .560. That’s a lot larger than I would have expected: the average AL team would finish with a record of 91-71 against an average NL team.

But the numbers seem to make sense, and there's a good correspondence between year-to-year expected and actual AL vs. NL records.

Finally, Lichtman verifies the numbers via one other technique: he looks at the performance of AL pitchers against NL batters, and vice versa. Eventually, he finds that the AL’s 2006 winning percentage against the NL should be .543 by this method.

So we have four different estimates:

.524 – ERA+ differences in pitchers changing leagues (Schwarz)
.518 – salaries
.560 – talent moving between leagues (Lichtman)
.543 – batter/pitcher stats from interleague games (Lichtman).

Clearly, none of these numbers are anywhere near the actual .611 number posted last year by the American League. Might the AL just have had a run of good luck in 2006? If the true AL winning percentage should have been .550, the actual .611 is less than 2 standard deviations away. That’s not as outrageous as you might think. Personally, I suspect it was just random chance, and I’d be surprised if, in 2007, the AL was much over .550.

Labels:

## Monday, January 15, 2007

### An NBA free throw coach

A post from the Freakonomics blog points to a New York Times article about the free throw coach hired by the Dallas Mavericks.

Apparently, coaching works well, so well that Dubner calls it an "arbitrage opportunity" and wonders why other teams haven't done likewise.

If you want to read the full NYT article, go now -- it doesn't take long for the Times to make their archives subscription-only.

My previous post on underhanded foul shooting is here.

Labels: ,

## Sunday, January 14, 2007

### Why don't teams and GMs care about the long term?

Would you pay an interest rate of 174% for a loan? Apparently, football GMs do. In their study of the NFL draft (previously reviewed here), Cade Massey and Richard Thaler looked at trades that involved only draft choices, and discovered, based on the value of the draft choices, that the average implied discount rate on those trades was 173.8%.

For instance, the authors write,

“... teams seem to have adopted a rule of thumb indicating that ... for example, a team trading this year’s 3rd-round pick for a pick in next year’s draft would expect to receive a 2nd-round pick in that draft.”

According to NFL teams’ own internal charts of draft choice values, a mid-second-round pick is worth “420,” while a third-round pick is “190”. Receiving 190 this year, in exchange for 420 next year, calculates to an interest rate of 121%. (Presumably, to get from 121% up to the average of 174%, there must be other trades with an interest rate in the 200s!)

In real life, if the market sets an interest rate of 121%, that would suggest at least a 50/50 chance of the borrower going bankrupt and the creditor not getting any money back at all. But, in the NFL, there’s no doubt that the creditor team will make good on its promise of the second round choice – the league offices will see to that. So why would a team think of paying 121%?

It could be the incentives its management faces. Every GM knows that he could be out of a job at any time, which gives them an urgent incentive to show improvement right now, before they get fired. If the GM is thinking only of himself, and not the team, the high discount rate might be perfectly appropriate – a 50% chance of getting fired is roughly equivalent to a 50% chance of going bankrupt. Either way, the party choosing to defer the possible benefits has a only a one-in-two chance of reaping them.

This is the argument that
J. C. Bradbury and Keith Law are making (in the Sabernomics blog) for a slightly different situation – the question of why baseball teams pay so much for players who are demonstrably not worth it, and why they continue to issue long-term contracts to players in their declining years.

Law writes,

“The problem is that the best way to keep a GM job when you know you’re in danger of losing it is to produce results in the short term, sometimes in the very short term. This idea of trading a dollar in the future for 10 cents in the present often manifests itself in moves like trading prospects or young players for “proven” veterans, signing well-known free agents whose name value exceeds their on-field value, and backloading deals to maximize disposable payroll in the current year without regard to the payroll consequences for future years.”

While I think that this argument is certainly a part of the answer, I don’t believe it’s even close to the entire answer. In the football case, the 173% discount rate is more than just an unwritten rule of thumb – it’s literally accepted, in writing, in the form of 30 teams’ similar draft value charts. If thirty general managers were conspiring – in writing! -- against thirty team owners, in an effort to keep their positions, all it would take is for a few owners to tear up the draft value chart and present the GM with a better one along with orders to use it. Or for one GM to start aggressively trading current draft choices for future ones. At a 100% discount rate, one choice today would be worth 32 choices five years from now. That seems like a good way to for a GM to attract attention as a genius (or at least attract attention as something).

More practically, a team could trade this year’s draft choices and receive double draft choices forever! To do that, trade this year’s choices for two choices next year – in addition to the “regular” choice next year, that gives you three. Next year, trade one of those three for two again the next year. Repeat, and you’ll have two draft choices instead of one, every year, forever. If you were a GM, wouldn’t you think a deal like that would impress your ownership?

In the baseball case, if it’s against the better interests of the team for the GM to offer the contracts he does, that would imply that not offering those contracts would make the team better off. But if 29 GMs are offering too much money to free agents, the (rational and ethical) thirtieth GM does not have the option of paying less – he only has the option of not paying at all. That would put his team in the basement, and that’s not clearly in the interests of the team. I think a better explanation of out-of-control contracts is a model where some owners are willing to lose money for
the pride of having a winning team. Their overspending forces other teams to also overspend, even teams that want to maximize profits rather than wins, and the result is a salary spiral.

In any case, Bradbury has what I think is an excellent idea for how owners can bring rationality back to the executive office. His suggestion is to offer the GM a bonus for how well draft choices turn out – even if the GM has since been fired. (This doesn’t mean GMs would necessarily make more money overall, since their base salaries could drop by the expected amount of the bonuses.) The same calculation could work for trades – GMs could be credited (or debited) based on the performances of the traded players in subsquent years.

It seems like this extra incentive would work very well. A early draft choice next year would earn more money overall for the GM than a later draft choice this year. Once every GM starts realizing this – which would probably take about fifteen seconds -- the market price would immediately adjust. It won’t adjust perfectly, because the GM will receive only part of the value of the draft choice. If a second round choice is worth a million dollars more than the third round choice, but the GM only gets to keep 1% of that, he might be willing to sacrifice \$10,000 in bonuses to show results this year and keep his job.

And on the flip side, the incentive structure needs to be considered carefully, because sometimes you absolutely need to make a “bad” trade for a short-term need. In 1992, the Blue Jays traded Jeff Kent and Ryan Thompson for two months’ worth of David Cone. In raw long-term value, the trade looks like a disaster. But Cone helped the Jays win the World Series, and GM Pat Gillick probably has no regrets. But would he have made that same trade if it would have cost him thousands of dollars personally?

Still, it seems like the right set of incentives could work to correct whatever anomalies ownership is concerned about. If GMs received part of the profit from a
Nolan-Ryan-for-Jim-Fregosi trade, they they’d at least be thinking a little more about long term benefits. And if they had to take a loss for trading a draft choice that turns into a superstar, they'd suddenly realize that part of the 121% interest rate they’re paying would be on their own money.

Labels:

## Tuesday, January 09, 2007

### Charlie Pavitt on recent clutch hitting studies

I am writing this entry in order to put my two cents in on an essay written by Bill James and published in Baseball Research Journal 33 (2005) a couple years back called “Underestimating the Fog.” It got a lot of response at the time, most notably from Jim Albert and Phil Birnbaum in By the Numbers, Volume 15 Number 1 (see page 3 for Jim, page 7 for Phil). Phil’s response was specifically relevant to one of the claims Bill made in that essay: that the conclusions made by previous analysts that clutch hitters as such don’t exist were premature. Bill believed the existence of clutch hitters to be an open question; Phil attempted to provide evidence that we have good reason to believe that there is no such thing. I wish to make some comments on the validity of Bill’s claim. Although I will use the clutch hitting case as an example, my comments are also relevant to analogous issues such as “hot and cold streaks aren’t real” (which Bill also claimed to be an open question in his essay).

I write this from the standpoint of the traditional logic of hypothesis testing (apologies to those readers already familiar with this way of thinking). In this tradition, one proposes a “research hypothesis” (e.g., clutch hitting exists) and a corresponding “null hypothesis” voicing its opposite (e.g., clutch hitting does not exist). One then compares one’s data to these hypotheses, and finds support for whichever hypothesis the data more closely reflects. The problem is that no matter what the data look like, one can never be sure that one’s finding is accurate, because, due to the law of chance, fluky data is always a possibility. It could be that in the real world there is no such thing as clutch hitting but that it appears to exist in one’s data, leading to an incorrect conclusion in favor of the research hypothesis. This is like flipping a fair coin 20 times and getting 17 heads (which I have seen occur twice in a course exercise I have used) and concluding that one’s coin is biased; statisticians call this “Type I error,” although it would have better been called Type I bad luck. It could also be that in the real world there is such a thing as clutch hitting but it does not show up in one’s data, leading to an incorrect conclusion in favor of the null hypothesis. This is like flipping a biased coin 20 times and, despite the bias, getting 10 heads; this is known as Type II error, although again it is more bad luck than mistake.

On any single occasion, one can never be sure whether one’s findings are accurate or not. However, in a given research situation, we can use laws of probability to estimate the odds that Type I or Type II error would occur, and use these estimates as the basis for our decision concerning which hypothesis should gain support. If our data appears to support the research hypothesis and the calculated probability of Time I error is very small (one’s “significance level,” traditionally in the social and behavioral sciences, less than five percent), then we claim support for that research hypothesis, although we know there is some chance that claim is wrong. If our data appears to support the null hypothesis and the calculated probability of Type II error is very small, then we can analogously claim support for the null hypothesis with the corresponding proviso.

The point I wish to make concerns the second of these possibilities. Although there have been two recent cases in which researchers claim to have found an impact, both of which I will comment on later, research as a general rule has supported the null hypothesis that clutch hitting as a skill does not exist. However, our trust in this conclusion has to be tempered by the chance for Type II error. Now, rather than Type II error, the odds of supporting the null hypothesis in error, we tend to think about the issue in terms of its complement; statistical power, the odds of correctly supporting the null hypothesis. In this case, that would be the odds of finding evidence for clutch hitting in a given data set assuming it really exists. Now, statistical power (and Type II error) is determined by three different factors. First, significance level, which is as noted earlier traditionally .05 in the social and behavioral sciences. If one become more lenient and makes it say .10, then in so doing one increases one’s statistical power. The problem with doing so is that one increases the odds of Type I error; as a consequence, we don’t normally muck around with significance level. Second, one’s sample size; the more data we can work with, the more random fluctuations in data are likely to cancel one another out, making it more likely to find something that is there. Third, the strength of the effect itself. This is called “effect size.”

Let us turn now to clutch hitting. To be clear, the issue is not whether there is such a thing as a clutch hit; of course there is. The issue is whether there are certain players that have some special talent such that they are consistently better in the clutch than in “normal” game situations, whereas other players do not have this talent. If clutch hitting as an inherent ability has a significant factor in performance, it would be easier to find than if it its impact is weak. Given the difficulty people have had in finding a clutch hitting effect, if it actually exists, it must have a very small effect size. Given that we don’t mess with significance level, we would need to increase our sample size as much as feasible if we want to increase our power and the resulting odds of finding an effect if it exists. Having said this: If we run a study and find no evidence for clutch hitting, we will be confident in that conclusion to the extent that statistical power is high and, saying the same thing a different way, Type II error is low. We can calculate that power, based on our sample size, our significance level, and the effect size of clutch hitting. We know our significance level, we know our sample size, but we do not know our effect size. We can only guestimate it. We can never prove that clutch hitting does not exist as an inherent talent, no matter what our data look like, because our guestimate of its effect may be too large. Herein lies the fog that Bill alluded to.

Now, let’s get to what Phil Birnbaum did. Bill James had claimed that the Dick Cramer method for studying the clutch hitting issue, looking at year-to-year correlations in clutch performance (see the 1977 Baseball Research Journal), is invalid, because one cannot assume anything based on random data. In his article, using the Elias definition of a “late inning pressure situation” to distinguish clutch from non-clutch plate appearances, Phil computed the statistical power of this type of test assuming various effect sizes. This is a good way to deal with the fog issue; show the possibilities and then let the reader decide. For example, based on his data, if the correlation in clutch performance from season to season was a trifling .08, he would have found it 97.7% of the time; as he did not find it, if such an effect exists, it must be even weaker than that. Assuming the validity of his data set and analysis, what cannot be denied from his work is that if clutch hitting does exist as a talent, then its effect size must be extremely small. The moral of our story: Bill James was right about the following: we will never be in the position to definitively say that an effect does not exist. This is because, no matter how large our sample, our analysis will assume some effect size, and the effect could always be smaller than our assumption. But Bill did not understand that we can estimate the odds that there is no effect given different conceivable effect sizes. And at some point, we can also place ourselves in the position of concluding that if such-and-such an effect does exist, it is too small for have any appreciable impact on the game.

Let me now turn to the two recent claimed demonstrations of a clutch hitting effect. One is by Nate Silver in the Baseball Prospectus folks’
Baseball Between the Numbers. I like Nate’s general method; rather than making a clear distinction between clutch and no-clutch situations, Nate used the concept of situational leverage (the likelihood that the outcome of a given game situation will determine which team wins the game) to estimate the degree of clutch-ness in batter’s plate appearances, which could then be compared to the player’s overall performance to see if he tended to do better or worse as situational leverage increased. Turning a dichotomous clutch/non-clutch distinction into a gradated degree of clutch-ness scale (in technical terms, from ordinal measurement into interval measurement) improves the subtlety of one’s measure, which in effect is another way to improve one’s statistical power. Now, apologies for getting a bit technical; using a sample of 292 players, Nate found a correlation of .33 in their “situational hitting” (his term, which is probably better then “clutch” given his method) between the first and last half of their careers, which is significant at better than .001; in other words, Type I error rate is less than one in a thousand. Finally, some evidence that clutch hitting might indeed exist as a skill. Nonetheless, Nate is very careful to downplay the effect, guestimating that it may account for 2 percent of the impact of offense on wins.

Tom Tango, Mitchel Lichtman, and Andrew Dolphin also devote a chapter to the issue in The Book. They unfortunately use the dichotomous clutch versus non-clutch distinction despite the fact that most of the analyses in their book rely on situational leverage, and they claim to have found an effect but do not provide any relevant data for the rest of us to examine (other than lists of players with good and bad clutch hitting indices, which does not rate as evidence that the effect is non-random). So I cannot judge the validity of their conclusion one way or another. This chapter is one of the few weak points of an otherwise impressive body of work.

-- Charlie Pavitt

Labels: ,

## Sunday, January 07, 2007

### How dominant should the Yankees be to maximize league revenues?

Every couple of years, you read complaints about how World Series TV ratings are going to be very low because the pennant winners are small market teams. The implication is that if Major League Baseball had a choice, they would make a lot more money if they could arrange for big city teams to win most of the time.

If that's true for the World Series, it's also true for the regular season. If winning lots of regular season games increases attendance by 10%, then it's better for the league as a whole if it's the Yankees do the winning. Ten percent of the Yankees' revenues is much, much higher than ten percent of the Expos' revenues.

At some point, though, that arrangement could backfire. If the Yankees are 15 games up by the end of May, fan interest in Baltimore and Toronto might drop so much that even higher attendance in New York won't make up for it. Therefore, there must be some kind of equilibrium, where making the Yankees good, but not *too* good, maximizes the revenues of MLB as a whole.

Finding that equilibrium is the goal of a 1976 study (from the American Economic Review, JSTOR access required) by Joseph W. Hunt Jr., and Kenneth A. Lewis, "
Dominance, Recontracting, and the Reserve Clause: Major League Baseball."

Hunt and Lewis run regressions on team-seasons from 1969-73, to predict home and road attendance based on a bunch of factors. These factors are:

-- metro population
-- if the team is a division winner, its average lead over the season
-- a weighted average of games behind on August 15 and end of season
-- that same weighted average for last year
-- a dummy variable (2, 1, or 0) for the length of time since the last pennant
-- the number of superstars on the team (45 HR or 25 wins)
-- a bunch of other stuff specific to the team – stadium, ticket price, etc.

Armed with their attendance estimators, they can then compute expected revenues (the sum of: home attendance, road attendance, post-season, and TV money). I won't say much about the results of the regression, because the signs are about what you'd expect in terms of direction and magnitude.

But having done all that, the authors now want to figure out exactly where the equilibrium level of dominance is, in terms of winning the division. How often should the city with the highest population -- call it the Yankees -- win the AL East, in order to maximize the entire division's total revenues?

To calculate this, the authors have to do some additional work; that's because their regression doesn't include "chance of winning the division". They must therefore figure out the relationship between the proabability of winning, and those other variables above.

To do that, they run more regressions to try to relate the known dependent variables listed above (average games behind, number of superstars) to probability of winning the division. Based on that, they sketch out – by experimenting -- what a typical division might look like, for a given Yankee probability of winning the division.

Suppose you wanted to construct a division where the Yankees win 30% of the time. The authors have it looking like this:

Team ...... prob ... pop .... GBL .. lead ... flag .. stars
-----------------------------------------------------------
Yankees .... 30% ... 7.0 .... 9.0 ... 2.0 ... 1.67 ... 0.36
Team 2 ..... 25% ... 3.5 ... 16.0 ... 1.5 ... 1.20 ... 0.34
Team 3 ..... 15% ... 3.0 ... 18.0 ... 1.0 ... 0.50 ... 0.32
Team 4 ..... 15% ... 2.5 ... 20.0 ... 1.0 ... 0.50 ... 0.32
Team 5 ..... 10% ... 2.0 ... 22.0 ... 0.5 ... 0.45 ... 0.30
Team 6 ...... 5% ... 1.5 ... 25.0 ... 0.2 ... 0.23 ... 0.23

To read the top row of the chart, for the Yankees to have a long-term 30% chance of winning the division (with a population of 7.0 million), they would average 9.0 games behind the leader, they would lead the division by an average 2.0 games throughout the 30% of seasons they win, the time since their last pennant would be 1.67 out of 2, and they would have 0.36 of a superstar on their team.

With their model constructed, the study can go back to its original regression to see how much revenue each team will now have. Totalling up all six teams gives \$29,307,095.

That \$29.3 million is when the Yankees win 30% of divisions. The maximum revenue occurs when the Yankees actually win 43% of divisions. Here's their chart:

20% ... \$28.5 million
30% ... \$29.3
40% ... \$29.4
50% ... \$29.3
60% ... \$28.9
70% ... \$28.5
80% ... \$28.0
90% ... \$27.2
100% .. \$25.1

Hunt and Lewis write,

"Over the past thirty years, the level of domination achieved by the New York Yankees has been 50 percent, while the that achieved by the Los Angeles/Brooklyn Dodgers has been about 37 percent. The long-run experience is consistent with the predicions implied by the experiments, indicating that the recontracting market [where players are sold from small market teams to the Yankees and Dodgers] operates reasonably well over the long run."
From an economic standpoint, what the authors are saying is this: as predicted by the Coase Theorem, players will go where they can earn the most revenue for a team. Since a superstar makes a lot more money for the Yankees than he would on the Brewers, the Brewers should be selling players to the Yankees so as to maximize both teams' revenues, but only until the Yankees are good enough to win the division 43% of the time. And the study supports the idea that this happens.

But I don't think the technique is strong enough to support the conclusions. Notice that the revenue estimates are very close – the 30% figure is only 3% higher than the 70% figure. Are the regressions, based on a small amount of data as they are, reliable enough that we can trust the results within 3%? I think it's obvious that we cannot.

And some of the regression variables seem arbitrary – for instance, the dummy variable for pennants is set to 2 if the team won in the past four years, and 1 if they won within the past nine years. Is that really what we'd expect? What if teams that haven't won in a long time – say, the Cubs and Red Sox – generate attendance *because* of that? And why should the *average* number of games ahead matter? Won't a close race attract more fans than a runaway championship?

Generally, my gut says that if you changed the variables used and the experiment even a little bit, you'd get substantially different results.

My feeling is that the method behind this paper is reasonable, and if you had perfect data, it might work. But I'd bet that you'd need so much data, and there are so many unknown variables driving attendance and revenues, that even doing the best you can, with the best information available – as Hunt and Lewis did – just isn't going to be enough.

Labels: , ,

## Thursday, January 04, 2007

### Do fans care about "uncertainty of outcome"?

One of the arguments in "The Wages of Wins" is that fan interest in a game is proportional to the amount of "uncertainty of outcome" in the game. For instance, two closely-matched teams playing each other should draw higher interest than the Yankees against the Royals.

The TWOW authors illustrate this theory with the example of ESPN Classic, the channel that broadcasts reruns of historic games. Even though these are some of the best games in sports history, the authors argue, ratings are very, very low. According to the authors, that's because viewers value uncertainty, and everyone knows who's going to win the rerun.

I can't fully agree. I don't think the reason few people watch the reruns is just because they know who's going to win – it's because they also know the score, and the sequence of events, and the plot. It's also because they may have seen the game before. It's also that live games have a more "participatory" feel than taped ones. And it's also because live games are "news," while old ones aren't. (Would you watch a game tape-delayed by six hours, even if you had no idea what happened? I wouldn't.)

You can understand that fans don't want to watch an old game without necessarily believing that they care, for a live game, whether a team is a 55% favorite or a 60% favorite.

So I decided to look up the three studies the authors cited. I only found one of them online (via my public library), and I don't find it at all convincing.

It's from the Fall 1992 issue (36.2) of the journal "American Economist." It's by Glenn Knowles, Keith Sherony, and Mike Haupert, and it's called "The Demand for Major League Baseball: A Test of the Uncertainty of Outcome Hypothesis." (I couldn't find it online, even an abstract.)

Knowles, Sherony and Haupert take every game in the 1988 National League. They run a regression to predict attendance based on these variables:

Games Behind (sum of GB for both teams)
Whether the game is a weekend game
Whether the game is a night game
Population of home city
Unemployment rate of home city
Per Capita Income of home city
Distance between the teams' two cities

Their findings: the two income variables are not signficant. All the other variables are significant in the expected direction (for instance, the bigger the city, the higher the attendance).

And, most important for this study, the authors find that attendance increases up to the point where the home team has a .600 chance of winning; then it declines.

The authors conclude:

"The uncertainty of the game's outcome indeed a significant determinant of attendance. In addition it was shown that attendance is maximized when the home team is slightly favored. ... These results indicate that competitive balance is important in MLB, and that league-wide cooperation in establishing it will increase attendance ..."

But there are obvious problems with the regression.

-- the assumption that attendance should depend on the sum of the two team's GB is not reasonable, since, for the most part, paying customers are fans of only the home team. If the home team is one game out of first place, and the visiting team is 20 games out, you'd expect attendance to be high – the other way around, you'd expect it to be low.

-- GB has different meaning early in the season vs. late in the season, but the regression treats them the same.

-- why should attendance be based on the distance between the two cities? Do you really expect more Yankee fans to come out to see the Twins than the Mariners because Minnesota is only half as far away? An indicator variable for "traditional rivalry" or something would have been more appropriate.

-- while attendance does correlate with city size, the relationship is not perfect. Some large cities, it is argued, don't have as much per-capita fan support for cultural reasons – I've heard Philadelphia described this way. If, in 1988, baseball hotbed teams tended to be close to .500, while less fanatical cities tended to be at the extremes, this could cause the observed effect without anything to do with in-game parity.

-- in 1988, the teams with the best records happened to be the ones with the highest populations – New York and Los Angeles. Suppose that attendance isn't exactly proportional to population, but drops off after a few million people in the Metro area (which kind of makes sense – once a game is sold out, it's sold out, no matter how many people want to attend). Suppose that, for this reason (or any other), the regression overestimates the Mets' expected attendance. Then, the regression might falsely attribute the difference to the fact that the Mets are so good that they play too many "unbalanced" games.

It's possible that despite these problems, the results still hold. To find out, you'd have to fix some of these issues and see if the ".600" result changes. (My intuition is that it would indeed change.)

And there's another argument. Conventional wisdom is that fans want to see their team win, but, also, they come out to see good opposition, and noted superstar players on the opposing team. If both those are true, what should happen?

First, more fans should come out when the home team has a higher chance of winning the game. And, second, fewer fans should come out when the other team doesn’t have any glamour.

This would explain the ".600" result perfectly. Attendance increases as the team improves from .500 – but, when the opposition is a bad team (and therefore the chance of winning increases past .600), some fans decide to stay home.

Under that theory, it may *appear* that fans care about uncertainty of outcome, but what they really care about is watching their good team play another decent team.

And that latter theory seems a lot more plausible to me.

## Tuesday, January 02, 2007

### A winning strategy: bet home underdogs

This Steve Levitt study is more about gambling markets than the sports themselves, but it comes to a surprising conclusion – that it's possible to make money betting on football using a very simple strategy.

Conventional wisdom is that a bookie will deliberately set the point spread for a football game so that an equal amount of money is bet on each of the two teams. That way, and since bettors have to bet \$110 to win \$100, the bookie is guaranteed a fixed profit regardless of who wins. If bettors put \$110,000 on each side, the bookie takes in \$220,000 but pays out only \$210,000 to the winners. He therefore assures himself a profit of \$10,000 (which is a little less than 5% of the total amount bet).

That's the theory; is it true in practice? It's hard to check, because bookies are reluctant to give out this information. But Levitt was able to find some public data from an online betting tournament. He found that, contrary to expectation, there are unequal amounts bet on the two sides of the spread. Instead of about 50/50, a typical distribution is 60% on one side and 40% on the other. That's significantly different from what you would expect by chance.

What does this mean? It means that the consensus is wrong -- bookies do *not* successfully choose the point spread to equalize betting on both sides.

Is it because they're not smart enough to predict what the spread should be? No, that can't be the case, because the deviations aren't random. Levitt found that the effect is skewed towards favorites. That is, in the typical 60/40 split, the "60" is usually bet on the favorite. And, in fact, when the favorite is the visiting team, more money is bet on the road favorite than the home underdog about 90% of the time!

Clearly, there's something else going on. Otherwise, bookies would just bump the spread down a couple of points to move more action to the home underdog, thus evening out the betting. The fact that they don't do that suggests that they have a reason for not wanting to.

That reason: home underdogs don't make the bookie as much money. They beat the spread much more often than 50% of the time. That is, bettors are so biased towards road favorites that they're willing to bet on them even when their odds are below 50/50. Bookies are therefore happy to see extra bets on them, because, even though they'll lose money if the favorite comes through, in the long run they'll still make more profit.

(As an extreme example, think of it this way: suppose a thousand dumb people are willing to bet you \$110 to \$100 that the Raiders will win the Super Bowl next year. And suppose another thousand rational people are willing to bet you \$110 to \$100 that the Raiders won't win. In that case, if you take all the bets, you're guaranteed a profit of \$10,000. But wouldn't you be tempted to move the odds a little bit to encourage more people to bet on the Raiders, and fewer to bet against them? Sure, if the Raiders win, you'll lose money, but the chances of that happening are very low, and so your expected profit will be significantly more than \$10,000.)

This hypothesis needs data to support it, of course – and Levitt comes up with that data. It does indeed turn out that both requirements for the hypothesis are met – (a) favorites cover the spread less than 50% of the time, and (b) more than 50% of customers bet on the favorite anyway. A summary of the findings:

Home favorites attract 56.1% of the bets, which are won 49.1% of the time;
Home underdogs attract 31.8% of the bets, which are won 57.7% of the time;
Road favorites attract 68.2% of the bets, which are won 47.8% of the time; and
Road underdogs attract 43.9% of the bets, which are won 50.4% of the time.

If you do the arithmetic, as Levitt did, you find that the above results show that bettors, being unduly biased towards favorites, win only 49.45% of their bets, instead of 50%. The missing 0.55% goes to the bookie. That increases his profit from 5% to 6.1%, which is a 23% increase. In exchange, the bookie takes the risk that, over a given time period, favorites will hit a lucky streak, and he'll make less money (or even post a loss). Levitt argues that the risk is small compared to the 23% increase in earnings.

And so Levitt's conclusions are:

-- bettors consistently overestimate favorites;
-- bettors like to bet on favorites anyway;
-- bookies recognize this, and are willing to allow more bets on favorites to increase their expected profits (despite the extra risk).

Moreover, Levitt looked at all NFL spreads from 1980-2001. He found that home underdogs beat the spread 53.3% of the time – higher than the 52.4% success rate a bettor needs to overcome the "110-to-win-100" vigorish and break even. And so, the simple strategy of betting the home underdog can turn a profit. Not only that, but the bookie actually knows it, but is willing to put up with it to make more money from the favorite bettors.

(Levitt finds that in both NCAA and NBA basketball, home underdogs also cover in about 53% of cases.)

Bookies could go even further – skew the line even more towards the favorite – to try to make even more money (again, at higher risk). But at some point, the advantage to the underdog bettors becomes so great that they wind up betting much more than they would otherwise, and the bookie loses his advantage. Levitt thinks that line occurs when betting *all* underdogs, not just home underdogs, becomes a winning strategy. At that point, the wisdom of the "you can make money betting on underdogs" rule would become so well-known that the bookies would no longer be able to depend on customer ignorance.

The study was published almost three years ago. Has the market adjusted to the new information? Maybe, but
Levitt thinks that home underdogs are still profitable.

Labels: , ,