Sabermetric Research: June 2008

Wednesday, June 25, 2008

SABR presentation on Hamermesh study

I'll be giving a presentation on the Hamermesh study Saturday, at the SABR convention in Cleveland. For those interested, I've posted my PowerPoint presentation at my website, www.philbirnbaum.com. Slides are subject to change between now and then ... the presentation is too long, so I'll probably wind up shortening it a bit.

Comments and suggestions are welcome.

Labels: baseball, Hamermesh update, umpires

Did NBA playoff referees cheat in favor of the home team? Part II

In the last post, I looked at Kevin Hassett's analysis of home field advantage in the NBA playoffs. It occurred to me that it was possible that HFA should be higher in the playoffs because the teams are closer together in talent (since the worst teams don't play). However, before writing the post, I did some quick math and concluded that wasn't the case.

I was wrong. Brian Burke, over at Advanced NFL Stats, convinced me of that, with logic and empirical data. Here's his original post, and a follow up on the topic.

My mistake (in my mind, not in the post) was assuming that HFA manifests itself in winning percentage. It does not. It manifests itself in better play – improved shooting, and defense, and rebounding. That turns into more points and fewer points against, and that's what causes the increase in wins.

But a fixed improvement in point differential translates into a different number of wins, depending on how close the two teams are in talent.

HFA in basketball is about 3 points. That means that the home team's advantage will only change the outcome of the game if, without the HFA, the visiting team would have won by 0-3 points. If the visiting team would have won by more than that, the three points wouldn’t have helped the home team. And, of course, if the home team would have won anyway, the HFA doesn't matter.

What's the probability that the visiting team would win by 0-3 points? That depends on the relative quality of the two teams. If the home team is way better than the visitors, it won't be very high. And if the visiting team is much better than the home team, it will win by more than 3 points so often that the probability again will be low.

So the closer the talent, the higher the home field advantage. That means HFA is higher in the NBA playoffs than in the regular season. How much higher? I'm not sure, but if you bump it up from .600 to .630, the significance level of what we saw in this year's playoffs (a home record of 64-22) goes from a 1 in 161 chance down to about a 1 in 40 chance.

That means that Hassett's evidence of abnormal home field advantage is even weaker than I previously argued.

In fairness, it also means that ignoring the first round, as Hassett did in his analysis, is not totally a case of "cherry picking" – HFA should indeed be higher in subsequent rounds as the relative talent tightens up with the elimination of the weaker teams. But, in that case, the HFA should increase in subsequent rounds as well, and, in that light, I doubt that even the 34-8 record in the second through fourth rounds is more significant than 2 SD.

Labels: basketball, cheating, NBA

Tuesday, June 24, 2008

Did NBA playoff referees cheat in favor of the home team?

Columnist and Economist Kevin Hassett today looks at the 2007-08 NBA playoffs, and its unusually large home-court advantage. He sees "troubling" evidence to suggest that the NBA manipulated the results, perhaps instructing referrees to favor the team trailing in games, allowing the league to wind up up with longer, more exciting, more remunerative series.

That's a huge accusation, and one that calls for a lot of supporting evidence. And I think Hassett doesn't provide nearly enough evidence.

Here's what he does tell us. First, in terms of game statistics, then, in terms of wins:

Game Stats

First, in the last two playoff seasons, when the home team was leading 3-1 in games, the visiting team actually had a better field-goal percentage than the home team, by 1.1 percentage points. But when the visiting team was leading 3-1, the home team outshot the visitors by 5.4 points:

-1.1: home team ahead in series
+5.4: home team behind in series
Hassett tells us there were 25 games total. Assuming 12.5 games in each group, and 80 field-goal attempts per team per game, the SD of the difference would be 2.2 points, which would make the observed difference very significant.

But if there were only, say, 5 games in one of the two groups, the SD increases to 2.8 points. And if there were only 2 games in one of the two groups, the SD is now 4.1, and the result isn't significant at all.

However: is it possible that success rates aren't independent? A team that falls behind might try higher risk shots, for instance. That would increase the SD, and lower the observed significance.

And isn't it possible that when a team is down 3-1, it plays harder, plays its regulars more, doesn't worry as much about fouling out, and so on?

Bottom line: it looks like this may be significant, but there isn't enough information to tell.

Second, in Game 6s, again over the past two playoffs, the home team was called for 4.1 fewer fouls than the visiting team when it was behind in the series three games to two. In Game 7s, the 4.1 drops to 1.0.

Home team 2-3 in games: 4.1 fewer fouls
Home team 3-3 in games: 1.0 fewer fouls

Again, how many such games were there? There's nothing about sample size, variance, or statistical significance. As it turns out, there have been only four seven-game series in the past two seasons. I'd bet a lot of money that 1.0 is not significantly different from 4.1 over *four games*. And that's even without considering the "desperation" factor.

Third, Hassett shows that in the regular season, home teams get called for 0.8 fewer fouls than visitors. In the playoffs since 2003, the figure jumps to 1.4. The home team advantage in field goal percentage also increases in the playoffs, from 1.3 to 2.3. This year, it was 3.4. This, Hassett writes, "suggests the home team is allowed to play aggressively."

These differences are statistically significant. But it's a long step from showing a difference in home advantage, to showing the league is corrupt. There are lots of possible other explanations:

-- in the playoffs, all teams are fairly strong
-- in the playoffs, teams are more closely-matched than in the regular season
-- in the playoffs, teams may play more aggressively
-- in the playoffs, teams may give their regulars more court time.

Or, how's this? In the playoffs, games are closer because the teams are more evenly matched. Deliberate fouls in the last minute happen only in close games. Therefore, there are more fouls in the playoffs by the team that's behind.. And the team that's behind is more likely to be the visiting team.

So I'm certainly not convinced that this is evidence of anything.

Game Stats

Hassett writes,

"In the 2008 playoffs, the home team won 64 of 86 games -- or 74 percent of the time. If we exclude the first round, where there are bound to be some blowouts, the home team won 34 out of 42, an 81 percent clip ...

"Since the 2002 regular season, home teams won a little more than 60 percent of the time ... If the true odds of the home team winning were six in 10, as in the regular season, then the odds of observing 34 home victories in 42 games simply by chance are close to zero."

There are a few objections to this analysis.

First, why exclude the first round? Isn't that cherry picking, to notice that most of the observed effect is in subsequent rounds, and rationalizing after-the-fact that only those rounds matter?

Second: unlike the regular season, not every team has the same number of home games. In the first three rounds of the NBA playoffs, if the series goes an odd number of games, the better team gets the extra game. So the naive measure of HFA in the playoffs suffers from selection bias – better teams are overrepresented.

In 2008, there were eight series that went exactly 5 or 7 games; five series in the first round, and three in the next two rounds. The home team, obviously, won all those games.

So let's adjust for that.

In the playoffs as a whole, the home team would have been expected to go .613. Let's assume it should have gone .750 in those odd-numbered games (because the better team was at home). That means 78 games at .600, and 8 games at .750. That works out to .613.

Home teams actually went .744. They went 64-22; they should have gone 53-33. In winning percentage terms, they overachieved by .131.

Using the binomial approximation, the SD of winning percentage over 86 games is .052. That means the .744 figure is 2.5 SDs over expected, which is significant at 0.6%.

That means what happened this year should have happened 1 out of 161 years.

Is 1 in 161 strong enough evidence that we should consider the conspiracy allegations? I don't think so. Do you really want to be in a position where you falsely accuse the league of this kind of corruption every 161 years? If you include all four major sports, you'd be making a false accusation every 40 years. If you looked at statistics other than home field advantage, for anything unusual, it would be even more often than that. (I'm sure that if, instead of an abnormal HFA, there were some unusual comebacks, there would be accusations of collusions for those, too.)

If you look at 20 different statistical categories in each league, and they're independent, you'll come up with a corruption charge every two years. The 1 in 161 is certainly noteworthy, but not so noteworthy that it's enough in and of itself.

Also, the 1 in 161 again assumes that there is no other explanation for why home field advantage is higher in the playoffs than in the regular season. If you hypothesize that, because of changes in play, the actual playoff HFA is .629 (which it was in the 2002-03 regular season), you wind up below 2 SD.

Also, why concentrate so hard on home-field advantage? If the NBA were indeed colluding with the referees to make series go longer, why would they tell them to favor the home team? Wouldn't they tell them to favor the team that's behind in the series? If the goal is to extend the series, and the visiting team wins the first two games of the series, why would the NBA tell the refs to go ahead and continue to cheat for the home team, when that would just ensure a 4-0 series finish? That would work against their interests!

I suppose you could argue – albeit implausibly -- that the NBA doesn't want the referees to know it's looking for longer series. It could lie to the refs and tell them they're favoring the visiting team too much. The refs then lean towards the home team.

But would that really make the series longer? I ran a simulation. I assumed one team would be .600 over the other team at a neutral site (which means the other team would be .400). Then, with a .110 home field [dis]advantage, 28.2% of series to go 7 games. Increase the HFA to .160, and the percentage increases to 29.5%. It's a very minor difference -- one series out of 70.

Rather than corruptly increase the HFA, the league would do better to pick one crucial game, and arrange for the referees to favor the team that's behind just that one game. That would be worth more than any reasonable estimate of the value of illicitly increasing home field advantage, and greatly reduce the amount of effort required.

Even then, the benefit is going to be small. Would the NBA risk its existence, government hearings, and criminal charges, for a tiny bit of advantage? Would the NBA be so stupid as to believe that the referees would not only agree, but also be able to keep it all a secret? Would they have thought that if they started favoring the home team, nobody would notice?

The whole idea is just absurd, this weak statistical evidence notwithstanding.

-----

UPDATE: I originally had written that the true HFA in basketball was .613 rather than .600. Thanks to Justin Kubatko, who provided me with the data, I found out that .600 is indeed accurate for the past few regular seasons (.613 was the figure for 2003-04). I have now updated the calculations and arguments to use .600.

Labels: basketball, cheating, NBA

Monday, June 23, 2008

Do teams play worse after a time zone change? Part III

(UPDATE: This post was updated after I discovered my own analysis had three teams in the wrong time zones.)

In the last two posts, I reviewed reports of Dr. W. Christopher Winter's study on jet lag and baseball performance. The data suggested that the observed effect was just home field advantage.

In a comment to the second post, Dr. Winter said that most of the effect came from 2- and 3-hour time changes. So I reran the numbers, and considered only the first game where one team had 3-hour jet lag (from the day before), and the other team had none.

1997-2006 home winning percentage
----------------------------------
Home team jet-lagged: 14-19 (.424)
Road team jet-lagged: 72-46 (.610)

There's actually something here, although it's not statistically significant with the small sample of games.

Here are the breakdowns by decade:

2000-2007 home winning percentage
----------------------------------
Home team jet-lagged: 6-16 (.273)
Road team jet-lagged: 68-43 (.613)

1990-1999 home winning percentage
----------------------------------
Home team jet-lagged: 20-17 (.541)
Road team jet-lagged: 68-45 (.602)

1980-1989 home winning percentage
----------------------------------
Home team jet-lagged: 11-15 (.423)
Road team jet-lagged: 43-43 (.500)

1970-1979 home winning percentage
----------------------------------
Home team jet-lagged: 14-9 (.609)
Road team jet-lagged: 47-48 (.495)

None of these results looks statistically significant. The overall totals are:

1970-2007 home winning percentage
-----------------------------------
Home team jet-lagged: 51-57 (.472)
Road team jet-lagged: 226-179 (.558)

The effect goes in the right direction, but neither result is significantly different from .530. The difference between the two numbers is 86 points; that's about 1.5 SD from zero, which again is not statistically significant.

Labels: baseball, home field advantage, jet lag

Saturday, June 21, 2008

Do teams play worse after a time zone change? Part II

(UPDATE: This post was updated after I discovered my own analysis had three teams in the wrong time zones.)

In the previous post, I discussed an MLB-funded study on jet lag, ""Measuring Circadian Advantage in Major League Baseball: A 10-Year Retrospective Study," by W. Christopher Winter.

It claimed to find that teams recently changing time zones performed worse than expected. Conventional wisdom in sleep science is that, for each time zone crossed, it takes one day to adapt. So a team that flew from Tampa to Oakland two days ago should be at a "2-day disadvantage" in their circadian rhythm. The study looked at all MLB games where the two teams were not equally adapted to their time zone, and claimed to have found that the disadvantaged teams did in fact play worse.

While I couldn't find the actual study, the data quoted in a press release actually supports the opposite conclusion: that jet lag has no effect. It appears that the study didn’t correct for home field advantage, and jet-lagged teams tended to be road teams. So what the researcher thought was jet lag was really just the normal road team effect.

To double-check, I ran the numbers myself. I was able to substantially reproduce the numbers in the press release.

I'll start with the records of teams with the "circadian advantage" (less jet-lagged than the opposition). All numbers, by the way, are 1998-2007. (This represents only about 20% of all games, because, in most games, the teams are equally jet-lagged.)

All teams less jet-lagged than opposition
-----------------------------------------
2621-2425 (.519) – study
2537-2337 (.520) – me

The numbers are very slightly different, and I'm not sure why.

Now, here are home teams that had the jet-lag advantage:

Home teams less jet-lagged than opposition
------------------------------------------
2002-1679 (.544) – study
1930-1609 (.545) – me

Again, I'm not sure why the study has so many more games than I do. It could be my 2:00am programming was wrong; it could be I assumed the wrong time zone for certain teams (Arizona is on Pacific time, right?); it could be the press release got a number wrong. Regardless, I think the results are close enough that I did the same analysis the study did.

Here are home teams that had a jet-lag DISadvantage:

Home teams more jet-lagged than opposition
------------------------------------------
746-619 (.547) – study
728-607 (.545) – me

Both the original study, and my study, contradict the press release and the press reports: having a "circadian advantage" does NOT improve your chances of winning. In fact, the original study shows such teams did very slightly *worse* than normal, not better.

Of course, this doesn't adjust for the quality of teams. But over 10 years, you'd think it would all even out.

Full year-by-year data is available on request.

Labels: baseball, circadian advantage, home field advantage, jet lag

Thursday, June 19, 2008

Do teams play worse after a time zone change?

According to a resent research presentation by a Baltimore sleep scientist, baseball teams that have recently travelled across time zones play worse than otherwise.

It's called "Measuring Circadian Advantage in Major League Baseball: A 10-Year Retrospective Study," by W. Christopher Winter, M.D. It was funded by MLB.

The study was presented at a conference on June 10, but I can't find it online. (That drives me nuts -- the study is quoted by a whole bunch of press releases, the author is quoted directly, but the actual paper isn't publicly available? What's with that?)

The Scientific American writeup quotes the results this way: if a team travels three time zones west (like from New York to San Francisco), its chance of winning would be

-- 40% on the first day
-- 47% on the second day
-- 48% on the third day
-- 50% on the fourth day

That, I assume, doesn't include home field advantage.

The article does imply, near the end, that the first game of three-time-zone trip happened only 160 times in the 10 years of the study. That's a 64-96 record for the tripping team. That works out to about 2.5 SDs away from .500, which is statistically significant. But it depends on the study having controlled for the quality of the teams and home field advantage.

I have a vague feeling that I've seen studies that checked the time-zone theory of home field advantage, and couldn't find any effect. But I'm not sure. In any case, when the study becomes available, I'll take a look at it.

(Hat tip: Freakonomics)

UPDATE: This article has more details, and it seems like the data doesn't support the conclusion. Here's the summary:

Approximately 79.1 percent of the games analyzed (19,084 of 24,133 games) were played between teams at equal circadian times. The remaining 5,046 games featured teams with different circadian times. In these games, the team with the circadian advantage won 2,621 games (51.9 percent). However, 3,681 of these 5,046 games were also played with a home field advantage. In isolating games in which the away team held the circadian advantage (1,365 games), the away team won 619 games (45.3 percent).

From this, we can figure that:

When the road team had the "circadian advantage" -- meaning the home team had to travel more time zones to get to the game -- the disadvantaged home team's winning percentage was 54.7% (746-619), almost exactly the normal home field advantage.

When the home team had the circadian advantage, they were 2002-1679, for 54.4% -- again almost exactly the normal home field advantage, and almost exactly the same HFA they had when the other team had the circadian advantage!
In bold:

Home teams were .544 with circadian advantage;
Home teams were .547 with circadian disadvantage.

So, basically, the study's data show that time zone travel doesn't matter at all. The apparent difference is completely caused by the fact that teams that have recently travelled are more likely to be road teams.

Labels: baseball, circadian advantage, home field advantage, jet lag

Wednesday, June 18, 2008

Replacement players, VORP, salaries, and MRP

In a post on his blog, J.C. Bradbury argues, again, that a player's free-agent player salary is equal to his "MRP," which means "marginal value of production." That is, a player should be paid exactly the amount by which his performance increases the team's revenue.

And he seems to think that revenue is exactly proportional to the player's performance, rather than the player's performance as measured against replacement level. Because of that, he dismisses the concept of replacement value (and VORP). But I don't understand why he would do that.

The idea behind MRP is this: the more employees you hire, the less each one contributes to the bottom line. If you're running a Wal-Mart, you might want ten cashiers. If you hire an eleventh cashier, it might help a little bit: if there's a crowd of customers, fewer might leave the store if the lineups are shorter. But the eleventh is only useful in busy times, so he's worth less than the other 10.

The idea is this: suppose cashiers earn $30,000 a year. The first five or six might bring the company $70,000 in revenue each. The seventh might bring in only $60K. The eighth adds $50K, the ninth $40K, the tenth $30K, and the eleventh $20K. The eleventh cashier is actually losing the company money, so she never gets hired in the first place. And the last cashier hired brought in $30,000, which exactly matches his salary. Thus the equivalence: salary = MRP.

That works for Wal-Mart, but not for baseball. Why? Because in baseball, the number of employees is fixed, and so is the minimum salary. At Wal-Mart, if you have 11 cashiers, you figure that the 11th costs $30,000 but is bringing in only $20,000 in revenue. So you fire him. In baseball, you might figure that you're paying the 25th man $390,000, but he's contributing nothing to the bottom line (because he gets no playing time). So you want to release him. But you can't – there's a rule requiring you to have 25 men on your roster. And there's a minimum salary of $390,000. So you're stuck. In this case, the MRP of the 25th man is less than his salary.

It can work the other way around, too. Suppose you're a big-market team with lots of fans who love to win, and you figure that a 25th man, while costing only $390,000, is bringing in revenues of over a million. You'd like to hire a 26th player, who would bring in another $900,000 or so. But, unlike Wal-Mart, you can't go hiring that extra player. There are rules against that. In that case, the 25th man is earning less than his MRP.

So, in baseball, a player's salary could easily be more, or less, than his MRP.

The real-world equivalence between salary and MRP is

Salary = MRP

But that's a special case that just happens to apply to Wal-Mart. I would argue that the more general equivalence is

Salary over and above the alternative = MRP over and above the alternative

At Wal-Mart, the alternative is "nobody" – you just never hire the 11th cashier. That alternative has zero salary and zero MRP, so the second equation collapses into the first equation. But in baseball, the alternative is NEVER "nobody" – you have to fill the roster spot, whether you want to or not. The alternative is a player at minimum salary, creating a replacement-level MRP. It's one of the many freely available minor-leaguers. That means

Salary over and above the $390,000 minimum = MRP over replacement player

If you choose to define "VORP" in terms of dollars instead of runs, you get

Salary - $390,000 = VORP

Which, I think, is what's really happening in baseball.

J.C. doesn't agree with that formulation – he wants to stick with "salary = MRP". He wants to value marginal runs from zero, rather than from replacement value. But, I argue, that clearly leads to untenable conclusions.

For instance, suppose a marginal win is worth $5 million. Then a marginal run is worth about $500,000.

Suppose a replacement-level player creates 39 runs. At $50,000 per run, you'd expect him to cost $19.5 million. But you can pick up any one of these guys for $390,000! So "salary = MRP" just doesn't make sense.

------

If you don't buy that argument, here's another one. Suppose you really, truly believe that a player earns his MRP. And suppose the 25th guy on your roster earns $390,000, the MLB minimum.

Now, halfway through the season, the union and MLB agree to double the minimum salary. The 25th guy gets to keep his job – after all, the team has to have 25 guys, and this is still the best one available. But now he's making $780,000.

His salary doubled, but his MRP, obviously, is exactly the same! So even if his salary was equal to MRP before, it certainly isn't now. Which means that there's no reason to have expected them to be equal in the first place.

------

One of J.C.'s arguments is that not all replacement-level players are worth only $390,000. It could be that all the players eligible for the minimum are young draft choices, and you don't want to use up their "slave" years if your season is a lost cause – you'd rather save them for when your team is a contender. In that case, you might have to sign a replacement-level veteran for $1,000,000 or so.

To which I say: there is no shortage of mediocre veterans who can be had for $390K, that you would have to spend a million. If you DO spend a million, it's probably because you peg the veteran as better than replacement. The extra $610,000 is worth it if you expect about 12 1.2 runs better than replacement over a full season, which isn't a lot.

------

By the way, you could argue that a player is never paid less than his MRP. In a sense, even if the 25th guy on your roster never bats, his presence contributes more than $390,000. That's because if you released him, and didn't call anyone up, the commissioner would fine you a lot more than $390,000.

But that's a trick technicality, and it's not what J.C. is arguing here.

------

Finally, I am puzzled by J.C.'s dislike of VORP because, according to him, it's an insider term and hard to explain:

The big advantage of these is that I can have these conversations with people other than die-hard stat-heads ... I view VORP as an insider language, and by using it you can signal that you are insider. It’s like speaking Klingon at a Star Trek convention. I can signal to others who speak the language that I am one of you. But, the danger of VORP is that once you bring it up the discussion goes down the wrong path as the uninitiated have reason to feel they are being told they are not as smart as the person making the argument. It’s like constantly bringing up the fact that you only listen to NPR or watch the BBC news at dinner parties. The response is likely going to be the same, “well fuck you too, you pretentious asshole!”

But, as I think a commenter on J.C's site points out, "MRP" is also pretty jargon-y insider economist talk, isn't it? And it's a lot hard to explain than VORP. So I'm a bit confused by J.C.'s aversion to the term. How come sabermetric abbreviations are pompous, but economics abbreviations are not?

------

Tuesday, June 17, 2008

NBA teams win 7 games more when their coach is a former All-Star

I'm shocked by the results of this NBA study: "Why Do Leaders Matter? The Role of Expert Knowledge." It's by Amanda H. Goodall, Lawrence M. Kahn, and Andrew J. Oswald.

Basically, the study tries to figure out if having been a successful NBA player would make you a more successful coach. The answer: yes, and hugely.

After controlling for a bunch of possible confounding variables, it turns out that:

-- for every year the coach played in the NBA, his team will win an extra 0.7 games per season [over a coach who never played].

-- if the coach was *ever* an NBA All-Star, his team will win an extra 7 games per season.

That last number is not a misprint. I'll repeat it in bold:

A coach who was an NBA All-Star at least once is associated with a 7-game increase in team wins.

The result is statistically significant, at 2.5 SDs above zero.

What could be causing this?

My first reaction was that teams who spend a lot of money on talent might also willing to spend to hire a high-profile coach. But the study controlled for team payroll, so that can't be it.

So what is it? I can't figure it out, but I'd bet a lot of money that it's not (as the study thinks it is) that All-Star coaches are somehow "experts" in getting the most out of their teams. I see no reason why it should be assumed that better players would make better coaches (and, in fact, I have heard plausible arguments the other way, that guys with natural talent have no idea how they do it, and so can't teach others to do the same).

For the record, here are the other variables controlled for: payroll, race of coach, age and age squared, NBA head coaching experience (and experience squared), college head coaching experience, "other pro" head coaching experience, and number of years as an NBA assistant coach.

(In the study, I am looking at the regression with no team fixed effects. That's Table 1, bottom panel, second from the right. There were 16 All-Star coaches in the sample, for 68 seasons. The authors don't list them.)

Any ideas? I feel like I must be missing something ... coaches can't be contributing seven wins to their team just because they were better players. Could they?

(Hat tip: Andrew Leigh via Justin Wolfers.)

Friday, June 13, 2008

Chipper Jones' chance for .400

Over at Baseball Prospectus, Nate Silver gives us an estimate of the chance that Chipper Jones will wind up hitting .400 this year. (Subscription required, I think.)

Silver starts by estimating that, before the season started, our best estimate for Chipper's talent was a normal curve centered around .310. But, now that he's hitting .419 so far this year (or whatever it was when Silver wrote his analysis), the best estimate is now a normal curve centered at .345 or so.

Then, it's straightforward: figure the chance that Chipper is a .300 hitter, and multiply the chance that he'll hit well enough as a .300 hitter to finish him above .400. Repeat for .301, .302, and so on. Add up all those numbers and you have his probability.

Actually, Silver didn't quite do it that way: he did it by simulation, 1000 repetitions of picking a random talent, then playing out the season. I think a thousand reps isn't very many to get a precise estimate, but it should be unbiased, at least.

Silver comes up with a probability of 12-13%.

However, there are a couple of problems with the analysis, that Tom Tango nails over at his blog.

First, the estimate of Chipper's talent shouldn't be normal. It should be biased towards the left. That is, even if your best guess is that Chipper is .345, you should give him a much better chance to be .335 than .355. Silver makes them equal. [UPDATE: Tango's comment, and further reflection, have convinced me that this criticism is not correct, and that Silver's symmetrical distribution is in fact correct. See the comments.]

Second, bumping Chipper from .310 to .345 on the basis of a couple of hundred at-bats is too big a jump. His talent almost certainly should be centered lower than .345.

As Tango points out, when you're looking at low-probability events, a small change in assumptions makes a huge change in probabilities. And so Silver's probability estimate is probably too high – much too high.

Labels: baseball

Wednesday, June 11, 2008

A hockey fighting study

An academic study on hockey fights made the news in a couple of papers up here in Ottawa, a day or two after Marginal Revolution blogged it.

Here's the paper: "Blood Money: Incentives for Violence in NHL Hockey." It tries to figure out if fighting, and winning fights, helps teams win.

The study is a bit aggravating throughout, mostly because the authors don't seem to know hockey very well at all, and make some strange assumptions.

For instance, they equate fighting with "real" violence in hockey, not acknowledging that fighting is an accepted part of the game, and allowed by the rules (if you're willing to accept the five-minute penalty). The authors cite violent incidents like the J. P. Parise's aborted attack on referee Josef Kompalla in the eighth game of the 1972 Canada-Russia series, and Dino Ciccarelli's 1988 attack on Luke Richardson (video), and don't seem to understand that these are entirely different, in hockey culture, from a routine fight.

This leads them, in their conclusions, to recommend how much the NHL would have to fine players to give them incentives not to fight. I don't think that particular number is relevant, not just because of flaws in the study (which I'll get to), but because the NHL would have no trouble banning fighting if it really wanted to. The authors seem to imply that the league wants fighting to end, but doesn't know how much the fines have to be.

Also, judging from the middle paragraph of page 11, they seem to think that in individual scoring, a goal is worth two points (it's only worth one).

------

Anyway, let me get to the meat of the study.

The authors run a bunch of regressions to predict team performance based on a few factors. For some reason, they regress on the probability of making various rounds of the playoffs, rather than regular season points. I'm not sure why they do this; obviously, this adds a lot of randomness.

It's a weird regression, too: the authors include games won, but also goals for and goals against, which aren't all that important once you have goals won. They also have a dummy variable for team, which doesn't make much sense either, since the Philadelphia Flyers of 1967-68 (when the study starts) have little in common with the Philadelphia Flyers of 2007-08.

However, the authors wind up with the statistically significant result that penalty minutes do lead to playoff success. All the coefficients for the probabilities (for simply making the playoffs, all the way up to winning the cup) are positive. Penalty minutes help teams make the various rounds of the playoffs with significance ranging from 1.76 SDs (making the finals) to 2.70 SDs (making the semi-finals).

For fights, the effect is even more extreme – all five categories show fights as statistically significant, except for winning the Stanley Cup. Fighting helps you win the first and second round of the playoffs with SDs of 4.37 and 4.62, respectively.

If I've done the math right, an extra 10 fights in the regular season (the average team in the last forty years had 73) improves your chances of making the quarter-finals from 26.7% to 32.5%, which seems like a lot.

I guess this confirms conventional wisdom, that, given two equal teams, the more aggressive one will do better in the playoffs, where checking is tighter and it's a more physical game.

Next, the authors try to predict a player's salary based on how often he fights. Some of their regressions include dummy variables for each player. That doesn't make sense to me. Effectively, those regressions compare a player's fights to the rest of his own career, and his salary is set long before he actually gets into those fights. So I'll ignore those.

Fortunately, some of the other regressions don't control for the player. According to those other regressions, every additional fight is worth $1606, (not statistically significant).

Even ignoring the insignificance, I'm not sure that figure is meaningful – instead of adjusting for salary inflation, the authors used year dummies. So $1606 is the average over 40 years of play, which means we don't really know what that is in today's dollars.

The authors also break down the fight bonus by position. A fight *costs* a winger $7120 in salary. It *costs* a centre $31129. But it *increases* a defenseman's salary by $47726 (significant at 7 SD). Again, this is kind of bizarre. Why would defensemen be rewarded, but not forwards?

Perhaps what's going on is this: the study controlled for games played, but not for ice time. For forwards, ice time is roughly proportional to points (which the study controlled for). But for defensemen, there are so many different styles that if a guy gets 30 points, you don't know if he's a part-time offensive defenseman or a full-time defensive defenseman. So maybe lots of fights, for a defenseman, is an indication of ice time, which is an additional indication of high salary.

The most interesting part of the study is where the authors look only at fights *won* (they got this information from various web sources, such as dropyourgloves.com). Compared to just plain fights, the coefficients for forwards stay roughly the same. But the bonus for defensemen is about double.

For fights "not won" (a loss or a no-decision), the numbers are almost the same as for fights won. This suggests that it doesn't really matter if the player wins the fight -- which shouldn't be a big surprise to most fans.

However, in the regression that included player dummy variables, the numbers for forwards came out double for winning than for not winning. The authors cite this in their conclusions, but I'm not convinced it means anything because of those inappropriate dummy variables. And why would it hold for wingers but not centers?

And I don't think too much of this stuff matters anyway, because I don't think the controls are strong enough that we can trust the results. I wouldn't be surprised if what's really happening is that players are being rewarded for being physical, for ice time, or for something else that correlates with fights. Only goals, assists, and games are otherwise controlled for, so there are a lot of other factors that could be correlated with fights.

Also, could it be that most or all of the effect is being caused by the presence of a few "goons," who are indeed paid to fight? Even if the "real" players' salaries were unaffected by fights, couldn't it be that the entire results of the regression are caused by the presence of enforcers, who we know are paid to fight?

The authors think so:

"We provide evidence for the proposition that observed low-ability wing players are paid a substantial wage premium to protect high-ability center players who can score goals. They do this by fighting with any other opposing player who threatens their star players, allowing the star player unfettered scoring possibilities. ... For this the wing players are paid a premium ..."

This is just weird. Did we really need more evidence that enforcers exist? Do the authors think this is new information? Isn't the best evidence for enforcers the fact that their existence is admitted to by every GM, coach, and player? And if you don't believe the hockey establishment, won't you believe their stat lines?

Did the authors of this study really do all this work just to prove that certain players are paid to fight?

Labels: fighting, hockey

Friday, June 06, 2008

The Hamermesh umpire/race study revisited -- part IX

This is the ninth post in this series, on the Hamermesh racial-bias study. The previous posts are here.

There's not going to be anything new in this post – I'm just going to recap the various issues I've already talked about, with fewer numbers. You can consider this a condensed version of the other eight posts.

------

I'll start by summarizing the study one more time.

The Hamermesh study analyzed three seasons worth of pitch calls. It attempted to predict whether a pitch would be called a ball or a strike based on a whole bunch of variables: who the umpire was, who the pitcher was, the score of the game, the inning, the count, whether the home team was batting, and so forth.

But it added one extra variable: whether the race of the pitcher (black, white, or hispanic) matched the race of the umpire. That was called the "UPM" variable, for "umpire/pitcher match." If umpires had no racial bias, the UPM variable would come out close to zero, meaning knowing the races wouldn't help you predict whether the pitch was a ball or a strike. But if the UPM came out significant and positive, that would mean that umpires were biased -- that all else being equal, pitches were more likely to be strikes when the umpire was of the same race as the pitcher.

It turned out, that, when the authors looked at *all* the data, the UPM variable was not significant; there was only small evidence of racial bias. However, when the data were split, it turned out that the UPM coefficient *was* significant, at 2.17 standard deviations, in parks in which the QuesTec system was not installed. Umpires appear to have called more strikes for pitchers of their own race in parks in which their calls were not second-guessed by QuesTec.

An even stronger result was found when selecting only games in parks where attendance was sparse. In those games, the UPM coefficient was significant at 2.71 standard deviations. The authors interpreted this to mean that when fewer people were there scrutinzing the calls, umpires felt freer to indulge their taste for same-race discrimination.

In this latter case, the UPM coefficient was 0.0084, meaning that the percentage of strikes called for same-race pitchers increased by 0.84 of a percentage point. That would suggest that about 1 pitch in 119 is influenced by race.

That's the study.

------

I do not agree with the authors that the results show of widespread existence of same-race bias. I have two separate sets of arguments. First, there are statistical reasons to suggest that the significance levels might be overinflated. And, second, the model the authors chose have embedded assumptions which I don't think are valid.

------

1. The Significance Arguments

In their calculations, the authors calculated standard errors as if they were analyzing a random sample of pitches. But the sample is not random. Major League Baseball does not assign a random umpire for each pitch. They do, roughly, assign a random umpire for each *game*, but that means that a given umpire will see a given pitcher for many consecutive pitches.

If the sample of pitches were large enough, this wouldn't be a big issue – umpires would still see close to a random sample of pitchers. But there are very few black pitchers, and only 7 of the 90 umpires are of minority race (2 hispanic, 5 black). This means that some of the samples are very small. For instance, there were only about 900 ptiches called by black umpires on black pitchers in low-attendance situations. That situation is very influential in the results, but it's only about 11 games' worth. It seems reasonable to assume that these umpires saw only a very few starting pitchers.

What difference does that make? It means that the pitches are not randomly distributed among all other conditions, because they're clustered into only a very few games. That means that if the study didn't control for everything correctly, the errors will not necessarily cancel out, because they're not independent for each pitch.

For instance, the authors didn't control for whether it was a day game or a night game. Suppose (and I'm making this up) that strikes are much more prevalent in night games than day games because of reduced visibility. If the sample were very large, it wouldn't matter, because if there were 1000 starts or so, the day/night situation would cancel out. But suppose there were only 12 black/black starting pitcher games. If, just by chance, 8 of those 12 happened to be night games, that might explain all those extra strikes.

8 out of 12 is 66%. The chance of 66% of 12 random games being night games is reasonably high. But the chance of 66% of 900 random *pitches* being in night games is practically zero. And it's the latter that the study incorrectly assumes. (Throughout the paper, the standard errors are based on the normal approximation to binomial, which assumes independence.)

(I emphasize that this is NOT an argument of the form, "you didn't control for day/night, and day/night might be important, so your conclusions aren't right." That argument wouldn't hold much weight. In any research study, there's always some possible explanation, some factor the study didn't consider. But, if that factor is random among observations, the significance level takes it into account. The argument "you didn't control for X" might suggest that X is a *cause* of the statistical significance, but it is not an argument that the statistical significance is overstated.

So my argument is not "you didn’t control for day/night." My argument is, "the observations are not sufficiently independent for your significance calculation to be accurate enough." The day/night illustration is just to show *why* independence matters.)

Now, I don't have any evidence that day/night is an issue. But there's one thing that IS an issue, and that's how the study corrected for score. The study assumed that the bigger the lead, the more strikes get thrown, and that every extra run by which you lead (or trail) causes the same positive increase in strikes. But that's not true. Yes, there are more strikes with a five-run lead, but there are also more strikes with a five-run deficit, as mop-up men are willing to challenge the batter in those situations. So when the pitcher's team is way behind, the study gets it exactly backwards: it predicts very few strikes, instead of very many strikes.

Again, if the sample were big enough, all that would likely cancel out – all three races would have the same level of error. But, again, the black/black subsample has only a few games. What if one of those games was a 6-inning relief appearance by a (black) pitcher down by 10 runs? The model expects almost no strikes, the pitcher throws many, and it looks like racial bias. That isn't necessarily that likely, but it's MUCH more likely than the model gives it credit for. And so the significance level is overstated.

So we have at least one known problem, of the score adjustment. And there were many adjustments that weren't made: day/night, runners on base, wind speed, days of rest ... and you can probably think of more. All these won't be indpendent, and there's going to be some clustering. So if any of those other factors influence balls and strikes – which they probably do -- the significance levels will be even wronger.

How wrong? I don't know. It could be that if you did all the calculations, they'd be only only slightly off. It could be that they're way off. But they're definitely too high.

Note that this argument only applies to the significance levels, and not the actual bias estimates. Even with all the corrections, the 0.0084 would remain. The question is only whether it would still be statistically significant.

------

2. The Model

As I mentioned earlier, the study included only one race-related UPM variable, for whether or not the umpire and pitcher were of the same race. Because the variable came out significantly different from zero, the study concluded that its value represents the effect of umpires being biased in favor of pitchers of their own race.

However, the choice of a single value for UPM is based on two hidden assumptions. The first one:

-- Umpire bias is the same for all races.

That is: the study assumes that a white umpire is exactly as likely to favor a white pitcher as a black umpire is to favor a black pitcher, and exactly as likely as a hispanic umpire is to favor a hispanic pitcher.

Why is this a hidden assumption? Because there is only one UPM variable that applies to all races. But it's not hard to think of an example where you'd need to measure bias for each race separately.

Suppose umpires were generally unbiased, except that, for some reason, black umpires had a grudge against black pitchers, and absolutely refused to call strikes against them (if you like, you can suppose that fact is well-known and documented). If that were the case, the analysis done in this study would NOT pick that up. It would find that there is indeed racial discrimination against same-race pitchers, but it would be forced to assume that it's equally distributed among the three races of umpires.

That's a contrived example, of course. But, the real world, things are different. In real life, is it necessarily true that all races of umpires would have exactly the same level of bias?

It seems very unlikely to me. Historically, discrimination has gone mostly one way, mostly whites discriminating against minorities, mostly men discriminating against women. There are probably a fair number of white men who wouldn't want to work for a black or female boss. Are there as many black women who wouldn't want to work for a white or male boss? I doubt it.

Why, then, assume that signficant bias must exist for all races? And, why assume, as the study did, that not only does it exist for all races, but that the effects are *exactly the same* regardless of which race you're looking at?

If you remove the assumption, you wind up with a much weaker result. There still turns out to be a statistically significant bias, but you no longer know where it is. Take another hypothetical example: there are white and black pitchers and umpires, and three of the four combinations result in 50% strikes. However, the fourth case is off -- white umpires call 60% strikes for white pitchers.

Who's discriminating? You can't tell. It could be whites favoring whites. But it could be that white pitchers are just better pitchers, and it's the black umpires discriminating against whites, calling only 50% strikes when they should be calling 60%. If you open up the possibility that one set of umpires might be more biased than the other – an assumption which seems completely reasonable to me – you can find that there's discrimination, but not what's causing it.

Also, you can't even tell how many pitches are affected. If the white/white case had 350,000 pitches and the black umpire/white pitcher case had only 45,000 pitches, you could have as many as 35,000 pitches affected (if it's the white umpires discriminating), as few as 4,500 pitches (if it's the black umpires) or something in the middle (both sets of umpires discriminate, to varying extents).

And maybe it's not white umpires favoring white pitchers, or black umpires disfavoring white pitchers. Maybe it's black umpires favoring black pitchers. Maybe black umpires generally have a smaller strike zone, but they enlarge it for black pitchers. Since there are so few black pitchers, maybe in this case only 800 pitches are affected.

The point is that without the different-races-have-identical-biases assumption, all the conclusions fall apart. The only thing you *can* conclude, if you get a statistically-significant UPM coefficient, is that there is bias *somewhere*. You can reject the hypothesis that bias is zero everywhere, but that's it.

The other hidden assumption is:

-- All umpires have identical same-race bias.

The study treats all pitches the same, again with the same UPM coefficient, and assumes the errors are independent. This assumes that any umpire/pitcher matchup shows the same bias as any other umpire/pitcher matchup – in other words, that all umpires are biased by the same amount.

To me, that makes no sense. In everyday life, attitudes towards race vary widely. There are white people who are anti-black, there are people who are race-neutral, and there are people who favor affirmative action. Why would umpires be any different? Admittedly, we are talking about unconscious bias, and not political beliefs, but, still, wouldn't you expect that different personalities would exhibit different quantities of miscalled pitches?

Put another way: remember when MLB did its clumsy investigations of umpires' personal lives, asking neighbors if the ump was, among other things, a KKK member? Well, suppose they had found one umpire who, indeed *was* a KKK member. Would you immediately assume that *all* white umpires now must be KKK members? That would be silly, but that's what is implied by the idea that all umpires are identical.

I argue that umpires are human, and different humans must exhibit different degrees of conscious and unconscious racial bias.

Once you admit the possibility that umpires are different, it no longer follows that bias must be widespread among umpires or races of umpires. It becomes possible that the entire observed effect could be caused by one umpire! Of course, it's not necessarily true that it's one umpire – maybe it's several umpires, or many. Or, maybe it is indeed all umpires – even though they have different levels of bias, they might all have *some*.

How can we tell? What we can do is, for all 90 umpires, see how much they favor their one race over another. Compare them to the MLB average, so that the mean umpire becomes zero relative to the league. Look at the distribution of those 90 umpire scores.

Now, suppose there is no bias at all. In that case, the distribution of the individual umpires should be normally distributed exactly as predicted by the binomial distribution, based on the number of pitches.

What if only one or two umpires are biased? In that case, we probably can't tell that apart from the case where no umpire is biased – it's only a couple of datapoints out of 90. Unless the offending umps are really, really, really biased, like 3 or 4 standard deviations, they'll just fit in with the distribution.

What if half the umpires are biased? Then we should get something that's more spread out than the normal distribution – perhaps even a "two hump" curve, with the biased umps in one hump, and the unbiased ones in the other. (The two humps would probably overlap).

What if all the umpires are (differently) biased? Again we should get a curve more spread out than the normal distribution. Instead of only 5% of umpires outside 2 standard errors, we should get a lot more.

So we should be able to estimate the extent of bias by looking at the distribution of individual umpires. I checked, using a dataset similar to the one in the study (details are in part 8).

What I found was that the result looked almost perfectly normal. (You would have expected the SD of the Z-scores to be exactly 1. It was 1.02.)

This means one of the following:

-- no biased umps
-- 1 or 2 biased umps
-- many biased umps with *exactly the same bias*.

As I said, I don't find the third option credible, and the statistical significance, if we accept it, contradicts the first option. So I think the evidence suggests that only a very few umpires are biased, but at least one.

However: the white/white sample is so large that one or two biased white umpires wouldn't be enough to create statistical significance. So, if we must assume bias, we should assume we're dealing with a very small number of biased *minority* umpires. Maybe even just one.

And as it turns out (part 7), if you take out one of the two hispanic umpires (who, it turns out, both called more strikes for hispanic umpires), the statistical significance disappears.
If you take out a certain black umpire, who had the highest increase in strike calls for black batters out of all 90 umps, the statistical significance again disappears. This doesn't necessarily mean that any or all of those umps is biased. It *does* mean that the possibility explains the data just as well as the assumption of universal unconscious racial bias.

------

Based on all that, here's where the study and I disagree.

Study: there is statistically significant evidence of bias.
Me: there *may* be statistically significant evidence of bias, but you can't tell for sure because some of the critical observations aren't independent.

Study: the findings are the result of all umpires being biased.
Me: the findings are more likely the result of one or two minority umpires being biased.

Study: many pitches are affected by this bias.
Me: there is no way to tell how many pitches are affected, but, if the effect is caused by one or two minority umpires favoring their own race, the number of pitches would be small.

Study: overall, minority pitchers are disadvantaged by the bias.
Me: That would be true if all umpires were biased, because most umpires are white. But if only one or two minority umpires are biased, then minority pitchers would be the *beneficiaries* of the bias.

Study: the data show that bias exists.
Me: I don't think the data shows that bias exists beyond a reasonable doubt. For instance, suppose the results found are significant at a 5% level. And suppose your ex-ante belief is that umpire racism is rare, and the chance at least one minority umpire is biased is only about 5%. Then you have equal probabilities of luck and bias.

I do not believe the evidence in is study is strong enough to lead to a conclusion of bias beyond a reasonable doubt. But it is strong enough to suggest further investigation, specifically of those minority umpires who landed on the "same-race" side. My unscientific gut feeling is that if you did look closely, perhaps by watching game tapes and such -- most of the effect would disappear and the umps would be cleared.

But that's just my gut. Your gut my vary. I will keep an open mind to new evidence.

Labels: baseball, Hamermesh update, race

Sunday, June 01, 2008

Why are runs so scarce in the 2008 American League?

Last month, J.C. Bradbury showed that comparing April home run rates to average April temperatures showed a surprising correlation: the two curves moved together almost in unison.

So is this year's drop in home run power due to the weather? Apparently not. In a follow-up post, Bradbury shows that if you look more closely at the relationship between game-time temperature and home runs (outdoor games only), only 4% of this year's decline can be explained by the cold.

Also, and more interestingly, home runs are down a lot more in the American League than in the National League. In fact, AL offense is actually lower than NL offense so far this year: at time of writing, 4.38 runs per game vs. 4.60 runs per game. That's a big difference, especially considering the AL, with its DH rule, is normally about half a run *higher* than the NL. So the American League is about 0.72 runs below where it "should" be, although you have to adjust for interleague games, in which the DH advantage disappears. Call it, say, 0.6 runs per game.

Could it be that AL cities have been colder than NL cities so far this year? Nope. Bradbury checked that too, and it turns out that the temperature in AL parks has been pretty typical this year.

And to put one more nail in the temperature coffin, Zubin Jelveh looked at domed stadiums, and found the same sharp dropoff in slugging percentage as in other parks.

So what's causing the drop? One theory is that it's not a dropoff in hitting, but, rather, an improvement in pitching. Supporting that theory is that, in interleague play to date, the AL is three games above the NL, suggesting that it's still the better league. Of course, that could just be random chance, because there haven't been that many interleague games so far this year. (I couldn't find exact interleague records, but the AL as a whole has three more wins than losses, and the NL vice-versa.)

Another theory is that it's the PED clampdown causing the drop in power, but, as others have pointed out, steroids testing actually started several years ago, and, in any case, it doesn't make sense that users would be so heavily concentrated in the American League.

Over at the Sporting News, David Pinto suggests that it's an age thing. He notes that AL hitters are significantly older than their NL counterparts, 29.5 to 28.8. That's come as a reversal from 2005, when then AL was actually 0.4 years *younger* -- a four-year change of 1.1 years.

But there's no hard evidence that younger players are actually better than older players – looking at Pinto's (very interesting) charts, you note that the NL has better younger players, and the AL has better older players. But the AL old-guy advantage is smaller than the NL young-guy advantage. As Pinto writes,

"... at the ages where the NL OPS is higher, it tends to be much higher than it is at the corresponding AL age. Where the AL OPS is higher, the gap is not quite as large."

So it seems to me that it's not as simple as an age thing.

It's an interesting problem, and I don't know what the answer is. But any theory would have to explain:

-- why the sudden drop
-- why the drop is so much larger in the AL
-- and why the AL is ahead in interleague play, despite its anemic offense.