Sabermetric Research: July 2007

Tuesday, July 31, 2007

Jonathan Gibbs study unconvincing on NBA point shaving

The Tim Donaghy NBA scandal has reignited interest in point shaving in general. In that light, a bunch of sources – a New York Times column, the Freakonomics blog, and TWOW, to name three – have mentioned a new study by Jonathan Gibbs, an undergraduate at Stanford. ESPN.com has an excellent overview article on its TrueHoop blog.

The study is called "Point Shaving in the NBA: An Economic Analysis of the National Basketball Association's Point Spread Betting Market." Gibbs says he found reasonable grounds to suspect point shaving, and everyone (that I've read) seems to believe that the study does indeed show good evidence.

I disagree. Let me start by summarizing Gibbs' results (in between dotted lines), then I'll comment.

-----

Gibbs built a database of a large number of NBA games. He found that the favorite beat the spread almost exactly half the time, 49.95%. However, the larger the spread, the less likely the favorite would cover. If you bet underdogs of 10 or more points, you'll win 52.15% of your bets. At 13 points or more, 53.04% of underdogs covered. The betting market seems to be inefficient when it comes to games between unbalanced teams.

Taking home underdogs only, it's even worse. At 10 points or more, home underdogs had a .548 winning percentage against the spread. At 13 points or more, it was an unbelievably large .718 (although in only 42 games). You can apparently make very good money by betting on home underdogs. [There's a similar effect in the NFL, also – see here.]

If you divide all games into three groups, based on the size of the spread (0-6 points, 6.5-12 points, 12.5 points and up), you find a continuum. Narrow favorites tend to cover more than 50% of the time. Medium favorites cover almost exactly 50% of the time. And, as mentioned, heavy favorites cover less than 50% of the time.

Moreover, there is an interesting "last minute" effect. After 47 minutes of play – or 46, 45, 44, and 43 -- favorites in games with small spreads have actually covered less than half the time. But in the last minute, they tend to outscore their opposition to the extent that they go from a couple of points below the spread, to a couple of points above it. That's because of the opposition's fouling strategy. If a team is up by 4 with less than a minute to play, the opposition will take deliberate fouls in order to get the ball back. On very rare occasions, the strategy works and they win. Usually, it just gives the team in the lead more points. And so, the favorite's 3 point lead often turns into 6 or 7 following the gift foul shots.

The same pattern appears for medium-spread games, but to a lesser extent – which makes sense, because fewer medium-spread games are within "fouling distance" towards the end.

But for games with large spreads, the effect goes the other way. Heavy favorites are a little bit ahead of the spread with a minute to go, but wind up behind the spread by the end of the game. The author believes this is evidence of point shaving by the favorite.

Finally, Gibbs runs a few regressions on the final score (against the spread) against a bunch of other variables. In several cases, the "point spread" variable is significant, which suggests to Gibbs that teams are aware of, and being affected by, the spread.

-----

Okay, now my comments.

First, the shift by the heavy favorites from covering the spread after 47 minutes, to not covering the spread after 48. Isn't there a more likely explanation than point shaving? Couldn't it just be normal "garbage time" strategy? Up by 15 with a minute to go, the team leading will perhaps want to seal the victory by wasting as much of the 24 seconds as it can, rather than rubbing it in by trying to score more points. The opposition has no such motivation, and they might gain a few points in the last minute on the couple of possessions they get.

Second, Gibbs is puzzled that the point spread variable should be significant in regressions. But, of course it should, because it's a proxy for team quality. Suppose team X is tied with over team Y with a minute to go. Who will win? From the information I gave, it's 50-50. But now suppose I tell you that X was a ten-point favorite over Y. Now who's more likely to win? Team X. Not because X cares about the spread, but because the spread means they're a better team than Y! Of course, it's only one minute, and so team X might only be a very slight favorite. But Gibbs' study includes an awful lot of games, which provides more than enough data for a slight advantage to become statistically significant.

Third, I think Gibbs misinterprets the results of his regressions. In one model, he tries to estimate the probability of the favorite covering by the betting line, and the five scores with 1 to 5 minutes to go. He finds that "as the ... point spread becomes ... larger, the likelihood of ... covering decreases." (And that this is evidence of shaving.)

But that assumes all dependent variables being equal – for instance, the amount of the lead (against the spread) after 47 minutes. Consider two situations where that margin is one point.

-- X is five points ahead of Y, against a 4-point spread, with one minute to go.
-- A is fifteen points ahead of B, against a 14-point spread, with one minute to go.

Which team is less likely to beat the spread? A, of course. They're 15 points up, so they’re going to kill time, let B score a few points, and win by 11. X is only 4 points up, so they can't afford to do that, so they'll try to score and increase their margin.

As Gibbs says, "the betting line is a significant determinant for whether the favored team covers." Yes, but that's only because the way he set up his regression makes it a proxy for actual margin of victory. It has nothing to do with the spread at all!

Fourth, there's something about these regressions I don't understand. Take the one just described, for instance. It turned out that all the variables were significant – margin with 5 minutes to go, margin with 4 minutes to go, margin with 3 minutes to go, margin with 2 minutes to go, and margin with 1 minute to go. Why is that? If A is ahead by 4 points with a minute to go, what difference does it make how much they led by three minutes ago? Shouldn't only the 1-minute-to-go number be significant? If the score is 93-89, why does it matter how it got that way?

I suppose you could come up with some hypothetical ... maybe if you were 10 points up before, but only 1 point up now, you might have benched the unfocused regulars who blew 9 points in three minutes, and that might make the last minute go differently. But that doesn't seem right, and I really wonder what's really going on in that regression.

Fifth, and very important: the fact that some teams beat the spread more or less often than 50% can't possibly be evidence of point shaving. Oddsmakers have studied thousands of games, and are the best in the world at what they do. If 6% of games were being fixed – or even 1% of games – the oddsmakers would have taken that into account. Even if they didn't know, or even suspect, that point shaving was happening, they would just notice that heavy favorites win by fewer points, and adjust accordingly.

Suppose that the oddsmakers figure that team X is good enough that they should beat team Y by 12 points. But, they know from studying past games that their model isn't good enough – that because of point shaving, the average is 11 points. So they drop the line to 11. Now, the corrupt player, seeing that, will try to lose by only 10. The oddsmaker figures that out, and drops the line to 10. The corrupt player now tries for 9. The oddsmaker matches, and the player drops to 8. On it goes, until the line has dropped so far that the player can just barely shave that many points without arousing suspicion. That's an equilibrium, and now the odds correct exactly for the probability of point-shaving.

So my argument is that looking at how often teams beat the spread can't possibly provide evidence of fixed levels of point-shaving. It can only show evidence of point shaving that's *unexpected* by the oddsmakers. Since Gibbs' sample of games is many seasons long, and oddsmakers are very, very good at what they do, the failure to beat the line can't automatically be attributed to cheating.

Finally, suppose you're a corrupt player trying to miss shots to come under the spread. Wouldn't you do as little as possible to ensure that result? Suppose the spread is 10, you're up by 9, and so you deliberately miss a shot with 30 seconds left. The opposition scores a three, and now you're only up by 6 with five seconds left. Wouldn’t you stop trying to lose? No matter what, the opposition is going to cover the spread. You won't risk getting caught just to lose by 6, when losing by 8 is perfectly acceptable.

And you're not going to cheat in the first or second quarter, when the game is close and you don't know how the score's going to wind up. For one thing, your team might lose outright because of your shaving. For another, the opposition might play so well that you don't have to take the risk of shaving at all.

To save your butt, you're not going to start deliberately missing until late in the fourth quarter. And even then, only when the game is close compared to the spread.

And so, if teams were regularly point shaving, you'd see an unusual shape when you plotted the results. You wouldn't expect to see the nice curve that Gibbs found. His graphs are smooth and symmetrical, just shifted left. Not only are –1 and -2 (against the spread) too high, but so is –5, and –10, and –15. But why would anyone cheat on a –15 game?

Instead of smoothness, wouldn't you expect a big hump only close to zero, at minus 1 and 2 and 3? And wouldn't all those extra games come from games that were plus 1 and plus 2 and plus 3? Gibbs claims that 6% of lopsided games may involve point shaving. That should create a strikingly huge growth immediately to the left of 0, and a similar valley immediately to the right of zero. But that's not what we see.

---

It's definitely possible that I'm missing something, especially considering that so many people have read Gibbs' paper and found it convincing. But I don't think it gives any real evidence of point shaving whatsoever.

Labels: basketball, economics, point shaving

Sunday, July 22, 2007

Are MLB player skills normally distributed?

There's a paper in the most recent JQAS on evaluating outfielder and cather throwing arms. It's called "Evaluating Throwing Ability in Baseball," by Matthew Carruth and Shane Jensen.

I'm still going through it, but the first thing that struck me was that the authors assumed that throwing skills in the baseball population are normally distributed.

That seemed to be wrong. In the 1985 Abstract (Blue Jays comment, page 113), Bill James argued, convincingly, that major-league player skills should be shaped not like the normal bell curve, but like the right tail of that curve. The general population is shaped like the normal distribution, but only the very best players make the majors. Those are the ones at the extreme right tail, which doesn't have a bell-curve shape at all.

So shouldn't outfielder and catcher arms also be shaped like the right tail? If they are, the Carruth/Jensen study shouldn't work at all.

After thinking about it a bit, I wonder if individual skills might not be close to normal after all.

Suppose a player's overall talent was the sum of 1,000 independent skills, each of which is normally distributed in the population. Wouldn't each of those skills be almost normally distributed in MLB? Consider skill number 502. If you were average in skill 502, it would barely affect your chance of making the majors – you'd have 999 other skills to be good at. And so if you took a MLB-wide profile of skill 502, it would barely be different from the general population. The same, of course, would be true for each of the other skills. Each skill would look very close to normal, but the *sum* of all skills would be shaped like Bill James' right tail.

Of course, the example of 1000 different skills is farfetched. But, it turns out, the "right tail" disappears much sooner than 1000. It goes away also with a much more realistic model of baseball skills.

Suppose there are three basic skills: hitting, range, and throwing. All three are normally distributed in the general population with mean 100 and SD 16 (which I think puts them on the same scale as IQ). All are independent. A player's overall value is the sum of 70% of his hitting score, 20% of range, and 10% of arm. Any player with an overall score of 146 or more makes the majors.

Under those conditions, you'd still expect the MLB distribution of overall score to be shaped like the right-tail of the normal distribution. But what about the three individual skills? Will they be right-tail shaped, or bell shaped? Probably batting, being 70% of the total, will be tail shaped. But what about range? And arm?

To check, I ran a simulation. I created a general population of 2,000,000 baseball players (I would have used more, but the VB random number generator apparently started repeating). Of those two million, only 87 players made the majors.

I then plotted graphs for those 87 players. As you would expect – and as the simulation forced – the overall rating of those 87 players looked like the right tail. 29 players scored at 146. Only 18 scored at 147. 12 scored at 148. 9 were at 149, 11 at 150, and then they trickled off, so that only 8 were between 151 and 161. Definitely a right-tail picture.

That was the overall rating. So I then plotted just batting. And what happened? The distribution looked more like a normal curve! Even though batting comprised 70% of the "right tail" rating, that 70% was bell shaped!

(I'm not good enough with HTML to show the curve here, but I've posted it on my website. Take a look here.)

The other two skills – range and arm – looked even "more" normal. They're shown in the above link also.

I didn't run any formal statistical tests for normality. Indeed, it's probably easy to show that none of the three skill distributions should be normal. But they're *approximately* normal, pretty good bell shapes. A normal distribution has a "skewness" (measure of symmetry) of zero, and a "kurtosis" (measure of peakedness) of 3. Here are the stats for these four curves:

Normal : Skewness = +0.0, Kurtosis = 3.0
----------------------------------------------------------
Overall: Skewness = +2.5, Kurtosis = 11.5
Batting: Skewness = +0.4, Kurtosis = 3.6
-Range : Skewness = -0.1, Kurtosis = 2.8
-- Arm : Skewness = +0.1, Kurtosis = 2.4

The above numbers don't tell you anything the graphs don’t – they're just a numerical way of summing up the pictures. I have no idea if the last three are statistically significantly different from normal (0.0, 3.0), but they're pretty close in real-life terms.

I can't help but conclude that most individual player skills, so long as they're not overwhelmingly correlated with the player's overall value, could be pretty close to normally distributed.

But a few points. First, I assumed that all three skills were independent of each other in the general population. In real life, that's obviously not true: good athletes will have both good range and good arms. The more correlated the skills, the more likely they'll be in line with overall value, which is right-tail shaped. So that might change things.

Second, the assumption was that an arm of –2 SDs is worth twice as much (badness) as an arm of –1 SD. That's again not true: a minus-two player can be taken advantage of by baserunners, and can therefore cost his team three or four times as much as the minus-one guy. That might mean there are fewer –2s in real life than the model, which would make the distribution look more right-tailed.

(Finally, my sample may be too big. 87 out of two million is the equivalent of some 4,000 players in the U.S. male population of baseball-playing age. Throwing out half the players may make everything more right-tailed. Hang on, let me check ... nope, results stayed roughly the same. Never mind!)

By the way, a few more interesting notes from the sample of 87:

1. The players had an average batting rating of 163. The average range rating was only 119. The average arm rating was barely above average at 105. This is as you'd expect: if you can hit 4 SDs above average, you're a major leaguer, no matter how bad your arm. But if you can throw 4 SDs above average, so what? Unless you can hit, your arm just isn't that valuable to the team.

This works for other sports too – in golf, as I understand it, there are guys who can drive a ball 400 yards. But the other aspects of their game aren't very good, so they're doing driving contests for a living instead of playing on the PGA tour.

2. The best player in the sample, with an overall 161, had only an 88 arm. Again, this makes sense. It would probably take months of going through people at the mall before you found one who can hit like a major-leaguer outfielder. But you could probably find some guy who can throw like a major-league outfielder much more easily.

3. There was a high negative correlation (r = -0.75) between hitting and range. That makes sense too. Players who make it to the majors with their bat don't have to have good range to keep their jobs. And players who earn a job with their glove are unlikely to be among the best hitters.

The correlation between range and arm was –0.14, and between bat and arm was –0.18.

4. Every one of the 87 players had a better hitter rating than his overall rating. This is probably indicative of the study being oversimplified, since there are numerous players in MLB whose value comes mostly from defense. A better study is probably called for, but I think the conclusion – that individual skills may indeed be normally distributed – is still supported.

Labels: baseball, distribution of talent, statistics

Saturday, July 21, 2007

The NBA referee scandal -- any evidence yet?

An NBA referee is accused of manipulating games to come out on a certain side of the point spread. If that's true, it shouldn’t be too difficult to look at that referee's games and come up with the evidence. As Steven Levitt points out, maybe we should expect to be hearing from Justin Wolfers before long.

A thought: if this were baseball instead of basketball, would we have expected the evidence to have come out already? There are lots more sabermetricians in baseball than in basketball, and the existence of Retrosheet (which provides umpire data) would take care of the data-gathering step.

Regardless, I'd give pretty good odds that the amateur APBRmetricians will beat the academics to the punch on this one.

Labels: basketball, gambling

Tuesday, July 17, 2007

Tom Benjamin on NHL salary cap "loopholes"

I recently wrote about a loophole in the NHL's collective bargaining agreement that may allow teams to skirt the salary cap by front-loading contracts.

Tom Benjamin points out that the CBA actually limits the amount of trickery teams can do. For instance, no contract can specify a salary more than 50% less than the previous season. Also, poorer teams who back-load contracts also get an advantage; towards the end of the contract, when the player makes much more than his cap amount, he can be profitably traded to a rich team that's short of cap space.

Very good stuff; read Tom's entire post.

Update: James Mirtle comments on the 50% restriction here.

Labels: economics, NHL, salary cap

Monday, July 16, 2007

Bonds' steroid cycles -- a Tom Tango study

A few days ago, I posted a review of Derek Zumsteg's study of Barry Bonds' hitting patterns, which, Zumsteg argued, showed evidence that Bonds was on a steroid cycle.

It turns out that Tom Tango studied the issue in much more detail over three months ago, which I missed. (You'll have to read the comments here to find his studies.)

The original idea was this: if Bonds was on steroids for three weeks, and then off them for one week, we should find that his performance similarly cycles over 28 days – three weeks better hitting, one week worse hitting. Comparing the years Bonds was allegedly on PEDs to the years he was off, Zumsteg found the cycles were larger in the steroid years.

Tango built on that by looking at *all* players over several years, and examining their own cycles. If Bonds was juicing and most other players were not, his cycles should be much more pronounced than those of the clean players.

It turns out that if you compare cycles by looking at the difference in wOBA (Tango's stat) after subtracting "off cycle" from "on cycle", Bonds is very near the least consistent, suggesting PED use. But if you look at the *ratio* of the two stats (by dividing rather than subtracting), his cycles become much more ordinary. As Guy points out in comment 39 of Tango's post, he's only 1 standard deviation above average.

Guy is probably right, that ratio is a better indicator than difference. It's a lot easier to gain X points of OPS by hitting a few more home runs than to gain the same number by hitting singles.

And so I think Bonds' showing in Tango's excellent study doesn't really constitute evidence of juicing. And while I might be biased in favor of my own work, I think the simulation results in my previous post are good evidence that Bonds' 2002 cycles are almost exactly what you would expect by chance.

Labels: baseball, steroids, streakiness

Sunday, July 15, 2007

Massive loophole in the NHL salary cap?

The NHL's labor agreement has a loophole that allows teams to skirt the salary cap almost at will. That is, if I correctly understand Michael Farber's article in the July 16 Sports Illustrated (which I couldn't find online – maybe it's not up yet?).

According to Farber, when a player is given a multi-year contract that doesn't pay the same amount each year, the team's salary cap is debited the average annual amount. For instance, Daniel Brière signed with the Flyers for $52 million over eight years. That's an average of $6.5MM, which will be charged each season against the Flyers' cap. But for the eight years, Brière is actually being paid $10, $8, $8, $7, $7, $7, $3, and $2 million, respectively.

While this contract does add up to $52 million, its present value is obviously much higher than if it were eight equal payments. At an discount rate of 10%, the present value of Brière's contract is about $39.1 million. If the payments were equal, the PV would be only about $36.4 million. So the Flyers are effectively handing Brière a bonus of $2.7 million, right now, without it affecting the salary cap.

Obviously, the Flyers could game the system even worse, by paying Brière $51,999,993 this season, but only $1 for each of the last seven seasons. It still adds up to $52 million, and $6.5 million on the cap each year. But now the present value is $49.6 million, for a salary-cap-free bonus of $13.2 million.

That might be a bit obvious, though, and Gary Bettman would probably take notice.

Now, suppose the Flyers are so rolling in money that they're willing to pay Brière something completely unreasonable, like $50 million a year for 20 years. Technically, they can sign him to a billion year contract, at $1 per year, and front load all the money into the first twenty years. The cap gets charged $1 per year, for a billion years – and the Flyers barely notice. Brière gets rich, and the cap (in spirit, if not in letter) is blown away.

Of course, even if Gary Bettman could ignore the front loading, he couldn't let a billion year contract slip by. So here's another trick. According to Farber,

"Under the collective bargaining agreement, if a player signs a multi-year deal before the age of 35, and retires before it finishes, there is no salary-cap charge for the unplayed seasons."

So, suppose you want to sign a 34-year-old free agent to a six-year, $120 million contract, and the player plans to retire at 40. You'd normally take a $20 million hit to the salary cap each of the six years. But what if you sign him to a twelve-year, $120 million contract, instead, but front-load the contract onto the first six years? Now, his salary cap charge is only $10 million a year. And, when he retires at 40, the last six years of cap charge disappear completely. So you've doubled his salary over the cap charge, and gotten away with it!

If I understand this right, front-loading has the potential to undermine the intent of the salary cap, and allow the rich teams to continue to outspend the poor teams. Maybe in negotiating the agreement that ended the lockout, the NHLPA was a lot smarter than everybody thought?

Labels: economics, NHL, salary cap

Thursday, July 12, 2007

"The Cheater's Guide to Baseball," and Bonds/steroids evidence?

A reader of this blog sent me a copy of Derek Zumsteg's "The Cheater's Guide to Baseball," on condition that I write about it. So here you go. And I'm happy to report that it's a fine book.

Zumsteg takes us through many different kinds of cheating, even kinds that aren't cheating at all (like heckling and the hidden-ball trick). We get a good explanation of the 1919 Black Sox, and I learned about some groundskeeper shenanigans that I'd never heard of before. For instance, Cleveland's Emil Bossard would water down the infield when a ground-ball pitcher was scheduled to start for the Indians, so that opposition balls would die in the soggy ground.

The book is written in a friendly, easy style. Zumsteg usually sticks to the issue at hand, but, just when you least expect it, he'll throw in some pretty good sarcasm, and then get serious again. It works very well, not just because he's genuinely funny, but because you is sarcasm is so well-directed, you feel you can trust him on the other stuff.

Zumsteg gives himself freer rein in the little boxes that dot the book. He mocks Pete Rose's evasions with an imaginary conversation after you've just videotaped Pete hitting your car:

"You hit my car."
"No, I didn't."
"Yes, you did. I was right here. You slammed right into it."
"I don't know what you're talking about."
"I have video footage of your car ramming mine."
"I wasn't driving."
"You're clearly visible in the tape."
"Maybe that tape's from some other car I hit, I don't know."
"You can see today's newspaper on the dash of my car."
"Maybe it was some guy who looks like me."
"When you got out of your car, your wallet flopped open to your driver's license."
"I don't know why you're persecuting me like this. I don't deserve this kind of treatment."
"You hit my car!"

Readers who didn't understand the Pete Rose situation before reading this, should now have a pretty vivid idea of Rose's attempts at spin. Especially since most of us have had this kind of argument with some idiot or other.

As far as sabermetrics is concerned, there's one major study (which I'll talk about in a bit) and a bunch of claims that would be interesting to follow up on. For instance, Ichiro Suzuki is very particular about the moisture content of his bats. An academic study found that a dry bat could have a one percent increase in performance over a bat with a higher moisture content. I don't know what that means, but, if it means that a batted ball would go 1% farther, that's a huge advantage for Ichiro, or anyone.

Also notable:

-- Jamie Moyer is credited with being able to work an umpire, by arguing "as if he's giving the umpire a shoulder massage ... he pumps the umpire for information on balls ("Too low?" Ump nods. "Okay.") as he works out what can and can't get called a strike that day."

-- The Twins experimented with the Metrodome ventilation system, trying to get air currents running out when the Twins were at-bat, and running towards the batter when the opposition was at bat.

-- According to Baseball Prospectus, umpire Chuck Meriwether "was about two and a half times more likely than Mike Everitt" to call a base stealer out. Also, according to Zumsteg, "Derek Jeter is a master of the graceful-looking fake tag."

There are also several occasions where one character or another estimates the benefit of a particular form of cheating:

-- Lou Boudreau said "I wouldn't be surprised if [Bossard's groundskeeping] helped us win as many as ten games a year."

-- Earl Weaver argues that groundskeeping increased the Orioles' batting average by "more than 30 points."

-- Zumsteg himself argues that exceptional grounds crews "might be worth a few games a year."

-- And George Steinbrenner argued that Earl Weaver gained "eight or ten games a year" through intimidation of umpires.

Obviously, these are huge overestimates, and probably could be shown to be such. Were the Indians 10 games better at home than teams with less creative grounds crews? Were the Orioles actually 30 points better at home than on the road, after accounting for normal home field advantage? I haven't checked, but I'm pretty sure the answers would be no.

The book's only major sabermetric study comes near the end of the book, in the chapter on steroids. And it's an ingenious idea.

Zumsteg notes that according to the book "Game of Shadows," Barry Bonds was on a three-week-on, one-week-off steroid cycle in 2002. Moreover, Bonds complained that he didn't feel his usual self on his clean weeks.

If that were true, it should show up in Bonds' day-to-day records. He should be playing three weeks great, then one week not-so-great, then another three weeks great, and so on.

The only problem: Zumsteg didn’t know when Bonds' cycle started. So he ran all 28 possible breakdowns, and looked for the one with the biggest drop-off. That corresponded to a cycle starting on April 1 (or April 29). If you divide 2002 into four-week blocks, starting April 1, and consider Bonds "on" the first three weeks of each block, and "off" the last week of each block, you get:

"On"-- BA=.392, SLG=.878, HR/H=34%
"Off"- BA=.293, SLG=.533, HR/H=19%
----------------------------------
Diff-- BA=.099, SLG=.345, HR/H=15%

Does this evidence support the thesis that Bonds was on PEDs? Zumsteg says yes. One reason is that if you look at the second-best cycle, and the third-best cycle, and so on, they're all clustered around April 1 (or 29). "If we tried to apply this ... grouping to a season where a player was not experiencing [non-random] cycles ... the best fit dates would be randomly scattered."

But I don't think that's right. There will always be a cycle that looks bad, even if the results are just luck – the worst random cycle out of 28 will always be poor. And if a player hit poorly from (say) April 20-26, he will also have hit poorly from April 21-27, since, after all, the two intervals have six of their seven days in common! And that's true regardless of *why* he hit poorly those days. So I'd argue that the clustering of the bad cycles means nothing at all as far as steroids are concerned.

Zumsteg also tried this for other seasons (actually, he had Keith Woolner, of Baseball Prospectus, do it). They don't give much information on those other seasons, but say that in 1997 and 1998, when Bonds was almost certainly steroid-free, the effect is half the size of 2002, with less clustering. But in the suspect 2001 season, they again find clustering.

But still.

I ran a rough simulation of Bonds' 2002 season, assuming at-bats were random and his hitting was constant every day. Then I checked all 28 cycles, and found the one that made Bonds look the most suspicious. I then repeated that test 10,000 times. The results:

Simulation difference: .106 BA, .282 SLG, 17% HR/H
Observed 2002 numbers: .099 BA, .345 SLG, 15% HR/H

(technical details: I divided 2002 into 180 days. Each day, Bonds had 3 PA two-thirds of the time, and 4 PA one-third of the time. That meant he may have had more PA than his actual 2002, or fewer. Also, I found the *three* cycles with the highest difference, one cycle for each of the three stats. Zumsteg concentrated on finding only one cycle that maximized some function of all three stats, but it turned out that they all had their highs during that same cycle, so I figure this comparison is still legitimate.)

In truth, the actual Bonds was almost exactly as consistent as the random, non-steroidal one. The flesh-and-blood Bonds was actually slightly *more* consistent in terms of BA and HR/H percentage. He was, however, moderately more cyclic in terms of slugging percentage.

In terms of p-values, none of Zumsteg's findings are statistically significant compared to random:

p=.54 for BA
p=.28 for SLG
p=.58 for HR/H

(The p-values were derived from the simulation: in 10,000 trials, about 5,400 of them showed a "worst cycle" difference for BA of more than the actual .099 observed.)

I have to admit that I was disappointed by these findings – Zumsteg's idea was so intriguing that I was hoping that he'd indeed discovered something new. And the method had "worked," it would have been great for testing other players and coming up with a "suspects" list. Alas, it's not so.

Still, Zumsteg's book is good reading. Thanks again, anonymous benefactor, for the free copy.

Labels: baseball, steroids, streakiness

Sunday, July 08, 2007

Charlie Pavitt: A long-winded rant concerning the evaluation of fielding

By Charlie Pavitt

I start with Branch Rickey’s famous quote from his (and Allan Roth’s) ground-breaking Life Magazine statistical-analytic article “Goodby to Some Old Baseball Ideas” (August 2, 1954); “there’s nothing anybody can do with fielding” (this and subsequent quotes all from page 83). Rickey/Roth realized that fielding averages were “utterly worthless as a yardstick” because they say nothing about fielding range. But their conclusion that “fielding could not be measured” is surprising given their insights into evaluating batting and pitching. There’s quite a bit we can do with fielding. But we’ve not gotten as far in this regard as we have with batting and pitching; and when I see many of the evaluation methods used by some of our otherwise-best analysts, I am disappointed. My goal in this blog is to rant long-windedly about how I think we ought to be thinking about fielding evaluation in the first place.

Let’s start with the basics. We make progress in evaluating an aspect of baseball when we can successfully break the aspect into its component skills, measure each skill, and then combine these measurements in a meaningful common metric. Take batting prowess. There are four major component skills; the ability to get base-hits, the ability to hit for power, the ability to coax base on balls, and the ability to steal bases without being caught. One factor common among the four that suggests a metric is that successful performance occurs when bases are gained without the loss of outs. Each can be measured fairly easily in that way, and the measurements combined, resulting in tools such as OPS (sans the steals) and total average, both of which work fairly well despite their simplicity. Another common metric is number of runs gained, leading to runs created and a large set of regression-based methods.

Defense as a whole works analogously; it is relatively simple to concoct measures of bases or runs given up relative to either outs gained or opportunity (e.g., innings played). The trick is to distinguish the pitching and fielding parts of the equation. Recent work starting with Voros McCracken’s insight implies that pitchers are best evaluated through examination of events that fielders cannot influence. This means that pitching prowess has the following three components; the ability to hit the strike zone (as measured by walks and hit-by-pitches), the ability to miss bats (as measured by strikeouts), and the ability to keep batted balls in play (as measured by home runs allowed). Measures based only on these skills would follow; the Baseball Prospectus people have proposed just that (DIPS ERA) in their recent Baseball Between the Numbers book. (I should add that pitchers do have some influence on batted balls in play, which implies a fourth skill; but this influence is far less than we thought pre-McCracken, and at this time I would agree with those recommending that we ignore this influence for the time being.)

This leaves batted balls in play as the responsibility of the fielder, so that the evaluation of what fielders do with these is the issue at hand. Can we distinguish the component skills? (We need to exclude catcher defense here, which is a whole ‘nother matter.) Surely they include range and sure-handedness on batted balls and throwing ability. For middle infielders, one must consider adeptness as the middle man in a double play; for first basemen, the knack of picking up throws in the dirt. The trick lies in measuring each of them and then combining them into a common metric. This is a challenging project. Baseball differs from football, basketball, soccer, etc. in that it is an individual sport in a team context; i.e. its outcomes are primarily due to the pitcher versus hitter matchup. However, with fielding, particularly infielding, team coordination matters. As a whole it makes sense to credit fielders with the out when they successfully field a ball in their territory. But what do we do with the second out in a double play? Should the fielder get full credit, or the middleman, or should they split it? What do we do with assists from the outfield, especially when cut-off men are involved?

In order to make this rant particularly long-winded, I shall continue with a bit of history. Back in the March 1976 issue of "Baseball Digest", Bill James proposed the seemingly-novel idea that we measure infielding by the number of putouts and assists the infielder makes per game (in so doing reinventing, a hundred years later, an identical but soon-forgotten measure credited to Al Wright). Range factor was clearly an advance over fielding percentage, but it was laced with problems. First, it intermixes different skills without previous reflection concerning each. Infielders amass putouts and assists both through fielding batted balls and through participation in double plays and force outs, but these are reflective of different skills, only the first of which is directly relevant to range. I did a couple of studies in which I attempted to solve this problem by measuring infielding purely by assists, under the assumption that they were a purer measure of range than putouts. Bill published both, respectively in issues 24 (June 1986) and 31 (August 1987) of the "Baseball Analyst", although I believe that he disagreed with my method given the display of range that can be shown when infielders catch pop-ups far from their position. And I knew full well that many assists are racked up as a double-play middleman. Second, and these were the issues that my two studies were really about, range factor ignored the fact that pitching staffs with high strikeout totals limit infielder opportunities to field balls; pitching staffs with a high proportion of innings taken by lefthanded pitchers will face a preponderance of righthanded batters, leading to proportionally more grounders to the third baseman and shortstop and fewer to the second and first baseman, when compared to pitching staffs with few lefty innings. I presented this material at a SABR convention near Washington D.C., if I remember correctly in 1986; during my presentation, an audience member noted that pitching staff groundball vs. fly ball tendencies have analogous implications. Interestingly enough, John Dewan assumed the pitching-handedness bias and presented fielding measures adjusted for this problem at the same SABR convention, for Dewan beginning a concern with this issue that has continued to this time.

It was obvious that if we wanted to measure fielding plays made on batted balls independently of participation in double plays and free of biases due to pitching staff tendencies, we would have to go beyond the standard statistical measures of fielding and use play-by-play data to measure the proportion of balls hit into the portion of the ballpark for which each position is responsible that are successfully fielded. Fortunately, at about this time Project Scoresheet was beginning to supply the needed data, and analysts started using it for this purpose. The earliest effort of which I am aware was Pete DeCoursey’s work on what he called defensive average, first published in the March 1989 issue of a (sadly) short-lived publication called the "Philadelphia Baseball File." I believe others among “amateur” statistical analysts continued in this vein, and would be happy to hear from readers who have information on anybody doing this work during the 1990s. As for the “professionals,” and probably thanks to Dewan, the STATS annual Player Profiles books during the 1990s included a measure called zone rating, which unfortunately gave credit for two plays for fielded balls turned into double plays, in so doing conflating two different skills.

What are the lessons I think we should take home from all this? Let’s start with two do nots. First, do not use the standard indices, because no matter how well they are massaged they do not provide valid information. An example of this is the Defense ratings appearing in the Baseball Prospectus group’s annual. They are not always clear about their methods; from a description in "Baseball Between The Numbers" (page 97), it seems that Clay Davenport’s version of fielding runs begins with the standard measures and then adjusts them for park factor and the pitching staff tendencies mentioned above. As far as I can tell, they do not take the double play problem into account, but otherwise these adjustments are right-headed. But the method doesn’t seem to work. If you glance through their books, their Defense ratings for players differ wildly from year to year, at least by eye-ball analysis far more than random factors would allow. And they don’t trust their own numbers, regularly making verbal comments clearly inconsistent with their own calculations. For two examples from the 2007 book: on page 418 they ask whether Chris Duncan is “the single worst defensive outfielder in modern memory,” but his 2006 ratings are slightly above average (+1) in both left and right field; on page 381, they wittily call Pat Burrell “the Zeno’s Paradox Outfielder, in that no matter how close he seems to be to catching the ball he’s only halfway there,” but his 2006 rating (-2) isn’t all that bad. An interesting case, of course, is, the normally-maligned Derek Jeter. According to Davenport’s numbers, after years of futility (-12 in 1999, -22 in 2000, -17 in 2001, -19 in 2002, -15 in 2003) Jeter improved to -4 in 2004 and became a good shortstop the past two seasons (+12 in 2005, +7 in 2006), and this change is the main theme in Chapter 3-1 of "Baseball Between The Numbers" as a result of having Alex Rodriguez next to him. As I will describe below, we have good evidence that Jeter’s defense has not improved, and, while I like most of what BP does, I don’t trust their fielding numbers for a second.

Second, if you are going to combine indices for the different skills involved in fielding, do not do so arbitrarily. The example here is Bill James. I admire what he attempted to do in his Win Shares book, but much of it is based on what seem to be arbitrary decisions that make no sense to me. To begin, pitching is given 67.5 percent of the credit for defense and fielding the remaining 32.5 percent; the reader is never told where these numbers come from. The division of this 32.5 across positions is performed according to criteria that the author himself admits to be arbitrary. Ratings for each position are made in the context of their different skills; here is the method for infielders:

Second base - forty points double plays, thirty points assists, twenty points error percentage, and ten points putouts.
Third base - fifty points assists, thirty points errors, ten points sacrifice hits allows, and ten points double plays.
Shortstops - forty points assists, thirty points double plays, twenty points error percentage, and ten percent putouts.

There is no indication of where these numbers come from: why double plays are the most important part of second base play, why putouts are irrelevant to third base and so low for the other positions; could this be a late recognition that I was correct more than fifteen years before the book was written about removing putouts from range factor? Unless and until we get a convincing rationale for these proportions, as with the BP work I don’t trust any of Bill’s ratings for a second.

What would I like to see? First and foremost, I would like to see all measurements of range and sure-handedness based on play-by-play data. Dewan has continued work in this regard with his Baseball Info Solutions; his book The Fielding Bible is a gold mine of valuable data on defense. I might add that Dewan’s work makes plain Jeter’s continued defensive shortcomings; in 2005, he ranked 31st in Dewan’s metric among 32 rated shortstops. David Pinto’s Probabilistic Model of Range and Mitchel Lichtman’s Ultimate Zone Rating are basically identical with Dewan’s work in this regard.

But more generally, I think it is possible to come up with a fielding metric that does a fairly good job of evaluating most aspects of fielding (catchers excluded) in the context of either bases or runs. Beginning with range and sure-handedness, turning a measure such as Dewan’s or Pinto’s into either a base or run measure should be easy; Lichtman already calculates Ultimate Zone Rating in run metrics. As for the other aspects of fielding: in Volume 10 Number 3 of SABR’s Baseball by the Numbers, Clem Comly proposed a nice method (Average Arm Equivalent Method, or ARM) for evaluating the number of runs outfielders either save or cost their team based on their number of assists relative to their number of opportunities to throw out baserunners. As ARM is already in a run metric, it would merely have to be summed with Ultimate Zone Rating for outfielders. We do need to come up with a good method for dividing up responsibility for the second out on double plays; are there any out there of which I am unaware? I know that Pinto has recently put some attention to the double-play problem. I admit that the first baseman’s ability to turn errant throws into outs gets shortchanged here; I’m not sure whether any of the currently available play-by-play data provides enough detail for us to enter that into the equation. It may not be perfect, but contra Rickey/Roth there’s quite a bit we can do with fielding.

Labels: baseball, fielding, Pavitt

Saturday, July 07, 2007

"Uncertainty of Outcome" revisited

Among economists, there's a theory that competitive balance within a league will improve attendance. The argument is that, when historic games are rerun on ESPN Classic, ratings are near zero. The economists conclude that when fans know in advance who's going to win, they're not interested. So, by extension, when they "almost" know in advance who's going to win – such as a live contest between a strong team and a weak team – they should also be uninterested. Therefore, fans are the most interested in games between two equal teams.

I think they're jumping to conclusions. It may be that fans prefer competitive balance, but the "ESPN Classic" argument just doesn't make sense. There are lots of other, more likely, reasons that fans may prefer a live game to a rebroadcast game.

1. Fans care about a lot more than who wins. When watching a live game, anything can happen, and we might see something rare and exciting. It might be a perfect game, or a no-hitter. Someone may hit for the cycle. There could be spectacular catches, or managerial blunders. Our favorite player may go 4-for-4.

Why conclude that fans prefer uncertainty of who wins? Maybe they prefer uncertainty of whether there's going to be a no-hitter. No hitters are more frequent when one team is really bad. By the same ESPN Classic logic, maybe fans prefer *less* competitive balance?

2. Fans are more likely to go to games featuring the best opponents – even when their home team is mediocre. In that case, they are favoring games where the visiting team is very likely to win. This, of course, is contrary to the thesis that fans prefer uncertain games. (See Guy's comment here, in my previous post on this subject.)

3. Even when a game is almost certain to go to the favored team, sometimes it doesn't. When that happens, the game is very exciting and satisfying for the fans. Isn't it possible that the thrill of the underdog win compensates for the fact that it won't win very often?

Maybe it only partially compensates, or maybe it even overcompensates. In any case there's got to be *some* effect there, unlike the ESPN Classic game, in which we know one team *never* wins.

4. How are rebroadcasts different from live games? One obvious answer is that we know who will win. But there are other aspects too. One of them is simply the fact that a live game is ... live.

Suppose you're watching a live game, and the phone rings. You put your Tivo on pause, and chat for a few minutes. Then you hang up the phone. Do you watch the part of the game you missed? I don't – I skip the parts I missed, and immediately go live. There's just something unsatisfying about watching the game delayed, when the rest of the world already knows what happened. Maybe that's just me, but I can't be the only person in the world who thinks that way. I'd argue that in general, that a live game is desirable for its own sake, and not just because you don't know what's going to happen.

5. The NHL Network often shows vintage games from the 60s and 70s. They don’t tell you in advance how they're going to come out, and you'd have to have a pretty good memory to remember every game over two decades. So when you watch these, you're uncertain about the outcome.

But they're still less exciting. Why? Not because the outcome is uncertain, but because the game isn't news. When the Devil Rays play the Red Sox live, there's an impact – the standings change, the players' stats change, and so on. When we see the Devil Rays play the Red Sox in a replay from 2001, it just doesn't matter, even if we don't remember what happened at the time.

6. Finally, if uncertainty of outcome is what's important, how come fans aren't consistent in when they care about it?

According to the sports economists, fans are significantly less interested in watching a game where one team has a 70% chance of winning than one where it's 50-50.

But suppose it's 5-3 in the top of the 9th, two outs, runners on first and third, and the go-ahead run coming to the plate. Pretty exciting, right? But, historically, the batting team has a very small chance of winning. Between 1974 and 1990, this situation happened 171 times, and the team at-bat managed to win only 9 of those. (Data here.) That's not much uncertainty of outcome. But how many fans would turn off the TV in those circumstances?

Of course, this ninth inning isn't exactly the same as a full unbalanced game. For instance, waiting for the outcome will probably take only a couple of minutes here, rather than the three hours it takes to sit through a Yankees/Royals game. So there's less of an investment. And, the situation is indeed pretty exciting, where one swing of the bat might make the difference.

So let's take a less exciting example. Suppose two teams are exactly equal, but in the bottom of the first, the leadoff batter hits a home run. The home team now has a .685 chance of winning the game. In terms of uncertainty of outcome, that 1-0 situation is the equivalent of a last place team hosting a first place team.

If uncertainty of outcome is so important, should we expect TV ratings to drop after the home run? Should we expect that if fans knew, in advance, that it would be 1-0 with no outs in the bottom of the first, they would be less likely to buy a ticket?

Just doesn't seem plausible to me.

Labels: baseball, competitive balance, economics

Monday, July 02, 2007

Are GMs overoptimistic in their signings?

Economist Steve Walters had an article recently on the Baltimore Orioles and "optimism bias," which you can read on the "Wages of Wins" blogsite.

Optimism Bias is the tendency of human beings, when considering a decision, to overweight the positive possibilities of the decision, while underweighting the negative ones. (Here's the Wikipedia definition.)

According to Walters, optimism bias is rampant in baseball:

"Economist John Burger and I have researched the baseball labor market, and (in a paper that will soon appear in the Southern Economics Journal) found that teams, on average, under-value consistency and over-value players who produce eye-catching but rare “big years,” resulting in considerable red ink."

This is very interesting stuff, and I'm excited about reading the paper when it comes out. For now, Walters argues that this bias is part of the reason the Orioles are in last place. He argues that the O's signed three relievers over the winter -- Danys Baez, Jamie Walker, and Chad Bradford -- and all three have had careers with lots of ups and downs. Orioles' management must have been seduced by these guys' good years, and ignored their bad years – optimism bias at work.

I'm puzzled by this. Admittedly, I don't follow the Orioles, but looking at their 2007 stats, it does seem that Walker and Bradford are doing pretty much what you'd have expected from their recent years. The third pitcher, Baez, is indeed a disappointment, but he's even worse than his worst year ever, so you can't argue that you should have been able to predict his collapse. And while his 2006 ERA was indeed much worse than his 2005, the difference in his basic stats appears to be only about six extra hits, which is partly compensated for by three fewer home runs.

My take is that while optimism bias might be a factor, you can't see it in these three signings. Unless there's something going on deeper than you can see in the stats ... as I said, I don't follow these guys much.

Labels: baseball, economics

NASCAR teams worth less than other sports teams

Forbes has estimated the values of NASCAR teams. They run cheaper than in other sports, averaging about 10x operating income. The NHL averages around 43, for instance.

I suspect they're cheaper because they're not as much fun to own. But I have no evidence or argument to support that.

My comments about team ownership in other sports are here.

Labels: economics, NASCAR, NHL, picasso

Sabermetric Research