Sunday, December 30, 2007

Are the NFL gambling lines consistent with each other?

According to this old Boston Globe article, Daryl Morey discovered that the Pythagorean Projection for the NFL should use the exponent 2.37. That means that from the Vegas betting line and the over/under, we should be able to come up with an estimate of the probability of winning the game.

For instance, last night, the Patriots were favored by 13.5 points over the Giants. And the over/under was 46.5 points. That means that the expected score, in a sense, was Patriots 30, Giants 16.5.

Using Pythagoras on that score, we get that New England should have had a 80.5% chance of winning the game.

But the market prediction, from, was 88.0%. (Sorry, no link.)

So why the difference? I can think of a couple of possible reasons:

1. Pythagoras doesn't work well on such heavy favorites;

2. The distributions are not symmetrical, so even though the *median* score is 30-16.5, the *mean* score is something else;

3. The market for outright wins is less efficient than the point-spread market.

I'd bet #2 is the correct answer, that strategic differences (such as the leading team taking time off the clock instead of going for more points) make the comparison inaccurate. In any case, I doubt #3: if it were that easy to make money by betting the heavy underdog to win, someone would have noticed by now.

This NYT article says that as of last week, the Patriots were 1:8 favorites to win the Super Bowl. That can't be right – those would be the odds of winning one game against a mediocre team, not three straight against quality opponents. The betting is at about even odds at TradeSports.

Labels: , , , ,

Wednesday, December 26, 2007

Why does a more precise fielding system give less precise results?

I was puzzled by something in Dan Fox's latest post on the SFR fielding method (which, as described in my previous post, is similar to the Sean Smith's "Total Zone" method). My puzzlement isn't about the method itself, but about how its results compare to the "plus/minus" system in "The Fielding Bible."

Dan got the "plus/minus" numbers from the 2008 Hardball Times (which I don't have yet – it's in the mail). What bothers me is that their numbers are so much more extreme than Dan's.

If you look at Dan's chart, his top three extremes (ignoring sign) are 62, 48, and 46. If you look at the THT extremes, though, they're bigger. The top three are 98, 81, and 68 plays; at 0.8 runs per play, that's 78, 65, and 54. The standard deviation of Dan's results is 28.6 runs; the SD of the Hardball Times' results is 33.5 runs.

The SDs aren't all *that* different – what bothers me is not the difference, but the fact that Dan's SD is smaller than THT's SD. It should be bigger.

That's because the more luck in the statistic, the larger the variance. As Tangotiger has written many times,

Variance of observations = Variance of talent + Variance of luck

We can break down the luck further. In the case of Dan's measure, there's luck in the sense of, here's a ground ball that the shortstop will get to 80% of the time, but this time the ball just gets past him, and, over the season, just by that kind of luck, he winds up at 78.5% instead of 80%. Call that "binomial" luck.

Then, there's luck involved in misclassifying balls. Dan docks the shortstop for a percentage of all ground-ball singles to center. Some of those were actually playable, and some weren't. Dan doesn't know, and so sometimes a shortstop will be assigned too few chances, or too many. Call that "misclassification luck."

So for Dan's measure, we get

Var(observations) = var(talent) + var(binomial luck) + var(misclassification luck)

But now, for THT's observations, there's no misclassification luck – every ball is observed precisely, and assigned exactly to the proper fielder. So for THT,

Var(observations) = var(talent) + var(binomial luck)

Comparing the two, and remembering that variance is always positive, it seems that it's Dan's results that should have more variance, not THT's results.

(If you don't like the formulas, here's a non-mathematical explanation. It's a fact of life that what you observe has a wider spread than what caused the observations. For instance, if you repeatedly toss a coin 10 times, sometimes you'll get 3 heads, sometimes 5, sometimes 7, and so on. But what *caused* this distribution is that the coin always has a "talent" of 5 heads – but sometimes it gets lucky. The observations are 3, 5, 7, but the talent is narrower -- 5, 5, 5.

The same thing happens for fielders. Fielders may range from, say, 75% to 80%. But, just like a coin may land more than 50% heads, a fielder may "land" more than 80% of ground balls. Indeed, with so many fielders, it's almost certain that at least a few will overshoot their talent and break the 80% mark – and so some fielders will show more than 80%, even though none of them is really good enough to average more than 80%. So the observations may be 70%-85%, but the talent is narrower: 75%-80%.

Now, suppose you add even more luck. For each player, you take a random number of plays and reverse their status, converting makeable plays to nonmakeable, or nonmakeable ones to makeable. What happens? You get an even wider spread. Because it's likely that one of the players at the top will be boosted even higher by a positive random number. If your three top players are at 85, 86, and 87%, and you move the three of them randomly by a few points, you're likely going to boost one of them past the 87% mark, and the spread will be even wider. So the observations might now be 68%-87%, but the talent is narrower: still 75%-80%.

In general: the more independent sources of randomness, the wider the spread of observations.)

So what's going on? Is my logic wrong? Why are we seeing the reverse of what we expect? Any ideas? Because I'm stumped.

Labels: , ,

Sunday, December 23, 2007

A Retrosheet-based fielding evaluation method

UPDATE: After posting this as Dan Fox's method, Joe Arthur noted in the comments that Sean Smith previously introduced much the same method back in April. See Joe's link in the comments. As far as I can tell, it looks like Sean deserves credit for the method, and Dan for the improvements. I've changed the title, but the rest of the post remains the same.


Dan Fox has a significant new fielding evaluation method, which he explains over at Baseball Prospectus. (If that link is subscriber-only, try this post from Dan's own blog.)

How good your fielding stat is depends in large part on how much data you have. If you're only using "Baseball Encyclopedia" data, you've got range factor, or, more notably, Bill James' improved range-factor-type stat as explained in "Win Shares." If you've got full information of what "zone" of the field the ball covered, you can calculate a zone rating, the percentage of plays within reach of the defender that he actually turned into outs.

Fox's new stat is in the middle. It doesn't use full observational data, like Zone Rating, but it does use the play-by-play data found at Retrosheet. Specifically, it uses Retrosheet's rough description of the type of hit, and where on the diamond it went. I wish I had thought of it myself. I think it's got to be close to the best possible evaluation given the limit of publicly available data. Back in Win Shares, Bill James said, about his new method,

"this is the best sabermetric work I have done in the last ten years, maybe twenty ... I feel I have made some ... breakthroughs which will certainly lead, when other people see the work I have done and apply their own abilities to the issues I have raised, to even more and even larger advances."

My feeling is that Dan's method is the next advance above Bill's. I don't think Dan explicitly started with Bill's method, but wound up where he did through other, more obvious, paths. But, perhaps because of my experience with Bill's method (we adapted it for the eighth edition of Total Baseball), my first reaction was to see Dan's work as an extension of Bill's, so that's how I'll explain it.

In Win Shares, Bill noted that ordinary range factor (plays made per team game) suffers from a major flaw – which is, that the composite range factor for a team will always be around (27 minus strikeouts). No matter how bad a team's defense, it will keep getting chances to make plays until it's made 27 outs total. And no matter how good its defense, once it makes 27 outs, even 27 super-spectacular Ozzie-Smithian or Masafuri-Yamamorian outs, it has to stop playing.

So Bill's insight was that you have to adjust range factor to take into account plays *not* made. If team A has a "defensive efficiency ratio" of 68% (meaning it makes an out on 68% of the balls in play), but team B has a DER of 72%, then team B's players are actually making about 6% more plays than team A. You can't tell, just from the DER, which players are responsible for that extra 6%. But, as a first approximation, if you're comparing A's shortstop to B's shortstop, you can start by devaluing A's numbers by 6%. That may be wrong, of course. The team's 6% shortfall could be due to a bad second baseman, or left fielder, or first baseman, or some combination – the shortstop could be perfectly good, or even spectacular. But in the absence of any evidence either way, you assume the 6% reduction. In effect, A's man handled the same number of chances *per game* than B's man – but 6% fewer chances *per ball in play*. And the latter figure is what really matters when evaluating fielding.

What Dan Fox now adds to that is some information that tries to figure out how much of those 6% extra balls in play actually should be attributed to the shortstop. What he does is use the information Retrosheet provides – the type of hit (line drive, fly, ground), and who fielded the ball. Bill James used, as the denominator, all balls in play for the entire team. Dan improves the stat by limiting the denominator to balls hit near the defender.

For shortstops, Fox considers all the ground balls fielded by the shortstop or left fielder, and half the ground balls fielded by the centerfielder; all the line drives fielded by the shortstop, all the line drive hits fielded by the left fielder, and half the line drive hits fielded by the center fielder. For all those balls, he figures, there's at least a chance the shortstop could have handled them. But for balls hit to the right fielder, there was no chance, so they shouldn’t affect your evaluation of the shortstop. The Bill James method doesn't have a breakdown of where the balls went, so it has to assume that every fielder had a chance at every ball.

Put another way: whereas Bill James considered that the shortstop is exactly as good or bad as the DER for the team, Dan figures a specific DER for the shortstop by considering only balls in play that are roughly in his area. Also, he takes into consideration the fact that line drives are more likely to be unfieldable than ground balls, so a shortstop who sees a lot of line drive hits past him will have a better rating than one with the same number of hits, but more of those on ground balls.

Of course, the method is not perfect. For one thing (as Dan notes), a third-baseman who covers a lot of ground to his left will make a shortstop look good. For another thing, not all ground balls handled by the left fielder were actually balls the shortstop could have got to. (However, in
later revisions, Dan addresses this point by, for instance, tweaking his formula to assume any doubles to left were actually solely the third baseman's responsibility. There are other excellent tweaks too, such as updating the "50%" figure depending on the handedness of the batter.)

As an empirical test, there's a pretty good correlation between this method and the state-of-the-art, watch-where-every-ball-actually-goes, "Ultimate Zone Rating".

I really liked this system when I first read about it; now, with the revisions, I really, *really* like it. It's the best approach I've seen for players from the past. Do you want to know, for instance, how good a fielder Clete Boyer was, without need access to any proprietary data? In my opinion, this is by far the best objective method to use.

Labels: ,

Saturday, December 22, 2007

New York Times article on steroids and performance

There's another statistical follow-up to the Mitchell Report, this one in today's New York Times. There, Jonathan R. Cole and Stephen M. Stigler – a sociologist and statistics professor, respectively – look at whether there is evidence that steroids improved players' performances.

For the players named by Mitchell, the authors looked at their career performance (ERA, HR, BA, SLG) before their steroid year, and their career performance after:

"After excluding those with insufficient information for a comparison, we were left with 48 batters and 23 pitchers. ... For pitchers there was no net gain in performance and, indeed, some loss. Of the 23, seven showed improvement after they supposedly began taking drugs (lower E.R.A.’s), but 16 showed deterioration (higher E.R.A.’s). ... Hitters didn’t fare much better. For the 48 batters we studied, the average change in home runs per year “before” and “after” was a decrease of 0.246. The average batting average decreased by 0.004. The average slugging percentage increased by 0.019 — only a marginal difference."

But the authors didn't adjust for age. It could be, as the authors conclude, that the steroids were ineffective. Or, it could be that the players who did take PEDs tended to be older, and the increased performance worked to offset age-related decline. For instance, the authors write,

"Roger Clemens is a case in point: a great pitcher before 1998, a great (if increasingly fragile) pitcher after he is supposed to have received treatment. But when we compared Clemens’s E.R.A. through 1997 with his E.R.A. from 1998 on, it was worse by 0.32 in the later period."

Of course, in 1998, Roger was 35 years old. For a pitcher to lose only 0.32 in ERA from age 35 to 44 ... well, that's remarkable. Adjusting for park -- which the article did not do -- would make the decrease worse. But the gradualness of the decline would still be impressive. I'd say that if Clemens took PEDs, it would be a reasonable presumption that they worked.

Also, what's with the "after excluding those with insufficient information?" If it's information on steroid use, that's one thing. But if the information is "insufficient" because the player dropped out of the major leagues or got injured, that's an important data point.

Finally, the authors say,

Our results run contrary to the prevailing wisdom. One reason might be that most baseball skills depend primarily upon reaction times and judgments, factors unaffected (or even degraded) by these drugs.

In that case, wouldn't steroids be of more help those players whose skills depend on muscle strength, like power hitters and strikeout pitchers? Say, Barry Bonds or Roger Clemens?

HT: J.C. Bradbury, who is more impressed than I am with the article and its authors. Bradbury in turn links to a BTF post, which has a couple of good rebuttal points in its comments.

UPDATE: here's a better, more thorough, more sarcastic analysis.

UPDATE: Isn't it annoying how the New York Times insists on writing "E. R. A." instead of just "ERA" like the rest of the world does it? I saw one article where they kept referring to the "N. F. L.", except the TV channel was the "NFL Network." I'm sure they have their own internal logic in their style book about how "N. F. L." is actually an abbreviation, but "NFL Network" is its proper copyrighted name. But I don't care. They still wind up looking like a bunch of pedantic dorks.

Labels: ,

Wednesday, December 19, 2007

Milwaukee newspaper article on steroids and performance

This article, from the Milwaukee Journal-Sentinel, professes to be "the first statistical analysis of player performance for those named in the 409-page [Mitchell] report." But it's really just a bunch of anecdotes; there's little useful analysis there at all.

The authors start by telling us that of the ninety players named in the report, 33 "immediately improved in the first season [after starting to juice] compared with their career averages." So, what does that mean? Wouldn't you expect that, even without steroids, 45 of the 90 would improve, and the other 45 wouldn't? Are they trying to tell us that steroids actually *hurt* performance? Probably not, but we don't know for sure, because, having given us this statistic in the second paragraph, they never mention it again.

Then, they immediately tell us that 27 hitters and 19 pitchers "raised their statistical performances." That's 46 players, not 33. Why the difference? They don't say, but, based on an
accompanying chart, it looks like the extra 13 players are those who improved in the second season, but not the first. And that 46 out of 90 is again very low. If seasons were random, then you'd expect 75-80% of subsequent two-year records to show at least one improvement, wouldn’t you?

Perhaps the difference in both cases is due to mostly *older* players turning to performance-enhancing drugs; older players tend to be on the decline. But without a control group, how do we know whether these numbers are actually different from the norm?

The authors go on to tell us that "thirty players used performance-enhancing drugs only in their last year or two in the big leagues or after they had slipped back to the minor leagues." I'd have phrased it a different way: that thirty players dropped out of the major-leagues within two years after first starting to take PEDs. That would suggest that maybe the steroids didn’t work, wouldn’t it? The article says that "many of these players tried to hang on to the tail end of their careers," but, again, we don't have a control group. (And perhaps some of these players didn't know it was "the tail end of their careers" when they started on the drugs; it's not like baseball players have a best-before date tattooed on their body.)

Then, the article says that "eight of the 33 all-stars named in the report were selected only in seasons after they reportedly began using performance-enhancing drugs." Is that significant? Again, how many were selected only in seasons after they changed teams or got married? Maybe honeymoons are also performance-enhancing.

And, finally, the authors note that some players signed huge contracts after starting to use PEDs. They didn't mention that some players also signed huge contracts after NOT starting to use PEDs.

A real statistical analysis would take an unbiased prediction of the players' next season – Marcels, or PECOTAs – and see if those players outperformed. It would be interesting to find out if that was actually the case, and to what extent.

UPDATE: J.C. Bradbury links to a similar comment today.

Labels: ,

Thursday, December 13, 2007

Stats vs. scouting: a thought experiment

I was thinking about the Moneyball debate, about traditional scouting vs. statistical analysis. Here's a thought experiment I came up with.

Suppose you take the 25 best scouts today, and put them in suspended animation for 40 years. Then, you wake them up. You ask them to evaluate the major-league first basemen of 2047. Of course, none of the scouts know anything about the players, who weren't even born when the scouts went to sleep in 2007.

The scouts get to watch the players hit. You don't want them to evaluate the players by keeping track of their stats, so you make sure all the stats work out the same. To do that, you show them only showing them 300 PA for every player. You pick those plate appearances by making sure to include exactly 80 hits, 10 home runs, 14 doubles, and so on. (The exact PA in each category are picked randomly). The scouts can watch those plate appearances as many times as they want. The technology of 2047 lets them see the everything holographically, in 3D, from any angle. They can even use radar guns if they like. (Indeed, since this is a thought experiment, assume any additional technology you want.)

You then ask the scouts to rank the 30 players by how well they'll do next year. Would they do a decent job?

I'm probably less qualified than most readers of this post to guess at this question, but I'll try anyway. I'd bet that the scouts wouldn't do very well. I'd bet that an Albert Pujols single doesn't look that much different from Kevin Youkilis single. However, I think the scouts might figure out who has power by looking at home run distances, and who walks a lot by noting plate discipline and the ability to lay off pitches. They'd also see who has good speed.

Now suppose you froze 25 sabermetricians. To this group, instead of showing them holographic replays of plate appearances, you were to show them only the players' stats. Would they do better than the scouts? I think it's almost certain they would. The sabermetricians would have the stats for the player's whole career in front of them. The traditional scouts wouldn't have that. They might know a few small things the stats group doesn't – plate discipline, for instance – but unless they counted, their impressions would be off a bit, over 300 PA times 30 players. But the sabermetricians would know a LOT more than the scouts -- batting average, home runs, walk

And suppose that you *included* statistics for all these things for the sabermetricians – speed, pitch counts, home run distances, line drive frequency, average pitch speed against, and so on. In fact, let the sabermetricians have any stats they want (within reason). Would there be anything left for the scouts? Only things that can't be measured. What are those things? Subjective impressions of personality and drive to win? Leadership? Certain aspects of body type? Are those really enough to measure up against all that data?

Doesn't it seem like a copy of the 2047 Baseball Prospectus and 2047 Baseball Forecaster should beat the crap out of a bunch of scouts who aren't allowed to count things?

Before this thought experiment, I felt like traditional scouting was of substantial value – although not as important as the statistical record. But now, it seems to me that hard data would trump live scouting in almost every case.

Here's an experiment you could do right now, to check that. Find your top 25 scouts right now, and ask them: you've seen a lot of current major-league players live this year. For which players have you seen live indications that suggest the player's prospects are better or worse than what his statistical record suggests? Maybe you've seen something like, "hey, Joe Blow normally hits .320, but he's weak on curve balls on the outside corner, and once pitchers catch on, he'll only hit .270." Or, maybe, "you know, these five guys have had stats very similar to those five guys. But these five have drive and leadership, and are going to make themselves into better players. Those other five just coast through the season, and they're going to be washed up before too long."

That is: ask scouts to make testable predictions that are based only on observations of things that can't be measured by sabermetricians.

Can any scouts reliably make successful predictions like that? If they can, that would be evidence that scouting valuable, much more valuable than I think it is. If not, though, isn't that itself evidence that traditional scouting only has value because there isn't enough good data?

It seems to me that scouting is a *substitute* for data, and an inferior one. For those who think it's a *complement* to data, my view is that you have to show me where the benefit is.

P.S. As Tango points out, scouts sometimes add value by noticing trends that statistical analysts can verify. In that case, you can argue that they're really doing sabermetrics ...

Labels: , ,

Tuesday, December 11, 2007

Zimbalist reviews Bradbury, Bradbury responds

Famed sports economist Andrew Zimbalist has posted a review of two recent baseball economics books: Vince Gennaro's "Diamond Dollars," and J. C. Bradbury's "The Baseball Economist." The Bradbury review is of more interest to me, since I haven't finished Gennaro's book. Plus, Bradbury has responded to Zimbalist on his website.

Zimbalist's review; here's Bradbury's response.

It seems to me that Zimbalist makes some good points, but also some very questionable ones. Bradbury defends himself well.

And my sympathy is with Bradbury, because Zimbalist doesn't know much about sabermetrics, and, for that reason, some of his criticisms are badly miscast. On various points, it's clear that Zimbalist is oblivious to most of the sabermetric progress of the past thirty years or so.

Take clutch hitting, for example, the subject that has arguably been debated the most of any controversy in the field. It seems like Zimbalist has seen none of that work. When Bradbury argues that clutch hitting talent is an illusion, Zimbalist responds

"There you have it – there is no such thing as clutch hitting. This is an awfully linear, materialist view of the world where a player’s emotions and his state of physical depletion over a 162-game season play no role."
This kind of intuitivist, naive response is what you expect from sportswriters, not from people who study sports economics for a living. But Zimbalist seems unaware of any of the body of literature on this question.

Here's another one. Bradbury talks about OPS, and how a better version of the formula would give more weight to OBP, perhaps by a factor of 3 (as Michael Lewis quotes Paul DePodesta in Moneyball). Zimbalist writes,

"... SLG ... is a much higher number than OBP. The coefficient, therefore, will necessarily be smaller on SLG. If elasticity is used instead of the estimated coefficient, OBP is 1.8 times greater than SLG."

Here, Zimbalist is trying to criticize Bradbury for the way he casts the question, arguing that the coefficient is an inappropriate measurement here. But if he were familiar with Moneyball, or the debate on OPS, he'd have known that the question DOES refer to the coefficients, and, yes, we are indeed aware that SLG is a higher number. Again, it's apparent from his comments that he has no idea there's already an extensive literature on the subject.

And here's one more:

"[Bradbury argues] that a pitcher’s ERA from one year to the next is highly variable, but that a pitcher’s walks, strikes and home runs allowed are more stable over time. The inference is that ERA depends more on outside factors, such as a team’s fielding prowess, and, hence, is a poor measure of the inherent skills of a pitcher. While there is something compelling to this logic, it seems caution is in order. First, a pitcher’s skills may actually vary from year to year, along with his ERA, as other factors change, such as, his ballpark, his pitching coach, his bullpen, his team’s offense, the angle of his arm slot, his confidence level, etc. This variability does not mean that the skill is spurious. Second, if all we consider is strikeouts, walks and home runs, what are we saying about sinkerball pitchers who induce groundballs or pitchers who throw fastballs with movement or offspeed pitches that induce weak swings and popups?"

That last sentence, about pitchers inducing weaker balls in play ... well, what we are saying about it is the DIPS theory. And that chapter of Bradbury's book does include an extensive discussion on DIPS ... if Zimbalist did indeed read it, you can't tell by his argument here. Bradbury rips into Zimbalist for this, with a lot more restraint than I would. Also – and this Bradbury does not mention – is that it's not just "outside factors such as a team's fielding prowess" that makes ERA unreliable. It's mostly just luck – whether the hits, walks, and home runs are bunched together or not. I'm sure Bradbury, or any one of countless bloggers and writers in the field, could have told Zimbalist this.

Anyway, as I said, Bradbury defends himself against Zimbalist quite well. For my part, though, I have to say that it's disappointing that Zimbalist, who is so respected in the realm of sports economics, would know so little about sabermetrics. After all, sabermetrics is an established scientific discipline, and one quite substantially impacts his own. Moreover, Zimbalist seems unaware that he is unaware. You'd expect a reviewer to be well-versed in the subject he's reviewing, but that doesn't seem to be the case here.

Zimbalist's review has been published in the "Journal of Economic Literature," an academic publication. This is unfortunate. I don't know much about how things work in academia, but it does seem that Bradbury's reputation will take a unfair hit -- at least on these sabermetric points, on which Zimbalist's less-than-fully-informed criticisms are way off the mark.

Labels: , , ,

August, 2007 issue of "By the Numbers" released

The latest issue of "By the Numbers," the SABR Statistical Analysis newsletter I edit, is now available (.pdf).


Monday, December 10, 2007

"Super Crunchers" -- Bill James and oenometrics

A few minutes ago, I started reading "Super Crunchers," by Ian Ayres. It's a book about how sabermetrics is better than intuition for finding things out. Of course, Ayres doesn’t use the word "sabermetrics," but that's what he means.

I'm only a few pages in, but already there's a great example. Economist Orley Ashenfelter looked at wine ratings, and figured out that the quality of a given year's vintage can be easily predicted from rainfall data. He produced this formula, which appears to have been derived from a simple regression:

Wine Quality = 12.145 + .00117 * winter rainfall + .0614 average growing season - .00386 harvest rainfall

The reaction from traditional wine experts was not surprising. Robert Parker, perhaps the world's most famous wine critic, said the method "was so absurd to be laughable." A wine magazine said "the formula's self-evident silliness invites disrespect."

However, it worked. Ashenfelter used the formula to predict that the 1989 vintage would be exceptional, and that 1990 would be even better. He turned out to be right.

That's the first six pages. On page 7, Ayres notes that the technique of wine sabermetrics can also be applied to sports. He calls Bill James the "Orley Ashenfelter of baseball," and talks a bit about Runs Created and Moneyball.

So now I'm nine pages into the book, and it looks very promising.

P.S. Appropriately, I think I remember a Joe Morgan baseball card from the '70s that said Joe is a wine connoisseur ...

Labels: , ,

Predicting the Heisman the old-fashioned way

This article, from Carl Bialik, the Wall Street Journal "Numbers Guy," profiles a man who's come up with a way of predicting the Heisman Trophy winner.

Alas, Keri Chisholm's method is not a sabermetric one – he just surveys the voters. I was hoping for something analytic, like
Rob Wood's method for predicting baseball MVPs.

Labels: ,

Monday, December 03, 2007

Does pace impact defensive efficiency? Don't use r-squared, use the regression equation

When someone runs a regression, they will wind up reporting a value for r or r-squared. If the value is small, they'll argue that the two variables don't really have much of a relationship. But that's not necessarily true.

Before I talk about why, I should say that if the result goes the other way – the r or r-squared is high, and statistically significant -- that *does* mean there's a strong relationship. If the correlation between cigarettes smoked and lung cancer is, say, 0.7, that's a pretty big number, and we can conclude that lung cancer and smoking are strongly related.

But a low value doesn't necessarily mean the opposite.

For instance: inspired by the smoking example, I looked at another lifestyle choice. Then, I ran a regression on expected remaining lifespan, based on that lifestyle choice. The results:

r = -.17, r-squared = .03

What is the effect of that lifestyle choice on lifespan? It looks like it should be small. After all, it "explains" only 3% of the variance in years lived.

But that wouldn't be correct. The lifestyle choice really does have a large effect on lifespan. In fact, the lifestyle choice I used in the equation is (literally) suicide.

Here's what I did. I took 999 random 40-year-olds, and assumed their expected remaining lifespan was 40 years, with an SD of about 8. Then, I assumed the 1,000th person jumped in front of a moving subway train, with an expected remaining lifespan of zero. (These numbers are made up, by the way.)

The results were what I showed above: an r of only –.17.

Why does this happen? It happens because the r, and the r-squared, do NOT measure whether suicide and lifespan are related. Rather, they measure something subtly different: whether suicide is a big factor in how long people live.

And suicide is NOT that big a factor in how long people live. Most people don't commit suicide; in my model, only 1 in 1000. The r-squared shows how much effect suicide has *as a percentage of all other factors*. Because there are so many other factors – in real life, heart disease and cancer are about 40 times as common as suicide -- the r-squared comes out small.

If you want to know the strength of the relationship between A and B, don't look at the r or the r-squared. Instead, look at the regression equation. In my suicide experiment, the equation turned out to be

Lifespan = 40.0 – 40.0 (suicide)

That is, exactly what you would expect: the lifespan is 40 years, but subtract 40 from that (giving zero) if you commit suicide.

And, even though the r-squared was only 0.03, that r-squared is statistically significant, at greater than 99.99%.

Again: the r-squared is heavily dependent on how "frequent" the lifestyle choice is in the population. But the significance level, and the regression equation, is not.

To prove it, let me rerun my experiment a few times, with different percentages of suicide in the population:

1 in 10,000:
r-squared = .003
lifespan = 40.1 – 40.1 (suicide); p > 99.99%

1 in 1,000: r-squared = .03

lifespan = 40.0 – 40.0 (suicide); p > 99.99%

1 in 100: r-squared = .26

lifespan = 40.3 – 40.3 (suicide); p > 99.99%

1 in 10: r-squared = .71

lifespan = 37.3 – 37.3 (suicide); p > 99.7%

The r-squared varies a lot – but all the experiments tell you that suicide costs you 40 years of life, and that the result is statistically signficant.

The moral of the story:

The r-squared (or r) does NOT tell you the extent to which A causes B, or even the strength of the relationship between A and B. It tells you the extent to which A explains B relative to all the other explanations of B.

If you want to quantify the effect a change in A has on B, do not look at the r or r-squared. Instead, look at the regression equation.


Which brings us to today's post at "The Wages of Wins." There, David Berri checks whether teams who play a fast-paced brand of basketball (as measured by possessions per game) wind up playing worse defense (as measured by points allowed per possession) because of it. Berri quotes Matthew Yglesias:

"For example, there’s a popular conception of a link between pace and defensive orientation — specifically the idea that teams that choose to play at a fast pace are sacrificing something in the defense department. On the most naive level, that’s simply because a high pace leads to more points being given up. But I think it’s generally assumed that it holds up in efficiency terms as well. The 2006-2007 Phoenix Suns, for example, were first in offensive efficiency, third in pace, and fourteenth in defense. But is this really true? If you look at the data season-by-season is there a correlation between pace and defense?"

Berri runs a regression for 34 years of team data. So, is there a relationship? He writes,

"The correlation coefficient between relative possessions and defensive efficiency is 0.17. Regressing defensive efficiency on relative possession reveals that there is a statistically significant relationship. The more possessions a team has per game - again, relative to the league average - the more points the team’s opponents will score per possession. But relative possessions only explains 2.8% of defensive efficiency. In sum, pace doesn’t tell us much about defensive efficiency ... " [emphasis mine]

But I don't think that's right. As we saw, the r-squared of 2.8% (or the r of 0.17) means only that, historically, pace is small *compared to other explanations of defensive efficiency.* And that makes sense. Even if pace has a significant impact on defense, we'd expect other factors to be even more important. The players on the team, for instance, are a big factor. Luck is also a big factor. The coach's strategy probably has a large impact on defensive efficiency. Compared to all those things, pace is pretty minor. And we probably knew that before we started, that personnel matters more than pace.

And so I would guess that's not really what Yglesias wants to know. What I bet he's interested in, and what teams would be interested in, and what I'd be interested in, is this: if a team speeds up the pace by (say) 2 possessions per team per game, how much will its defense suffer? That's an important question: if you're evaluating how good a team (or player) is on defense, you want to know if you can take the stats at face value, or if you have to bump them up to compensate for fast play, or if you have to discount them for teams who play a little slower. It's like a park factor, but for defensive efficiency. The regression should be able to tell you just how big that park factor is.

That's the real question, and the r-squared doesn't answer it at all. Given the data Berri gives us, the effect of pace on defensive efficiency could be small, or it could be large. After all, the effect of suicide on lifespan was huge, even though the r-squared was small. And just like in the suicide case, if a lot more teams suddenly decide to start playing at a different pace, the r and r-squared will go up – but the relationship between pace and defense will likely not change.

To really understand what effect pace has on defense, we need the regression equation. Berri doesn't give it to us. He does tell us the result is statistically significant, so we do know there *is* some kind of non-zero effect. But without the equation, we don’t know how big it is (or even whether it's positive or negative). All we know is that pace *does* signficantly impact on a team's defensive stats, and that the effect (as judged by statistical signficance) appears to be real.

Labels: ,