Tuesday, March 29, 2011

"Sabermetrics" and "Analytics"

What is sabermetrics?

We sabermetricians think we know what it means ... one definition is that it's the scientific search for understanding about baseball through its statistics. But, like a lot of things, it's something that's more understood in practice than by strict definition. I think a few years ago Bill James quoted Potter Stewart: "I know it when I see it."

But how we see it seems to be different from how the rest of the world sees it. The recent book "Scorecasting" is full of sabermetrics, isn't it? There are studies on how umpires call more strikes in different situations, on how hitters bat when nearing .300 at the end of the season, on how hitters aren't really streaky even though conventional wisdom says they are, and on how lucky the Cubs have been throughout their history.

So why isn't "Scorecasting" considered a book on sabermetrics? It should be, shouldn't it? None of the reviews I've seen have called it that. The authors don't describe themselves as sabermetricians either. In fact, on page 120, they say,

"Baseball researchers and Sabermetricians have been busily gathering and applying the [Pitch f/x] data to answer all sorts of intriguing questions."


That suggests that they think sabermetricians are somehow different from "baseball researchers".

Consider, also, the "MIT Sloan Sports Analytics Conference," which is about applying sabermetrics to sports management. But, no mention of "sabermetrics" there either -- just "analytics".

What's "analytics"? It's a business term, about using data to inform management decisions. The implication seems to be that the sabermetrician nerds work to provide the data, and then the executives analyze that data to decide whom to draft.

But, really, that's not what's going on at all. The executives make the decisions, sure, but it's the sabermetricians who do the actual analysis. Sabermetrics isn't the field of creating the data, it's the field of scientifically *analyzing* the data in order to produce valid scientific knowledge, both general and specific.

For instance, here's a question a GM needs to consider. How much is free agent batter X worth?

Well, towards that question, sabermetricians have:

-- come up with methods to turn raw batter statistics into runs
-- come up with methods to turn runs into wins
-- come up with methods to estimate future production from past production
-- come up with methods to quantify player defense, based on observation and statistical data
-- come up with methods to compare players at different positions
-- come up with methods to estimate the financial value teams place on wins.

But isn't that also what "analytics" is supposed to do? I don't understand how the two are different. I suppose you could say, the sabermetricians figure out that the best estimate for batter X's value next year will be, say, $10 million a season. And then the analytics guy says, "well, after applying my MBA skills to that, and analyzing the $10 million estimate the sabermetricians have provided, I conclude that the data suggest we offer the guy no more than $10 million."

I don't think that's what the MIT Sloan School of Management has in mind.

Really, it looks like everyone who does sabermetrics knows that they're doing sabermetrics, but they just don't want to call it sabermetrics.

Why not? It's a question of signalling and status. Sabermetrics is a funny, made-up, geeky word, with the flavor of nerds working out of their mother's basements. Serious people, like those who run sports teams, or publish papers in learned journals, are far too accomplished to want to be associated with sabermetrics.

And so, an economist might publish a paper with ten pages of analysis of sports statistics, and three paragraphs evaluating the findings in the light of economic theory. Still, even though that paper is sabermetrics, it's not called sabermetrics. It's called economics.

A psychologist might analyze relay teams' swim times, discover that the first swimmer is slower than the rest, and conclude it's because of group dynamics. Even though the analysis is pure sabermetrics, the paper isn't called sabermetrics. It's called psychology.

A new MBA might get hired by a major-league team to find ways of better evaluate draft prospects. Even though that's pure sabermetrics, it's not called sabermetrics. It's called "analytics," or "quantitative analysis."

I think that word, sabermetrics, is costing us a lot of credibility. My perception is that "sabermetrics" has (incorrectly) come to be considered the lower-level, undisciplined, number crunching stuff, while "analytics" and "sports economics" have (incorrectly) come to symbolize the serious, learned, credible side. If you looked at real life, you might come to the conclusion that the opposite is true.

My perception is that there isn't a lot of enthusiasm for the word "sabermetrics." Most of the most hardcore sabermetric websites -- Baseball Analysts, The Hardball Times, Inside The Book, Baseball Prospectus -- don't use the word a whole lot. Even Bill James, who coined the word, has said he doesn't like it. From 1982 to 1989, Bill James produced and edited a sabermetrics journal. He didn't call it "The Sabermetrician." He called it "The Baseball Analyst." It was a great name. About ten years ago, I suggested resurrecting that name for SABR's publication, to replace "By the Numbers" (.pdf, see page 1). I was voted down (.pdf, page 3). I probably should have tried harder.

In light of all that, I wonder if we should consider slowly moving to accept MIT's word and start calling our field "analytics."

It's a good word. We've got a historical precedent for using it. It will help correct misunderstandings of what it is we do. And it'll put us on equal footing with the MIT presenters and the JQAS academics and the authors of books of statistical analysis -- all of whom already do pretty much exactly what we do, just under a different name.


Labels: ,

Saturday, March 26, 2011

The swimming psychology paper

In the previous post, I wrote about a paper (gated) that showed an anomaly in team swimming relays. It turned out that the first swimmer's times were roughly equivalent to his times in individual events -- but the second through fourth swimmers had relay times are significantly faster than their individual times.

The paper concluded that this happens because people put more effort into group tasks than individual tasks. They do this because other people are depending on their contributions. However, the leadoff swimmer's time is seen to be less important to the team's finish than the other three swimmers' times, and that explains why swimmers 2-4 are more motivated to do better in the team context.

My point was not really to criticize that individual paper, but to make a broader point -- that just because the results are *consistent* with your hypothesis, doesn't necessarily mean that's what's causing them. In this case, I agreed with an anonymous e-mailer, who speculated that it might have to do with reaction times. The first swimmer starts by a gun, while the other swimmers start by watching the preceding swimmer touch the wall. I said that I didn't know whether the authors of the paper considered this or other possibilities.

Commenter David Barry kindly sent me a copy of the study, and it turns out the authors *did* consider that:


"We corrected both performance times for the swimmer's respective reaction time by subtracting the time the athlete spent on the starting block after the starting signal (also retrieved from [swimrankings.net]). ... Please note, however, that previous research did not find any differences between the individual and relay competition after a swimming distance of 10m. Thus, faster swimming times for relay swimmers are unlikely to be merely due to differences in the starting procedure."


Fair enough. But ... well, the effect is so strong that I'm still skeptical. Could it really be that swimmers, who have trained their entire lives for this one Olympic individual moment, are still sufficiently unmotivated that they can give so much more to their relay efforts?

Here are the results for the four relay positions. (Times are an average of 100m and 200m):

#1: individual 78.19, team 78.38. Diff: -0.19
#2: individual 87.30, team 86.92. Diff: +0.38
#3: individual 87.73, team 87.39. Diff: +0.34
#4: individual 87.40, team 86.66. Diff: +0.74

It seems to me that the 2-4 differences are *huge*. Are the #4 individual swimmers so blase about the Olympics that they swim almost three-quarters of a second slower than they could if they were just more motivated? My gut says: no way.

One thing I wonder, following Damon Rutherford's comment in the previous post: could it be that correcting for the swimmer's reaction time to the starting gun isn't enough? Mr. Rutherford implies that the first swimmer's reaction time is for him to *start moving*. But he implies that subsequent swimmers are already well into their diving motions when the previous swimmers touch the wall. That could explain the large discrepancies, if the reaction time correction only compensates for part of the difference.

Is there anyone who knows swimming and is able to comment?

Oh, and there's one more issue with the differences, and that's a selective sampling issue. The authors write,

"We focused our analysis on the data from the semi-finals to obtain a reasonable sample size. If a swimmer did not advance to the semi-finals in the individual competition, we included his/her performance time from the first heats."


That means the individual times are going to be skewed slow: if the swimmer did poorly in the heats, his unsuccessful time is included in the sample. But if the swimmer did *well* in the heats, his successful result is thrown away in favor of his semi-final time.

That would certainly account for some of the differences observed.


Labels: ,

Sunday, March 20, 2011

Psychology should be your last resort

Bill James, from a 2006 article:

"... in order to show that something is a psychological effect, you need to show that it is a psychological effect -- not merely that it isn't something else. Which people still don't get. They look at things as logically as they can, and, not seeing any other difference between A and B conclude that the difference between them is psychology."


Bill wrote something similar in one of the old Abstracts, too. At the time, I thought it referred to things like clutch hitting, and clubhouse chemistry, where people would just say "psychology" as (in James' words) a substitute for "witchcraft." It was kind of a shortcut for "I don't know what's going on."

Today, it's a little more sophisticated. They don't say "psychology" just like that as if that one word answers the question. Now, they do a little justification. Here's a recent sports study, described recently in the New York Times by David Brooks:

"Joachim Huffmeier and Guido Hertel ... studied relay swim teams in the 2008 Summer Olympics. They found that swimmers on the first legs of a relay did about as well as they did when swimming in individual events. Swimmers on the later legs outperformed their individual event times."


Interesting! Why do you think this happens? The authors, of course, say it's psychology. But they have an explanation:

"in the heat of a competition, it seems, later swimmers feel indispensible to their team’s success and are more motivated than when swimming just for themselves."


OK ... but what's the evidence?

"A large body of research suggests it’s best to motivate groups, not individuals. [Other researchers] compared compensation schemes in different manufacturing settings and found that group incentive pay and hourly pay motivate workers more effectively than individual incentive pay."


Well, that paragraph actually makes sense, and I have no objection with the finding that group pressure is a good motivator. Still, that doesn't constitute evidence that that's what's going on in the swimming case. Yes, it shows that the results are *consistent* with the hypothesis, but that's all it shows.

You can easily come up with a similar argument in which the same logic would be obviously ridiculous. Try this:

I've done some research, and I've found that a lot of runs were scored in Colorado's Coors Field in the 1990s -- more than in any other National League ballpark. Why? It's because the Rockies led the league in attendance.

How do I know that? Because there's a large body of research that shows that people are less likely to slack off when lots of other people are watching them. Since Coors Field had so many observers, batters on both teams were motivated to concentrate harder, and so more runs were scored.


See?

The point is that, as Bill James points out, it's very, very hard to prove psychology is the cause, when there are so many other possible causes that you haven't looked at. When the Brooks article came out, someone e-mailed me saying, couldn't it be that later swimmers do better "because they can see their teammate approach the wall, and time their dive, as opposed to reacting to starter's gun"? Well, yes, that would explain it perfectly, and it's very plausible. Indeed, it's a lot better than the psychology theory. Because, why would later swimmers feel more indispensable to their team's success than the first swimmer? Does the second guy really get that much more credit than the first guy?


In fairness, I haven't read the original paper, so I don't know if the authors took any of these arguments into account. They might have. But even so, couldn't there be other factors? Just off the top of my head: maybe in later legs, the swimmers are more likely to be spread farther apart from each other, which creates a difference in the current, which makes everyone faster. Or maybe in later legs, each swimmer has a worse idea of his individual time, because he can't gauge himself by comparing himself to the others. Maybe he's more likely to push his limits when he doesn't know where he stands.

I have no idea if those are plausible, or even if the authors of the paper considered them. But the point is: you can always come up with more. Sports are complicated endeavors, full of picky rules and confounding factors. If you're going to attribute a certain finding to psychology, you need to work very, very hard to understand the sport you're analyzing, and spend a lot of time searching for factors that might otherwise explain your finding.

Your paper should go something along the lines of, "here are all the things I thought of that might make the second through fourth swimmers faster. Here's my research into why I don't think they could be right. Can you think of any more? If not, than maybe, just maybe, it's psychology."


If you don't to do that, you're not really practicing science. You're just practicing wishful thinking.


Labels: ,

Sunday, March 13, 2011

An adjusted NHL plus-minus stat

There were a whole bunch of new research papers presented at last week's MIT Sloan Sports Analytics Conference. I actually didn't see any of the research presentations -- I concentrated more on the celebrity panels, as did most of the attendees -- but that doesn't matter much, because every attendee got an electronic copy of all the papers presented. Also, there were poster summaries of most of the presentations, with the authors there to answer questions.

Anyway, I'm slowly going through those papers, and my plan is to summarize a few of them here.

I'll start with a hockey paper. This one (.PDF) is called "An Improved Adjusted Plus-Minus Statistic for NHL Players." It's by Brian Macdonald, a civilian math professor at West Point.

In hockey, the "plus-minus" statistic is the difference between the number of goals (excluding power-play goals) a team scores when a player is on the ice, and the number the opposition scores when the player is on the ice. The idea is great. The problem, though, is that a player's plus-minus depends heavily on his teammates and the quality of the opposition. Even the best player on a bad team would struggle to score a plus, if his linemates are giving up the puck all the time and missing the net.

So what this paper does is try to adjust for that. The author took the past three seasons' worth of hockey data, and ran a huge regression, which tries to predict goals scored based on which players are on the ice. In that regression, every row represents a "shift" -- a period of time in which the same players (for both teams) are on the ice. The regression helps to estimate out the value of a player, by simultaneously teasing out the values of his linemates and opponents, and adjusting for those.

Another improvement that Macdonald's stat holds over traditional plus-minus is that he was able to include power play and shorthanded situations as well. He did that by running separate regressions for those situations, and combining them. Also, in addition to the identities of the players on the ice, he included one additional variable -- which zone the originating faceoff was in (if, indeed, the shift started with a faceoff).

Here are his results. The numbers are "per season", by which Macdonald means the number of minutes the guy actually played on average over the three years. The number in brackets at the end is the standard error of the estimate.

+52.2 Pavel Datsyuk (20.9)
+45.8 Ryan Getzlaf (19.6)
+45.3 Jeff Carter (15.7)
+43.0 Mike Richards (17.2)
+42.6 Joe Thornton (17.6)
+42.1 Marc Savard (15.5)
+40.2 Alex Burrows (13.3)
+40.0 Jonathan Toews (15.5)
+39.8 Nicklas Lidstrom (25.9)
+38.3 Nicklas Backstrom (18.4)

As you can see, the standard errors are pretty big. I'd say you have pretty good assurance that these players are good, but not very much hope that the method gets the order right. You look at the list and see Pavel Datsyuk looks like the best player, but with such wide error bars, it's much more likely that one of the other players is actually better.

The standard errors are large because there's not a whole lot of data available, compared to all the players you're trying to estimate. But why do the standard errors vary so much from player to player? Because they depend on how many different sets of teammates and opponents a player was combined with. The highest overall standard error was Henrik Sedin (+33.8, SE 27.0), because he and his twin brother Daniel "spend almost all of their time on the ice playing together, and the model has difficulty separating the contributions of the two players."

(If I'm not mistaken, this problem is why a similar Adjusted Plus-Minus technique doesn't work well in the NBA. With so few players on a basketball team, and most of the superstars spending a lot of time playing together, there aren't enough "control" shifts to allow the contributions of the various players to be separated.)

However, the imprecision doesn't mean the statistic isn't useful. It's still a lot better than traditional plus-minus. That may not be obvious, because traditional plus-minus doesn't come with estimates of the standard error, like this study does. But if it did, those SEs would be significantly higher -- and the estimates would be biased, too. As far as I know, Macdonald's statistic is the best plus-minus available for hockey, and the fact that it explicitly acknowledges and estimates its shortcomings is a positive, not a negative.

---

Oh, a couple more things.

Macdonald actually ran separate regressions for offense and defense (the numbers above are the sum of the two). It turns out that the way Datsyuk wound up leading the league, was, in large part, by virtue of his defense. His +52.2 is comprised of +37.8 offense and +14.5 defense. But Datsyuk's is not the league's highest defensive score: the superstar of defense turns out to be the Canucks' Alex Burrows, at least among the players Macdonald lists in the paper. Burrows looks like most of his value came on D: +21.3, versus +18.9 on offense.

And: where's Sidney Crosby, who's supposedly the best player in the NHL? He's down the list at +33.6: +36.4 on offense, and -2.8 on defense. But the standard error is 16.5, so if you tack on two SEs to his score, he pretty much doubles to 66.6. So you can't really say where Crosby really ranks -- it's still very possible that he's the best.

Also, the numbers make it look like Crosby is below-average on defense, which I suppose he might be ... but the relevant statistic is the sum of the two components, not how they're broken up. The idea is to score more than your opponents, whether it's 2-1 or 5-4.

Alexander Ovechkin is similar to Crosby: +38.7 offense, -1.5 defense, total +37.2. Nicklas Backstrom has the best power play results; Alex Burrows is, by far, the highest-ranking penalty killer. Download the paper for lots more.


Labels: , ,

Sunday, March 06, 2011

Is "superstar bias" caused by Bayesian referees?

Would you rather have referees be more accurate, or less biased in favor of superstars?

In the NBA, a foul is called when the player with the ball makes significant contact with a defender while he's moving to take a shot. But which player is charged with the foul? Is it an illegal charge on the offensive player (running into a defender who's set and immobile), or is it an illegal block by the defensive player (who illegally gets in the way of a player in the act of shooting)?

It's a hard one to call, because it depends on the sequence of events. As this Sports Illustrated article says,

"... the often-fractional difference between a charge and a block call is decided by a referee who has to determine, in a split second: a) were the defender's feet set, b) was he outside the court's semicircle, c) who initiated contact, and d) does the contact merit a call at all?"


It seems reasonable to assume that, in a lot of cases, the referee doesn't know for sure, and has to make an uncertain call. Maybe he's 80% sure it's a charge, or 70% sure it's a block, and makes the call according to that best guess. (Not that the ref necessarily thinks in terms of those numbers, but he might have an idea in his mind of what the chances are.)

Now, suppose there's a case where, to the referee's eyes, he sees a 60% chance it was a charge, and only a 40% chance it was a block. He's about to call the charge. But, now, he notices who the players are. Defensive player B ("bad guy") is known as a reckless defender, and gets called for blocks all the time. Offensive player G ("good guy") is known to be a very careful player with his head in the game, who doesn't charge very often at all.

Knowing the characteristics of the two players, the referee now guesses there's an 80% chance it's really a block. Instead of 60/40, the chance is now 20/80.

What should the ref do? Should he call the charge, as he originally would have if he hadn't known who the players were? Or should he take into account that G doesn't foul often, while B is a repeat offender, and call the block instead?

------

If the ref calls the foul on player B, he'll be right a lot more often than if he calls it on G. When the NBA office reviews referees on how accurate their calls are, he'll wind up looking pretty good. But, B gets the short end of the stick. He'll be called for a lot more fouls than he actually commits, because, any time there's enough doubt, he gets the blame.

On the other hand, if the ref calls the foul on G, he'll be wrong more often. But, at least there's no "profiling." G doesn't get credit for his clean reputation, and there's no prejudice against B because of his criminal past.

Still, one player gets the short end of the stick, either way. The first way, B gets called for too many fouls. The second way, G gets called for too many fouls. Either way, one group of players gets the shaft. Do we want it to be the good guys, or the bad guys?

Maybe you think it's better that the bad guys, the reckless players, get the unfair calls. If you do, you shouldn't be complaining about "superstar bias," the idea that the best players get favorable calls from referees. Because, I'd guess, superstars are more likely to be Gs than Bs. Tell me if I'm wrong, but here's my logic.

First, they're better players, so they can be effective without fouling, and probably are better at avoiding fouls. Second, because they're in the play so much more than their teammates, they have more opportunities to foul. If they were Bs, they'd foul out of games all the time; this gives them a strong incentive to be Gs. And, third, a superstar fouling out costs his team a lot more than a marginal player fouling out. So superstars have even more incentive to play clean.

So if superstar bias exists, it might not be subconscious, irrational bias on the part of referees. The refs might be completely rational. They might be deciding that, in the face of imperfect information on what happened, they're going to make the call that's most likely to be correct, given the identities and predilections of the players involved. And that happens to benefit the stars.

------

When I started writing this, I thought of it as a tradeoff: the ref can be as accurate as possible, or he can be unbiased -- but not both. But, now, as I write this, I see the referee *can't* be unbiased. If there's any doubt in his mind on any play, his choices are: act in a way in which there will be a bias against the Bs; or act in a way in which there will be a bias against the Gs.

Is there something wrong with my logic? If not, then I have two questions:

1. Which is more fair? Should the ref be as Bayesian as possible, and profile players to increase overall accuracy at the expense of the Bs? Or should the referee ignore the "profiling" information, and reduce his overall accuracy, at the expense of the Gs?

2. For you guys who actually follow basketball -- what do you think refs actually do in this situation?




Labels: , ,