Saturday, May 31, 2008

The Hamermesh umpire/race study revisited -- part VIII

This is part 8 of a continuing series on the Hamermesh study on umpire racial bias. The previous parts can be found here.


One of the criticisms I had of the Hamermesh study is its implicit assumption that if there is same-race bias on the part of umpires, it must be the same for all three races. The model used in the paper assumes that if (for instance) white pitchers are 0.5 percentage points more likely to call a strike for a white pitcher, then hispanic umpires must also be 0.5 percentage points more likely to call a strike for a hispanic pitcher. I argued that racism usually goes more one way than the other, and it's usually the majority who's more biased against the minority.

But there's another implicit and incorrect, assumption – that the *individual umpires* themselves have identical bias. That obviously can't be right, can it? Grab any random group of white people, and some will be more racist than others. You might have some that are explicitly racist, and some that harbor racist biases but won't admit it. But, on the other hand, you'll have some people who favor affirmative action, and may even practice it in their daily lives. And you'll find others who strictly believe in color-blindness in all aspects of life, and oppose affirmative action and racism with equal fervor.

If that's how it works in everyday life, wouldn't it be reasonable to assume that's how it would work with umpires too? Even if some umpires would have an unconscious bias in favor of their own race, wouldn't there be others who didn't, and even some who were unconsciously biased the other way? It's not that hard to imagine a white umpire so concerned about racism (societal or baseballistic) that he bends over backwards to be fair to minority pitchers, with the effect that he becomes unconsciously biased in their favor.

In other words, the idea that *every* umpire is biased in favor of his own race is probably not very close to reality.

What if, say, only half of umpires were biased? Well, then, what we'd get is a bimodal distribution of strike calls. There's still random luck in strike calls, so we'd get the sum of two normal distributions: the race-biased umpires would be one normal distribution, and the non-biased umpires would be the other. Half the pitchers would form a bell curve around unbiased, and the other half would form a bell curve around biased. (If there was a substantial number of "affirmative-action-biased" umpires, that might be a third).

Now, I bet that the differences would probably be small enough that the sum of the two normals would still look bell-shaped if you drew a graph. However, you might expect more outliers, more umpires far from the mean. For instance, 2.5% of the biased umpires would be 2 SD from the biased mean – which means they'd be *more* than 2 SD from the overall mean. That is, the "half of umpires are biased" bell curve would be spread out more than you'd expect from the regular binomial distribution of balls and strikes.

To check, I looked at all the umpires on 0-0 counts, and how they judged hispanic pitchers relative to white pitchers. The results looked very normal. (The Anderson-Darling test, which checks for normality, gave a coefficient of 0.13, definitely not significant.)

Out of about 85 umpires with large enough sample sizes, there were two umpires more than 2 SDs in favor of hispanic pitchers (Welke at 2.2 SD, Dellinger at 2.0), and two umpires more than 2 SDs in favor of white pitchers (Nauert at 2.2, Carlson at 2.5). That's almost exactly what you'd expect if nobody were biased. The alleged bias is on the order of 0.5 percentage points, and the random SD for an individual umpire is about 2.5 points. That ratio, I think, is big enough that if bias actually existed, we'd notice something funny in the data. But everything looks very close to normal.

What if it were *more* than half of umpires that were biased? Well, you'd still expect some variation in how biased they are. Some might be *really* biased, and some might be only slightly biased. Some might be biased against blacks but without it affecting their ability to call strikes. And some might have prejudices that actually benefit the minorities. (For instance, suppose an umpire harbors an unconscious assumption that blacks are intrinsically better athletes than whites. Might that prejudice lead him to call more strikes for black pitchers? I think it might.)

In that case, if the extent of the bias varied substantially between umpires, the distribution of strikes would be more spread out than we observed: again, we'd expect more than 5% of umpires to wind up outside the +/- 2 SD range. I'm not sure how big the effect would be; it would depend on the extent of the variance in bias. But, again, the study is talking about half a percentage point overall, which would require a significantly higher amount of bias at the extremes to counteract the mildly-biased umpires. I can't prove it offhand, but I think if you added any reasonably large (but varying) bias, you'd see something non-normal in the distribution of strikes.

So I think we can reject the idea that most of the umpires are biased, or even that half the umpires are biased. And logic suggested earlier that we reject the idea that *all* or (almost all) the umpires are biased. What does that leave us? Only the possibility that *a few* umpires are biased.

And if only a few are biased ... well, isn't that suggestive of the minority umpires? From
Part 7, here are the pictures of where the minority umpires fit in relation to pitchers of their own race:



The top line is the hispanic pitchers, the bottom line is the black pitchers. In both cases, umpires of the same race (marked with an "x") lean towards the left side of the line (favoring their own race).

If you had to get the x's balanced around the center of the lines, how do you do it? You could move a couple of x's to the right. Or, you could move a whole bunch of hyphens (mostly the white umpires) to the left.

But moving 10 or 20 hyphens would imply widespread racism (20-40 umpires, since each hyphen represents two). We rejected that idea. That means we have to move Xs, which implies that it's the minority umpires who are biased (if, in fact, bias exists at all).

Which is why I disagree with the authors' conclusions on page 24 of the paper:

"In particular, non-White pitchers are at a significant disadvantage relative to their White peers ... the fact that nearly 90 percent of the umpires are White implies that the measured productivity of non-White pitchers may be downward biased."

What they're saying is this: black umpires favor black pitchers, and white umpires favor white pitchers. But since there are so many more white umpires than black umpires, the white pitchers get favored more often, and get a better deal.

But, again, that can only happen if racial bias is widespread. If only a few umpires are biased, it's probably the minorities. It could well be that it's the minority pitchers who are the beneficiaries of any bias.

If, in fact, there is any bias at all.

Labels: , ,

Wednesday, May 28, 2008

Pop music's answer to Retrosheet

The "Whitburn Project" is an online collaborative effort to "preserve and share high-quality recordings of every popular song since the 1890s." The project has led to the availability of a public database full of chart data, which Andy Baio is now starting to analyze.

In this blog post, Baio gives us three separate studies, based on Billboard chart data from 1957-2008. He tells that:

-- "diversity" is down -- there are fewer new songs appearing on the charts (which I suppose is the same thing as saying that songs stay on the chart longer);

-- "one-hit wonders" peaked between the 50s and 70s. But, because those were the years when diversity was high, you can era-adjust the numbers. Once you do that, you find some mild inter-decade variations, but, overall, there's a pretty consistent level of one-hit wonders throughout chart history.

-- and some other interesting bits -- read the full blog post.

It's amazing what you can learn about stuff if you have enough good data. I'm looking forward to seeing what else Baio can come up with.


Monday, May 26, 2008

Are black NBA coaches more likely to be fired?

Alan Reifman was kind enough to point me to a new study on whether racial discrimination affects the firing of NBA coaches. In case you only care about the answer, it's no – coaches appear to be fired independently of whether they're black or white.

The study is called "Race, Technical Efficiency, and Retention: The Case of NBA Coaches." The authors are Rodney Fort, Young Hoon Lee, and David Berri. Fort is a renowned sports economist, as is Berri (who is co-author of "The Wages of Wins). Lee is an academic economist in Korea.

The study is
here (PDF). A press release describing the research is here.

As far as I can gather, "Technical Efficiency" (TE) is an economics term that refers to a firm's ability to efficiently produce valuable goods from its inputs of labor and capital. A TE of 1.00 would signify a firm that would produce 100% of the best possible theoretical output, given the staff and technology available to it. (That might not be completely correct, but it's what I gather from the
Wikipedia description.)

The bulk of this paper tries to figure out how much efficiency coaches achieved from their teams, where the inputs are players and the outputs are wins. To do that, the authors figure out the win values of various basketball statistics, which are the same as the ones in "The Wages of Wins" (for instance, a rebound is worth 0.034 wins). Then, they look at last year's statistics for the players on every team, and figure out what they should have been expected to produce.

At this point, you could just look at the expected wins from the players, and use that to figure the expected wins for the team. But the authors added extra variables. There's roster stability, for how much this team's personnel varies from last year. There's years of experience for the coach. There's the coach's career winning percentage. And a few more.

I don't fully understand the statistical technique the authors used; it's some kind of maximum likelihood estimation, rather than a straight regression. It seems like it's standard in the economics literature when estimating technical efficiency.

In any case, the authors estimate TE for each season for each coach. The average TE for all coaches was .760. But coaches that got fired averaged only .670 before being dismissed. This suggests, the authors say, that "firings tend to occur as if owners use TE in the decision."

Furthermore, fired black coaches had TEs of .676, while fired white coaches had TEs of .666. This is a very small difference, not statistically significant, which leads to the paper's main conclusion that there is no racial discrimination by team owners.

Despite the fact that I don't completely understand the authors' methdology, it seems to me that it's a very complicated way of trying to figure out by how much the teams exceeded (or fell short of) expectations. Suppose you want to figure out whether the coach got the most out of his team. A simple way would be to just look at how it did last year, mentally account for any personnel changes, regress to the mean as appropriate, and see if it met that standard this year. An even simpler way would be to look at the pre-season Vegas for over/under in wins. I recall that Bill James created a really simple "expected wins" formula, and did a similar manager evaluation based on that. It worked pretty reasonably, as I recall.

In that light, the algorithm used in this study seems a lot more complicated than it needs to be. (To me, it seems hugely, absurdly complex, but I'm not an academic economist, for which it's probably the standard.)

Admittedly, a complex methodology would be worth it if it led to greater accuracy. But the authors don't compare their estimates to simpler ones. They do say, at one point, that their model has a high correlation with actual wins – 0.983. It seems to me that this must be retroactive, because, with the extent of luck in basketball, there's no way to predict wins with anywhere near that accuracy. So there's still no test of how well their model predicts *future* wins, which, of course, is what it's actually trying to do.

Frankly, I doubt that the system is that much better than a more naive one, if only because of the flaws in the measures of player productivity (as pointed out in discussions of "The Wages of Wins," here and elsewhere – my original review is

Another thing that bothers me is the apparent significance of the coach's career winning percentage:

"For a team with TE=0.70 and a coach with a career winning percentage of 0.500, hiring a coach with a career winning percentage of 0.600 increases the TE to 0.79."

I'm not 100% sure what a difference of 0.09 in TE means in practical terms. But it seems pretty big. And, regardless, do you really want to base any conclusions on the coach's previous record without taking into account the talent level of his teams?

And perhaps the one thing that bothers me the most about this study is that it doesn't mention, anywhere, the effects of luck. A team expected to go 41-41 will vary from that with a standard deviation of 4.5 games. Actually, that's the binomial estimate – in real life, it might be a bit lower, so let's call it 4. What is the variance in wins based on the coach's talent at molding a team? Even if it's the same 4 games – which I think is a huge overestimate – that means half of any "efficiency" discrepancy is caused by luck. Shouldn't the study discuss what this means? Because it seems to me that it's likely that many coaches – perhaps most – would be fired due to bad luck, rather than bad coaching.

As for the finding that coaches aren't fired based on race, I do think the study supports the conclusion. It may have used a complex methodology, but it does seem to properly distinguish the teams that exceeded expectations from the teams that did not. That I think there are much simpler ways of doing that doesn't change the fact that theirs probably does the job too.

Still, I have reservations about the study's methodology, in terms of predicting wins. I'd bet that if the authors had run a table of expected versus actual wins, using their methodology, it wouldn't be hard to find a simpler one -- without all the logarithms and likelihood estimators and regressions -- that would be at least as good.

Labels: , , ,

Friday, May 23, 2008

Does the Wonderlic test predict QB performance?

The Wonderlic is a kind of intelligence test, used to screen potential employees. Apparently the NFL has been using it as part of scouting the draft.

Does a player's Wonderlic score predict how well he'll perform in the NFL? According to
this study, by Arthur J. Adams and Frank E. Kuzmits, the answer is no. They found no statistically-significant correlations between Wonderlic score and measures of draft position, salary, QB rating, or games played.

However, a subsequent
study by a testing company called "Criteria" *did* find an effect. Criteria looked at the 61 QBs drafted between 2000 and 2004. For performance measures, they looked at total passing yards and TD passes (which you would think would be so highly correlated with each other that they would be measuring almost the same thing, but never mind). They found a mild positive correlation of about .2.

But: when they included only quarterbacks with at least 1,000 yards – effectively eliminating the ones that didn't get any playing time – the correlation jumped to .5. As
Brian Burke points out on his own blog, .5 is a HUGE correlation in a study of this type.

Now, it should be mentioned that Criteria has a vested interest in making aptitude tests look effective. If this is a trial and I'm on the jury, I'm going to assume that they tried a bunch of possible cutoffs and methods and chose the one that makes the test look the most predictive. Brian notes that if you change the cutoff from 1,000 yards to 2,000, and take Tom Brady (high Wonderlic, very high NFL performance) out of the study, the correlation drops to about .26 (still pretty significant).

I'm even more skeptical than Brian. Well, I am and I'm not. First, I would be willing to bet a lot of money that there IS a significant correlation between the Wonderlic and QB effectiveness. There's no doubt that QB requires good intelligence and quick decision-making skills, which is partly what the Wonderlic measures. In fact, I'd bet that
Vince Young's supposed score of 6 out of 50 is a mistake – I don't believe that anyone with Wonderlic skills that low could have been a successful college quarterback, never mind a top-ranked one. (See these sample questions for yourself – supposedly Vince Young would only score 2 out of 15.)

So, yes, of course intelligence (or whatever you want to call what the Wonderlic measures) is important to making a good quarterback. But that's really not the question. The question is, *given what you already know* about the quarterback, in four years of extremely competitive college play, does the Wonderlic give you any *additional* useful information about the player?

On that, I'd bet that it doesn't.

Look at it this way: switch to baseball for a minute, and suppose I ask you to predict what Albert Pujols will do in the remainder of this season. You'd have a pretty good idea, and would make a fairly accurate estimate. Now, suppose I boot on over to St. Louis, give Pujols the Wonderlic test, and tell you his score. Will that affect your estimate? No, of course not. Whatever effect Pujols' Wonderlic skills have on his performance has already shown itself in several seasons of major league play. No matter how much Wonderlic correlates with having a high OPS, if you know the player's skill in achieving OPS, no intelligence test is going to give you a better estimate.

Where the Wonderlic would help you is if you didn't know what real-world skills an applicant has. It would be much more useful to know the Wonderlic scores of presidential candidates than football players, because, for instance, we don't have a good idea of their president skills -- we haven't seen Obama or Hillary play president for four years in college. It would be less useful for George W. Bush, because we have strong, first-hand observations on what he's like as president. Did he achieve his Bushness with a high Wonderlic, a medium Wonderlic, or a low Wonderlic? That might be interesting trivia, but shouldn't affect how we evaluate his potential, because we've actually seen him in action.

And the same is true for drafting quarterbacks. The question we *really* want to answer, in the NFL context, is this: given what you know about the player, and have discovered in the NFL combine and from watching him play college football for four years, does the Wonderlic tell you anything *additional* to that?

That is: you have two identical quarterback candidates, with identical college records playing for identically-skilled college teams against identical opponents. At the combine, they look exactly the same in every respect, except the Wonderlic. One of them has a Wonderlic score of 30, and one has a Wonderlic score of 20 (24 is average for QBs). Should you expect a difference in NFL performance?

It's possible to study that question, but I think you'd have trouble finding enough data, and adjusting for the different environments in which quarterbacks play, to be able to make any significant conclusions.

And intuition suggests that they'd probably be about the same. If A and B had equal performance, despite a being significantly "smarter" than B, it follows that B must have had *other* compensating aptitudes that made up for it. Doesn't it?

As for this study, if you believe the results, all we have is a conclusion that better and faster thinkers make better quarterbacks. Well, duh.

Hat tip:
NFL Stats (Brian Burke)

Labels: , ,

Monday, May 19, 2008

Soccer sabermetrics

If you're a soccer fan, how would you try to determine who the better players are? You could look at goals and shots, and if you watched enough games, you could get a general impression who the best players are just from watching their moves. But if you have two guys who look like middle of the road players, how can you better evaluate them?

According to
this Sports Illustrated article, private firms have been calculating additional stats for a while now: touches per 90 minutes, possessions won, passing percentage, and total distance run during a game ("top midfielders" log over 7 miles).

One of those firms is called "Match Analysis." They've been hired by Billy Beane, GM of the Oakland A's, and now "strategic overseer" of the MLS San Jose Earthquakes, to provide statistical support for what is presumably a sabermetric approach to soccer.

According to this firm, David Beckham

"... not only led MLS with an average of 87.9 touches per 90 minutes last season ... but he also dominated in shot creation – how frequently a player is involved in an attack that leads to a shot – helping to set up 11.2 shots per 90 minutes, or a whopping three more than the next-best player."

To me, these stats seem to be measuring opportunity more than effectiveness. But there's another stat, called "possession percentage" (PP):

"Ever heard of Jesse Marsch? Neither had I, but the Chivas USA midfielder led MLS in PP last year: He got the ball and passed it successfully 81% of the time. Conversely, Yura Movsisyan of the Kansas City Wizards had a 37% rate ..."
I'm not sure what PP actually measures. Is it the percentage of time you pass cleanly after getting the ball? If it is, wouldn't your percentage be lower when you're deep in the opposing zone, with more defenders in the vicinity? Anyway, I'm sure Billy Beane is busy figuring out what the stats mean and how they can be used.

In any case, I'm very optimistic about this approach, of watching and counting what happens when the player has the ball. I think it would work well for hockey, too. I remember back in the 80s and 90s, watching Leaf games, that it seemed whenever Todd Gill had the puck he'd do something bad with it ... give it away, or miss the pass. I'd bet you could learn a lot about players by just keeping track of the equivalent of "possession percentage." When a defenseman has the puck in his own zone, how often does he finish a clean pass to an open teammate? How often is the pass a bit off, where the teammate loses the puck? How often does he fail to get the puck out of the zone?

Another thing I used to notice is that Sylvain Lefebvre, my favorite player back in the early-to-mid-90s, was a rock on defense. An opposing forward would carry the puck into the zone – and Lefebvre would be on him, pinning him to the boards or taking him out of the play. Couldn't you rank defensemen by what happens when they go one-on-one with the puck carrier?

If I were an NHL general manager, I'd hire a bunch of students to watch replays of every game and tabulate all these stats. For almost nothing, you could have a record of almost anything you wanted (subject to a couple of days delay while your beleaguered interns to revisit the video). And identifying just one significantly overrated or underrated player could easily be worth millions.

Indeed, it makes so much sense that I wonder if teams are actually doing this already, and just not letting it be known.

Labels: ,

Wednesday, May 14, 2008

Home run rates and temperature

Home run rates are down a bit this season. Could it be the weather? J. C. Bradbury looks at April home runs over the past few seasons, and shows that they correlate pretty well with the temperature.


The NFL passing premium revisited

There is a so-called "passing premium" in the NFL. A passing play, on average, will gain more yards than a running play, even after accounting for turnovers. But teams still call for the run, and fairly often. What can explain that?

In an article in the current
JQAS, Duane Rockerbie takes a shot at the question. His study is called "The Passing Premium Puzzle Revisited."

Rockerbie argues that the effect is caused by risk aversion on the part of the coaches. Passes gain more than runs, but they are also riskier – sometimes they succeed spectacularly, and sometimes they fail badly. Rushing plays also vary, but not as much as passes.

Risk aversion is an established concept in finance. Investments with little or no variance, such as short-term government bonds, are expensive, yielding only a couple of percentage points. Investing in the stock market, however, gives average returns that are much higher. However, if you invest in equities, you take the risk of a decline or a crash.

Rockerbie argues that the same is true in football. Coaches are willing to accept fewer yards in exchange for less risk, and that's why they often call for the run. They care about the average gain, but they care about the risk, too.

So, as expected, strong rushing teams will call more rushing plays than weak rushing teams, and weak passing teams will call more rushing plays than strong passing teams. But, also, teams with *predictable* rushing results and varying passing results will call more rushing plays than teams with predictable passing and varying rushing.

Economists model risk aversion in terms of "diminishing marginal utility." What that means is that the more money you have, the less each additional dollar means to you. This is sufficient to explain risk aversion. No matter how rich you are, you will never take an even gamble for $10, because the $10 you lose would mean more to you than the $10 you would win. (To make you take the bet, I'd have to offer you (maybe) $11 or $12 against your $10.) Diminishing marginal utility, is, on its own, enough to explain risk aversion.

And so Rockerbie uses a standard economics utility function, one that mathematically models a certain function where dollars are worth less the more of them you already have. And he applies that to yardage – the more yards you've already gained, the less you care about the next few yards. (There's no evidence that football yards should have the same mathematical marginal utility function as dollars, but I guess you have to start somewhere.)

Using that equation, and full regular-season data for the 2006 NFL season, Rockerbie sets out to figure how proportion of running plays "should" have been called for each team. However, doing this requires him to estimate a mathematical parameter for risk aversion. Because the San Diego Chargers had the best record in 2006, he uses the parameter that makes the Chargers look like they made a perfect decision. (In Figure 3, he shows that choosing some other parameter wouldn't affect the results all that much.)

The results: two teams ran more often than they "should have" – the Patriots and Saints – and the rest of the teams (not including the Chargers, who were modelled to be perfect) ran less often than they "should have."

Rockerbie shows that, generally, the farther a team was from its optimal run/pass proportion, the worse its record. That fact, he argues, supports the model.

But I don't think it does.

Actually, it's Brian Burke who doesn't think it does; I'm just agreeing with him. Brian notes (in parts 2 and 3 of a recent
three-part posting) that it's not running that leads to wins, but the other way around. Teams with a fourth-quarter lead will call a lot of running plays, to reduce the risk of a turnover and run out the clock. And that explains the correlation more convincingly than risk aversion.

Further, Brian points out, in a financial application, every dollar is of equal value. But in football, there are other constraints, like having to earn ten yards in four downs, or suffer a big loss. As Brian says,

"At the end of each year no one takes most of your money away if your mutual funds don't earn at least 10%. If they did, and you hadn't made your 10% by November, your risk tolerance would dramatically increase for the final 2 months of the year."

So there are situations in the game – third and 9, say – where you have to take more risk, and probably call for a passing play. And any analysis has to take those constraints into account.

Also, it's not just the down-and-yards situation that calls for an increase in passing plays – it's the score. The value of an additional yard may be reduced, on average, the more yards you have. But the goal of football is not to gain yards – it's to win the game. No matter how many yards you have, if you're down by 2 points on your own 20 late in the fourth quarter, you'd better try a few throws.

And so, the more points – or yards – the other team has, the risker you have to be. So, as Brian notes, the better your defense, the more you can afford to run.

Brian's conclusion is that he doesn't think the study shows what it purports to show, and I think he's right on in his arguments.


A couple of additional things that bother me a bit:

First, the study uses the teams' raw numbers from the 2006 season. But diminishing marginal utility doesn't depend on the outcome, it depends on the *expectation* of outcomes. And outcomes, on average, are always more extreme than their expectation – the guy who went 3-for-5 probably wasn't really a .600 hitter. You have to regress all the results to the mean, somewhat, to get the actual expectation. Rockerbie doesn't do this. He explicitly assumes that all teams' performance is an unbiased estimate of their talent, which is false.


Secondly – and most important – shouldn't the risk-aversion model account for the fact that risk is reduced when you have multiple plays? The risk-return tradeoff for one play is different for one play than for fifty consecutive plays.

Suppose I offer to bet on a coin flip, and offer you favorable odds of $15,000 to $10,000. You recognize that the game is weighted heavily in your favor, but, because you're risk averse, you reluctantly turn me down – the risk of losing $10,000 with 50/50 odds is too much for you.

But, now, suppose I offer to play the same game with you, not just once, but one hundred consecutive times. Now, the odds of you losing money are almost zero. To finish in the red, you'd have to lose 2/3 of the 100 coin flips, and the probability of that is very, very low. So you'd accept the bet and get rich.

It seems to me, if I'm getting this right, that the analysis in this study is for a single play. (Specifically, in Equation 6, the sigma-squareds are those hypothesized for one play.) But over, say, 100 plays, the variances of the average play would be 1/100th as in the study. Over a season, you could almost argue that the variances would be almost zero, and you should just go 100% for the play (probably passing) that would yield the most benefits.

That is: for a single play, if you pass, there's the possibility you gain lots of yards (a completion), but the possibility that you lose lots of yards (an interception). Even though there's a higher chance of a completion than a turnover, it's still risky. But over a season, the chance of having more interceptions than completions is very close to zero. So shouldn't lots of passing be your best option, even if you're risk averse?

Am I missing something?


Finally, there are results in the table that confuse me. I may not understand how this really works, but, if you look at the chart, for the Raiders, Steelers, Falcons, Broncos, and Titans, (a) running has a higher expectation than passing, and (b) running has a lower variance than passing.

For those five teams, shouldn't a risk-aversion model predict that they should run 100% of the time? According to this model, why would you EVER pass if your expectation is lower and your risk is higher?

Labels: ,

Friday, May 09, 2008

New issues of "By the Numbers" now available

Two new issues of "By the Numbers," the SABR Statistical Analysis newsletter which I edit, are now available at my website.

As always, submissions for future issues are appreciated.


Wednesday, May 07, 2008

Back-to-back NBA games and home field advantage

Do NBA teams suffer if they've played a game the day before? Apparently they do.

In a study in
JQAS, called "The Role of Rest in the NBA Home-Court Advantage," authors Oliver A. Entine and Dylan S. Small run a straightforward regression on 2,415 NBA games – two season's worth. They use dummy variables for each of the two teams, another variable for home-court advantage, and three more dummy variables for days of rest. They find that, compared to three or more days of rest, teams playing the second of back-to-back games suffer a disadvantage of 1.77 points (split somehow between offense and defense).

That finding was statistically significant at p=.03.

Teams playing with one day of rest or two days of rest showed little difference in performance – the entire effect was in the "zero days of rest" case.

In terms of wins, rather than points, the findings were consistent – an odds ratio of 0.75 for the second of back-to-back games. If I understand that right, that's a winning percentage of .429 (4/7). At 30 points per game, that works out to 2.13 points, close enough to the 1.77 the authors found.

What does days of rest have to do with home-court advantage? Well, the authors note that road teams play a lot more back-to-back games that home teams: 33%, as opposed to 15%. The difference translates into a larger apparent visiting-court disadvantage. It works out that 0.3 points of home field advantage, out of 3.24 total, comes from the fact that home teams are likely to be more rested.

The authors also checked whether the length of a road trip has any effect on winning percentage. They find that teams in the second game of the road swing play significantly different than in the first game, 1.04 points worse (p=0.07). However, there is no effect for the third game or beyond, which suggests that this finding could just be random.

One more test by the authors checks whether the back-to-back effect is different for the home team as opposed to the visiting team. They say, no, the effect is roughly the same. Unfortunately, they don't give us any coefficients or significance levels for the estimates. They just tell us that the fit is not significantly better for the home/road regression (in terms of residual sums of squares), which still leaves us wondering whether the effect might be different in basketball terms.

Perhaps the authors chose to do it this way because the home and road coefficients, taken separately, would not be statistically significant. (The combined coefficient (1.77 points, as described above) is 2.14 standard errors, so the home/road coefficients, if also equal to 1.77, would be about 1.5 standard errors each.) Still, it would be nice to have seen how different the effects are.

Labels: , ,

Thursday, May 01, 2008

The Hamermesh umpire/race study revisited -- part VII

This is Part 7 of the discussion of the Hamermesh study. The previous six parts can be found here.


In comments to posts here and elsewhere, Tom Tango asked whether the "same-race" effect could be caused by one umpire – specifically, Angel Hernandez, a controversial umpire with a poor reputation. (Here's a
fan page.)

So I checked. I started by trying to reproduce the "low attendance" finding, because that was where the original study found the largest effect.

(Technical note: To save time, I only approximated what the Hamermesh authors did. I didn't correct for count, score, home/road pitcher, batter, or umpire. And I selected games by actual attendance (less than 30,000) instead of the study's 70% of capacity. For umpires, I considered only Hernandez and Alfonso Marquez as hispanic, and CB Bucknor, Laz Diaz, Chuck Meriwether, and Kerwin Danley as black. The original study included one additional hispanic umpire and one additional black umpire, but I don’t know which ones those are. Also, for hispanic pitchers, I used only those born in one of the countries listed in the study; and, for black pitchers, I used only those in
this list.

But I got similar results both in relative number of pitches seen, and in the effect of those pitches. So I figure it's close enough.)

Here's what I got for the equivalent of the "Table 2" matrix in the original study (which the authors didn't actually publish for this sub-sample):

Pitcher ------ White Hspnc Black
White Umpire-- 31.88 31.27 31.27
Hspnc Umpire-- 31.41 32.47 28.29
Black Umpire-- 31.22 31.21 32.52
All Umpires –- 31.83 31.30 31.32

It does seem obvious that there's a same-race effect here: all three the numbers on the diagonal are the highest in their row and column. The "UPM" coefficient was 0.76, which is a bit over two standard errors, and so significant at p=.04. (The original study had 2.7 standard errors, which was more highly significant, which I think was because they (properly) adjusted for count. But the differences won't affect most of the discussion here.)

As I mentioned before, one of the things the authors didn't do was show us how the individual umpires varied. Here are the two hispanic umpires and four black umpires:

Umpire ------- White Hspnc Black
Marquez ------ 30.86 31.24 27.78
Angel -------- 31.91 33.76 28.49
Danley ------- 31.86 29.66 26.42
M'wthr ------- 30.29 31.65 33.98
Diaz --------- 31.84 32.11 39.55
Bucknor ------ 31.11 31.03 31.25

Out of these six umpires, five of them (all but Danley) called more strikes for own-race pitchers over other-race pitchers. But, remember that we can only find umpire bias relative to other umpires. So it's important to remember that this could also be evidence of *white* umpires being biased, in the other direction. Maybe hispanic and black pitchers are actually better than white, but the white umpires keep the minorities down.

In any case, these are based on small sample sizes. Typically, the above umpires saw about 1500 hispanic pitches and 110 black pitches each (although Bucknor saw 352 black pitches, strangely enough). So, taken individually, none of the umps showed statistically significant differences. I'll give you the Z-scores for the differences for same-race compared to white:

+0.66 Marquez
+1.68 Angel

-1.13 Danley
+0.90 Meriwether
+1.94 Diaz
+0.27 Bucknor

By the usual standard of 2 standard deviations, none of these six umpires shows individual significance. Diaz is the closest, at +1.94.

Having said that, I should admit that these z-scores are probably underestimates. As I wrote previously, ball and strike calls are somewhat mean-reverting, because (for instance) after a 3-0 ball, a pitcher is likely to throw a strike, and after an 0-2 strike, the next pitch is likely to be a ball. To get more accurate significance tests, you'd want to adjust for count, like the Hamermesh authors did.

Here's how the black umpires called hispanic pitchers, and how the hispanic umpires called the black pitchers:

-0.48 Marquez
-0.85 Angel

-1.19 Danley
+1.61 Meriwether
+0.61 Diaz
+0.34 Bucknor

Nothing close to significance here, either.

In fact, if you look at *all* the umpires, not just the minorities, you find only three statistically-significant white/other comparisons: one umpire in black/white (in favor of white), and two umpires in hispanic/white (one each way). You'd expect about four or five significant results in each group (5% of 93 umpires), not one or two.

Again, I think that's because I didn’t control for count. If I had, Diaz would probably have come out as significant, and maybe Angel Hernandez too.

But, sticking to these numbers, none of the umpires are significant by themselves. But, if we take all six together, we do get significance, at the .04 level – 2.01 standard errors from expected.


The question now is, how should we interpret these results?

Let's start by asking, as Tango did, if Angel Hernandez might be responsible for the finding of significance. If Angel wasn't included, would the results still be significant?

To check, I replaced Angel, in the hispanic group, with Gary Darling, who was almost completely average in the hispanic/white comparison. And, yes, the results became non-significant:

With Angel ----- UPM = 0.76, p=0.04
With Darling --- UPM = 0.41, p=0.28

This is in my study. Going back to the original Hamermesh study, they found a UPM of 0.84. Extrapolating, we can assume that replacing Angel there would have reduced it to 0.45, which would be about 1.5 standard errors away. So replacing Angel with Darling would have eliminated significance even in the original paper.

So it's true that without Angel, there would be no effect. But, in any finding of statistical significance, in any field, you can always find outlying datapoints to remove and eliminate the effect. That doesn’t mean there was no effect in the first place. And it doesn't mean that you can say the outliers "caused" the effect.

For instance, suppose there's a theory that russian roulette causes death. I check six participants, and one is dead. You could argue, "well, if you eliminate Bob from the study, russian roulette is risk free." But that wouldn't be a fair argument.

However, there is a slightly different argument in this case that *might* have some merit. You might say, "look, the the finding of significance rejects unbiasedness. So it really only tells us that *at least one* umpire may be biased. Maybe the only umpire that's biased is Angel, and the others are innocent. After all, if you eliminate Angel from the study, the significance disappears."

That's a better argument. But: why Angel? Diaz had a higher z-score, and if you neutralize Diaz, you probably also lose the signficance. And probably even Marquez. When you're so close to non-signficance – in this case, 2.01 standard errors, when 2.00 is the threshold – neutralizing any of the same-race-positive umpires will drop you below significance.

Where this argument makes the most sense is when one umpire is so clearly out of normal range, so obvious an outlier, that he almost demands to be a special case. But that's not happening here. Of the 93 umpires (of which about 70 are full-time enough to be considered), Angel Hernandez is only the 6th most favorable to hispanic pitchers (the other five are white). Why single him out?

Laz Diaz, on the other hand, is the most favorable of all 93 umpires towards black pitchers. By a lot. He's at 1.93 SD, while the second place ump is only at 1.51. But, still, that's based on only 134 pitches, and I don't think it would be fair, on the basis of this evidence, to suggest that he's biased. Besides, *someone* has to rank highest out of 93. And there are five black umpires. So, it's 1 in 20. Is that really that much of a coincidence?

It is certainly *possible* that any one of the moderate-to-high Z-score guys (Diaz, Meriwether, Angel, Marquez) is the only biased umpire. But this data isn't enough to tell. What you actually could do for the black umpires (because there are so few black/black pitches) is just use QuesTec (or trained observers) to call the pitches, and see if they're accurate or not. That way, you get right to the point. Did Laz Diaz lead MLB in apparent bias simply because the pitchers legitimately threw more strikes? If you have access to video archives, you can just look at the pitches themselves.

If they find that Laz Diaz did indeed make the right call as often as expected, that means he saw more black strikes just out of random chance. You'd be able to eliminate his datapoint, and your finding of significance will fade away.

But if you find that he made the wrong call a bit too often with black pitchers on the mound, you've confirmed the statistical evidence with some observational evidence.


Here's one last way to look at it, which might be more inutitive. There were 93 umpires in the study. If you sort them by the difference between their strike calls against whites and hispanics, here's what you get.


The two Xs are the hispanic umpires, and they're over to the "favor hispanics" side of the line. (The hyphens each represent two non-hispanic umpires.)

Here's the same for the black umpires with the black pitchers:


Again, more black umpires on the "favors blacks" side of the line.

Very roughly speaking, the significance level of .04 means that if you were to scatter six Xs randomly onto these lines, the chance they would land that far to one side (or farther) would be 1 in 25. At the same time, you can see that if you take one of the two leftmost Xs (Diaz and Hernandez) and move it to the middle, it doesn't look that significant anymore. And, if you move *both* of them to the middle, suddenly it looks really, really close to perfectly unbiased.

Let me repeat that. If you take one of Laz Diaz and Angel Hernandez out of the sample of umpires, and replace him with an average umps, every statistically significant effect in the original Hamermesh study become statistically insignificant. And if you replace *both* of those two umpires, the effect not only becomes insignificant, but almost completely disappears.

Make of that what you will.


Labels: , ,