## Thursday, August 30, 2007

### How well can anyone predict the NFL standings?

In a post at The Sports Economist, Brian Goff reports that pre-season predictions of NFL teams’ rankings turned out to be not all that accurate. Comparing predicted to actual ranking within the conference, Peter King (of Sports Illustrated) came up with an r-squared of only .11. His colleague “Dr. Z” did a bit better, at .21, while the ranking implied by Las Vegas oddsmakers did the best, at .26.

But you’d expect that correlations would appear fairly weak if you use rankings. If the talent distribution of teams is shaped like a bell curve, many of the teams will all be bunched together in the middle. Those teams would be so close together in talent that their actual results are effectively random. Combining the “random” teams in the middle with the obvious choices at the top and bottom, those numbers don’t seem all that bad.

To check, I ran a simulation. First, I chose a random “talent” for each of the 16 teams in a conference, from a normal distribution with mean .500 and SD of .143 (
per Tangotiger’s technique). Then, I ran an independent 16-game season for each team, where their chances of winning any given game were equal to their talent. Finally, I computed the r-squared between their talent rank and the rank of their actual performance (both ranks from 1 to 16). I ran this simulation 10,000 times and got the average r-squared.

The results:

r-squared = 0.53

This was a lot higher than I expected; but, remember, this is an upper bound on how well any prediction can do. It assumes that the predictor is capable of knowing a team’s actual talent level. In reality, that’s not possible. Even if you’ve watched every NFL game ever, the players’ historical performances have a fair bit of luck embedded in them too. If a QB had a passer rating of, say, 80 in his rookie season, you don’t really know if he was a 90 who had an unlucky year, or a 70 who had a lucky year, so your estimates are going to be somewhat off.

Suppose we build that uncertainty into the model, so that instead of correlating wins and actual talent, we correlate wins with a *guess* at the actual talent. We get the guess by starting with the talent, but adding an error term, with mean 0 and SD of 0.050.

If we do that, we now get:

r-squared = 0.48

Now, what about talent lost to injuries and criminal convictions and so forth? Those are unpredictable too. Suppose we lump those together with the other SD, and raise it from .050 to .070. Repeating the simulation, I get

r-squared = 0.44

This is still quite a bit higher than the actual performance of even the Las Vegas oddsmakers. What could explain the difference? It could be that our assumptions were too conservative: maybe the limit on human ability to predict talent isn’t actually .070, but even higher. But to get the r-squared down to .26, I had to raise the uncertainty from .070 all the way to .160, which seems way too high.

Another possibility is that the pundits just had a bad year. The SDs of the r-squareds over the 10,000 trials were all in the 0.19 range. So an r-squared in the .20s would be only a bit more than 1 SD away, and so should happen reasonably often.

One more factor: in my simulation, all the teams’ records were independent (as if every game was against a team in the other conference). But in real life, teams play each other, and upsets thus impact two teams instead of one. I’m too lazy to simulate that, but I’d bet the theoretical r-squared would go down, because the variance of the difference between two teams who wound up playing each other would increase.

But in any case, and even with knowledge as perfect as anyone can have, the r-squared couldn’t get any higher than somewhere in the .40s. In that light, the Vegas figure of .26 is looking pretty reasonable.

Labels: ,

## Monday, August 27, 2007

### Managers and the Pythagorean Projection

A reader of this blog recently sent me a couple of old academic articles arguing about managers and Pythagoras.

In 1994, Ira Horowitz wrote an article (subscription required for all these academic articles) called “
On the Manager as Principal Clerk.” I’m not exactly sure what the title means (I actually haven’t read this one, but it’s referenced by the two I did read), but it attempts to evaluate major league managers by whether and by how much their teams beat their Pythagorean projection.

I’ve never been convinced that beating Pythagoras is a managerial skill, and I’ve always thought it was mostly luck and relief pitching. And, as far as I can tell, Horowitz didn’t really produce any evidence to address the question. Instead, he quotes the projection, in this form:

(W/L) = (R/OR)^2

which is correct. But then, inexplicably, he tries to model each individual manager by a regression of the form

(W/L) = a(R/OR) + b(R/OR)^2

Why would you include a linear term when the proven relationship doesn’t include one? Maybe Horowitz explains in the paper, but I can’t think of any reason for doing this. Maybe it’s standard that if you suspect a quadratic relationship, you include the linear term -- but in this case, we know there’s no linear term.

Anyway, Horowitz “rates” each manager by the sum a+b. The thinking is that that represents the winning percentage the manager would squeeze out of his team if he had exactly as many runs scored as runs allowed. That is, the paper makes the assumption that a+b represents the skill of the manager, and is constant across all RS and RA for any possible team.

That doesn’t really make sense, and, in a 1997 rebuttal paper, “A Note on the Pythagorean Theorem of Baseball Production,”
John Ruggiero, Lawrence Hadley, Gerry Ruggiero, and Scott Knowles provide a neat explanation why.

They note that for any two managers, with two different “a+b” parameters, there is a value of R/OR for which the two managers predict equal results. On one side of that equilibrium number, manager X appears to be better, while on the other side, manager Y appears to be better. This contradicts Horowitz’s assumption that “a+b” provides a well-ordered ranking of individual skill.

As the rebuttal puts it,

“... the results indicate that [Al] Lopez has a higher predicted win-loss ratio than [Earl] Weaver for any runs ratio greater than 1.2 while Weaver has a higher predicted win-loss ratio for any runs ratio less than 1.2. In other words, Horowitz’s index indicatest hat Lopez is the better manager as long as a team is expected to outscore their opponents by 20%; otherwise Weaver is he better manager. We believe this is illogical.”
And it is indeed illogical. Unfortunately, so are some of these authors own criticisms. I’ll mention just one. The rebuttal paper comes up with this identity:

W - L = R – OR – E

where “E” is “excess runs” – the sum of all runs, for and against, that create a win or loss by more than the single run necessary to win the game.

The identity is, I guess, true, but what good is it? The authors don’t say, except to argue that because of their identity, the Pythagorean Projection is wrong because “the functional form of the equation is misspecified, and E is omitted.” I have no idea what they really mean, but it sounds like those silly arguments you hear that it’s impossible that OPS can be any good because you shouldn’t add two numbers with different denominators.

Based on that logic, the paper also argues that there is no mechanism by which managers can beat Pythagoras: “Apparently, the belief is that an efficient manager will forgo runs in a ball game once a lead is obtained, and then use these forgone runs during future games when his team is behind.”

Well, no, there are other, more realistic interpretations. One reasonable idea is that managers will use their best pitchers when it matters most, thus having a better winning percentage in one-run games, which leads to outdoing Pythagoras. The authors improperly reject this idea, too, on the grounds that they can come up with counterexamples where this doesn’t happen.

These, and many of the authors’ other comments, suggest that they don’t really understand the issues behind Pythagoras at all, and the main contribution of their paper is the explanation of why Horowitz’s measure doesn’t work.

Finally, Horowitz responds in an article called “Pythagoras’s Petulant Persecutors.” I think he rebuts the rebuttal correctly, but in academic and economics terms. Even so, the three papers don’t really tell us anything useful about anything.

Chris Jaffe’s
recent article on the 2007 Diamondbacks, though, does. As has been noted many different places, the D-Backs are in first place in their division despite giving up signficantly more runs than allowed. As of right now, they are 12 games above their Pythagorean projection.

Is that just luck? Is Arizona likely to regress to the mean and fall back out of contention? Jaffe says no – they will continue to win. The reason: they aren’t outperforming because of luck, but because their mop-up men stink, with a combined ERA above 7.00. Since those pitchers are often being used when the game is completely out of hand anyway, those allowed runs are less important than others. The result: the Diamondbacks lose a lot of blowouts, and therefore appear to be “lucky” in beating Pythagoras.

Jaffe credits the manager for this; he’s got crappy relievers, and has succeeded in saving them for situations where it doesn’t matter. That’s one way manager skill can beat Pythagoras: know which pitchers are better and which are worse, and save the worse ones for situations where it doesn’t matter how bad they are.

Alternatively, maybe those relievers have just been unlucky; after all, no manager will keep a pitcher on the roster if the manager believes his real skill is in the 7.00 range. If that’s the case, then, as Jaffe says, it won’t be the W-L record that reverts to the mean; it’ll be the opposition runs against.

Anyway, it seems to me that the discrepancy is still due to luck – at least in the sense that the Diamondbacks were lucky that the excess runs came when it didn’t matter. That (good) luck is offset by (bad) luck, in that the long relievers are giving up a lot more runs than they should. Combining those two leads to Chris’s conclusion that the D-Backs really are as good as their actual record.

All this leaves the question of managers and Pythgoras still open. In a study last year, Jaffe found that some managers were able to consistently beat Pythagoras over a long career.
At the time, I wondered if that means discrepancies were actually something other than luck. This could be one reason: allowing blowouts to get out of hand when it doesn’t matter anyway. It doesn’t seem to me like that should be enough, but it’s worth looking into.

Labels: ,

## Thursday, August 23, 2007

### "Freakonomics" on the Rangers' 30-run line score

Freakonomics' Stephen Dubner argues that the Rangers' line score in yesterday's 30-3 win is unusual. Dubner says he would have predicted something like this:

431 056 353

But the actual line score was

000 509 0(10)6

There was a lot more clustering than Dubner would have expected.

The post went up 40 minutes ago, but already commenters are arguing that there are reasons that runs come in clusters in baseball. TWstroud makes the important point that there is a "queueing issue," in that the first 2 or 3 singles don't score right away, but instead join a "queue" that makes it easier to score future runs.

One point I'll make is that record-breaking scores are more likely to be clustered than non-record-breaking scores. Scoring 30 runs likely means you hit better than normal with runners in scoring position. That means your hits were clustered more than usual.

That is, suppose team A and team B both put 40 men on base; team A scores 20 runs, team B scores 28 runs. Team B left fewer men on base, which means more clustering of hits.

The Rangers had 29 hits and 8 walks. Scoring 30 out of 37 baserunners, you should expect few men left on base, which means the hits were clustered more than usual, which means that the runs must have been scored in bunches.

Labels:

## Thursday, August 16, 2007

### Lichtman study fails to find umpire bias

On “The Book” blog, Mitchel Lichtman (mgl) runs his own version of the Hamermesh study, and finds no race bias on the part of the umpires.

He does it without using regression, adjusting the umpires’ stats for pitcher characteristics using more straightforward techniques. Definitely worth reading the whole thing now, even though mgl promises there’s more to come.

Labels: , , ,

### An alternative significance test for the Hamermesh data

A couple of days ago, I looked at the data from the Hamermesh (et al) study on racial bias among umpires. There, I gave an intuitive argument that there is no bias evident between whites and blacks. Specifically, I showed that if black umpires had called only five less strikes in favor of black pitchers, that would have wiped out all traces of differential treatment.

Here, I’m going to try to take that informal argument and turn it into a statistical test, kind of. Any statisticians reading this, tell me if I’ve done something wrong.

I’m going to start with the matrix of results from Table 2 of the study, except that I’m going to leave out the column of Asian pitchers, because they won’t affect the results.

Here’s the table:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.80 30.77
Black Umpire-- 31.93 30.87 30.76

(Apologies for the crappy-looking table, but I’ve had trouble with HTML tables in Blogger.)

The numbers are the percentages of non-swung-at pitches that are called strikes.

What I’m going to try to do is find a table where the values are close to the ones above, but where they show no bias. Then, I’m going to check for statistical significance between the two tables.

The most obvious non-biased table might be to make all the rows the same. I set every cell to the overall average of its column, which got me this:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.05 31.45 30.62
Hspnc Umpire-- 32.05 31.45 30.62
Black Umpire-- 32.05 31.45 30.62

If the data had come out like this, we would all agree that umpires aren’t biased at all.

Now, suppose umpires truly aren’t biased, and this is actually the way they call balls and strikes. That means the data came out the way they did only because of random chance. The question is: what is the probability that random chance could turn this beautiful unbiased table into the real table 2?

I’ll start by figuring out how many pitches would have to change to turn the second table into the first. Take the top-left cell, for instance. The difference between 32.06 and 32.05 is 0.01 of a percent. There were 741,729 called pitches in that cell. 0.01 percent of that is 74. So the difference is that in the first table, white umpires called 74 more strikes on white pitchers than expected.

Repeating that calculation for every cell gives this result:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +74 +47 -3
Hspnc Umpire-- -34 +26 +1
Black Umpire-- -56 -81 +2

Now, what kind of bias was the study looking for? It was looking to see if same-race umpires call more strikes on same-race pitchers than on different race pitchers.

The same-race case comprises the three cells on the diagonal; the different-race cases are the other six cells. Adding up the pitches gives:

Same race: 74 + 26 + 2 = +102
Different race: -34 - 56 + 47 – 81 –3 + 1 = –126

The difference between the two is 228. The same race umpires called 228 more strikes on same-race pitchers than on different-race pitchers. Is 228 is big enough to be significant? We’ll check, but, so far, at least we know that the bias goes in the right direction.

How can we test this for significance? Well, the result of any cell is a binomial distribution. The mean is zero. The variance is given by p(1-p)n, where p is the probability of a strike (say, 0.31) and n is the number of observations. For the white/white cell, the variance is 0.31 * 0.69 * 741,729, which equals 158656.

Just for information, here are the variances I calculated for all 9 cells:

Pitcher ------- White Hspnc Black
--------------------------------
White Umpire-- 158656 50680 5370
Hspnc Umpire-- __5260 _1566 _181
Black Umpire-- _10015 _2969 _378

Now, the calculation of interest is the diagonal cells minus the non-diagonal cells; we did that calculation a few paragraphs ago and got 228. If you number the cells 1-9 starting horizontally, the calculation was

cell 1 + cell 5 + cell 9 – cell 2 – cell3 – cell4 – cell 6 – cell 7 – cell 8.

These variables are all independent, so the variance of that expression is just the sum of the variances of the nine terms. Add up the nine variances, and you get 242,338. The standard deviation is the square root of that number, which is about 492.

Our result was a difference of 228 extra strikes. That’s less than one-half an SD away from zero, and so obviously not statistically significant.

By the way, this test checks for unbiased umpires, but it doesn't test that the "null hypothesis" unbiased matrix is actually a decent fit overall. To do that, you can just add up the nine pitch-discrepancy cells (that is, add the non-diagonals instead of subtracting). That result should alsobe normal with mean 0 and SD 492. The farther you get from zero, in either direction, the more likely the test matrix is a crappy fit.

In this case, the total is -24, which is close to zero. So we can conclude that not only is the measure of bias close enough to zero, but also that the "null hypothesis" matrix is a pretty good fit to the data as a whole.

---

But there are lots of other unbiased tables we could test for fit. The one we tried had all the rows the same, but that isn’t necessary. Here’s an alternative “null hypothesis” table:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.32 30.46
Black Umpire-- 31.93 31.34 30.48

This new matrix is as unbiased as the other one, but that may not be obvious when you look at it. It’s different in that, here, we don’t demand that all races of umpires be exactly the same. We just demand that they treat the various pitchers in the same way as the other umpires.

In this table, every race of umpire calls 0.59% fewer strikes for hispanic pitchers than for white pitchers. And every race of umpire calls 0.86% fewer strikes for black pitchers than for hispanic pitchers. Sure, white umpires appear to have a bigger strike zone than hispanic or black umpires, but they have that same bigger strike zone regardless of who the pitcher is.

If that were the model, would the observed results be statistically significant?

Let’s check. Here is the matrix of pitch discrepancies:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +35 +03
Black Umpire-- +00 -74 +05

Same race: 35 + 5 = +40
Different race: -74 + 3 = -71

So the “own race/different race” difference is 111. Still in the right direction, but even smaller than the previous model. And still a very low significance level, less than a quarter of the standard error of 492.

Even without the formal significance test, but we can see intuitively that the result is very close to non-bias. If only 117 pitches were called differently, we would have absolutely no bias at all! And 117 pitches over three seasons is a very small number.

---

We tested two different models of unbiasedness, and both of them were a good fit with the observed data. But there’s no reason to stop at two. We could build another, and another, and another. That might seem like cheating, but I don’t think it is. After all, when you do a regression, you find the best-fit straight line out of a literal infinity of possible lines. Finding the best-fit matrix (out of all the matrices that represent unbiased umpires) seems like the same idea.

If there actually *were* large amounts of between-race bias in the data, we wouldn’t be able to fit *any* unbiased matrix to the data with low significance levels. It just turns out that the apparent bias in the real-life data was so small that both “null hypothesis” matrices went unrejected with low significance levels.

Which of the two matrices is better? I suppose you could argue that the one with the smaller significance level is better. But this exercise won’t really help tell you what the *real* umpire characteristics are. Maybe all racial groups of umpires do call the game the same, and the first matrix matches real-life better. Or maybe the groups call the game differently, and the second matrix is more realistic. You can’t tell from the data; you have to use your judgement. It’s as if you do a regression and you find the equation comes out to y=4x+1. Are you sure real life is y=4x+1? No, not necessarily; it could really be y=3.99x+1.01; either is consistent with the data. 4x+1 is *more* consistent, but if you have really good reason to think 3.99x+1.01 is better, then, hey, go for it.

So all I think we can say here is that, from the evidence of this test, there is absolutely no reason to suspect overall umpire bias in favor of their own race.

That is, no umpire bias in the sense of “a random same-race encounter having a higher strike probability than a random mixed-race encounter,” because that’s all our test checked for. It’s possible there are other forms of bias, and you’ll need other tests. For instance, suppose the matrix looked like this:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +01 +06 +00
Hspnc Umpire-- -04 +99 -12
Black Umpire-- +06 +99 -01

If you run the above test on this data, you get something extremely insignificant. And that’s correct, because what you’re trying to prove – that a same-race encounter is more likely to lead to a strike – isn’t true. However, there is another manifestation of umpire bias shown in the data -- a same-race encounter involving a *hispanic* umpire will be biased, and a black/hispanic encounter will also be positively biased. And there are other tests, obviously, that will find an effect for that.

Also, even by our measure, our test may not be powerful enough to pick up an effect. Our measure of “same-race” bias was the overall difference in number of pitches. A different test that considers, say, the sum of squares of the number of pitches in each cell, might be more powerful and find an effect this test didn’t.

The Hamermesh authors did try to test roughly the same thing we did: the difference between same-race and different-race. In fairness, they did do a more sophisticated analysis, doing a regression and correcting for pitcher and count. But our own test found so little bias – only 117 pitches for complete unbiasedness! – that, even with their extra sophistication, I still wonder how their test was able to come up with statistical significance.

Labels: , , ,

## Tuesday, August 14, 2007

### The Hamermesh study: what percentage of pitches are biased?

The Hammermesh study on racial bias among umpires (which I posted about here) concluded that “a given called pitch is approximately 0.34 percentage points more likely to be called a strike if the pitcher and umpire match race/ethnicity.” But I don’t think they actually say what percentage of calls are biased. I’m tried to figure this out for myself, without using regression. It depends on the assumptions you make; under the assumptions I’ll show you in a bit, I get 0.13%.

---

I’m going to start with a simple example. Suppose you have only white umpires and black umpires, and only white pitchers. The white umpires call 31% strikes, but the black umpires only call only 30% strikes. We don’t know yet whether this is bias or just random fluctuation. But we can say that for whatever reason, the pitchers are advantaged by 1 percentage point with a white umpire, relative to a black umpire.

But suppose it were bias. What percentage of (called) pitches would be affected?

Your first inclination might be to say that 1% of pitches are affected when there’s a white umpire, but none are affected when there’s a black umpire.

But, wait. We don’t know what direction the bias goes. It’s possible that all the bias is from the black umpire. He should be calling 31% strikes, but he’s only calling 30%. In that case, 1% of pitches are affected when there’s a black umpire, but 0% when there’s a white umpire.

There’s still another case. Perhaps the pitcher should actually 30.5% strikes, and both umpires are biased by 0.5%. In that case, 0.5% of pitches are affected when there’s a black umpire, and 0.5% are affected when there’s a white umpire.

Indeed, there is an infinity of possibilities. Maybe the black umpire is biased 0.3% and the white 0.7%. Maybe the black umpire is biased 2%, and the white umpire is biased negative 1%. And so on.

So we can’t say what percentage of pitches are biased for which umpire. We can’t even say what percentage of total pitches are biased. In the last case I suggested, where the black umpire was biased 2% and the white umpire was biased 1% in the other direction, there would be 1.5% of total pitches biased (assuming the white and black umpires called equal numbers of pitches). But in all the other cases I used as examples, it would be only 0.5%.

So it all depends on your assumptions of where the bias is.

---

OK, now let’s move to a real-life example, one where the umpires don’t call equal numbers of pitches. In fact, I’ll take the “white pitcher” column of the paper’s Table 3. Here it is:

White umpires: 32.06% strikes.
Hspnc umpires: 31.91% strikes.
Black umpires: 31.93% strikes.

Now, again, let’s suppose all the differences between umpires are bias. That means we want equal numbers for all umpires. Since the overall strike rate for white pitchers was 32.05% (from Table 2), we might choose that. So we *really* want the table to look like:

White umpires: 32.05% strikes.
Hspnc umpires: 32.05% strikes.
Black umpires: 32.05% strikes.

How do we make that happen? Well, we can subtract called strikes from the first case, and add them to the second and third cases:

White umpires: subtract 0.01 from the original 32.06
Hspnc umpires: add 0.14 to the original 31.91
Black umpires: add 0.12 to the original 31.93.

This will make all the percentages equal to 32.05%.

Since we assumed the differences represented bias, that means that

0.01% of white umpires’ calls were biased;
0.14% of hispanic umpires’ calls were biased;
0.12% of black umpires’ calls were biased.

Those numbers depended on our choice of 32.05% as the baseline. Another choice we can make is to use 32.06% as the baseline instead, which assumes that both hispanic and black umpires should call the same 32.06% strikes as the white umpires. In that case, you need to add 37 strikes to the hispanic umpires, and 61 strikes to the black umpires. That means

0.00% of white umpire calls are biased
0.15% of hspnc umpire calls are biased
0.13% of black umpire calls are biased

Which do you like? Either: there’s no real reason to choose one over the other, and no way to prove which is more accurate.

We could go extreme the other way, by assuming that the hispanic umpires are correct, and the “real” percentage is 31.91%. In that case,

0.15% of white umpire calls are biased
0.00% of hspnc umpire calls are biased
0.02% of black umpire calls are biased

Do you like that better? It’s arbitrary still.

Arbitrary or not, my feeling is that we should find a pattern where all three groups of umpires seem about equally biased. That way, we don’t have to single out one group. Maybe we can choose 31.99 as our estimate of the “true” value. That would mean:

0.07 of white umpire calls are biased
0.08 of hspnc umpire calls are biased
0.06 of black umpire calls are biased.

This is my favorite, because it spreads the blame around; in the absence of evidence that one group is “guiltier” than another, this seems like the ethical default assumption.

Now, back to the original question: what percentage of *all pitches* are biased? From this last distribution of bias, it looks like about 0.07% of pitches are biased (regardless of who the umpire is). That’s one out of every 1,400 pitches.

I’ll quickly do the Hispanic and Black umpires too. For Hispanic, it looks like 31.15% might be a pretty good stab at the “real” strike percentage, the one that makes all the umpires look equally biased. That means that for Hispanic pitchers,

0.32% of white umpire calls are biased
0.35% of hspnc umpire calls are biased
0.28% of black umpire calls are biased.

Since white umpires are an overwhelming majority, the overall average of these numbers is probably 0.32%.

For black pitchers, let’s use 30.69% as the base:

0.08% of white umpire calls are biased
0.08% of hspnc umpire calls are biased
0.09% of black umpire calls are biased

That’s about 8% overall for the black pitchers.

Summarizing the three groups of pitchers:

White pitchers: 0.07% of calls are biased
Hspnc pitchers: 0.32% of calls are biased
Black pitchers: 0.08% of calls are biased

Hispanic pitchers are 25% of the total, and black pitchers about 5%. If we weight the three groups accordingly, we get:

Overall: 0.13% of all umpire calls are biased. That’s about 1 pitch in 750.

Again, this is subject to assumptions, not all of which might be true:

-- we assume that all umpires have an equal propensity to be biased for a particular race of pitche
r.

-- we assume that 100% of the discrepancies actually seen represent bias. This is almost certainly not true, because there is inherent random variation in what kinds of pitches umpires will see.

-- more importantly, we assume that bias exists. I’m still not convinced that the amounts of variation seen in the study are statistically significant.

But it appears that if you do accept the above assumptions, it follows that at most 1 pitch in 700 will be biased. Unless I’ve screwed up the logic somewhere.

Labels: ,

### The Hamermesh study on umpires and race

Yesterday, an article in Time Magazine discussed a new study about biased umpring. Here’s the study. It’s called “Strike Three: Umpires’ Demand for Discrimination,” by Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh.

The paper is quite similar to the Price/Wolfers paper on basketball referees (which I earlier reviewed in three parts). However, the data are much less convincing, and there are seeming conflicts in the results that I don’t understand.

The authors divide umpires and pitchers into four racial (or ethnic) groups: white, black, Hispanic, and Asian. (Hispanic is defined as any player born in one of several Spanish-speaking countries, regardless of skin color.) For three full seasons of play-by-play data, they counted balls and called strikes for each combination of pitcher and umpire. They conclude that pitchers have an advantage when facing an umpire of their same group. White umpires seem to favor white pitchers (over black, hispanic and asian), black umpires favor black pitchers, and so on. That is, umpires discriminate in favor of their own kind.

---

To show what they did, I’ll simplify things by ignoring the Hispanic and Asian groups (there are no Asian umpires anyway), and just show the data for white and black. Here’s the authors’ summary of the results (taken from their table 2):

 White Pitcher Black Pitcher Difference White Umpire .3206 (741,729) .3061 (25,108) .0145 Black Umpire .3193 (46,825) .3076 (1,765) .0117

The numbers in the table are percentage of called pitches that were strikes; the number of pitches follows in brackets.

You can see that White pitchers got more strike calls than black pitchers, regardless of who the umpire was; the overall numbers were 32.05% strikes for the white pitchers, and 30.62% for the black pitchers. We can reasonably conclude that white pitchers are actually more skilled in this regard. However, the size of the black pitchers’ disadvantage depends on the umpire. White umpires gave black pitchers a .0145 disadvantage over white pitchers, while black umpires cut that disadvantage to .0117.

This indeed seems to show that umpires favor their own group. But what would the chart look like if there were no bias at all? There are many ways to equalize the two groups of umpires. The easiest is to subtract .0028 from the Black/Black cell, in order to widen the .0117 to .0145. The chart would then look like this:

 White Pitcher Black Pitcher Difference White Umpire .3206 .3061 .0145 Black Umpire .3193 .3076 .3048 .0117 .0145

What’s the real difference between the two cells? In the real chart, black umpires called 543 strikes out of 1,765 pitches. To get the bottom chart, black umpires would have had to call only 538 strikes out of 1,765 pitches. The difference: five pitches.

Over more than 7,000 ballgames over three seasons, the two groups of umpires are five pitches away from showing absolutely no racial bias. Obviously, that’s not statistically significant.

If you repeat this analysis, this time comparing white and hispanic pitchers and umpires, the difference is 35 pitches out of 7,323. It’s a bit higher a proportion, half a percent, but still not statistically significant (the SD, using the binomial distribution, is 39).

Finally, if you compare hispanic and black, the difference is less than two pitches.

Why are the discrepancies so small over such a large sample of data? It’s because of the small samples in the hispanic/hispanic and black/black cells. There are about three times as many white pitchers as hispanic, and 30 times as many white pitchers as black. There are also few black umpires, and even fewer hispanic umpires. The result: there are, literally, 420 times as many white/white datapoints as black/black datapoints.

Given all this, I’m not sure how the authors manage to come up with statistically significant (and baseball significant) findings of bias. I’ll return to that in a bit.

---

When the umpire and pitcher are of the same race, the study calls it “UPM” for “umpire/pitcher match”. In the black/white pairings, there are 743,494 UPM pitches, but only 71,933 non-UPM pitches. More than 99% of UPM pitches involve a white pitcher, while only 65% of non-UPM points do. And so, since white pitchers have higher strike frequencies than black pitchers, the UPMs also have higher strike frequencies than non-UPMs:

.3206 UPMs
.3147 non-UPMs

This has nothing to do with racial bias – it’s just a consequence of how the numbers work out. White pitchers throw more strikes than black pitchers, and the UPM group is dominated by the white/white group. So even if umpires were unbiased – or even moderately biased in favor of blacks – we’d still see this effect.

I mention this because it completely explains two of the columns in the study’s Table 3. In columns “1d” and “2d,” the authors run a regression that includes UPM. The table shows, not surprisingly, that the coefficient for UPM is hugely significant. Again, that’s just because the white pitchers throw more strikes than the black pitchers.

That problem goes away if you include a race variable in the regression, such as pitcher race. In that case, you wind up adjusting for the fact that white pitchers are strikier than black, and the coefficient for UPM starts to make more sense.

That’s what the study does in the other columns of Table 3 – notably, column “3d,” which, I think, is the major result of the paper. There, the authors run a regression that includes UPM, but also a bunch of control variables. They have an indicator variable for each of the 12 possible counts (from 0-0 to 3-2), and an indicator variable for each of the 900 or so pitchers in the study. The inclusion of the pitcher variables means that when a white umpire calls a white pitcher, the encounter is adjusted to the proportion of strikes that pitcher is expected to throw. So the fact that whites throw more strikes than blacks should be factored out.

What’s the result? Statistical significance at about the 2.5% level (almost exactly 2 standard errors). The estimate of the UPM factor is 0.00341. Since the overall probability of a strike is somewhere around 0.30, the increase is about 1%. That’s fairly signficant in a baseball sense. 1% is about half a pitch a start, perhaps (remember, swinging strikes aren't included). My research shows that changing a ball into a strike (or vice-versa) is worth about .14 of a run; a study in Baseball Prospectus estimated it at .10 runs. Either way, taking a hit of .05-.07 on your ERA, just by facing the “wrong” race umpire, is fairly significant.

But, to be honest, I don’t understand where this number could have come from, based on the raw data. As I showed in the 2x2 table above, the discrepancy between black and white is 5 pitches in 1,765, which is 1/3 of 1%. For white/hispanic, it’s about half a percent. For hispanic/black, it’s about an eighth of a percent. None of these is statistically significant.

So, how is it that Table 3 can combine all these and get something higher than even the largest of the 2x2 discrepancies, and so much more statistically significant? The only thing I can think of is the control variables for the count and specific pitcher. Even though the raw data don’t show much discrimination, it could be that when you look closer, the white umpires, by random chance, faced better black pitchers than average, so they should have called even more strikes than average (but didn’t). Still, that seems unlikely, given the size of the sample.

Anyone have an idea what’s going on? Is there something wrong with my 2x2 analysis?

---

In Table 4, the regressions include a variable for whether or not the game was pitched in a QuesTec park. In those parks, the umpires are “graded” against an electronic observation on whether or not a pitch should have been called a strike. The idea is that when umpires are being observed, they should discriminate less, because they have an incentive to be accurate instead of biased. That’s where the title of the study, "Umpires’ D
emand for Discrimination," comes from. The implication is that umpires (perhaps unconsciously) like to discriminate, but will “buy” less discrimination when the price goes up (QuesTec).

The results show there's very much an effect. The UPM factor was much less when QuesTec was in operation. Indeed, the umps appear to have *overcompensated.* Unmonitored, umpires were 2% more likely to call a strike on a different-race pitcher. Under QuesTec, they were two-thirds of a percent *less* likely (although that number is not significantly different from zero).

While I’m convinced that umpires call pitches differently under QuesTec, I’m still uncomfortable about the discrimination estimates.

---

The authors then proceed to Table 5, where they add control variables for attendance and “terminal count” (any count with three balls or two strikes, where this pitch is more likely to make the at-bat "terminate").

I don’t think we can learn anything from the attendance analysis, because the sample for these new variables is not random. Parks with low attendance probably have worse pitchers. The study did control for that, but what it didn’t control for was that parks with low attendance have worse pitchers *playing at home*. This would tend to cause the worse pitchers to play slightly better, and the better pitchers (who are on the road) to play slightly worse.

For what it’s worth, the authors found that high attendance almost completely cancels out the racial bias (a small amount is left). But I’m not sure whether that would still hold if you re-ran the study, correcting the pitcher effects for home/road. Normally I would think it's not that important, but this study does look for very small effects.

As for the “terminal count” study, different pitchers will have different types of terminal counts. Good pitchers will have a lot of 0-2s, and (semi-deliberately) throw a lot of balls. Bad pitchers would have a lot of 3-0s, and (semi-deliberately) throw a lot of fastball strikes. I think that would render the results of the study unreliable.

But, again, for what it’s worth, on “terminal counts,” the bias is completely reversed: umpires judge own-race pitchers more harshly. The authors suggest it's because the ump knows his judgement on this pitch will be watched more closely.

---

Table 6 attempts to relate the race bias to team winning percentage -- how many wins is the bias worth? (The authors seem unaware of the previous work on the run value of balls and strikes.) I’m not sure how to interpret the “probit” coefficients in the table (are they just percentage points?), but the authors say that the probability of winning increases by "over 4.2 percentage points for a home team if its starting pitcher matches the umpire’s race/ethnicity." That seems way too big: it would improve the chance of winning from .540 to .582. But if two pitches per game are affected, that’s .28 runs, which is almost .030 in winning percentage (at 10 runs per win).

Still, two pitches per game is a lot. Overall, how many pitches in a typical game are borderline? Most calls are pretty clear. Suppose there are, say, 20 pitches a game that can go either way. Suppose 10 go to the pitcher and 10 go to the hitter. Now, suppose there are two calls that go the "wrong" way because of race. That brings the 10 up to 12. That means that umpires are so biased that 17% of close pitches are decided by race!

My gut says that's pretty unlikely.

---

-- In a section on robustness checks, the authors find that the race of the batter doesn’t seem to matter – only that of the pitcher. They hypothesize that the umpire is reluctant to show bias against the batter because he stands so close. The batter can confront the umpire, while the pitcher can’t. Doesn’t seem too unreasonable to me. Also, there is “weak evidence” that the more-experienced umps show less bias than the less-experienced ones; and, there is *no* observed bias among the crew chiefs.

-- To their credit, the authors suggest explanations for all the observed effects that don’t involve racial bias, although those only appear in
their FAQ. For instance, they suggest that perhaps Hispanic umpires and pitchers have distinctive “styles,” and so both will implicitly understand a certain kind of pitch to be a strike. Other umpires may not call it a strike, and other pitchers won’t throw it, and so the hispanic/hispanic strike count is inflated.

-- I think all the standard errors and significance levels given in the study are too small. That’s because the analysis the authors do assumes that the umpires are seeing a random sample of pitches. They aren’t – they see pitchers in bunches. If umpire X has 30 games a year behind the plate, he’s going to see less than half the starters in the league. So it’s kind of a cluster sample instead of a random sample. Again, I don’t know how much of a difference this makes in the significance levels, whether they're a little too big or a lot too big, but I wish the study had tried to estimate the effect.

---

My bottom line is that I’m a bit confused by this study. The raw numbers seem as close to non-biased as you can get, but the regressions seem to show a significant effect. Until the discrepancy is explained to me, I have to go with the 2x2 data, because that’s what I understand. And that says no discrimination.

One explanation might be that the specific technique the authors used, despite using “pitcher fixed effects,” didn’t fully adjust for the fact that white pitchers throw more called strikes than black pitchers. If that’s the case, it would explain most of the positives. It wouldn’t explain why QuesTec completely reversed them into negatives, though. At any rate, these guys are professionals, and I can't think of anything wrong that might have caused this to happen.

And almost all the results go the expected way -- the umpires improve under QuesTec, they appear to improve when the pitch is more important, they improve when the crowds are large, and the experienced umpires show less bias. Again, this would be a big coincidence if there weren't real bias going on.

So am I missing something? Can anyone help resolve the contradiction?

Labels: , , ,

## Friday, August 10, 2007

### NCAA point-shaving study convincingly debunked

There have been a couple of Justin Wolfers papers in the news recently. There was the “racial bias in the NBA” study a few months ago (co-authored with Joseph Price), which I reviewed here. And, last May, there was an article on NCAA point shaving, which I’ll talk about now.

Wolfers analyzed the outcomes and betting lines for over 70,000 NCAA Division I games. He found that, overall, the favorite covered the spread almost exactly half the time (50.01%), exactly as you would expect. But heavy favorites (defined as a spread of more than 12 points) covered only 48.37% of the time.

He then found that, for heavily favored teams, the results weren’t symmetrical. That is, for teams favored by, say, 4.5 points, they would cover by 0-4 almost exactly as often as by 5-8. But for large spreads, like 15.5 points, teams covered by 0-15 a lot more than by 16-31.

“Among teams favored to win by 12 points or more, 46.2% of teams won but did not cover, while 40.7% were in the comparison range of outcomes.”

The difference is about 6% of those games, which we can call the “Wolfers Discrepancy” for his dataset.

Wolfers concludes that 6% figure is evidence of cheating – that point shaving caused 3% of potential covers to become non-covers. Since only half of games can be shaved (games where the favorite is leading), that means that the proportion of corrupt games was double that. Conclusion: the fix was in in about 6% of 12-point-plus-spread games.

This study, and the 6% figure, got a lot of publicity.

Here’s a post from The Wages of Wins. Here’s an article from the New York Times. And you can find a lot more by Googling.

Not receiving anywhere near as much publicity is a rebuttal paper by Dan Bernhardt and Steven Heston. And that’s too bad, because it’s a great study, and it thoroughly debunks Wolfers’ conclusions. These guys nail it.

Bernhardt and Heston argue that there’s no reason to expect the symmetry that Wolfers assumes. For instance,

"Consider a 14 point favorite that is up by only seven points with five minutes to play. To maximize its chances of winning, the favorite will “manage the clock,” holding the ball to reduce the number of opportunities that the other team has to score ... In contrast, the same favorite up by 21 is sure to win, and has no need to manage the clock, raising the expected increment in winning margin, generating an asymmetric distribution in winning margins."

That’s very plausible, and the authors back it up with some clever analysis. They argue that if the Wolfers Discrepancy is caused by cheating, then it should be much higher than 6% in games that are more likely to be corrupt, and less then 6% in games that are less likely to be corrupt. And so they check – four different ways.

First, they note that a fixed game will attract heavy betting on the underdog, and all that money will move the betting line to reduce the spread. So they split games into two groups: games where margin moved the “wrong” way, towards the favorite, and all other games. It turns out that the Wolfers’ Discrepancy is almost exactly the same between the two groups, suggesting that it’s a natural part of the scoring distribution rather than an artifact of corruption.

Second, they note that if game fixing happens, it happens in games where the final score is very close to the spread. So, for a 14-game spread, instead of comparing finals of 1-14 points to finals of 15-28 points, they compare 8-14 to 15-21, to narrow the range. Now, the number of corrupt games in these smaller samples should be equal to the number in the bigger sample. But the denominator, the number of total games, is smaller. Therefore, the percentage of apparently corrupt games – the Wolfers Discrepancy -- should rise.

It didn’t. In fact, it *dropped*, by more than two-thirds.

As a third test, they check games where the line did move towards the underdog, games where corruption is plausible. You’d expect, in those games, that the more the line moved, the more likely the game is fixed, and so the larger the Wolfers Discrepancy for those games.

The result -- nothing statistically significant:

5.61% -- line moved towards the favorite
6.64% -- line didn’t move
7.17% -- line moved towards the underdog by 0 - 0.5 points
8.43% -- line moved towards the underdog by 0 - 1 point
6.83% -- line moved towards the underdog by 0 - 1.5 points.

Finally, they analyzed games where there was no betting line, which usually happens when there isn’t enough interest in the game for the sports books to bother. (For those games, Bernhardt and Heston had to estimate a spread using team strength ratings.)

If there is no betting, there’s no incentive to shave points. And so, if the Wolfers Discrepancy is really caused by corruption, it should be zero for those games.

But it wasn’t zero. In fact, it was 6% -- almost exactly the same as Wolfers found in his own study!

Truly outstanding work by Bernhardt and Heston. They took a statistical effect that Wolfers claimed showed corruption, and proved, four different ways, that the effect exists even when corruption is unlikely. In addition, they provided a plausible explanation of what the effect might be.

The NCAA should buy these guys a beer.

Hat tip:
Zubin Jelveh

Labels: , ,

## Thursday, August 09, 2007

### May issue of "By the Numbers" released

The new issue of "By the Numbers," the SABR statistical analysis newsletter, is now available at my website here. It's dated May, 2007, but it's still new.

It contains articles by Charlie Pavitt, J.P. Caillault, David Roher, Russell Carleton, Fred Worth, and Andrew Boslett/Matt Hoover/Thomas Pfaff.

As always, all back issues are available at my website, philbirnbaum.com .

Labels:

### Would legalizing steroids create near-perfect competitive balance?

In a recent article from Slate, Daniel Engber speculates on what would happen if MLB effectively legalized the use of performance enhancing drugs.

According to Stephen Jay Gould, increasing the quality of baseball players reduces the variance of their talent. Since increased steroid use makes players better, Engber argues, the variance will continue to decrease. And so,

"A guy like Ichiro might lead the league while batting .280. A guy who hit .250 might ride the pine ... Teams would regularly take the pennant by winning just 83 or 84 games."

This is obviously absurd, even if you accept Gould’s original argument (which I’m not sure I do).

First, even if every player in baseball was exactly a .265 hitter, there would be lots of variation in batting averages just by luck. Over 500 AB, the standard deviation would be about 8 19 points, which means that one player in 40 would be above .281 .303 or below .249 .227. Assuming 240 regulars, that’s six players above .281 .303. One of those would likely wind up close to .300 .320 or so.

And that’s if every player were identical. The current standard deviation of player batting average talent is about 27 points (according to an estimate here). How much would legal steroids reduce that? Not by the full 27 points, by any means. I’d be astonished if it dropped much at all, but suppose it went down to 20 points. That means that the SD of observed batting average (which is the combination of the 20 points of talent and 8 19 points of luck) would go from 28 33 to 22 28. Instead of the top 2% of the league being at .320 .340, they might be around .310 .330. Not really a big deal.

As for team performance, Engber’s estimate is even more out of line. Even if every team were .500 in talent, the SD of observed wins would be 6.3 wins. One team out of 40 would finish with 93 wins, and another with 93 losses. And, again, that’s the minimum possible, which only happens when every team is exactly equal to every other team . Even under a free-steroid regime, that’ll will never happen.

And why assume that taking steroids would make players and teams so equal to each other anyway? Neifi Perez didn’t turn into Barry Bonds, or even come close. Isn’t it more realistic to assume that steroids make the player a little better than he was, instead of believing that it turns great and poor hitters alike into equal supermen? Is Engber just misunderstanding the implications of Gould’s work?

My intuition, for what it’s worth, is that steroids would indeed increase both home runs and strikeouts. But if you deadened the ball a bit, to bring HRs back to normal, you’d lose a lot of free-swinging power-hitters, which would bring the strikeouts down too. And then you’d barely notice a difference.

Labels: , ,

## Monday, August 06, 2007

### Is Barry Bonds' record "tainted" by his elbow protector?

Here's an article that argues that part of the reason for Barry Bonds' late-career power surge is the elbow guard he wears. According to Michael Witte, the "armor" has special features which limit the motion of Bonds' arm, thus keeping his swing in the correct plane. Among other things.

I have no idea what to make of this; just thought I'd throw it out there.

Hat Tip: John Pastier of SABR

Labels:

## Friday, August 03, 2007

### MLB: how good are 100-win teams?

Here's a study (.pdf), called "On Why Teams Don't Repeat," that I wrote 18 years ago, before I was born. It originally appeared in the February, 1989 issue of "Baseball Analyst," the journal of sabermetrics edited by Bill James.

The idea is this: teams that make the playoffs often fail to repeat the next year. Why? Because they probably were mostly lucky the first time.

Suppose a team won 100 games. How good a team was it in terms of talent? The answer: probably less than a 100-game talent. Either it was a less-than-100 talent that got lucky, or it was a more-than-100 talent that got unlucky. But since there are many more below-100 teams than above-100 teams, it was probably on the lucky side.

The question is, *how much* less than 100 games? To find out, I tried to find a distribution of talent that would roughly approximate the actual distribution of AL wins from 1961-1984. Once I found that, I figured the chance of the 100-win team coming from each segment of that distribution. Then I figured the average.

The results:

The average 100-win team is really only a 93-win team.

Other interesting results:

The average 81-81 team is really an 82-80 team.
The average 62-100 team is really a 66-96 team.
The average 116-46 team is really a 102-60 team.

There is a 79% chance a 100-62 team was lucky.
There is a 32% chance a 65-97 team was lucky.

All these figures are approximate, of course. And circumstances have changed since 1984; now, financial imbalance means there are probably more extreme teams (at both ends of the talent scale) than there were previously, which means the amounts of luck found here might be overestimates in today's MLB.

Also, of course, there is an easier way to estimate the SD of talent, as seen
here.

## Wednesday, August 01, 2007

### Is there a bubble in sports franchise prices?

Some comments from Arnold Kling on whether sports franchises make money.

In the post, Kling is asked whether sports franchises constitute a speculative bubble. After all, many owners lose money from operations and only profit from increases in the value of the franchise.

Kling responds that there is no bubble, with an argument that (boiled down) says that teams actually make money in terms of EBITDA (before interest on any debt). I agree with him that there's no bubble, but for different reasons.

For one thing, even if teams are profitable, there could still be a bubble – some of the tech stocks that crashed in 2000 were indeed slightly profitable. You don't need absolute losses for an asset to be overinflated.

But stocks are different than sports teams. Stocks have no consumption value, and are valued only for their ability to produce an income stream. Sports teams, on the other hand, are fun to own. My feeling is that it's that the consumption value of being a happy owner that keeps the prices high ... and I think the prices are reasonable, as I argued
here.

Labels:

### Game fixing in Quidditch

There's been a lot written about game-fixing in basketball lately, but not much about an obvious game fixing episode in World Cup Quidditch.

Quidditch is played on flying broomsticks by two opposing teams of seven players. There are two ways to score. A goal counts as 10 points, and is achieved by putting a soccer-ball sized "quaffle" through the other team's goal hoop. Meanwhile, a magical golden ball with wings, called the "snitch," flies throughout the area of play, trying to elude the players. Catching it scores 150 points, and ends the game.

In Harry Potter's fourth year at Hogwarts, the Quidditch World Cup pits Ireland against Bulgaria. Ireland takes a commanding 170-10 point lead. Then, Bulgaria's Viktor Krum catches the snitch, ending the match.
Ireland wins, 170-160.

Why did Krum throw the game by catching the snitch? According to Potter, "he knew they were never going to catch up ... he wanted to end it on his terms."

Well, that's kind of weird, isn't it? Suppose the Patriots were losing by four points in the Super Bowl with one second left, and decided to kick a field goal. Bill Belichick would be fired, then probably indicted.

It should also be noted that Fred and George Weasley, brothers of Harry Potter's best friend Ron, had placed a substantial bet on exactly this series of events – Viktor Krum deliberately throwing the World Cup of Quidditch by catching the snitch to lose. They collected on their wager, at undisclosed odds.

Apparently there were no repercussions. Either Quidditch is the wizarding equivalent of WWE, or, more likely, J. K. Rowling isn't much of a sports fan.

Labels: ,