Thursday, April 24, 2008

Racial bias and NBA referees -- a follow-up study

In comments to one of my posts on the Hamermesh study, commenter Guy mentioned a study on a similar subject, NBA referees and race. I hadn't seen that paper before.

The article is called "
Racial Bias in the NBA: Implications in Betting Markets." It's a follow-up to last year's famous Joe Price/Justin Wolfers study (which I reviewed back then in three parts). This one, now also co-authored by Tim Larsen (so I'll refer to the authors as "LPW"), examines the impact of race on NBA betting markets -- that is, the Vegas line.

The paper comes to similar conclusions to last year's study – that referees appear to be either biased in favor of players of their own race, or biased against players of a different race (or both – it's impossible to tell which). Also, it suggests a profitable betting strategy to take advantage of this bias.

However, I disagree that the study has found referee bias. The way I see it, there is only one finding in this paper that suggests such a bias, and it's significant only at the 10% level. I'll get to that finding in a bit, but I'm going to start by listing some of the paper's findings that do NOT have to do with racial bias. Some of these, actually, are very interesting findings, and I don't think I've seen them before.

------

1. Black referees favor the home team more than white referees.

In Panel C of Table 2 of the paper, the authors do a regression on the home team's winning margin, against the relative racial composition of the two teams. On average, the home team's winning margin is:

3.167 when there are three white referees
3.541 when there are two white referees [and one black]
3.631 when there is one white referee [and two black]
4.591 when there are no white referees [so three black].


The differences are not statistically significant, but it is interesting that each one is higher than the last, and that the extremes are significant in a basketball sense (1.4 points difference is a lot).


2. White players, on average, are better than black players.

Again, in the same table, the authors show the point differential between when one team has 100% black players, and the other team has 100% white players. Actually, this is extrapolated from real life – a more realistic explanation is that the study found what happened when one team has (say) one "extra" white player, and multiplied it by 5. In any case, here are the results, expressed in more points for the all-white team (relative to the all-black team):

6.128 more points when all three referees are white
5.140 more points when two referees are white
3.673 more points when one referee is white
0.903 more points when no referees are white.


Obviously, "whiter" teams, overall, are better than "blacker" teams. A 100% white team seems to be about 5 points better than a 100% black team, so the difference is about 1 point per "extra" white player.

Also, there's some evidence here of the racial bias the authors are searching for. It does look like white refs are easier on white teams (and vice-versa) This is indeed suggestive. However, none of the differences are statistically significant. The standard errors of the four estimates above are 1.1, 0.9, 1.3, and 3.2 points, respectively.


3. The Vegas line underestimates white teams' chances of winning.

We know that from Panel D of Table 2, where LPW show how the point spread varies by the race of referees. When one team has 100% white players, and the other has 100% black players, the spread in favor of the white team is:

2.976 points when all three referees are white
3.593 points when two referees are white
2.413 points when one referee is white
2.144 points when no referees are white.


Compare these spread numbers to the actual numbers above: except for the no-white-refs case (which is only 3% of games), the spread underestimates how much the white teams will win by. In real life, the difference was about 5 points. In the spread, it's about 3 points.

This appears to contradict the hypothesis that betting markets are efficient – they appear to be a couple of points off in these cases.

Since the Vegas line doesn't try to pick the correct score, but, rather, tries to pick the spread at which the betting on both sides will be even, the most likely explanation here is that bettors, as a group, are slightly biased in favor of teams with more black players. That could be because black players are more likely to be superstars (and have more fans who bet on their teams). It could be that cities with more basketball bettors happen to have more black players. It is also possible that basketball fans are biased against white players (although I prefer not to accuse anyone of racial bias, even bias on who to be a fan of, except when all other explanations have been considered and found wanting).

------

Okay, those are three statistical effects the study found that have nothing to do with racial bias among refs. I'll repeat them:

1. Black refs tend to slightly favor the home team.
2. White players tend to be slightly better than black players.
3. The betting line underrates whiter teams.

------

Now, let's go to the bias tests.

In Table 3 of the paper, LPW run a regression to predict various aspects of their sample of games. They correct for the race of the teams and the races of the referees. Then, after making those corrections, they look at what's left – specifically, what happens when the race of the refs matches the race of the players, and when it doesn't.


(Technical note: they define the "same-race" parameter as

% white refs * (% home black players - % visitor black players)

This reaches its minimum of -1.00 when (a) all the refs are white, (b) all the home players are white, and (c) all the road players are black. It's +1.00 when (b) and (c) are reversed.)


Here are the differences in this most extreme case (all white team, all black team, all white refs):

A 16% additional chance of beating the spread
An extra 3.3 points relative to the spread
An extra 4.1 points relative to the other team


(The difference between the 3.3 and 4.1 comes from the spread itself being 0.8 points lower.)

These seem like big differences. However, none of these numbers are highly significant. All three are a little less than 2 standard deviations away from zero, so they're only significant at the 10% level.

In terms of basketball, are they significant? Well, in the extreme case, yes. But in real life, you're not going to have these extreme conditions very often, if at all. No team is 100% white, and no team is 100% black. More typically, the home team might have one extra white player (out of 5, adjusted for expected playing time), and there might be two white refs on average. That would give a parameter of (66% * 20%), which is about .13. So the effects would be 13% of the ones above That means the extra chance of beating the spread would be closer to one percentage point, not 16.

Still, that's something, even though (as I said) statistically significant only at the 10% level.

------

Now, for Table 4, the authors suggest a betting strategy: bet on the team with the greatest racial similarity to the refereeing crew. If there are more white refs than blacks, bet on the whiter team. Otherwise, bet on the blacker team.

Here's the percentage of bets you'll win:

3 white refs: 51.37%
2 white refs: 50.99%
1 white refs: 50.39%
0 white refs: 52.53%


In all cases, you win more than half your bets!

However: remember finding number 3: the betting line underrates whiter teams. The "3 white refs" and "2 white refs" cases involve betting on the whiter team. So it's likely that what's happening here is not referee bias, but Vegas line inefficiency! Betting on the whiter team is generally a better than 50-50 shot.

The other two cases are not as easily explained – now, you're betting on the blacker team, and those should be slightly below-average bets. But they're not – they're better than even wagers. What's going on?

Probably statisical insignificance. The "0 white refs" sample is very small, and the 52.53% figure is only one standard error from 50%. As for the "1 white ref" sample, there's an interesting result: the more black players you're betting on, the lower your chance of winning the bet. If the blacker team is blacker by more than half a player (out of 5), your winning percentage is less than 50%. It's only when one team is *slightly* blacker – less than half a player – that the bet is a winning one. This is certainly not consistent with the idea that it's racial bias causing the effect.

In the 3-white-refs and 2-white-refs cases, the odds of winning go up the more white players there are on the team you're betting. Since this is consistent with white players being better, you can't really tell if there's referee bias happening there too.

------

Finally, we come to Table 5, where the authors consider various betting strategies.

They propose a simple rule: wait for an all-black or all-white refereeing crew. Then, bet on the team whose racial composition better matches the referees. Doing this, they find you win 51.48% of bets. This is significant at the 5% level (a tiny bit over 2 SD).

But, by this rule, one team might be only very, very slightly blacker (or whiter) than the other. Shouldn't you improve your odds if you also wait for a large discrepancy in team race? It turns out that you do. If you don't bet until you find a race difference is over 10% (one half a player), you win 51.82% of your bets.

If you wait for a 20% advantage (one player), you move up to 54.34%. If you wait for a 30% advantage, it's 56.30%. And if you wait for a 40% advantage, you'll only bet 160 times in 14 years. But you'll win 61.88% of those bets!

Here's all this in a table. Remember, these are games with three same-race refs only:

51.48% -- all games
51.82% -- 10% player race advantage
54.34% -- 20% player race advantage
56.30% -- 30% player race advantage
61.88% -- 40% player race advantage


The first three of these are statistically significant at 5%; the fourth one is significant at 1%. The last one is also significant at 1% (although the authors didn't mark it as such).

So does all this significance indicate racial bias? Nope.

Remember that there are many more white referees than black. Three white refs happens 28.1% of the time, but, because there are so few black referees, three black refs in one game happens only 3.0% of the time. So if you wait for three same-race refs, over 90% of the time, those three refs will be white.

And since you bet on teams that match the referees, that means that 90% of the time, you'll be betting on the whiter teams. And we saw, from number 3, that the Vegas line underrates white teams. So you're probably winning because your teams are whiter, not because of the referees.

To check for sure, you need to (for instance) find all the games, not just the ones with same-race refs, where the race difference is 30% or more. Then see what happens when you bet on the whiter team. I bet it would be not too far off from the 56.30% observed with only white refs. And I'd bet the difference between the two would not be statistically significant.

------

Bottom line: in this study, the only evidence of same-race bias came in Table 3. The difference one extra white player makes (assuming two white refs) is:

-- an extra .022 chance of beating the spread
-- an extra .54 points relative to the opponent


That has reasonable basketball significance – half a point. But even so, I hesitate to accuse anyone of racial bias on the basis of a significance level of only 10%.




Labels: , ,

Wednesday, April 23, 2008

US government weather service regularly lies, admits lying

The National Oceanic and Atmospheric Administration (NOAA), a branch of the US government, admits to deliberately lying in their weather forecasts.

No, really. In
this Freakonomics blog post, J.D. Eggleston reports on his very interesting statistical study that found that weather forecasters aren't very accurate, especially when it comes to predicting rain. When he asked the NOAA about it,

... [NOAA] meteorologist Noelle Runyan ... stated, "Our forecasts are more conservative than the television stations. We raise our P.O.P. predictions to over 50 percent only when we are sure of rain."

So when the government determines, to the best of its scientific ability, that there's a 70% chance of rain, they will *at best* tell you there's a 49% chance. Even when there's a 90% chance, they'll still quote you 49% or less.

I wonder if Runyan misspoke ... a literal reading of her statement suggests that NOAA will NEVER predict anywhere between 50.1% and 99.9%. I suppose you could check whether that's really true.

Still, calling an estimate "conservative" usually applies when you're trying to make an argument, and you're showing that even if you err towards the opponent's side, your position still holds. That's not the case here. If what Runyan says is correct, I would characterize the NOAA's predictions not as "conservative," but as "false."

Labels:

Tuesday, April 22, 2008

Regular-season performance and playoff success

Alan Reifman, author of the "Hot Hand" blog, comments on NHL and NBA performance in a recent New York Times article.

He tells us that when you're trying to predict playoff success, regular season performance is a much better indicator in the NBA or the NHL. Reifman ran correlations between regular season points and playoff rounds won. Here's the NHL:

2007 NHL Playoffs: r = .50
2006 NHL Playoffs: r = .04 (.22 excluding Detroit)
2004 NHL Playoffs: r = .31
2003 NHL Playoffs: r = .33
2002 NHL Playoffs: r = .50


And the NBA:

2007 NBA Playoffs: r = .33 (.58 excluding Dallas)
2006 NBA Playoffs: r = .67
2005 NBA Playoffs: r = .71


All this is as expected: NBA games are more predictable in the sense that the better team wins more of them. If NBA games were shorter, or had fewer possessions, the numbers would be closer.

Also affecting these correlations is the relative strengths of teams in playoff matchups, but my impression is that these are roughly equal between the two leagues.

Labels: , ,

Monday, April 21, 2008

The Hamermesh umpire/race study revisited -- part VI

This is Part 6 of the discussion of the Hamermesh study. The previous five parts can be found here.


------


A few days ago, I thought I was done with the Hamermesh paper, but I found something else that might substantially impact the results.

In almost all its regressions, the paper controls for the score of the game. That makes sense; you'd expect that, with an 8-run lead, a team would tend to throw more strikes than when the game is tied. That's because,





"... if a pitcher is ahead in the game, he typically pitches more aggressively and is more likely to throw a pitch in the strike zone. ... The reason is that having a lead effectively reduces the pitcher's risk aversion. Relative to throwing a pitch likely to result in a walk, throwing a "hittable" pitch is risky – it increases the probabilities of both a very poor outcome for the pitcher (such as a home run) and a very good one (a fly out)."

So far, I agree. But the problem is that the study doesn't use indicator variables for separate scores – it just uses one variable for the number of runs. This assumes that the effect is linear with respect to the size of the lead. For instance, it forces the assumption that when a pitcher has a 10-run lead, the effect is twice as big as when he has a 5-run lead. More importantly, it assumes that a pitcher who throws lots more strikes with a 6-run lead will throw that same number of strikes *fewer* when he is *behind* by six runs.

That doesn't seem to make sense, does it? You'd think that when a pitcher is (say) 10 runs behind, he'll throw lots of strikes. For one thing, his team isn't going to win the game anyway, which means there's no risk, and no risk aversion. For another thing, he probably doesn't want to wear out his arm by throwing too many pitches, so he's not going to try to pick the corners quite as much.

You'd think that more strikes happen when the score is extreme either way, not when it's only in the positive direction. Wouldn't you?

And I think that's what happens. I checked all pitches from the 1991 to 1996 seasons (thanks, as always to Retrosheet). I limited the sample to relief pitchers to eliminate part of the bias where pitchers who are five runs behind will throw fewer strikes simply because they're not very good (which is why they gave up so many runs in the first place). That's necessary, because the Hamermesh study controlled for pitcher quality, and I have no way of doing that. It doesn't fix the problem entirely, because even relief pitchers might be responsible for part of the score, but it's better than nothing.

So here's what I found for called strike percentages based on score. Plus means the pitcher is ahead: minus means he's behind:

28.59 -6
28.53 -5
28.36 -4
28.01 -3
28.18 -2
28.10 -1
27.93 0
29.25 +1
31.12 +2
32.04 +3
32.27 +4
31.29 +5
31.45 +6

As I guessed, the strike percentage does increase along with the lead, but it also increases with the *deficit*. In any case, it's obviously not linear.

Here's a regression I did on lead vs. strikes. The trend is positive – more runs do lead to more strikes – but if you look at the graph, it's obviously not a very good fit. Sorry about the quality.






It looks like a quadratic curve might be a better fit. To check, I ran another regression, this time including a "lead squared" term to test for a quadratic. The squared term was statistically significant, and made the fit much better:




And keep in mind there is some "quality leakage" here too. Remember that this is relievers only. When 1 to 3 runs ahead, teams are more likely to put in their stopper, so those pitchers are better than normal. When they're 4-6 runs behind, teams are more likely to put in a mop-up man, so those pitchers are worse than normal. If you normalize for pitcher quality by lifting the left end of the curve and lowering the right end, you get a nice U-shape. I'm betting that's what would happen if you added indicator variables for the individual pitchers, as the real study did.

(Technical note: if you use weighted least squares instead of ordinary least squares, weighting by the number of times that lead occurred, you get much the same result.)

Now, what you'll notice is that the discrepancies for the top curve (the bottom one too, but the top is what the Hamermesh study used) can get pretty big. If you look at the tied games – "0" on the horizontal axis" – you notice that strikes in tie games are a lot scarcer then than the study thinks they are. By design, the Hamermesh model has tie-game strikes occurring at the overall average. But tie-game strikes happen a lot less than average. In this sample, the mean was 29.35% strikes, but tie-strikes were 27.93%. That's a difference of 1.43 percentage points.

1.43 percentage points is very, very large in the context of this study. That's about the same as the difference between white pitchers and black pitchers. It's one-and-a-half times as large as the biggest "racial bias" coefficient found in the study (0.84%).

And consider games with the pitcher ahead by 2-3 runs. According to the study, they should have been 0.36 percentage points above average in strikes. But in real life, they were over 2.00 percentage points higher! Again, that's about a point and a half wrong. (That's probably an exaggeration, because of the fact that my sample has ace relievers overrepresented, but the point remains.)

Remembering that the "black/black" sample consisted of only 11 games total, is it possible that those 11 games happened to be games in which those pitchers had the lead, which made black pitchers appear to have gotten more strikes from black umpires than they deserved? Or, could the black/hispanic sample, which was barely over 5 full games, have had a lot of close games, which caused those hispanic pitchers to look like they didn't get enough calls from those same black umpires? It seems possible, doesn't it?

The fact is that the linear score adjustment used in the study is wildly inaccurate. That would add an element of randomness selectively to those pitchers who appeared in certain situations. That would increase the standard error of all the estimates in the paper. I can't prove it mathematically, but I think the increase in variance would be enough that some of the statistically-significant findings would become non-significant.

The broader point, as commenters to previous posts have touched on, is this: if you're looking for a very, very small effect in the data, you need to be sure that your model, your assumptions, and your corrections are sufficiently accurate that, if you *do* find an effect, you can assume it's real and not an artifact of the method used. And, after considering this problem with the score adjustment, I don't think this study is able to give the reader confidence that the assumptions are sufficiently precise.

Put another way: a completely unadjusted, naive reading of the data (Table 2) shows almost no bias whatsoever. Controlling for a whole bunch of other variables suddenly *does* show bias. Is the bias what's left truly what's left after after properly adjusting for the randomness surrounding it? Or did improper adjustments, incorrect assumptions, and incomplete calculations *cause* the appearance of bias?

At this point, I think the adjustments – especially this one – aren't clean enough that we can be confident that what's left is real.

------

I think I'm done talking about this paper now ... unless I think of something else to say. Either way, in a few days, I'll come back and summarize everything properly.

------

Labels: , ,

Thursday, April 17, 2008

The Hamermesh umpire/race study revisited -- part V

(This is a continuation of an analysis of the Hamermesh "racial discrimination among umpires" paper. The other parts are the posts immediately preceding this one. If you're new to this, I'd recommend you go back and read the other parts, in order, or this won't make much sense to you.)

-----

In section V of
the paper, the authors turn from counting pitches to examining other measures of game performance. If a same-race umpire increases the number of called strikes the pitcher receives, shouldn't that also lead to improved results in the win column, or in ERA?

The authors investigate starting pitchers "for the roughly 14,000 starting pitchers in the roughly 7.000 games in the three seasons in [the] sample."

The authors first look at white starters. They compare their performance when the umpire is white, to their performance when the umpire is a minority. The results, as read from Figure 4 of the paper:

-- 2.5% higher [Bill James] game scores [with a white umpire]
-- 1.0% more wins
-- 0.1% fewer strikeouts
-- 4.5% fewer home runs
-- 2.2% fewer hits
-- 5.2% fewer runs
-- 4.1% fewer walks
-- 5.8% lower ERA


The results are much better for a white pitcher when he faces a white umpire. And when a minority pitcher faces a same-race umpire, the results are also favorable:

-- 0.5% higher game scores [with a same-race umpire]
-- 11% (!) more wins
-- 7% more strikeouts
-- 14% (!!) fewer home runs
-- 4% fewer hits
-- 4.7% fewer runs
-- 1% fewer walks
-- 1% lower ERA

For both sets of pitchers, all the numbers go in the expected direction, except white/white strikeouts (which show a very small effect the other way). From this, the study concludes:

"For virtually every measure of pitcher performance, the impact of having a matched umpire benefits the pitcher ... many indirect outcomes, such as the number of home runs allowed by the pitcher, are also affected, suggesting that the umpire's behavior may alter the strategies of pitchers and batters."


Looking quickly at the raw data, you might agree with the authors. But if you look a bit more closely, you can see that's not necessarily so.

1. White Pitchers

First, although the authors talk about "14,000 starting pitchers" in the full sample, the numbers that make up the comparisons are much smaller. There are indeed a lot of starts (9,335 by my estimate) where a white pitcher faced a white umpire, but only about 899 starts where a white pitcher faced a non-white umpire. And so the standard error of the differences is pretty large.

Suppose the SD of earned runs per game is 3. Then the variance is 9. Assuming a starter pitches seven innings, the variance of earned runs per start becomes 7. So the variance over 9,335 games is 7/9335. That means the SD is .027. Converting that back to 9 innings gives .035.

Repeating for 899 starts gives an SD of .113.

Then, the standard error of the difference between the two sets of umpires, is the square root of the sums of squares, which is .118.

The observed difference in ERA was 5.8% fewer earned runs than expected, which is about a quarter of a run. 0.25 divided by 0.118 is a bit over 2 standard deviations, which is just over the threshold of significance.

However, there are several reasons that we should still suspect non-significance.

Reason 1: remember that white umpires call more strikes than minority umpires, regardless of who the pitcher is. The raw difference was about 0.2 percentage points. The authors don't tell us what it is after adjusting for batter, count, pitcher, catcher, inning and score, but, as I argued earlier, it should be higher after adjusting for count. Suppose it's 0.3 percentage points. That's about one pitch every four games. If turning a ball into a strike is worth about 0.12 runs (I have it at 0.14, while another study has it at 0.1), that should reduce ERA by about .03. That's enough to push the observed result into statistical non-significance (although just barely).

Reason 2: the observed difference makes no sense in light of the probable number of pitches affected. Overall, before any adjustments, the UPM coefficient was 0.27 percentage points, which could mean 0.27 points for each of the three races. Now, to be the most conservative, suppose the entire effect is white/white bias. That might mean a W/W coefficient of 0.30 or so (there are so many more W/W than others that the coefficient wouldn't rise much).

Take that 0.30, add another 0.30 for the fact that white pitchers call more strikes, and you get 0.6 pitches per game. That's about .07 runs per game, which is less than a 2% drop in ERA. The observed drop in ERA was about 5.8%.

The huge difference between the effect in pitches, and the effect in runs, suggests that you're looking at a substantial amount of luck.

Reason 3: Given that the authors found that most of the racial bias seems to occur in situations of lesser importance, you would think that the cost of a single biased strike call would be *less* than 0.12. That makes the large ERA drop even harder to explain.

Reason 4: The observations of the separate statistics aren't consistent with each other either. A 5.2% reduction in runs should lead to a 5.2% increase in wins. But the increase in wins was only 1%. Again, that suggests luck.

Also, the basic stat most affected was home runs (4.5% decrease). If the improved performance was the result of more called strikes, wouldn't you expect the largest effect to show up in strikeouts and walks? But Ks were actually *lower* than expected not higher. True, BBs were down only about 4.1%, but it's hard to understand why strikeouts and walks should move in opposite directions. Again, that suggests luck at work.

The study's authors seem to believe that the decrease in home runs is indeed due to umpire bias. They argue that pitchers, (subconciously?) aware of the discrimination they face, may alter their strategies to (for instance) throw more hittable pitches. However, ignoring the bias (and not changing strategy) would cost them at most a 2% increase in ERA. The theory that the threat of a 2% increase makes pitchers alter their behavior to cause a 5% increase seems kind of implausible.)


2. Minority Pitchers

Because there are so few non-white pitchers and umpires, there were only about 114 times in the sample where a minority starting pitcher faced a same-race umpire. So all the results in Figure 4 are decidedly non-significant.

For instance, there was a 11% increase in wins. Assuming that starters in the that 114-start sample would have gone 40-40, they actually went 44-36. This, I think, is about 1 standard deviation. You do have to adjust for the fact that minority umpires call fewer strikes than white umpires, but that would be another 1% effect, which is less than another half win.

And, again, there's a mismatch among the stats. Wins increase 11%, but runs decrease only 5%.

However, one thing in favor of the study's hypothesis, in this case, is that, because the same-minority-race sample is so small, it is possible (as we saw) that if all the bias is concentrated in the B/B and H/H cases, there could be a LOT of bias there – certainly enough to cause these results. But there's not enough data here to distinguish luck from bias. And so you certainly shouldn't conclude anything just based on the idea that it's *possible* that bias is the cause.

So when the authors write ...

"... [the] bias is strong enough to affect pitchers' measured performance and games' outcomes."

... I don't think they're correct. I bet that had the authors done significance tests on this data, and considered the sabermetric inconsistencies among the categories, they wouldn't have come to this conclusion.

------

Next, the study examines the effect on race matches in home games. When both teams have a starter/umpire race match, the home team wins 53.8% of games. When only the home team has a match, they win 55.6% of games. But when only the visiting team has a match, the home team again wins 53.8% of games:

53.8% when neither or both teams have a same-race pitcher
53.8% when only the visiting team has a same-race pitcher
55.6% when only the home team has a same-race pitcher


The authors write,

"These differences ... suggest that there is an asymmetry in the impact of racial/ethnic matching: Matches are more important between the umpire and the home team's pitcher than between the umpire and the visiting team's pitcher."

In my opinion, the differences suggest no such thing, because the sample size is so small. The difference is 1.8% out of two groups of about 1350 games each – 24 games total out of 2700. That's less than one standard error.

The authors break the numbers down by race of umpire, but with only 11 B/B and 36 H/H games, respectively, the results are certainly not meaningful.

-----

So, I think the authors' attempt to establish game-level differences by race fails: first, because of lack of statistical significance, and, second, because of the implausible relationships among the affected statistics.

-----

Labels: , ,

Tuesday, April 15, 2008

The Hamermesh umpire/race study revisited -- part IV

This discussion of the Hamermesh study is continued from the previous post. I'm going to go through this part pretty quick, since it's probably getting boring.

-----

The previous posts have brought us to the finding with the most statistical significance: the attendance study.

The authors divided their three-seasons' worth of games into high attendance vs. low attendance, with a cutoff point of 70% of capacity. Their idea is that the more people are attending the game, the worse the consequences for the umpire if he gets the call wrong: attendance "prox[ies] the scrutiny of umpires and thus the price of discrimination."

And the authors' expectations are confirmed. It turns out that when attendance is high, what discrimination there is is opposite to what you'd expect – the racial preference goes *against* the umpire's own race. But when attendance is low, there is a large, and statistically significant, same-race preference. Expressed in "UPM," the results are

-0.28 % fewer same-race strikes with high attendance (0.8 SD)
+0.84% more same-race strikes with high attendance (2.7 SD)


The same qualifications apply as in the previous cases ... plus a couple of additional things:

-- In the QuesTec case, all home games of the QuesTec teams were omitted from the regression. Here, some home games will be included while some won't. Suppose that (say) only one home game in Boston is included, but all their road games. Might that skew the estimate of the "home" parameter in the regression, if the one Boston game was somehow an outlier? That adds a little more of a clustering effect, would cause the standard error to be underestimated even a little bit more.

-- According to
a recent study by David Gassko, there are park effects for strikeouts and walks – and those effects appear to be real (even after regressing to the mean to eliminate luck). If that's the case, that would cause even more clustering – it becomes more likely that the small number of pitches in the B/B cell, just by luck, featured games in high K/BB ratio parks. Again, that would cause the regression to overestimate significance levels.

But I think there's really something there. I think this is real evidence that differences in attendance lead to umpires calling pitches differently with respect to race. As I wrote before, I think the next step would be to look at the small number of hispanic/hispanic and black/black games, to see if there's anything unusual, and then to look at each of the individual umpires.

It would be premature to conclude, as the authors do, that they have found conclusive evidence of equal, subconcious bias among all umpires. The results are just as consistent with conscious bias, unequal bias, and bias on a small minority of umpires.

-----

There's a second way that the study's authors try to show that racial bias decreases when scrutiny increases. They classify pitches with two strikes or three balls as "terminal" – meaning they have the potential to cause a strikeout or walk, and the plate appearance to end. They point out that such terminal pitches are more highly scrutinized, and umpires would be motivated to put their biases on hold when the situation is so important that they might get in trouble.

And indeed, they find some differences:

+0.46% UPM when not in terminal count;
-0.28% UPM when in terminal count.


They also show that the effect is larger in the early innings (when, presumably, the pitch isn't important and the price of discrimination is low), and smaller in the late innings (when scrutiny is high).

In general, the individual UPMs are not statistically significant (the +0.46 above is 1.9 SD), but the difference between terminal and non-terminal count *is* significant.

However: I'm not sure if the authors controlled for the actual count; in Table 5, they said they did, but in Table 7, which gives the same result, they say they didn't. And if they didn't, the results aren't meaningful.

Why? Because white pitchers throw more strikes than minority pitchers. So, when it comes to a teriminal count, white pitchers will be more heavily weighted among the 2-strike counts, and minority pitchers will be more heavily weighted among the 3-ball counts.

That means that white pitchers will be in terminal-count situations where they're more likely to waste a pitch and throw a ball, and minority pitchers where they're more likely to assume the batter is taking and throw a called strike.

Also, since the data show that white umpires call more strikes than minority umpires, the white/white cell will have the largest decrease in strikes on terminal counts. And the minority/minority cells will have the largest increase in strikes on terminal counts. How will this affect the UPM coefficient? I'm not sure, but if you compare white umpires calling extra 0-2 pitches on white pitchers to minority umpires calling extra 3-0 pitches on minority pitchers, I don't think the regression will tell you anything of value.

If the authors *did* control for count, the results are reasonable.

-----

In Table 7, the authors do a master regression, including all three factors: QuesTec, attendance, and terminal count. When all three factors are low scrutiny, the sum of the UPM factors is 0.0107. When all three factors are high-scrutiny, the UPM total is *negative* 0.0120:

+1.07 percentage points: non-QuesTec, low attendance, non-terminal count
-1.20 percentage points: QuesTec, high attendance, terminal count


This seems to be pretty good evidence that there's a difference between the two cases. It does, however, bring up an obvious question: if umpires have a tendency to be biased against their own race in general, why are they suddenly biased *against* their own race in high-scrutiny situations? Shouldn't they, at best, become neutral? The authors argue,


"One might speculate that umpires feel that they are favoring matched pitchers [at other times] and that they sub-consciously overcompensate in instances when they know they are under scrutiny."

I don't know anything about the psychology of racism, so I'll defer to Hamermesh's expertise that this is a plausible explanation.

-----

In cases of more scrutiny, umpires have negative racial bias. In cases of less scrutiny, they have positive racial bias. Is it possible the two cancel out?

The bias does seem to be larger on the positive side, but the negative cases might have higher leverage in terms of winning the pennant. "Terminal counts" are the most important to winning games, and high attendance games are probably more important towards winning the pennant. Interpreting the results the way the study does, you might conclude that, yes, umpires have bias in low-scrutiny situations, but that in subconsciously undoing that bias in high-scrutiny situations, they even everything out.

Of course, they don't even everything out evenly: the Yankees and Red Sox will have negatively-biased umpires most of the time, and the Marlins will have positively-biased umpires most of the time. So the point is somewhat irrelevant.


Labels: , ,

Friday, April 11, 2008

The Hamermesh umpire/race study revisited -- part III

This is part 3 of a discussion of the Hamermesh study. There have been several previous posts:

Old:
first, second, third, fourth
Recent: part 1, part 1(addendum), part 2

-----

Here's a summary of my analysis of the Hamermesh paper so far:

-- I agree with the study that there is no evidence of bias in the overall data (Table 2 in the paper).


-- I believe the study is flawed because the model forces the assumption that every race of umpires is equally biased in favor of his own race.

-- Because of that flaw, I believe the coefficients found by the study's regression are not meaningful, although the significance levels are.

-- Doing my own regressions, I found an insignificant same-race preference for black and hispanic umpires, but a slight insignificant *opposite-race* preference for white umpires.

-- I argue that the results from the raw data can be explained by as few as 100 pitches out of a million.

-- After controlling for pitcher, umpire, batter, and count, the authors of the original study found an effect about five times as large as without those controls. That effect is also not statistically significant.

------

Now, let's turn to the paper's next test, the one that *does* turn out to be significant.

The authors note that about a third of major-league ballparks have the "QuesTec" system is installed. QuesTec is a hardware and software package that measures whether or not a given pitch is in the strike zone. After the game, the umpire's calls are compared to the QuesTec results. The authors write,



" ... QuesTec is the primary mechanism to gauge umpires' perfromance. In particular, if more than 10 percent of an umpire's calls differ from QuesTec's records, his performance is considered substandard, and that may influence his promotion to "crew chief," assignment to post-season games, or even retention in MLB."

They note that if an umpire discriminates by race, whether consciously or unconsciously, he faces a higher "price" of doing so when QuesTec is in operation. He has a much stronger incentive to abandon his same-race preference when the cost is high, and, therefore, we should see much less evidence of discrimation when QuesTec is in effect.

And we do.

In the previous post, we saw that, after adjusting for batter, pitcher, umpire, and count, the coefficient of "UPM" (designating an umpire/pitcher race match) was 0.27 percentage points more called strikes. With QuesTec running, the effect is almost reversed: UPM comes out to –0.35, which in fact favors other races. However, when umpires do not have to face an objective review, same-race favoritism appears to skyrocket: it jumps to 0.63.

I'll write those numbers out again:

+0.63 without QuesTec
-0.35 with QuesTec
---------------------
+0.27 overall


The –0.35 figure is not statistically significant (about 1 SD), but the +0.63 figure definitely is, at +2.2 SDs (p=.015). The difference between the two figures, 0.98%, is exactly 2 SDs from zero, and so is also significant.

It makes sense to me that umpires will be calling pitches differently with QuesTec and without. It is well-accepted that every umpire has his own strike zone. If 93 umpires have 93 strike zones, but have a strong motivation to all use the same strike zone when QuesTec is running, there should be some significant changes in pitch calling.

It's unfortunate, though, that the authors chose not to provide any other data on the QuesTec breakdown. They don't even tell us whether umpires call more strikes under QuesTec, or fewer. That's a question that's of great interest to sabermetricians and baseball fans. It turns out (thanks to a Mitchel Lichtman study as part of a discussion of this paper) that there's no difference: there are 31.9% strikes in both QuesTec and non-QuesTec parks.

Another omission is that the paper doesn't show us what races of pitcher or umpire were affected the most. If they had provided us with the QuesTec equivalent of their Table 2, with strike percentages for each of the nine cells, we could figure this out.

Actually, in Figure 1, the authors do provide a bit of that data in graphic form. White umpires call 32.1% strikes on white pitchers in non-QuesTec parks, but only 32.0% in QuesTec parks – a difference of 0.1 percentage points. Minority umpires (the authors group black and hispanic together here) calling white pitchers go from 31.8% to 32.2% -- suggesting a much higher race preference, about five times as high.

For minority pitchers, there's a huge difference in how minority umps call strikes when the pitcher is of the same race. They go from 32.2% strikes without QuesTec, all the way down to 30.6% strikes when QuesTec is watching them. That's a difference of 1.6%. On the other hand, when the umpire is not of the same race as the minority pitcher (which mostly means white umpires), the difference is only 0.1%.

White ump, white pitcher: -0.1% under QuesTec
Minority ump, same-race pitcher: -1.6% under QuesTec
White ump, minority pitcher: +0.1% under QuesTec (approx.)
Minority ump, white pitcher: +0.4 under QuesTec

So if the QuesTec effect really does come from race bias, it looks like minority umpires' bias is between 4 and 16 times as high as white umpires' bias! This is something the paper never mentions; the authors never noticed that the empirical results strongly contradict their assumption about equal bias for every race.

However: I don't think any of these results are statistically significant. I think even the 1.6% figure is only about 1 SD (because of the small number of minority-on-same-minority pitches).

------

While none of the above four findings is statistically singificant by itself, taken together, they combine to create the significant UPM coefficient the authors found for non-QuesTec parks.

That coefficient again, is 0.63 percentage points. As I said in the last post, that coefficient means only that IF the races have identical biases, that bias is likely to be 0.63. But if the biases aren't identical, how big would they have to be?

To test, I started with an unbiased matrix that's close to the real Table 2:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.45 30.67
Hspnc Umpire-- 32.01 31.41 30.63
Black Umpire-- 31.85 31.24 30.46

As I mentioned, the authors didn’t provide a "Table 2" for non-QuesTec parks only. They did, however, say how many pitches there were, and it came out to 62.89% of the total. So I assumed that every cell had 62.89% as many pitches as the original. That's not exactly correct, I found out, because the first thing I did was to add the study's 0.63 to each of the diagonals. I wound up getting a significance level of .04, instead of .015. That might be the result of different numbers of pitches, or rounding errors. To get down to p=.016 I had to use 0.75 as the UPM, instead of 0.63.

Anyway, there are lots of unequal combinations of W/W, H/H, and B/B bias that would give us the same UPM coefficient of 0.75, and signficance level of .016. For instance, if we add 1.2 percentage points to the W/W cell, and leave the H/H and B/B cells alone, we get UPM=0.76, and p=.015. Here are some other combinations that work. I had to get these by trial and error:

-- W/W bias of 1.2 points (UPM=0.76, p=.015)
-- H/H bias of 2.6 points (UPM=0.76, p=.014)
-- B/B bias of 9.5 points (UPM=0.75, p=.015)

As it turns out, you can take any 100% combination of those three biases, and get roughly the same result. So if we go one-third/one-third/one-third, we get W/W bias of 0.4, H/H bias of 0.87, and B/B bias of 3.2:

-- W/W 0.4, H/H 0.87, B/B 3.2 (UPM=0.76, p=.015)

We can also go 1/2 H/H and 1/2 B/B:

-- W/W 0.0, H/H 1.3, B/B 4.7 (UPM=0.75, p=.016)

Let's take that last case and see what happens. That means that the bias in each cell is:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- +0.00 00.00 00.00
Hspnc Umpire-- 00.00 +1.30 00.00
Black Umpire-- 00.00 00.00 +4.70

If that were the case, it would explain the QuesTec findings completely. How many pitches does it take to create this effect? This many:


Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +60 +00
Black Umpire-- +00 +00 +52

So: it is possible that the entire effect the study found is due to only 112 pitches, out of 700,000. Now, that's a little unfair, because, even though it's a small number of pitches, it's particularly unlikely for it to happen exactly this way. (I think to get the most likely, you want to find the maximum likelihood diagonal that still creates a UPM of 0.75. I don't know how to do that. But that doesn't matter to my argument here.)

And the B/B bias is pretty high, at almost 5 percentage points. Since there are only 30% strikes in the first place, 5 percentage points out of 30 is a 17% increase in called strikes. You'd think that would be hard for an umpire to get away with.

But the point is that there's no way of knowing how many pitches are affected. And it's at least *possible* that the entire effect is only 112 pitches – which is probably a surprise.

-----

Now, let me argue that the significance level may not be all it's cracked up to be. That would be because the significance calculation made one assumption that isn't true in the smaller cells: the assumption that the errors in each cell are independent of each other.

Let me try to explain what I'm getting at here.

The regression in the study corrected for batter, pitcher, umpire, count, score, home team, and QuesTec. But we can imagine there are some other things that would affect strike percentage.

For instance: maybe runners on base and number of outs. With one out and a runner on third, the pitcher is willing to risk a walk, in order to avoid giving the batter something good to hit. If black pitchers facing black umpires had fewer such situations, that could account for part of the effect.

Also, there's the day/night factor. It's likely that there called strikes vary between day and night games. Suppose the ball is harder to see at night. Then there might be more swinging strikes. That means more 0-2 counts, which means more called balls. Or, it could be the other way around: bad visibility might mean more called strikes, as batters swing less.

In any case, suppose there's a significant difference between day and night. It could turn out that H/H and B/B combinations just happened to take place during the time of day when conditions were more "called-strikey."

I'm sure you can think of more factors that weren't controlled for. Maybe there's a platoon interaction: more called strikes when the pitcher has the platoon advantage.

These factors could be fairly large. And it could turn out that the same/race combinations just happened to have better circumstances.

Now, that's kind of grasping at straws, this "you didn't control for everything" criticism. In any study, there's always something you don't control for, and that, in and of itself, doesn't make the results of the study invalid. In fact, the significance *takes into account* that there are uncontrolled variables: the significance level is defined as the probability that the uncontrolled variables (including luck) could cause that kind of result.

But: that's only true if the errors are independent.

Suppose that the league bats .250. You pull 480 random AB from the league. What's the chance those random AB hit .300? Well, you expect 120 hits. You need 160. That's 3.9 standard deviations away, so the chance is less than 1 in 10,000.

But: instead of pulling 480 random AB from the league, what if you first select a random player, then select 480 random AB from only that player's batting line? Now, the chances are much higher: roughly, the chance of having pulled a .300 hitter. Maybe they're only, say, 1 in 15.

In the first case, the chance of drawing an AB from a certain player was independent of the last one. But in the second case, it's not. And so the normal binomial probability calculation doesn't apply.

Now, let's talk about the B/B cell. Are its pitches completely random in terms of day/night, platoon advantage, and runners on base? They are not. And that's because the umpire/pitcher combinations aren't dotted thoroughout the season – they're concentrated into only a few specific games.

There are about 160 called pitches per game (80 per team). The B/B cell has only 1,110 pitches. 1,110 divided by 160 is 7 games worth of pitches. That's it, just seven games. If the pitches were independent, the chance of getting, say, 80% of those pitches being during day games would be infinitesimal. But could easily happen that 6 of the 7 games would be day games. And because of that, the regression underestimates the probability that there could be a lot of day games, and therefore underestimates the amount of luck, and therefore overstates the signficance of the findings.

(This argument is a little oversimplified because of relief pitchers – the 7 games is probably 11 seven-inning games worth of starters, and maybe 48 innings of relievers. But it's still easy to see how the B/B cell could be daylight-intensive.)

Because of this concentration of pitches into small numbers of games – the B/B cell contains only 7 games, the B/H cell only 3 games, and the H/H cell only 29 games – the significance levels are overstated. Overstated by how much? I don't know. But the UPM coefficient was barely significant – 2.2 SD instead of just 2. If you figured out how to calculate the signficance levels correctly – using cluster sampling methods that I don't know how to use – it seems to me very possible that the UPM would slip below significance.

Having said that, I'm not really that thrilled with the idea that you should draw such a fine line at 5% significance, where one side of the line is meaningful and the other side is not. But a difference between p=.015 (which the study found) and (for instance) p=.08 (after adjusting for clusters) is a difference between a 1 in 67 chance that the results would obtain by luck, and a 1 in 12 chance that the results would obtain by luck.

And, in any case, just think about the B/B cell, which had 52 called strikes too many in seven full games. Does that seem like something that could be caused by a few too many day games? I'm not sure. We are talking about one extra strike every two innings, which seems like a lot. How about one strike every four innings? That would reduce the cell from +52 to +26, which would still be enough to make the results no longer statistically significant.

My gut says no, that's still too much, but my gut doesn't really know. To find out, you can find the SD of called strikes in full games, and compare it to the SD of one game's worth of called strikes, taken randomly from multiple games, and do some calculations.

Anyway, I'd argue that the significance level is definitely overstated due to the clustering of pitches within games. I just don't know by how much.

------

Now, suppose that, even after adjusting for day/night clustering and platoon clustering and runners-on-base clustering, we find there's still significance. That would be evidence of bias. We still wouldn't know how much bias, or among which races – as we saw above, there is an infinity of possible combinations – but we might conclude there's at least some bias.

But could it be anything other than racial bias?

The authors acknowledge that it could. Actually, I don't think they discuss this in the paper, but in a
FAQ posted on Dr. Hamermesh's website, he says,



"[Perhaps] different race/ethnicity groups espouse differen[t] styles of play. Suppose for example, that youth baseball coaching is different in Latin America than elsewhere, and that Hispanic pitchers consequently develop pitching “styles” that differ from those of Black, Asian, or White pitchers. If Hispanic umpires and pitchers both espouse similar styles that differ from other races/ethnicities, then what appears as discrimination may simply reflect these stylistic differences."

I think it's quite possible that something like this is going on – but it might have nothing to do with race or ethnicity. Every umpire is different, so it's possible that the three hispanic umpires, or the five black umpires, might have different "styles" just by chance.

For instance, suppose some umpires, just by virtue of their own quirks, have trouble calling curve balls that just miss the low inside corner – they usually call them strikes. And suppose, that, by chance, three of the five black umpires (60%) have that quirk, but only (say) 20% of other umpires.

Also, pitchers are different, too. And there are only 27 black pitchers (as compared to 669 white pitchers). So it's also possible there are lots of black curveball pitchers, as compared to white curveball pitchers, just by chance.

This combination of black curveball pitchers and umpires who have oversized curveball strike zones for curveballs could easily create the appearance of racial bias, when it's just random style interaction.

And, again, the regression algorithm doesn't consider the probability that will happen – it assumes, by necessity, that curveball quirks are spread evenly among all umpires, and that curveballs are spread evenly among all pitchers. So, again, the regressions in the study overstate the significance of any discrepancies.

------

So, to summarize the findings so far:

-- the non-QuesTec finding of racial bias is signficant at 2.2 SD (p=0.015);

-- but because of clustering of possible uncontrolled variables (like day/night), the actual number is certainly less than 2.2 SD;

-- and because of clustering of uncontrolled umpire and pitcher styles (like curve ball characteristics), the actual number is even further less than 2.2 SD;

-- I don't know how much to reduce the 2.2 SD by. I suspect not too much, but it might be enough to go below 2.0 and make the results "officially" not significant;

-- there is some suggestion that the differences are larger among minority pitchers than white pitchers;

-- and so probably only a small number of overall pitches are affected, many fewer than the 0.63% suggested by the regression coefficient.

Let my official wimpy conclusion be that for the non-QuesTec games, the data is suggestive, but probably not quite significant, and more research is called for.

------

And the one obvious question that hasn't been asked yet: if the effect is small, as it seems to be, couldn't it be just one or two umpires causing it? Why not just look at individual umpires and see if there's an outlier? It could be as few as, say, 25 pitches that moves the result from signficant to random. Since it's obvious that, even within the same ethnic group, different human beings have different propensities to conscious or unconscious racial bias, why does the study insist on looking only at global "same race" effects? Not only wouldn't they break that down into W/W, H/H, and B/B, but they also didn't break it down by umpire. That's kind of puzzling to me.

Some commenters, here and elsewhere,
have suggested it's worth looking at Angel Hernandez, because he has a reputation for being an outlier in other aspects of his umpiring career. I don't want to get sued or anything, so let me emphasize that there is absolutely no data to support this theory, and I have no idea whether it's true or not. But if it does turn out that he, or any other umpire, is an outlier who shows evidence of bias, and removing him is enough to make the rest of the umpires look unbiased ... well, won't a lot of work have been wasted on this intricate study for nothing? Why not just actually take a look at the data the way a fan would before you start doing complicated regressions with hard-to-interpret results?

------

Part IV will talk about some of the study's other tests, which (thank God) are a lot simpler than the regression we've been talking about in these last three posts.


Labels: , ,

Wednesday, April 09, 2008

The Hamermesh umpire/race study revisited -- part II

This is the latest in a series of posts about the Hamermesh umpire/race study.

August, 2007 posts:
first, second, third, fourth.
Recent posts:
part 1, part 1(addendum)

This is part 2. What follows is a little complicated, I hope I got it right. Please let me know.

---

In the
previous post, I showed that a regression on the Table 2 data from the Hamermesh study showed no significance for the "UPM" parameter, the one that represents same-race bias on the part of umpires.

That regression was still simpler than the ones in the study. In column 8 of Table 3, the authors do pretty much the same regression I did, but they control for a whole bunch of other variables:

-- the count
-- the pitcher
-- the umpire

When I say "the count," I mean that the authors included a dummy variable for every possible count (except one – you need to omit one, any one, to get the regression to work properly and avoid a singular matrix). That's 11 variables. They added a variable for each umpire, which is 92 variables (since there were 93 umpires). And they also included a variable for each pitcher (except one), which is about 900 more variables.

Why include all these variables? Because, whatever effect you get, it could be just luck, that certain pitchers wound up facing certain umpires more than usual. Maybe the white umpire with the big strike zone happened to face Jake Peavy a lot, and that's what inflated the W/W cell. By controlling for the individual pitcher, and the individual umpire, you help eliminate some of the random luck from the results.

A more important issue than controlling for pitchers and umpires is controlling for count. It's good that the authors did that, because using just the raw data might understate the differences among umpires and pitchers.

There's probably a bit of a negative correlation between one pitch and the next. After a strike, a given pitcher is more likely to throw a ball (perhaps wasting an 0-2 pitch); and, after a ball, he's more likely to throw a strike (such as on 3-0, when the batter is probably taking). So there is an underlying force – the count – that pulls pitchers' strike percentages towards each other. Pitcher A might have 5% more strikes than pitcher B on every count, but have only 3% more strikes overall because of all of B's extra 3-0 strikes. The 5% figure is the accurate indicator of the relative output of the pitchers – or of the umpires that called them.

So anyway, having added those 1,000 variables to the regression, the study found more of a UPM [umpire/pitcher matching in race] effect:

.05% -- my regression
.28% -- the study's regression, after controlling for pitcher, umpire, and count

With the extra variables, the racial bias coefficient was over five times as high as without – but it was still not statistically significant, at about p=0.12 (about 1.2 standard deviations above zero). The authors also ran one more regression, adding another 1,400 variables to control for the batter, and got almost the same result (.27%).

The authors write:



"... the point estimates imply that a given called pitch is approximately 0.27 percentage points more likely to be called a strike if the umpire and pitcher match race/ethnicity."


And this is where I disagree. I believe that the significance level is accurate, but that the actual coefficients are not.

There's nothing wrong with the way the authors read the results of the regression output – everything is fine statistically – but I disagree with the model that produced the regression. Specifically, the addition of the UPM parameter has hidden assumptions – the assumption that same-race bias must be the same regardless of the race involved. The UPM parameter of .27 means that each of the three cells on the diagonal is 0.27 percentage points higher than it would otherwise be. So white umpires called 0.27 percent "too many" strikes; hispanic umpires called 0.27 percent too many strikes; and black umpires called 0.27 percent too many strikes. The assumption forces the regression to choose equal estimates for all three races. That equal estimate came out to 0.27.

I don't think forcing this equality is appropriate. It's certainly *possible* that the races have the same bias, but is it so certain that you can embed it in the model that way? I'd argue that it isn't even likely. In almost every aspect of life that has real, proven bigotry, it almost always goes one way. Whites used to lynch blacks; did blacks ever lynch whites? Are there gangs of gay men who roam public parks looking for smooching heterosexuals to beat up?

Even where it's obvious that two groups mutually dislike each other, does it really follow that one group will be *exactly* as biased as the other? Is a Republican boss exactly as unlikely to hire a Democrat as a Democrat boss is to hire a Republican? Even if they're equal today, what about tomorrow? When George W. Bush does something controversial overnight, don't you think Democrats will get a lot more pissed off than Republicans, and the relative bias will wind up a little bit more extreme than yesterday?

If you agree that it's reasonable that the races would have different levels of bias towards each other – and even if you don’t – you have to qualify the results of the study. Instead of saying

-- "The best estimate of racial bias is 0.27% of pitches."

You need to say

-- "IF racial bias is the same across all races, THEN the best estimate of racial bias is 0.27% of pitches."

Since we don't know that bias is the same across the races (and I think we have reason to believe that it's not), we can't just assume that the 0.27% is the right number.

But, suppose that even without that assumption, we'd get a similar result. Then my objection would just be a technicality. But I don't think that's the case. Let me show how wildly different the numbers are if the bias isn't equally spread among the races.

Suppose the unbiased frequency of strikes should be this:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.00 33.00 34.00
Hspnc Umpire-- 35.00 36.00 37.00
Black Umpire-- 38.00 39.00 40.00

Now, suppose that same-race umpires are biased equally, as the model demands. Say, by 1 percentage point each. We'll add 1.00 to each cell of the main diagonal. That gives us:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 33.00 33.00 34.00
Hspnc Umpire-- 35.00 37.00 37.00
Black Umpire-- 38.00 39.00 41.00

In this case, the model works. If we do a regression now (keeping the same number of pitches for each cell as in the original study), we get exactly the results we expect, with a UPM coefficient of exactly 1%. The regression works perfectly because the actual bias matches the model. And the 1% means that all else being equal, a pitcher will receive, on average, extra percentage point more strikes when the umpire matches his race.

But, now, what if the bias isn't equal? What if it's all concentrated in the white umpires? That is, instead of a 1% same-race bias on each umpire, we have a 3% bias in the W/W case, and 0% in the other two. The observations look like this:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 35.00 33.00 34.00
Hspnc Umpire-- 35.00 36.00 37.00
Black Umpire-- 38.00 39.00 40.00

Now, all else being equal, how many percentage points more strikes will a pitcher get if the umpire matches his race? Well, the only time that happens is in the white/white case. So 3% more strikes out of 741,729 is 22,251 more strikes. There are 750,817 pitches in the same-race diagonal. 22,251 divided by 750,817 equals 2.964%.

So we'd expect the UPM coefficient to be about 2.964%. When we do the regression, it turns out to be 1.883%. That's way off, and it's way off because the reality doesn't match the model.

It's even worse if we try the other same-race cases. Suppose all 3% goes in the H/H cell, and the W/W and B/B umpires are unbiased:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.00 33.00 34.00
Hspnc Umpire-- 35.00 39.00 37.00
Black Umpire-- 38.00 39.00 40.00

Now, 0.029% of same-race matchups are affected, but the UPM coefficient is 0.882%.

Finally, suppose all the bias (3%) is in the B/B case:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.00 33.00 34.00
Hspnc Umpire-- 35.00 36.00 37.00
Black Umpire-- 38.00 39.00 43.00

Now, in reality, only 0.007% of pitches are affected (about 1 in 14,000). But the UPM coefficient is 0.23% (1 in 427).

My conclusion, again, is that if real-life doesn't match the model's assumption that all umpires are biased equally, then the estimated coefficients are so unreliable as to be meaningless.

-----

Now, I said the UPM coefficient is meaningless. But I am NOT arguing that the significance level is meaningless. In fact, I agree with the study's authors that the significance tests are valid.

Why? Because the significance test is checking for no bias by anyone. And if there's no bias by anyone, then real-life DOES fit the model: all races are biased equally. Equally at zero, but equally nonetheless.

That is: if there is no bias, then the regression should fit the model, and the significance level will be low. But if the UPM coefficient is significantly different from zero, we know we have bias. The actual equation probably doesn't match real life very well, but we still have evidence that there's *some kind of bias*. Just not necessarily the kind the regression thinks it found.

To put it more formally:

-- if there is no racial bias, then racial bias is the same across all races (at zero).
-- if racial bias is the same across all races, then the UPM coefficient is the best estimate of that bias.
-- so if there is zero racial bias, the UPM coefficient should come out near zero.
-- if we do NOT get a UPM coefficient close to zero, then there must be a non-zero bias – although then the UPM coefficient is not necessarily a good estimate of that bias.

Hope that makes sense. In summary, it seems to me that we can trust the significance level of the UPM estimate, but not the number itself.

-----

What's funny about the study is that the authors didn't really need to make the assumption that the bias was all equal. There's more than enough data to estimate the three races bias individually. All you have to do is get rid of the UPM variable, and replace it with three "umpire matches pitcher" variables, one for each race. That eliminates the assumption, and the results become applicable in cases where the real world groups doesn't have equal bias.

If you take the study data, get rid of UPM as a variable, and add the three race-specific UPMs, the regression gives these coefficients (I'm using the original Table 2 data, as usual):

W/W bias estimate: -0.416%
H/H bias estimate: +0.880%
B/B bias estimate: +0.689%


The regression is telling us that, indeed, it looks like the three races of umpires have fairly different biases. The white umpires are actually slightly biased *against* their own race. The black and hispanic umpires, however, are on the same-race side, perhaps as expected.

The other regressions didn't find statistical significance, and this one doesn't either. The first two coefficients are about 1 SD from zero, and the third one only half an SD. And even the differences between any pair aren't statistically significant.

But at least these coefficients are meaningful. We can say that when a white pitcher is on the mound, a same-race umpire means he loses an estimated 0.416 percentage points of his strikes. A hispanic pitcher facing a hispanic umpire gains 0.880 percentage points, and a black pitcher facing a black umpire gains 0.689 percentage points.

These are all *relative* to when a pitcher faces a different-race umpire (with the same overall strike zone). It is impossible to ever know the absolute level of "real" racial preference (if in fact there is any), because any relative difference equal to these coefficients is equally possible. It could indeed be that black umpires have +0.880% bias in favor of blacks, and 0.00% bias against non-blacks. Or it could be that they have +0.440% bias in favor of blacks, and –0.440% bias against non-blacks. Or any other combination.

To be more explicit, the regression came up with this matrix of relative own-race preferences:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- -0.42 00.00 00.00
Hspnc Umpire-- 00.00 +0.88 00.00
Black Umpire-- 00.00 00.00 +0.69

But to get *absolute* preferences, we can add any constant to any cell. Suppose we subtract 0.2 from every cell, because it seems to us that other-race umpires should have about that much bias. Then we get

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- -0.62 –0.20 –0.20
Hspnc Umpire-- -0.20 +0.68 –0.20
Black Umpire-- -0.20 –0.20 +0.49

Either of those matrices fits the data. There is no way, no matter how much data we have, to figure out what's the true level of bias: the top one, the bottom one, or any of the infinity of possibilities. That's because we don't know where the zero is – we don't know which cell, if any, calls balls and strikes perfectly. If we were to measure performance objectively, using QuesTec or something, and call that "zero bias," then we could find out where the zero mark is, and adjust accordingly. (Of course, in that case, we wouldn't need to do a regression – we could just measure every racial combination against QuesTec.)

-----

In the regression we just did, there was also a hidden assumption: that all different-race pairs had the same bias. If we didn't want to assume that all *same-race* pairs had the same bias, why would we want to assume that the *different-race* pairs had the same bias?

Unfortunately, we're stuck with a little bit of that. By adjusting for the tendencies of the individual pitchers and umpires – for instance, by assuming that white pitchers are legitimately better than hispanic pitchers, instead of considering that the result may be bias – we are simultaneously assuming that:

-- the bias in each column must sum to the same amount
-- the bias in each row must sum to the same amount

That means:

-- at most, we can estimate two of the three cells in each column, relative to the other
-- at most, we can estimate two of the three cells in each row, relative to the other

Which means that we have to cross out one row and one column. That leaves four cells, in a rectangle, that we can estimate. If we try to estimate more than that pattern of four cells, the regression won't work, because we get an infinity of possible answers. (Technical note: when we try to estimate more than four cells, we get collinearity among the dependent variables.)

(Just to emphasize: this is NOT a matter of not having enough data; it's a matter of trying to estimate too many things at once. To take a simpler example: if an umpire calls 31% strikes against whites and 32% against blacks, it is simply impossible, using only statistics, to figure out if he's biased against whites or in favor of blacks; the data, no matter how many pitches you have, equally support both possibilities. Here, with three races instead of just two, it's a little more complicated, but no matter how much data you have, you can't do all of: (a) estimate an umpire effect, (b) estimate a pitcher effect, and (c) estimate nine cells. If you want (a) and (b), the best you can do for (c) is estimate four cells relative to the other five.)

So which four cells should you estimate? That depends on what you're most interested in. Whatever four cells you pick, means you have to hold the other five (one row and one column) constant, and assume that the bias is the same in all five of those cells. The previous regression, where we chose the three same-race cells, and assumed the other six were equally biased, was a reasonable choice.

Since our null hypothesis is that there's no bias, another reasonable choice might be to hold constant the row and column involving whites. Why? Because they have the most pitches. By looking for bias in the other four cells, we are investigating bias where there are the fewest pitches, and where the difference would be most likely to happen by chance. That would be the most conservative test of whether the bias is just random.

(As an example: suppose you want to compare two players. Joe hit .333 in 480 AB, Manny went 0-for-3. It's more informative, intuitively, to say "Manny is only one hit worse than Joe's average" than to say "Joe is 159 hits better than Manny's average.")

If we choose those bottom-right four cells, here are the results:


Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 00.00 00.00 00.00
Hspnc Umpire-- 00.00 +0.48 –0.31
Black Umpire-- 00.00 –0.47 -0.28

In terms of pitches:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +35 +03
Black Umpire-- +00 -65 +05

If you change the "-65" to "-74", you get *exactly* what we got just by quickly playing with the numbers back in August.

And since the results do appear to be non-significant, I think this particular regression is probably pretty close to what's actually going on.

-----

Hope this all makes sense. Next will be Part III, where I'll talk about some of the other findings in the paper, including the statistically significant ones.


Labels: , ,

Monday, April 07, 2008

The Hamermesh umpire/race study revisited -- part I (addendum)

In the previous post, I ran some regressions on the Hamermesh data using a 1/100 sample. But thanks to a couple of commenters who suggested "gretl" regression software, I am now able to regress on the full dataset (a bit over a million pitches).

I thought the full data would give almost the same results as the smaller dataset -- after all, I cut down the data in exact proportion. But the rounding errors proved to be more important than I thought.
When I divided by 100 and rounded to nearest strike, that was a maximum of an 0.5 strike error in the new, smaller sample, which is the equivalent of a maximum 50 strike error in the large sample. That's a fairly large difference, considering that we're dealing in hundredths of percentage points.

So I'm going to rerun the results for the larger sample, here, just to be as consistent as possible with the data in the study. If you're not interested in the details, I'll tell you the conclusions right now so you can skip the rest of this post. The full regression gives:


-- slightly different numbers;
-- slightly less evidence of same race bias;
-- and pretty much the same overall conclusions.

For those who want to see the updated regressions, keep reading.

-----

Here's the "expected" strikes matrix for the regression that did NOT include a variable for racial bias:


Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.46 30.64
Hspnc Umpire-- 32.03 31.43 30.61
Black Umpire-- 31.83 31.23 30.41

Subtracting that from the real-life observations in the original Table 2 matrix :

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- +0.00 +0.01 –0.03
Hspnc Umpire-- -0.12 +0.37 +0.18
Black Umpire-- +0.10 –0.36 +0.35

Converting that to pitches:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- -15 +23 -08
Hspnc Umpire-- -29 +27 +01
Black Umpire-- +45 -50 +06

The results are a bit different than in the 1/100 sample. For instance, white umpires are even less biased in favor of their own race, by negative 15 pitches here vs. negative 4 pitches in the other regression. And the same-race cells add up to only +22, as compared to the +37 from before, which is also less suggestive of same-race bias.

-----

Now, here's the regression that includes the UPM variable for same-race bias:

Chance of a called strike equals:

30.4448%
---- plus .1906% if the ump is white
---- plus .1827% if the ump is hispanic
---- plus 1.377% if the pitcher is white
---- plus .8206% if the pitcher is hispanic
---- plus .0513% UPM (if the umpire matches the pitcher)


The old regression had the UPM term at .1169 – this one has it at less than half that, at .0513.

Now that we're using the full dataset, we can get a signficance level for the UPM parameter. It turns out it's not significant at all, with a p-value of about .83, far more than the .05 required for significance. In fact, the real-life data show *less* racial bias than if the data were random (which would give a p-value of 0.5).

Doing the calculation for baseball significance shows that the proportion of pitches affected by the presence of a same-race pair is somewhere between 1 pitch in 2855, and 1 pitch in 6140.

-----

It's interesting how a such small change in the observed percentages – caused just by rounding! – could bring on such a large difference in the estimate of racial bias. In part, that's because this regression is trying to reproduce the numbers in the nine cells, regardless of whether those cells contain 800 pitches or 700,000 pitches. While the number of observations certainly does affect the results of the regression, it seems that the raw numbers in the cells matter even more.

So the conclusions in the previous post still stand – no evidence of bias, no statistical significance, and no baseball significance either.


And, again, we haven't actually got to the Hamermesh study itself yet. We'll do that next.



Labels: , ,

The Hamermesh umpire/race study revisited -- part I

Back in August, a study came out finding that umpires appear to have racial preferences when calling balls and strikes. It found that when the umpire was of the same race (white, black, asian or hispanic) as the pitcher, he was more likely to call a strike.

I commented on the study at the time, in a series of posts (
first, second, third, fourth).

Since then, a few things have happened. First, a new version of the study came out. Second, I've been thinking about things a bit more. And, third, I'll be presenting some comments to the study at the SABR convention in Cleveland, and this is a good chance to get some feedback in advance. So here we go.

The study is called "
Strike Three: Umpires' Demand For Discrimination," by Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh. Time Magazine interviewed Hamermesh about the paper, and said that he led the research, so I'm going to call it the "Hamermesh study." (The above link is to the new version, dated December 2007. The old version, dated August, 2007, is here.)

Hamermesh et al looked at all major-league called (not swung on) pitches from the 2004-2006 seasons. They found what percentage of those were strikes, and which were balls. They then collated the results according to the race of the umpire and the race of the pitcher. Here is that data. I'm going to call it the "Table 2 matrix," since it's taken from Table 2 of the paper.

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.80 30.77
Black Umpire-- 31.93 30.87 30.76

(I've left out Asian pitchers, as did the study itself. There are no Asian umpires.)

There are a few important things to note about the data.

First, all umpires call more strikes for white pitchers than hispanic pitchers, and also for hispanic pitchers over black pitchers. It seems safe to assume, then, that the white pitchers as a group are better than the hispanics, who are in turn better than the blacks.

Second, it's not obvious from above, but white umpires called more strikes than hispanic umpires, and hispanic umpires called slightly more strikes than black umpires. (It's just a coincidence that the umpires happen to land in the same race ranking as the pitchers.)

Third, just from glancing at the table, there does seem to be a slight tendency for umpires to call more strikes for pitchers of their own race. (When the umpire matches the pitcher in race, the study calls this a "UPM," for "umpire-pitcher match".) White pitchers get the most strike calls from white umps; hispanic pitchers get the most strikes called from hispanic pitchers, and black pitchers get *almost* the most strikes called from black pitchers.

Fourth, there are a LOT more white pitchers and umpires than any other race. 87 percent of umpires are white, and 71 percent of pitchers. In fact, there are only 3 hispanic and 5 black umpires out of 93 total. So (for instance) a black umpire calling a black pitcher is rare; there were only 1,765 such called pitches in the three years of the study. On the other extreme, there were 741,729 white-on-white pitches. Full data (in hundreds of pitches):

Pitcher ------ --White -Hspnc -Black
------------------------------------
White Umpire-- 741729 -236937 -25108
Hspnc Umpire-- -24592 ---7323 ---845
Black Umpire-- -46825 --13882 --1765


There are 877 times as many pitches involving in the white/white cell than in the black pitcher/Hispanic umpire cell.

----

Okay, so: how can we tell if there's discrimination going on here? The study's authors used a regression, which I'll get to. But, for now, let's try to figure this intuitively (in a rehash of what I did back in my August posts). What would the table look like if there were no racial bias?

Perhaps all the umpires would see all pitches the same. That would look something like this (call this matrix 1):

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.05 31.45 30.62
Hspnc Umpire-- 32.05 31.45 30.62
Black Umpire-- 32.05 31.45 30.62

Obviously, there are no racial preferences showing up here. But this is an extreme case – there's no need to make the assumption that all the umpires are so perfectly identical. After all, every ump has his own strike zone, and, even if there's no racial component to strike-zone judgement, the races would be different just by random chance.

So an unbiased set of umpires might instead look like this (call this matrix 2):

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.47 30.61
Hspnc Umpire-- 31.91 31.32 30.46
Black Umpire-- 31.93 31.34 30.48

Now, every umpire calls 0.59% more strikes for white pitchers than for hispanic, and 0.86% more for hispanic than for black. White umpires have a bigger strike zone than minority umps, but they're consistent with the other umpires in how they treat the races.

So how does the Table 2 matrix vary from this "ideal" one? By this much:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 00.00 00.00 00.00
Hspnc Umpire-- 00.00 +0.48 +0.31
Black Umpire-- 00.00 -0.47 +0.28

Here, too, it looks like there might be a bit of bias: all the positives (actual higher than expected) are for UPMs [umpire/pitcher racial matches], and the negatives are from non-UPMs.

But the differences aren't that big. Let me convert them to actual pitches. Start with the black/black cell. In that cell, there were 1,765 pitches. 0.28 percent of that is 5 strike calls too many, or +5.

Repeating the calculation for the other cells gives:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- +00 +00 +00
Hspnc Umpire-- +00 +35 +03
Black Umpire-- +00 -65 +05

So, over three seasons, and over a million pitches, it turns out that changing only 108 total pitches would lead to a completely unbiased result. Intuitively, it does seem that there's little evidence of serious racism here.

Looked at another way: the same-race umpires called 40 more strikes than expected. The mixed-race umpires called 62 fewer strikes than expected. The difference: 102 pitches.

Now, there is no real reason that we had to choose Matrix 2 as our example of what unbiased umpires would look like; we could have chosen Matrix 1 instead. Or, we could have chosen any other matrix where none of the umpires show racial bias. Basically, if you start with unbiased Matrix 1, pick any number, and add it to each of the entries in any row, you get another unbiased matrix. Or if you add the same number to each of the entries in any *column*, you get an unbiased matrix.

My argument is that if ANY of those matrices are sufficiently similar to real life, you have to argue that the data doesn't show any evidence of bias. And that's what I'd argue for Matrix 2. The real life data in Table 2 was only 108 pitches away from matching the unbiased one, and none of the individual discrepancies was statistically significant. And if we were to test the difference between the same-race and different-race numbers – which came out to 102 pitches – we'd also find that it's not statistically significant.

----

OK, that was my quick, intuitive analysis, using a specific "unbiased" matrix, selected out of an infinity of possibilities. But we can go more formal, and use linear regression to investigate the issue.

Here's what I did. Using the pitch data from the authors' Table 2 (but divided by 100 because my 20-year-old DOS-based software can't handle a million rows), I regressed each pitch using indicator variables for umpire and pitcher race. What that translates to, in plain English, is basically that I asked for a matrix where the umpires are unbiased.

(Technical note: after dividing by 10, I added 2 called balls to the Hispanic Ump/Black Pitcher cell, and one called ball to the Black/Black cell. This makes the percentages closer to the original ones, fixing rounding errors caused by dividing by 100.)

But if there is an infinity of such matrices, which one will the regression choose? Well, the nature of the regression is that it will insist that:

(a) the total number (or average proportion) number of strikes has to be the same in the unbiased matrix as in the original – that is, no adding or subtracting strikes, just redistributing;

(b) no adding or subtracting strikes in any individual row or column of the matrix, either; and

(c) subject to the two restrictions above, choose the unbiased matrix that is the closest to the original, based on sums of squares.

Here's what the regression came up with:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.06 31.48 30.74
Hspnc Umpire-- 31.78 31.20 30.46
Black Umpire-- 31.80 31.22 30.48

Subtracting that from the original Table 2 matrix:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- +0.00 +0.01 –0.13
Hspnc Umpire-- -0.13 +0.60 –0.31
Black Umpire-- +0.13 –0.35 +0.28


Converting that to pitches:

Pitcher ------ Wht Hsp Blk
--------------------------
White Umpire-- -04 +17 -16
Hspnc Umpire-- -18 +22 -04
Black Umpire-- +20 -40 +19

Overall, the same-race pairs resulted in 37 more strikes than expected, and the different-race pairs 37 fewer strikes. That's a difference of 74 strikes, which doesn't seem like a lot. I can't actually give you a p-value, because I had to use only 1/100 of the full sample (if anyone knows of any good, free regression software for Windows, let me know), but it's definitely below 95% significance.

Another thing to notice is that, even though there does appear to be a 37 strike bias on the diagonal, all of it appears to be in the H/H and B/B cells. When a white umpire calls a white pitcher, he actually calls FEWER strikes than expected! This is kind of the opposite of what you might expect if the effect was really caused by racial preferences; racism, historically, has been inflicted by whites on minorities, but, here, white umpires actually appear to be showing no bias at all! (I'll return to this point in a future post.)

Again, this exercise is something I did for myself. It's not quite what the Hamermesh study did, yet.

----

Before we do get to the study itself, let me do one more regression, but add an indicator variable for "same race" – or, as the study puts it, UPM. That is, instead of just fitting the umpire and pitcher tendencies, let's also add a variable for whether the umpire matches the pitcher, and see how much that improves the fit of the model. In English, what this regression is saying is this:

1. Start with a matrix where all the cells are exactly the same.
2. Adjust all the rows by various amounts (one amount per row) to reflect that different (races of) umpires have different (collective) strike zones.
3. Adjust all the columns by various amounts (one amount per column) to reflect that different (races of) pitchers have different abilities to throw strikes.
4. Adjust the same-race diagonal, each cell by the same amount, to reflect any racial bias among umpires for their same race (UPM bias).
5. In doing all these adjustments, don't add or subtract any total strikes – just readjust the strikes that are already there.
6. Choose all the adjustments to come as close as possible to the original matrix, as measured by sum of squares.

The adjusted matrix isn't very important, but here it is anyway:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 32.07 31.47 30.73
Hspnc Umpire-- 31.73 31.37 30.51
Black Umpire-- 31.77 31.29 30.66

It isn't important because what we really care about is the value of the UPM variable. Is it zero, which means no race bias? Is it positive, which means (favorable) same-race bias? Is it negative, which means unfavorable same-race bias? And how big is it?

It's positive: +0.1169 percentage points. The full regression equation is:

Chance of a called strike equals:

30.5466%
--- plus .1799% if the ump is white
-- minus .0409% if the ump is hispanic
--- plus 1.221% if the pitcher is white
--- plus .7469% if the pitcher is hispanic
--- plus .1169% UPM [if the umpire matches the pitcher]


It turns out that, even with over a million pitches, the UPM coefficient of 0.1169 it's not statistically significantly different from zero. We can't conclude any racial bias.

But how baseball-ly significant is it? How important is a UPM coefficient of .1169? And what does it mean?

At first glance, it might seem that 0.1169% of pitches are affected. But that's not what it means.

The 0.1169 means that, after adjusting for the overall tendencies of the umpire and picher, a same-race pairing will produce 0.1169 percentage points more strikes, as compared to a different-race pairing. Put into the language of the 3x3 matrix, the regression tells us that the three cells on the diagonal average 0.1169 percentage points higher than the average of the other six cells. (If you want to verify, just do the calculation on the "expected" matrix a few paragraphs up.)

Also, the 0.1169 is a *relative* value. It's the *difference* between the same-race case and the different-race case. We don't know if the 0.1169 comes from "too many" strikes called in the same-race cases, "too few" strikes called in the different-race cases, or (most likely), a combination of both.

Suppose the entire effect is extra strikes by the same-race pairs. There are 750,817 pitches on the same-race diagonal. Multiplying that by 0.001169 gives 878 pitches.

On the other hand, suppose the entire effect is caused by too few strikes (too many balls) called in the different-race pairs. There are 348,189 pitches in those six mixed-race cells. Multiplying that by 0.001169 gives 407 pitches.

If the effect is mixed between extra strikes and extra balls, the number of affected pitches will be somewhere between 407 and 878.

Since there were 1,099,006 total called pitches, the effect is somewhere between one pitch in 1,252, and 1 pitch in 2700.

At 70 called pitches per game, that's somewhere between one race-related pitch every 18 games, and one every 39 games. Not very baseball significant at all.

----

So far, and based on only the raw data in the paper's Table 2, we've determined that:

-- the data looks pretty close to non-race-related;

-- if anything, any race bias appears to be in the hispanic/hispanic case and the black/black case, with the white/white case almost perfectly unbiased;

-- adding the UPM variable gives a coefficient that is slightly positive for same-race bias, but

-- the coefficient is not statistically significant (I haven't shown that yet, but trust me for now), and

-- the coefficient is not very significant in the baseball sense, either.

So far, this is all preliminary – I'm trying to lay some groundwork before we look at the actual Hamermesh regressions. Nothing I've written here, so far, is a direct comment on anything in the actual paper. That will come in the next post.



Labels: , ,