Friday, October 30, 2009

Don't use regression to calculate Linear Weights, Part II

Last post, I wrote about how using regression to estimate Linear Weights values is a poor choice, and that the play-by-play (PBP) method is better. After thinking a bit more about it, I realized that I could have made a stronger case than I did. Specifically, that the regression method gives you an *estimate* of the value of events, but the play-by-play method gives you the *actual answer*. That is: a "perfect" regression, with an unlimited amount of data, will never be able to be more accurate than the results of the PBP method.

An analogy: suppose that, a while ago, someone randomly dumped some red balls and some white balls into an urn. If you were to draw one ball out of the urn, what would be the probability it's red?

Here are two different ways of trying to figure that out. First, you can observe people coming and drawing balls, and you can see what proportion of balls drawn turned out to be red. Maybe one guy draws ten balls (with replacement) and six are red. Maybe someone else comes up and draws one ball, and it's white. A third person comes along and draws five white out of 11. And so on. Over all, maybe there are 68 balls drawn total, and 40 of them are red.

So what do you do? You figure that the most likely estimate is 40/68, or 58.8%. You then use the binomial approximation to the normal distribution to figure out a confidence interval for your estimate.

That's the first way. What's the second way?

The second way is: you just empty the urn and count the balls! Maybe it turns out that the urn contains exactly 60 red balls and 40 white balls. So we now *know* the probability of drawing a red ball is 0.6.

If the second method is available, the first is completely unnecessary. It gives you no additional information about the question once you know what's in the urn. The second method has given you an exact answer.

That, I think, is the situation with Linear Weights. The regression is like observing people draw balls; you can then make inferences about the actual values of the events. But the PBP method is like looking in the urn -- you get the answer to the question you're asking. It's a deterministic calculation of what the regression values will converge to, if you eventually get the regression perfect.


To make my case, let me start by (again) telling you what question the PBP method answers. It's this:

-- If you were to take a team made up of league-average players, and add one double to its batting line, how many more runs would it score?

That question is pretty much identical to the reverse:

-- If you were to take a team made up of league average players, and remove one double from its batting line, how many fewer runs would it score?

I'm going to show how to answer the second question, because the explanation is a bit easier. I'll do it for the 1992 American League.

Start with a listing of the play-by-play for every game (available from Retrosheet, of course). Now, let's randomly choose which double we're going to eliminate. There were 3,596 doubles hit that year; pick one at random, and find the inning in which it was hit.

Now, write out the inning. Write out the inning again, without the double. Then, see what the difference is in runs scored. (The process is almost exactly the same as how you figure earned runs: you replay the inning pretending the error never happened, and see if it saves you some runs.)

Suppose we randomly came up with Gene Larkin's double in the bottom of the fourth inning against the Angels on June 24, 1992. The inning went like this:

Actual: Walk / Double Play / DOUBLE / Fly Out -- 0 runs scored.

Without the double, our hypothetical inning would have been

Hypoth: Walk / Double Play / ------ / Fly Out -- 0 runs scored.

Pretty simple: in this case, taking away the double makes no difference, and costs the team 0 runs.

On the other hand, suppose we chose Brady Anderson's leadoff first-inning double on April 10. With and without the double:

Actual: DOUBLE / Double / Single / Double Play / HR / Ground Out -- 3 runs scored.

Hypoth: ------ / Double / Single / Double Play / HR / Ground Out -- 2 runs scored.

So, in this case, removing the double cost 1 run.

If we were to do this for each of the 3,596 doubles, we could just average out all the values, and we'd know how much a double was worth. The only problem is that sometimes it's hard to recreate the inning. For instance, Don Mattingly's double in the sixth inning on September 8:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Removing the double gives

Hypoth: Out / Single / ------ / Single / Fly Ball / Fly Ball

How many runs score in this reconstructed inning? We don't know. If the second single advanced the runner to third, and the subsequent fly ball was deep enough, one run would have scored. Otherwise, it would be 0 runs. So we don't know which it would have been. What do we do?

The situation arises that we were picking innings randomly, and dividing them into halves (the half before the double, and the half after the double). The problem is that the process creates an inconsistency in the hypothetical inning. The second half of the inning, in real life, started with one out and a runner on second and third. The hypothetical second half started with one out and a runner on first. That created the problem.

So, since we're picking randomly anyway, why don't we throw away the *real* second half of the inning, and instead pick the second half of some *other* inning, some inning where there actually IS one out and a runner on first? That will always give us a consistent inning. And while it will give us a different result for this run of our random test, over many random tests, it'll all even out.

We might randomly choose Cleveland's fourth inning against the Royals on July 16. In that frame, Mark Whiten struck out and Glenallen Hill singled, which gives us our required runner-on-first-with-one-out. After that, Jim Thome singled, and Sandy Alomar Jr. ground into a double play.

Grafting the end of that inning (single, double play) on to the beginning of the original inning gives us our "consistent" hypothetical inning:

Hypoth: (stuff to put a runner on first and one out) / ------ / Single / GIDP -- 0 runs scored.

Since the Indians scored two runs in the original, unadulerated inning, and zero runs in the hypothetical inning, this run of the simulation winds up with a loss of two runs.

Now, there's nothing special about that Cleveland fourth inning: we just happened to choose it randomly. There were 6,380 cases of a runner on first with one out, and we could have chosen any of them instead.

The inning could have gone:

Out / Single / ------ / result of inng 1 of 6,380
Out / Single / ------ / result of inng 2 of 6,380
Out / Single / ------ / result of inng 3 of 6,380
Out / Single / ------ / result of inng 4 of 6,380
Out / Single / ------ / result of inng 6,380 of 6,380

If we run the simulation long enough, we'll choose every one of the 6,380 equally. And so, we'll wind up with just the average of those 6,380 innings. So we can get rid of the randomness in the second half of the inning just by substituting the average of the 6,380. Then our "remove the double" hypothetical becomes:

Out / Single / ------ / average of all 6,380 innings

And, with the help of Retrosheet, we find that after having a runner on first and one out, those 6,380 innings resulted in an average of 0.510 runs being scored. So now we have:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Hypoth: Out / Single / ------ / plus an additional 0.510 runs.

I'll rewrite that to make it a bit easier to see what's going on:

Actual: (stuff that put a runner on first with one out) / DOUBLE / other stuff that caused 2 runs to be scored

Hypoth: (stuff that put a runner on first with one out) / other stuff that caused 0.510 runs to be scored, on average

So, for this inning, we can say that removing the double cost 1.490 runs.

Now, the "actual" inning was again random. We happened to choose the Yankees' 6th inning on September 8. But we might have chosen another, similar, inning where there was a runner on first with one out, and a double was hit, and the runner held at third. This particular September 8 inning led to two runs. Another such inning may have led to six runs, or three runs, or no runs (maybe there were two strikeouts after the double and the runners were stranded).

So, what we can do, is aggregate all these types of innings. If we look to Retrosheet, we would find that there were 796 times where there were runners on second and third with one out. In the remainder of those 796 innings, 1129 runs were scored. That's an average of 1.418 per inning.

So we can write:

Actual: stuff that put a runner on 1st with one out / DOUBLE leading to runners and 2nd and 3rd with one out / Other stuff leading to 1.418 runs scoring, on average.

Hypoth: stuff that put a runner on 1st with one out / ------ / Other stuff leading to 0.410 runs scoring, on average.

And so, we know that a double with a runner on first and one out, which ends with runners on 2nd and 3rd, is worth, on average, 1.008 runs.

Let's write this down this way, as an equation:

+1.008 runs = Runner on 1st and one out + Double, runner holds

We can repeat this analysis. What if the runner scores? Then, it turns out, the average inning led to 1.646 runs scoring instead of 0.410. So:

+1.236 runs = Runner on 1st and one out + Double, run scores

We can repeat this for every combination of bases and double results we want. For instance:

-0.117 runs = Runner on 1st and one out + Double, runner thrown out at home

+1.000 runs = Runner on 2nd and nobody out + Double

+1.212 runs = Runner on 1st and two out + Double, runner safe at home and batter goes to third on the throw

I'm not sure how many of these cases there are, but we can look to Retrosheet and list them all. At the end, we have a huge list of all possible combinations of doubles, and what they were worth in runs. We just have to average them, weighted by how often they happened, and we're done. We then have the answer.

As it turns out, the answer for the 1992 American League works out to 0.763 runs.

The answer is NOT a estimate based on a model with random errors that we have to eliminate. It's the exact answer to the question, the same way counting the balls in the urn gave us an exact answer.

Just to be absolutely clear, here's what we've shown:

Suppose we randomly remove one double from the 1992 American League. Then, we reconstruct the inning from the point of that double forward, by looking at the base/out situation before the double, finding a random inning with that same base/out situation, and substituting that new inning instead of what really happened.

If we do that, we should expect 0.763 runs fewer will be scored. If we were to run this same random test a trillion times, the runs lost will average out to .763 almost exactly.

If you try to answer this question by running a regression, to the extent that your estimate is different from 0.763, you got the wrong answer.


Anyway, the explanation above was a complicated way of describing the process. Here's a simpler description of the algorithm.

1. Using Retrosheet data, find every situation where there was a runner on second and no outs. It turns out there were 1,572 such situations in the 1992 AL. Count the total number of runs that were scored in the remainder of those innings. It turns out there were, on average, 1.095 runs scored each time that happened (1,722 runs scored in those 1,572 innings).

2. Repeat this process for the other 23 base-out states (two-outs-bases-loaded, one-out-runners-on-first-and-third, and so on). If you do that, and put the results in the traditional matrix, you get:

0 out 1 out 2 out
0.482 0.248 0.096 nobody on
0.853 0.510 0.211 first
1.095 0.646 0.293 second
1.494 0.907 0.423 first/second
1.356 0.940 0.377 third
1.804 1.151 0.470 first/third
2.169 1.418 0.598 second/third
2.429 1.549 0.745 loaded

3. Find every double hit in the 1992 AL. For each of those 3,596 doubles, figure (a) the run value from the above table *before* the double was hit; (b) the run value for the situation *after* the double; and (c) the number of runs that scored on the play.

The value of that double is (b) - (a) + (c). For instance, a 3-run double with the bases loaded and 2 outs is worth 0.293 minus 0.745 plus 3. That works out to 2.548 runs.

4. Average out each of the 3,596 run values. You'll get 0.763.

It's that simple. You can repeat the above for whatever event you like: triples, stolen bases, strikeouts, whatever. Here are the values I got:

-0.272 strikeout
-0.264 other batting out
+0.178 steal
+0.139 defensive indifference
-0.421 caught stealing
+0.276 wild pitch
+0.286 passed ball
+0.277 balk
+0.307 walk (except intentional)
+0.174 intentional walk
+0.331 HBP
+0.378 interference
+0.491 reached on error
+0.460 single
+0.763 double
+1.018 triple
+1.417 home run


Anyway, my point in the original post wasn't meant to be "regression is bad." What I really meant was, why randomly pull balls from urns when Retrosheet gives you enough data to actually count the balls? This method gives you an exact number.

One objection might be that, to do it this way, there's way too much data to use, and so regression is a more practical alternative. But is it really better to use a wrong value just because it's easier to calculate?

Besides, you don't have to calculate them yourself -- they've been calculated, repeatedly, by others who can be referenced. In the worst case, they're close to the "traditional" weights, as calculated by Pete Palmer some 25 years ago. If you need a solid reference, just use Pete's numbers. They're closer than you'll get by running a regression, even a particularly comprehensive one.

Labels: ,

Tuesday, October 27, 2009

Don't use regression to calculate Linear Weights

About a year ago, in a post titled "Regression, Schmegression," Tom Tango argued that regression is not usually one of the better techniques to use in sabermetric research. He's right, especially for the example he used, which is using regression to find the correct Linear Weight values for the basic offensive events.

What a lot of researchers have done, and are still doing, is listing batting lines for various team-years -- singles, doubles, triples, etc. -- and running a regression to predict runs scored. It's not that bad a technique, but there are other, better ones you can use. Still, by looking a bit closer at the regression results, you can get a good idea of why regression results don't always mean what you think they mean.

Let's start with the triple. How much is a triple worth? That is: how many more runs would an average team score if you gave them exactly one extra triples?

We can run a simple regression, runs scored vs. triples hit. I used a dataset consisting of all full team-seasons from 1961 to 2008 (only for teams that played at least 159 games, to omit strike seasons). That was 1,121 teams. The result of the regression:

Runs = 731 - (0.44 * triples)

That's not a misprint: the regression tells us that every triple actually *costs* its team almost half a run!

It's not a sample size issue, either. The standard error of the -0.44 estimate is 0.27. The estimate was actually significantly different from zero (in the wrong direction!) at the 10% level.

Is it possible that a triple actually lowers your runs scored? Of course not. Our baseball knowledge tells us that's logically impossible. A triple maximizes the value of runners on base (by scoring them all), and then puts a runner on third, where he's also likely to score. It's all positive. There must be something else happening here.

It's pretty obvious, but to understand that we can't take the results at face value, we needed subject matter expertise -- we needed to know something about baseball. In this case, we didn't need to know much, just that triples have to be a good thing. But that's subject matter knowledge nonetheless.

No matter how expert you are in the technique of regression, you have to know something about the subject you're researching to be able to reach the correct conclusions from the evidence. Because, as the saying goes, correlation doesn't imply causation. But it doesn't imply *non-causation* either. It could be that triples cause fewer runs, or it could be that there's some third factor that's positively correlated with triples, but negatively correlated with runs scored. Knowing something about baseball lets us argue for which conclusion makes more sense.

Normally, when you interpret a regression result like this, you say something like: "all else being equal, one extra triple will reduce the number of runs scored by about 0.44." But that's not quite right. The "all else" doesn't refer to everything in the universe -- it only refers to everything else *you controlled for in that regression*. Which, in this case, was nothing -- we only regressed on triples.

A more accurate way to put the regression result is:

"One extra triple is associated with a reduction of the number of runs scored by about 0.44. That's either because of the triple itself, or because of something else about teams who hit more triples, something that wasn't controlled for in the regression."

Now, a possible explanation becomes apparent. Teams that hit lots of triples are usually faster teams. Fast teams tend to have fewer fat strong guys who hit for power. Therefore, maybe hitting lots of triples suggests that your team doesn't have much power, which is why triples are negatively correlated with runs scored.

Again, the regression didn't suggest that: it was our knowledge of baseball.

We can test that hypothesis, and there are a couple of ways to test it. First, we can test for a correlation between triples and other hits. And, yes, the correlation between triples and home runs is -0.31: teams who hit a lot of triples do indeed hit fewer home runs than average.

Or, we can just include home runs in the regression. If we do that, we get the equation

Runs = 373 + (1.84 * triples) + (1.93 * home runs)

Which means:

"Home runs being equal, one extra triple will increase the number of runs scored by about 1.84. That's either because of the triple itself, or because of something else about teams who hit more triples (something other than home runs, which was controlled for)."

Of course, there's still something else other than home runs: our baseball knowledge tells us that teams that hit lots of triples are likely to be different in doubles power, too. And, in fact, in almost every other category: singles, outs, walks, steals, and caught stealings. So if we do a regression on all that stuff, we get:

Runs = 42
+ (0.52 * singles)
+ (0.67 * doubles)
+ (1.18 * triples)
+ (1.48 * home runs)
+ (0.33 * walks)
+ (0.18 * steals)
- (0.21 * caught stealing)
- (0.11 * batting outs (which is AB-H)).

Now we're getting values that are close to traditional Linear Weights. But not completely. For instance (and as Tango noted), we get that a double is worth only 0.67 runs, rather than the 0.8 that we're used to.

Assuming the 0.8 is actually the correct number, we ask: What's going on? Why are we getting only 0.67? It isn't just random variation, because the standard error of the 0.67 estimate in the regression is only 0.02. So what is it?

There must be something about teams that hit a lot of doubles that reduce the number of runs they score, in ways *other* than changing the number of singles, doubles, triples, home runs, walks, caught stealings, CSs, and batting outs. What could that be?

I don't know the answer, but here are some possibilities:

-- maybe teams that hit a lot of doubles (relative to the other events) are more likely to be intentionally walked. Therefore, their walks are less valuable than those of other teams. Every additional double may correlate with one extra walk turning out to be an IBB, which results in the regression giving a lower coefficient for the double.

-- the regression went from 1961 to 2008. Maybe teams that hit a lot of doubles (relative to other events) played in low-offense eras (like the mid 60s). The extra doubles mark the team as being from that era, which makes all events worth less, which makes the regression adjust by giving a lower coefficient for the double.

-- maybe teams that hit a lot of doubles ground into a lot of double plays. Since double plays are extra outs that don't show up in (AB-H), that would cause runs to be overestimated. The regression adjusts for that by building the extra DPs into the value of the double.

And so on. I don't know what the true answer is; none of the suggestions above seem very likely to me. It's a bit of a mystery. I'd add IBB to the regression, but the Lahman database doesn't seem to have it for teams. Maybe I'll calculate it some other way and try again.

Anyway, it was a bit of a shock to me that the doubles estimate was so far off. I would have thought that a technique like linear regression, with over 1,000 rows of data, would be able to come up with the answer. But it didn't, almost certainly because of outside factors that we didn't control for. Not only that, but we don't even really know what those outside factors are! (Although if you have an idea, let me know in the comments.)


So the accepted value for the double is 0.8 runs, using a method I will explain shortly. The regression, on the other hand, gives only 0.67 runs.

It's not that the regression answer is wrong -- it's just that the regression answers a different question than the one we want.

The 0.8 method asks: "If a team happens to hit an extra double, how many more runs will it score?"

The 0.67 method asks, "If one team hits one more double than another team, how many more runs will it score taking into account that the extra double means the team might be slightly different in other ways?"

To most analysts, the first question is more important. Why? Because we really *do* want to find the cause-and-effect relationship. Confounding variables may be interesting, but they usually get in the way of what we're really trying to find. It's interesting to note that more triples is correlated with fewer runs scored. But that information isn't very useful to our understanding of baseball -- it doesn't tell us what makes teams win. That is, we hopefully aren't about to tell the 1985 Cardinals that they would have scored more runs by hitting fewer triples, at least not unless we really believe that triples hurt offense.

What we usually want to do, using Linear Weights, is answer questions like: if you release player X, and sign player Y, where player Y hits 10 more triples than player X, how much will the team improve? And, for that question, the regression gives us the wrong answer.

So what's the method we use to find the *true* value of a triple? It's pretty simple:

-- For a certain period of baseball history, look at the play-by-plays of all the games, and divide all plate appearances (and baserunning events) according to which of the 24 base/out states (such as no outs, runners on second and third) was happening when it occurred.

-- For each state, calculate how many runs were scored after that state was achieved. (Here's the one for 2009, from Baseball Prospectus.)

-- Now, for every triple, calculate the difference between the runs before the triple, and the runs after the triple. For instance, a leadoff triple would have been worth .79 runs (there was an expectation of .52 runs before the triple, and 1.31 runs after the triple). But a triple with the bases loaded and two outs was worth 2.36 runs (before the triple, .75 runs were expected to score. After the triple, only .11 runs were expected, but 3 runs actually did score. 3.11 minus .75 equals 2.36).

-- Average out all the values of all the triples, as calculated above.

That average is how much an extra triple is worth to an average team.

If you do this for the 1988 American League (which is the one I happen to have on hand), you get that a triple is worth 1.024 runs. A double is worth 0.775.

There are several ways why this estimate is much better than what you could get from a regression:

-- as we pointed out, the regression estimate is influenced by superflous other factors possessed by teams that hit triples.

-- as Tango pointed out in his link, the regression uses data aggregated into team-seasons, which means you're losing a lot of information. This method uses PA by PA, inning by inning data, for a much more reliable estimate.

-- we have a direct, logical, cause-and-effect relationship.

-- in effect, we are able to hold *everything* constant, even factors we don't know about. That's because we are not comparing team X's triples to team X's runs. We're comparing a league-average triple to the league-average runs. All other confounding factors are averaged out.

Another way to look at it: the regression looks only at inputs and outputs. So it has no idea if the input *caused* the output, or if there's some third factor that links the two. But the play-by-play method isolates the direct effect of the input. It knows for sure that the triple *caused* the change from one state (bases empty, no outs) to the next (runner on third, no outs), and so it's not fooled by outside factors.

Correlation does not imply causation, and regression can only provide correlation. Why not use this method, which is based on causation, and therefore gives you the right answer?


UPDATE: Below, commenter Ted links to a paper (.pdf) where he used regression to figure Linear Weights, and found that when he added variables for GIDP and reached on error, the doubles coefficient increased (from .689 to .722). The HR coefficient also increased (by 10 points). So that, I think, explains part of the mystery: the doubles are artificially low because teams that hit a lot of doubles and home runs are so slow that they also hit into a lot of DPs and don't reach base on error as much. When these are accounted for separately, some of the true value of the double is restored.

Labels: ,

Monday, October 19, 2009

Premature accusations of anti-French NHL racism

Another accusation of racism in sports hit the newspapers today, on the front page of Canada's "National Post." This time, it's English-speakers who are accused of discrimination, in the form of racism against French-Canadian players.
The story is about Bob Sirois, a former NHL forward from Montreal, who did some analysis on NHL demographics and concluded that there is an "anti-francophone virus" in pro hockey. The reporter also quotes Réjean Tremblay, a sportswriter for Montreal's "La Presse," who got a look at the findings, and argues that "discrimination against the frogs is absolute." (Here's an article by Tremblay on the issue.)

What is the evidence for these accusations? We don't know for sure, because the full argument is in Sirois' upcoming book. But the article gives a few statistics:

-- Forty-two percent of francophone Québeckers who played three or more years in the NHL won a trophy or were named to the All-Star team. "Only francophones at the highest level were able to have lasting careers," Sirois said.

-- Of all 16-year-old players at the midget level in Quebec, 1 in 334 anglophones was eventually drafted, but only 1 in 618 francophones.

-- Francophone players in Quebec are less likely to get drafted than anglophone players in Quebec, and they go lower in the draft.

-- Of the 763 francophones drafted since 1970, one-third of them went to four teams: the Quebec Nordiques, Montréal Canadiens, Buffalo Sabres, and Philadelphia Flyers. (The teams drafting the fewest francophones were the Dallas Stars, Nashville Predators, and Phoenix Coyotes.)

-- Sometimes, undrafted players manage to eventually make it into the NHL. That group represents 10% of players overall, but 19% of players from Québec, suggesting that more francophone players are going overlooked.

Since the Post reporter wrote that he had obtained a pre-publication copy of the book, we can probably assume these are the most damning facts behind the accusations.

But are they actually evidence of discrimination? In every case, there are other, more plausible, explanations for the results. Let's take them one by one.

1. 42% of francophone Quebeckers who played three or more years in the NHL won a trophy or were named to the All-Star team.

The idea, presumably, is that to last in the NHL as a francophone, you have to be really, really good. But where's the evidence? Maybe the figure for anglophones is even higher than that? Forty-two percent sounds like a lot, but it's meaningless without a comparison number.

But maybe the lack of a contrasting figure for English Canada is the reporter's fault. Let's suppose the book has the anglophone number, and it's less. Does that prove anything?

No, actually, it doesn't. This is an old argument, actually. A couple of decades ago, baseball was accused of discriminating against blacks on similar evidence: there were lots of blacks in the league leaders, but fewer blacks as marginal players. Bill James effectively rebutted the argument then, based on the characteristics of the distribution of players.

Converted to hockey, the argument goes like this. Suppose that francophones happen to be better players, on average, than anglophones. More specifically, suppose skills are normally distributed with a standard deviation of 15 "points". English players have an average skill of (say) 100 points, but French players have an average skill of 105 points. You need to be over 130 to make the NHL, and over 135 to be considered a star.

So anglophone players need to be 3 SDs above the mean to hit 130 and make the league. That's about 135 players per 100,000 candidates. Francophone players need only be 2.5 SDs above their own mean. That's 233 players per 100,000 candidates.

To hit the superstar 135 mark, the anglophones need to be 3.5 SDs above 100; that's about 23 stars per 100,000 population, which means 23 stars per 135 players. But the Francophones only need to be 3 SDs above 105. That's 135 stars out of 100,000, or 135 stars out of 620 players.

Which means:

17% of anglophone players are stars (23/135)
21% of francophone players are stars (135/620).

So there's a larger proportion of francophone stars than anglophone stars. The difference in our contrived example is only 21% to 17%. But it would be relatively easy to come up with numbers to make the difference bigger, or smaller.

The point is that a small difference in means adds up to a big difference at the far tails of the normal distribution. That's not discrimination, it's just the way the bell curve works.

Here's a more intuitive way to look at it. Suppose the anglophones and francophones were exactly equal in terms of players and stars. Now, let's make the francophones better by taking a couple of Mario Lemieux clones and throwing them into the francophone pot. Doesn't it now make sense that a larger proportion of francophones will be superstars? It's not racism -- it's just that the francophones are now BETTER.

One objection to this line of reasoning might be: if francophones are so much better than anglophones, shouldn't we see them disproportionately represented in the NHL? Yes, we probably should. And who says we don't? The Post article does NOT say that fewer francophones make the NHL, per capita, than anglophones. It says only that francophone Quebeckers were less likely to be drafted than *Anglophone Quebeckers*. I'd be willing to bet, right now, that francophones are more likely to be drafted than non-Quebec anglophones. That's based partly on this logic, and partly on my feeling that if it weren't true, Mr. Sirois would be trumpeting that fact in the article.

[ --> UPDATE: that's apparently not right. "Hawerchuk" says that Québeckers comprise 18% of Canadian NHL players (by games played). But they're 23% of the population. ]

(Oh, and why might it be that francophone players are better than anglophone players? It could be that anglophone Quebeckers live mostly in Montréal, where ice time is harder to get. Francophone Quebeckers are more likely to be in small, northern towns, where there are more rinks per capita and more frozen ponds to play on after school. That would give francophone boys more ice time and practice time, which would make them better players. It would be roughly the same reason that the Canadian Olympic team is competitive with the US team, despite having only one-tenth the population.)

2. Of all 16-year-old players at the midget level in Québec, 1 in 334 anglophones was eventually drafted, but only 1 in 618 francophones.

That could easily happen without discrimination. All it would take is for hockey to be a bigger part of francophone culture than anglophone culture.

Suppose that hockey popular enough among French-speaking families that the top 20% of boys are still playing organized hockey when they're 16. And suppose that hockey is less popular among English-speaking families, so that only the top 11% of boys are still playing organized hockey when they're 16.

That would explain the numbers exactly. The mediocre francophone players don't get drafted. The mediocre anglophone players don't get drafted either, but they dropped out of organized hockey early enough that they don't make Sirois's survey.

Again, I'd be willing to bet that this is what's going on. I live in Ottawa, which is on the border with Québec, and I can tell you that the francophone families I know are much, much more hockey-mad than the anglophone families, on both sides of the border.

Sirois says,

"If you're francophone and your son is talented in minor hockey, anglicize his name and you will double his chances of being drafted."

If Sirois is basing that comment only on this particular statistic, his conclusion is premature, to say the least.

3. Francophone players in Québec are less likely to get drafted than anglophone players in Québec, and they go lower in the draft.

Same argument. The less-skilled anglophones drop out of hockey more frequently, while the less-skilled francophones drop out of hockey less frequently. So the remaining anglophones are better, on average, than the remaining francophones.

Again, I'd bet that if you looked at raw population numbers, more Québec francophones get drafted than Québec anglophones, at every level of the draft. They are less likely to be drafted, as Sirois says, if you look only at the pool of 18-year-old players. But I'd bet they are MORE likely to be drafted if you look at the pool of all 18-year-olds in Québec, whether they play hockey or not.

It's selective sampling if the mediocre francophones are more likely to be in the sample than the mediocre anglophones.

4. Of the 763 francophones drafted since 1970, one-third of them went to four teams: the Québec Nordiques, Montréal Canadiens, Buffalo Sabres, and Philadelphia Flyers. (The teams drafting the fewest francophones were the Dallas Stars, Nashville Predators, and Phoenix Coyotes.)

First, and easiest: are there fewer francophone players in the league now than in the past? Given the number of players these days being drafted from outside North America, that would seem likely. That would explain why the Stars, Predators, and Coyotes -- teams that weren't in the league in the 70s and 80s -- would have drafted fewer francophones than the more established teams.

It's the same reason you'd also find that Québec, Montréal, Buffalo and Philadelphia have had more non-helmeted players than Dallas, Nashville, and Phoenix. It's not because Nashville discriminates against bare heads, but because the Predators weren't around when it was legal to go without a helmet.

Secondly: it might just be a difference in scouting. Back in 1985, Bill James did a study of the MLB draft, and found that, in baseball, players in the southern United States were much, much more likely to be drafted than players in the cold states, even if the players were of equal talent. That wasn't racism against Minnesotans, it was just where the scouts decided to go. James wrote,

" ... the explanation seems obvious. ... The scouts spend a lot of time in the South because it gets warm down there while the North is still freezing, and they go where the baseball is. They see more of the players, see the ones they like more often, and wind up falling in love with them."

Doesn't it make sense that the same thing might apply in hockey? There isn't a weather issue, but there *is* a language issue. Doesn't it make sense that the Phoenix Coyotes are less likely to have a french-speaking scout, and are therefore less likely to send someone up to Chicoutimi in February to check out some prospect? If francophone scouts are rarer than anglophone scouts (which they obviously are), it makes sense that not every team would have one, and, as a result, francophone players would be disproportionately drafted by the teams that do. That's not discrimination, it's just rational allocation of resources.

Dallas might just be saying, "you know, we don't have a francophone scout, so we'll let the Canadiens concentrate on prospects in Trois-Rivières, and we'll send our guy to Regina."

5. Sometimes, undrafted players manage to eventually make it into the NHL. That group represents 10% of players overall, but 19% of players from Québec, suggesting that more francophone players are going overlooked.

This can easily follow from the hypothesis that there are more second-tier francophones in the draft pool than anglophones.

Again, suppose that 20% of francophone boys are still playing at age 16 (and therefore scouted for the draft), but only 11% of anglophone boys are. Scouts know that only 1% of Québec boys, of either language, will make the NHL. So they duly draft only 1% of the anglophone population, and 1% of the francophone population.

Scouts aren't omniscient, and they'll miss a few good prospects. There will be undrafted players who bloom later, and finally attract some interest from NHL teams.

Under our assumptions, 19% of francophone boys will be initially passed over, but only 10% of anglophone boys. That leaves almost twice as many francophones who might get noticed (and signed) later. Of course, those nine percentage points of extra francophones are less skilled than the top ten percent, but some players are late bloomers, and the bigger the pool, the more missed players you're going to sign later.

I think that's the obvious true explanation: more players means more late bloomers.


So the points raised in the article are certainly not enough evidence to conclude discrimination -- there is a perfectly plausible, non-racist explanation of each of them.

If you want to show discrimination, you need better arguments than these, to remove the selective-sampling problem intrinsic to each of the arguments here. What you can do is this: find all anglophones drafted in position X, and all francophones drafted in position X. See how they do in the NHL. If there's discrimination, you'll find that francophone 14th picks do better than anglophone 14th picks.

And even if you find there's discrimination, it doesn't mean it's racist, or even language-specific. It might just be a scouting issue, where there are fewer scouts in Québec than elsewhere, just as there were fewer MLB scouts in Minnesota than in Georgia.

My gut says you won't find much discrimination. I guess I wouldn't be surprised if you found a little bit, that team X might be less interested in a francophone eighth-round pick because of perceived language issues with the other players, when they can't really tell him much apart from a similar anglophone player who's also available. But discrimination is expensive, and every team wants to win. If you want to convince me that teams are deliberately leaving money on the table because of racism, you'll have to come up with some pretty good evidence.

These arguments, though, just don't cut it. There are many better, more plausible explanations for the apparent statistical anomalies in the article -- enough so, in fact, that, in my view, the accusations of racism are premature and irresponsible.

(Other views: Here's Tango, and here's mc79.)

Labels: , , ,

Sunday, October 11, 2009

Doesn't "The Book" study pretty much settle the clutch hitting question?

The clutch hitting debate continues. For the latest, here's Tango quoting Bradbury quoting Barra. Bradbury references Bill James' essay, and Barra references Dick Cramer's 1977 study.

In Tango's post, he says,

Anyway, as for actually finding a clutch skill, Andy [Dolphin] did in fact find it, and the results are published in The Book.

Absolutely. It's time, I think, that this study be acknowledged as the most relevant to the clutch question. Cramer's study gets quoted because it's the most famous, but recent studies (like Tom Ruane's) have used a lot more data. Dolphin's study improves on Ruane's by including even more data, by correcting for various factors, and by giving an actual quantitative estimate of how much clutch hitting talent there really is.

The one fault with Dolphin's work is that it hasn't been published in full. This is understandable: "The Book" contains a huge number of studies, and if they were all run in detail, the book would be a couple of thousand pages. But this is one of the most important studies, on one of the most asked questions in sabermetrics. If we want sabermetricians, academics, and reporters to accept the results, the study should be published in full, so as to be subject to full peer review. I'm not even completely sure how the study worked. I have a pretty good idea of the outline, but not the details. Part of the reason the study needs to be published is for the technical details to be available, so others can evaluate the method and reproduce the results if they choose to.

Anyway, here's what I *think* Andy did:

-- he took every regular-season game from 1960 to 1992.
-- he considered only PAs involving RHP, to eliminate platoon bias.
-- for every player who met minimum playing time, he computed his clutch and non-clutch OBP.
-- he adjusted those OBPs to reflect the quality of the opposing pitcher, and the fact that overall clutch and non-clutch OBPs differ.
-- he computed clutch performance by subtracting non-clutch from clutch.

That gave him clutch numbers for 848 players.

-- he looked at the distribution of clutch hitting, and figured the observed variance.
-- he then figured what the variance would have been if there were no clutch hitting.

It turned out that the actual variance was higher than the predicted variance, which is what you'd expect if there were something other than just luck causing the results (such as clutch hitting talent). The difference we can presume to be clutch hitting.

If luck and talent are independent (which is a pretty reasonable assumption), then

Variance caused by talent = (Total Variance) - (Variance caused by luck)

That calculation led Andy to conclude that the talent variance was .008 squared, which meant the standard deviation of clutch talent was 8 points of OBA.

Andy phrased it like this:

"Batters perform slightly differently when under pressure. About one in six players increases his inherent "OBP" skill by eight points or more in high-pressure situations; a comparable number of players decreases it by eight points or more."

That finding, I think, is the strongest we have, and I agree with Tango 100% that we should consider Andy's .008 figure to be the best available answer to the clutch hitting question.


As I said in previous posts, however, I do have some minor reservations about what we can conclude from the analysis, so it's appropriate to add a few caveats.

1. Mostly, I'm not convinced that the .008 represents individual clutch ability in the sense in which most fans think of it -- that the player "bears down" in important situations and performs better than normal. I wonder if, instead, it might just be a matter of both hitters and pitchers using different strategies in those clutch situations.

For instance, suppose you have a power hitter and a singles hitter, and neither gets any better in the clutch. But in those situations, the relative values of offensive events might change. Maybe, with the score close in the late innings, a home run becomes more valuable relative to a single. I'm making these numbers up, but, maybe instead of the HR being three times as valuable as a 1B, it becomes four times as valuable.

Now, the pitcher's strategy changes. Fearing the home run a little more than normal, he'd be apt to pitch around the power hitter, trading fewer home runs for more walks. That would cause the power hitter's OBP to increase more than expected. Even if there's no similar effect for the singles hitter, he'll look relatively worse in the clutch than the power hitter.

So it's possible, and even plausible, that the .008 might not be a reflection of the clutch behavior of an individual hitter, but just an artifact of the strategic manoeuvering in the batter-pitcher matchup.

To find out, you could check whether certain types of hitters have better clutch performances as a group. If you did find that, it would be evidence that at least part of what Andy found as "clutch ability" is just characteristics of the player.

There is some evidence that some of this is happening: in the book, Andy says that when he used wOBA (which weights events by their value, so HRs are worth about three times what a single is worth) instead of OBP (which weights all on-base events equally), the SD dropped from 8 points to 6. That suggests that clutch performance did indeed involve a trade-off between getting on base and hitting for power.

If you went one step further, and analyzed performance in terms of win probability (instead of OBP or wOBA), you might find some other result, such as no evidence of clutch talent at all. It could be that all the clutch differences are the result of hitters adjusting their game to what the situation requires, such as (say) a power hitter trying for a single with the bases loaded, vs. a home run with two outs and nobody on.

2. Just today, Matt Swartz suggested that lefties might be more "clutch" than righties, because they hit better with runners being held at first (I always thought that was because of the hole between first and second, but Matt suggests it's because that limits the defense's ability to shift in other ways). Again, that's something that's real -- so the team would know they could benefit from it -- but not "clutch" in the sense that the hitter is actually better in some way.

3. Another quibble I have with the conclusion is that the result appears to be not that significantly different from zero. Andy says there's a 68% probability that clutch talent is between 3 and 12 points; I calculated that the 95% confidence interval easily includes zero (the p-value of zero is somewhere around .14). So even if you're only interested in whether there's an ability to have a higher OBP (in the sense that some players' clutch OBPs vary more than others), the evidence is not conclusive beyond a reasonable doubt.

4. As Andy implies in "The Book" (and Guy explicitly suggests elsewhere), there could be other explanations for the .008. It could be that some players happened to have more clutch AB at home, so what we're seeing is partly HFA. It could be that some players happened to see a starter for the third time that game (when batters start gaining an advantage) more often in than expected in the clutch. It could be a lot of other things.

Guy suggests doing the same study, but choosing the PA randomly (instead of clutch and non/clutch). That would tell us how much of the .008 happens due to random clustering of factors.

(Note: just as I was about to submit this post, I found an earlier Andy Dolphin study that *does* do this kind of check. Andy found that dividing PA into other situations did not produce any false positives.)


Even if some of these criticisms turn out to be justified, it doesn't mean that clutch doesn't matter. Even if we find the entire effect is (say) due to lefties hitting better with runners on base, that's still something a manager or a GM should take into account. If you have two .270 hitters, but one hits .270 all the time, while the other hits .268 usually but .276 in the clutch ... well, you want the second guy. It doesn't really matter to you whether the extra performance comes from the players gutsiness, or just from something that's inherent in the game.

But my perception is that fans who talk about "clutch" are talking about something in a player's make-up or psychology that makes him more heroic in critical situations. I'd argue that while "The Book"'s study convincingly showed that some players hit slightly better (or worse) in clutch situations, it has NOT showed that it's because the players themselves are "clutch".


Looking back at what I wrote, I realize I'm repeating things I said before. But the point I was trying to make is that I agree with Tango: the study in "The Book" is state of the art, and, to my mind, the question of whether players hit differently in the clutch now has an answer.

I'm not sure how to get the result accepted. Well, publication of the study would help; the media are more likely to pay attention to a result if it's a full academic-type study instead of a few pages of a book. I'm sure JQAS would be happy to run it. Even a web publication would help.

What else? Well, I suppose that the more the sabermetric community cites the result, the more it'll spread, and the more likely sportswriters will be to come across it when researching clutch.

Or maybe a press release? It works for Steven Levitt!

Labels: ,

Monday, October 05, 2009

Stacey Brook on salary caps and competitive balance

You'd think that when a sport introduces a salary cap, it would lead to greater competitive balance in the league. That would make sense; with a cap, you won't have teams like the Yankees, who spend two-and-a-half times as much on players as the average team, and about five times as much as the Marlins. If you forced the Yankees to spend only the league average, they would have to get rid of many of their expensive star players, and they'd win fewer games.

In theory, if every team had to spend the same amount, they'd all start the year with equal expectations. I say "in theory" because, in practice, different teams would have different philosophies, some of which might work better than others. Certain teams might spend more on scouting, wind up drafting better, and win more games with the same payroll (at least until the draftees reach free agency). But, generally, you'd expect more balance among teams.

It seems that Stacey Brook, co-author of "The Wages of Wins," doesn't think that's true. He thinks that the salary cap (and floor) the NHL instituted in 2005 has had no effect on competitive balance.

Here are Brook's "Noll-Scully" measures of competitive balance for the last few years of the NHL (lower numbers = more balance):

2000-01 1.858
2001-02 1.581
2002-03 1.592
2003-04 1.633
salary cap begins

2005-06 1.637
2006-07 1.600
2007-08 1.037
2008-09 1.369

It does seem, Brook acknowledges, that competitive balance has improved the last couple of years. But, he says, that's part of a trend that's been going on for a long time. For one thing, there was virtually no change in the Noll-Scully the first two years after the cap. For another, balance has been improving since at least the 1970s:

1970s 2.557
1980s 1.969
1990s 1.796
2000s 1.538

Since competitive balance has been increasing even through most of hockey history that had no salary cap, he argues, it's just a continuation of the trend, and the salary cap doesn't have anything to do with the recent decline. He writes,

"As we argue in The Wages of Wins, and detail in our paper - The Short Supply of Tall People - competitive balance is declining not because of changes in league institutional rules - such as payroll caps - but rather due to the increasing pool of talent to play sports, such as hockey."

But that doesn't make logical sense. Sure, there's already a decreasing trend, for whatever reason, but that doesn't mean a change to the rules can't contribute to the trend. Does having the ability to send text messages lead to people using their phone more? Of course it does! But if you apply the same argument, you get something like, "well, cell phones were becoming more and more popular even before text messaging, so text messaging can't have anything to do with it." That's not right.

And, indeed, it contradicts their own findings in "The Wages of Wins" itself. The authors found that there was an r-squared of .16 between salary and performance in MLB. Which means that if you were to flatten out salaries, so that each team paid an equal amount, it would reduce the variance of wins by 16%. So, absent any compensating factors, "The Wages of Wins" is argues a salary cap MUST reduce the Noll-Scully measure!


By the way, take a look at the value of 1.037 for 2007-08. That's really, really low; the lowest you can expect Noll-Scully to be is 1.000, and that's when every team is of exactly equal talent. A value so close to 1 suggests a combination of (a) the league being really balanced that year, and (b) teams, by luck, playing closer to .500 than their talent suggested.

If you look at the standings, you see the usual suspects at the top of the conferences, so it doesn't really seem like all the teams were equal that year. Could it be that Brook used a formula for Noll-Scully that didn't consider the extra point for an overtime loss?


But what about Brook's (and Berri's) argument that balance has increased because players' skills are becoming more equal? Well, sure, that's been part of it, no question. But effects often have more than one cause. You may be earning more money because you're working overtime, but that doesn't mean winning the office hockey pool will *also* make you richer. Whatever was causing the levelling of team talent before might still be there ... but, now, there's an additional effect, the salary cap effect.

Now, maybe I'm not interpreting Brook's argument correctly. Maybe he's thinking that the salary cap does contribute to balance, but so much less than the other effect (players getting more equally talented) that it's not worth considering. But I think it's the other way around. With a salary cap, it doesn't matter much how the players' talent is distributed.

Suppose players vary a lot in talent, 100 players equally spaced from 0 to 100, with an average of 50. A team that has lots of money might buy players with an average of 70, and a team owned by Harold Ballard might buy players with an average of 30. Big difference.

Now, suppose the talent pool gets bigger, and competition gets tougher, and now the players are all spaced between 40 and 60. Now, no matter how much you want to spend, you can't get above 60. And no matter how cheap you are, you can't get below 40. But the league average is still 50.

So, yes, Brook is correct, a narrower range of talent leads to more competitive balance.

But, now, suppose that every team has a salary cap and a floor: they all have to spend exactly the same amount of money. Now, it doesn't matter how the talent is distributed: assuming every team is equally good at evaluating players, they'll all sign a team with an average of 50. Even if the distribution of talent is like it was in the 1970s, with lots of spread, it doesn't matter -- because even if there are lots of players in the 90s and 100s, no team can afford to sign more than one or two. The more talented the player, the more likely a team who signs him will have to sign *less* talented players to stay within the cap.

Even if you have the Babe Ruth of hockey, a player who's (say) a 500 when the other players top out at 100, it won't matter, because the teams will bid up the price of his services until they pay him what he's worth. The team who gets him will have less money to spend on other players, and it all evens out in the end.

What's happening is this: in the past decades, competitive balance decreased steadily for many reasons, including the increase of the talent pool that Brook cites. But, now, with a salary cap and floor, most of that stuff doesn't matter much any more!

It matters a bit, because not everyone is a free agent. The distribution of talent does matter for draft choices, because the top draft choice doesn't cost that much more than the others (but can be a whole lot better, as in Sidney Crosby).


Of course, NHL hockey teams are more than collections of free agents priced at market value, so we shouldn't expect competitive balance to be perfectly level. There are some factors that might cause the Noll-Scully to actually rise a bit from the theoretical bottom created by the salary cap.

For instance: the first draft choice goes to a team near the bottom of the standings. Back in the days of less competitive balance, that went to a team that was probably legitimately awful. Now, with teams closer in talent, it could go to a team that was just unlucky. If the team that gets the next Sidney Crosby is an average team, rather than a bad team, that won't reduce competitive balance the way it used to.

Also, scouting: an investment in scouting now pays off more than it used to. Before, if you were a low-spending team, maybe a better draft choice might move you from .400 to .450. Now, if all teams are medium-spending, maybe it'll move you from .500 to .550, and give you a legitimate shot at the Stanley Cup. So more teams should be willing to spend the money to improve their drafting. And so, the rich teams could "buy" better players, not by spending to pay them, but by spending to identify them better.

And there are probably other ways to get around the cap: didn't companies introduce employee health plans to get around wage controls in World War II? If a superstar free agent has knee problems, and I wanted to sign that player, I'd offer to hire the best knee doctor in the business and keep him on staff. Whatever he costs, it's not going to count against my cap. That may not actually be practical, but I'm sure rich teams will figure out ways to buy better teams, one way or another.

My point is not to say that these factors will push inequality back to where it was when teams could sign all the free agents they were willing to pay for, just that there may be other theoretical reasons that Noll-Scully may bounce back up a little bit. I think all those factors will be minor, and as long as the salary cap and floor stay within roughly the same range of each other, we'll continue to see a balanced league, regardless of how the talent pool changes.

Hat tip: The Wages of Wins

Labels: , ,