Wednesday, August 30, 2006

The color bar: discounting Babe Ruth's accomplishments

One of the more common arguments against Babe Ruth as the best player of all time is that, because of the color bar, he never had to face black pitchers. I’m not buying it, at least not at face value.

At first, it sounds pretty reasonable – there were many pitchers, better than those Ruth actually faced, who would have certainly been good enough to play in the American League, if not for segregation. But since Ruth didn’t have to face them, the pitchers he did face were a little worse, and his stats were inflated for that reason.

But by that logic, everyone’s stats are inflated. Sure, Ruth didn’t have to hit against black pitchers. But Rogers Hornsby never had to face Walter Johnson, arguably the best pitcher ever. That’s because, according to the rules at the time, Hornsby was segregated to the National League for all of Johnson’s career.

Morally, of course, there’s a big difference between segregation by race and segregation by league. But the question is not a moral one; it’s an empirical one. If we don’t discount Mike Schmidt’s career because he didn’t have to face American Leaguer Jim Palmer, why do we discount Hank Greenberg’s career because he didn’t have to face Negro Leaguer Satchel Paige?

If you consider the Negro Leagues like a third major league, equal in talent to the AL and NL, the parallel is exact. It doesn’t matter which pitchers are in which leagues, so long as the talent level is about the same. Schmidt didn’t have to face Palmer, but, because of that, he had to face Tom Seaver that much more. And since Greenberg didn’t have to face Paige, he had to face Lefty Grove a few more times. As long as the leagues are even, the segregation is irrelevant.

But the Negro Leagues were almost certainly worse than the majors.
The highest percentage ever of major league players who were black was in 1974, when there were about one-third as many black players as non-black. But in the 30s, there were probably at least half as many Negro League players as white major leaguers (as far as I can quickly estimate). So the Negro Leagues were, proportionally, somewhat diluted.

And if the talent level in the Negro Leagues was worse than the majors, then the white players had it harder, not easier. To even out the talent, you’d have to send some of the better white players to the Negro Leagues, and bring some of the lesser black players to the majors. That would decrease the quality of play in the majors, which means that integration would inflate the white players’ stats.

Now, in fairness to those who make the Babe Ruth argument, they’re implicitly assuming that integration would lead to the elimination of the Negro Leagues, or their conversion into minor leagues. You’d pull Josh Gibson and Oscar Charleston into the majors, they would displace marginal major-leaguers, and the white stars would stay put. Under this scenario, yes, of course, integration would have made improved the major leagues. Dizzy Dean’s job would be harder, as he’d be pitching to better hitters overall.

But that’s simply an argument that the pre-integration leagues were worse than they would have been if you contracted high-level baseball (by eliminating the Negro Leagues) at the same time you eliminated the color bar. That’s trivially true. It doesn’t follow that the leagues were worse than today, significantly worse, worse enough that Babe Ruth’s stats have to be discounted.

Consider that in the last 30 years, the proportion of blacks in MLB has dropped from 27% to 9%, purportedly for cultural reasons. (
Same link as above.) Blacks are segregating themselves out of baseball. Again, there is a huge moral difference between baseball rejecting blacks and blacks rejecting baseball. But as it impacts the issue of league quality, the effect is the same. What we have today is, for purposes of quality of play, the same as a two-thirds color bar. If Babe Ruth’s career is to be discounted, should we also discount Derek Jeter’s by 67% as much? Should we give three times as much credit to guys like Carl Yastrzemski and Joe Morgan, because they played in the era of maximum integration?

Perhaps we should, a little bit. But the race factor is one of many, many factors that affect the quality of play over time. Off the top of my head:

-- players today earn far more money than ever. This means more young athletes are likely to pursue a baseball career, which increases the pool of talent substantially.

-- on the other hand, other sports are now more competitive with baseball as a career choice for a talented athlete, which may act to reduce the talent pool.

-- baseball is dropping in relative popularity, because of an explosion of other recreational opportunities for children. Kids may be playing baseball less, which again reduces the quality of the talent.

-- expansion increases the number of players in MLB, which, all else being equal, reduces the talent level.

-- without question, players today have much better defensive skills than ever. This means that in the past, players’ batting averages would have been inflated by hits which would likely be turned into outs today.

-- players today have access to better medical procedures, which keeps more of today’s best players off the DL. Tommy John surgery has saved countless careers which would have been lost in the 30s.

-- more and more players are coming from countries other than the USA, for numerous reasons.

-- the pool of potential MLB players today includes players from Japan, a huge source of formerly-segregated talent.

And I’m sure you can think of more.

So compared to these, how important is the dropping of the color bar to the question of league improvement? It’s probably a drop in the bucket. Well, maybe it’s a cup in the bucket, or even a pitcher in the bucket – it would take a bit of research to know which. But in any case, to cite segregation as the only factor, or even the only important factor, while ignoring all the rest, doesn’t make sense. It attempts to answer the question by what feels good morally, rather than by a full accounting of the the evidence.

Tuesday, August 29, 2006

Snakes on a Plane

Sabermetrics and Freakonomics meet again, as sabermetrician Cyril Morong makes an appearance in the Freakonomics blog. He's analyzing whether internet ratings for "Snakes on a Plane" show evidence of manipulation.

Evidently, "Snakes on a Plane" is a recent movie. I wonder what it's about ...

Cy's own sabermetric research page is here.

Monday, August 28, 2006

NHL playoff shot quality differences -- better defense, or better goaltending?

In 2005, Ken Krzywicki authored an excellent hockey shot quality study (blog entry) that produced a formula for predicting the probability of a goal, based on shot type, shot distance, whether it was a rebound, and whether or not it was on a power play.

Later that year, Krzywicki published a
follow-up study, repeating his analysis for playoff shots. This time, I’m not so sure about the results.

“It is often said that playoff hockey is a different brand of hockey,” he starts out. But, then, instead of rerunning his full analysis on playoff data, he uses the same regular-season formula, as if the playoffs are the same brand of hockey.

Is that a problem? No, not if the formula works just as well. After all, it’s pretty much assumed that Runs Created and other such baseball formulas work fine for World Series games. Is the same true for hockey? Does a 55-foot slapshot have the same chance of scoring in the playoffs as it does in the regular season?

The answer, as Krzywicki finds, is no – in general, the chance of scoring on any given shot is lower in the playoffs. The study divides shots into deciles, based on scoring probability. In eight of the ten deciles, the chance is lower. In the other two, the chance is only very slightly higher.

So clearly, shots are less deadly in the playoffs. There are two possibilities:

1. The style of play is different in the playoffs; perhaps checking is tighter, and so shots are of lower quality, even from the same distance.

2. The shot quality is the same, but the goalies are better at stopping those shots.

It seems to me that number 1 is more likely. But Krzywicki argues that it’s number 2 – that the data shows the difference is improved goaltending. And that’s where I disagree.

Krzywicki’s reasoning is that statistical tests on playoff data using the regular-season shot formula give results that show a good fit – and therefore, we should assume that the shot probabilities haven’t changed. Specifically, the Kolmogorov-Smirnov (KS) statistic is only 2.8% worse for the playoff data; the population stability index is a low 0.0114, which indicates a stable population between playoff and non-playoff shots; and the “deviation in the logit” – a measure of how much the coefficients vary between the two datasets – is-0.0313, “a small number.” Therefore, since the statistical tests show non-significant differences, the old model is a good enough fit for the playoff data.

My response, as
I argued a few days ago in the case of correlation coefficients, is that the results of statistical tests, taken alone, can’t provide sufficient evidence of a good enough fit for the purposes. We already know that there is a high degree of similarity between playoff shots and regular season shots, just by the fact that hockey rules don’t change and nobody has noticed a serious difference even after watching hundreds of games. But that intuitive gut feel isn’t enough – if it were, we wouldn’t need this study. So how do we know that 2.8% goes farther than that, that it’s low enough to prove that there’s really no difference between the styles of play? Maybe 2.8% is huge – the difference between playing an AHL team and playing the New Jersey Devils. It looks small in the statistical sense, but it might be huge in the hockey sense. You’d need a whole nother study to show whether 2.8% really is small enough to ignore.

As for the conclusion about goaltending, suppose that there are two alternate universes. In one universe, playoff goaltending improves by 15% but defense stays the same (as Krzywicki argues here). In the other universe, goaltending stays the same, but defense improves by 15% -- that is, shots are 15% less likely to score because the defense forces softer shots, even given the type and distance.

Both those cases will result in exactly the same play-by-play data, and the same results for the statistical tests! There’s no way of figuring out which of the two universes we live in without additional data (such as shot speed and accuracy, for instance).

So, I would argue, this study leaves the question unanswered. The difference might be mostly the playoff style of play. It might be mostly goalie improvement. Or it might be some significant combination of both.

Saturday, August 26, 2006

"Abstracts from the Abstracts" and Bill James Index

Many of you probably know about these already ...

Rich Lederer summarizes all the Bill James Baseball Abstracts, from his self-published 1977-1981 books, up to the last Ballantine one in 1988. Here's 1977; the others are linked to on the left side of that page. Good stuff.

Also, there's also Stephen Roney's index to the works of Bill James. Mostly for reference, but links to a few on-line articles.

Friday, August 25, 2006

Advancing on outs -- how many runs is a skilled baserunner worth?

In a three-part series on Baseball Prospectus (subscription may be required, but he blogs on his study here), Dan Fox looks at which baserunners benefit their teams most by advancing on outs.

To do that, he starts with the usual base-out run value charts. For instance, suppose there’s a runner on second and nobody out, and the following batter hits a ground ball to the first baseman. If the runner advances, there will be a runner on third and one out, which is worth an expectation of 0.978 runs scoring in the remainder of the inning. If the runner stays put, the value is 0.704 runs. So that particular advancement would be worth 0.274 runs, which would be credited to the baserunner.

Fox compiled full baserunning stats for the 2000-2005 seasons, based on play by play data. The
first article in the series deals with ground outs; that article contains an explanation of the system, but, because of an error, the results were adjusted in part 3.

It turns out that the best ground-ball baserunners in that six-year period were Juan Pierre, who created 9.37 ground ball baserunning runs above what the average runner would do (EqGAR), and Adam Kennedy, who had 9.05. The third through tenth-place runners were between 5.31 and 3.82 runs.

second part of the study dealt with fly balls. Derek Jeter was the best fly-ball advancer with 6.93 extra runs (EqAAR). Ray Durham was second with an EqAAR of 4.49, and the next eight were 4.15 down to 3.25.

Aside from the identities of the best and worst players – for which you can see full top-ten listings in the study – an imporant question is just how much a good baserunner can contribute to a team. A naive estimate would be simply the number of runs saved, such as 9.37 for Juan Pierre. However, as Fox notes,

"… the extreme values … don’t represent an actual skill, but instead a combination of skill and random variation which has the effect of pushing some values to the extremes."

That is, the study makes it appear that the best baserunners can save their teams between 4 and 9 runs. But we can’t say that’s true – those might have just been the luckiest players. At the other extreme, it’s possible that skill doesn't matter at all -- that all players are equally good baserunners, and Pierre and Jeter wound up at the top simply because they happened to be on base for a lot of slow grounders and deep fly balls.

That’s unlikely, of course. But just as the effect of skill is unlikely to be zero, it’s also unlikely to be as high as the 4 to 9 runs observed. The truth is somewhere in the middle. It is probably closer to the top than to zero, as suggested by the fact that Juan Pierre is a very fast runner, the type we would expect to be leading the chart.

But to know for sure, to find out the actual value of baserunning skill, more work is required. If Dan pubishes full data, we could do a check using Tangotiger’s technique.

Thursday, August 24, 2006

Finding a true talent level for an outcome distribution

What is the distribution of team talent in major league baseball? For instance, how many teams are good enough to actually win 100 games in a season?

You can’t just go by actual won-loss records, because any team that actually wins 100 games has probably done so aided by a bit of luck. To oversimplify, a team that wins 100 games might be a 95-game team that got lucky, or a 105-game team that got unlucky. There are many, many more 95-game teams than 105-game teams, and so the average team that wins 100 games is probably closer to 95 games than to 105 games.

In general, the distribution of talent is much narrower than the distribution of actual results. One way to see this is to consider the extreme case – suppose every team had the same (.500) talent, as if every game outcome was determined by flipping a coin. In that case, the talent distribution is the narrowest it can possibly be, but the season outcomes are normally distributed with standard deviation about 6.

So, how can we determine the talent distribution, given that we only know the outcome distribution?
Tangotiger has a method. He points out that

var(outcome) = var(talent + luck) = var(talent) + var(luck) + 2 cov(talent, luck)

Since luck is random, it doesn’t correlate to talent, and the covariance term is zero. So

var(outcome) = var(talent) + var(luck)

We can observe var(outcome) from actual W-L records, and we can figure out var(luck) from the binomial distribution. And so we can easily figure out var(talent). Here’s a quote from Tom’s post on the method in more detail. He’s figuring out var(talent) for the NFL:

Here is one way to figure out the var(true) for any league.

Step 1 - Take a sufficiently large number of teams (preferably all with the same number of games).

Step 2 - Figure out each team’s winning percentage.

Step 3 - Figure out the standard deviation of that winning percentage. I just did it quick, and I took the last few years in the NFL, and the SD is .19, which makes var(observed) = .19^2

Step 4 - Figure out the random standard deviation. That’s easy: sqrt(.5*.5/16) 16 is the number of games for each team.

So, var(random) = .125^2

Solve for: var(obs) = var(true) + var(rand)

var(true), in this case, is .143^2

So the standard deviation of talent in the NFL is .143.

Tango tells us the SD of talent in MLB is about 0.060. If we assume that MLB teams are normally distributed with mean .500 and SD .060, then 99.5 wins (.614) is 1.90 standard deviations from the mean.

Looking up 1.90 in a cumulative normal distribution table tells us that 2.9% of teams have the talent to win 100 games or more. That’s 1 in 34, or about one team per season.

But in any case, the point of this post is not the specific value, but rather, Tangotiger’s method. It’s simple, it’s easy to calculate, it’s theoretically sound, and it’s extremely useful in all kinds of situations.


Tuesday, August 22, 2006

On correlation, r, and r-squared

The ballpark is ten miles away, but a friend gives you a ride for the first five miles. You’re halfway there, right? Nope, you’re actually only one quarter of the way there.

That’s according to traditional regression analysis, which bases some of its conclusions on the square of the distance, not the distance itself. You had ten times ten, or 100 miles squared to go – your buddy gave you a ride of five times five, or 25 miles squared. So you’re really only 25% of the way there.

This makes no sense in real life, but, if this were a regression, the "r-squared" (which is sometimes called the "coefficient of determination") would indeed be 0.25, and statisticians would say the ride "explains 25% of the variance." There are good mathematical reasons why they say this, but they mean "explains" in the mathematical sense, not in the real-life sense.

For real-life, you can also use "r". That’s the correlation coefficient, which is the square root of 0.25, or 0.5. In this example, obviously the r=0.5 is the value which makes the most sense in the context of getting to the ballpark. Because you really are, in the real life sense, halfway there.

r is usually the value you use to draw real life conclusions from a regression. According to "The Hidden Game of Baseball," if you regress Runs Scored against Winning Percentage, you get an r of .737, which is an r-squared of .543. A statistician might use the r-squared to say that runs "explains 54.3% of the variation in winning percentage." Which is true if you are concerned with the sums of the squares of the differences – and only a statistician cares about those.

What real people are concerned about is what conclusions we can draw about baseball. And those conclusions are based on the "r", the 0.737. What that tells us is that (a) if a team ranks one standard deviation above average in runs scored, then (b) on average, it will rank 0.737 standard deviations above average in winning percentage.

The 73.7% is useful information about the value of runs to winning ballgames. But the 54.3% figure doesn’t tell you anything you need to know.

I made this point in my review of "The Wages of Wins," where the authors found that payroll "explains only 18%" of wins. They were using r-squared. The r is the square root of .18, which is about .42. Every SD of increased salary leads to an increase of 0.42 SD in wins. In real life, salary explains 42% of wins – although a statistician would probably never put it that way.

Sometimes, the correlation coefficient is used not to predict anything, but just to give you an idea of the relationship between variables. Everyone knows that +1 is a perfect positive relationship, -1 is a perfect negative relationship, and 0 is no relationship at all. And the higher the absolute value of the number, the stronger the relationship. So an r of 0.1 is a weak relationship, but -0.9 is a very strong relationship.

But a "very strong relationship" depends on the context. Sean Forman reports that the correlation between year-to-year players’ batting average is 0.45. That’s pretty high. But if the game-to-game correlation was 0.45, that would be enormous! It would indicate a huge "hot hand" effect. It would mean that if a player was two hits above average one night – say, he went 3-for-4 instead of 1-for-4 – he would be 0.9 hits above average the next night. That would mean that a .250 hitter turns into a .475 hitter after a 3-for-4 game!

Obviously, if you really did the experiment of computing game-to-game correlations, you’d get a very small number. I’m guessing, but, for the sake of argument, let’s say it might be 0.04.

Now, these two correlations are measuring the same ability – hitting for average. But because of context, an 0.45 can be pretty high in the season case, but earth-shattering in the game case. Conversely, 0.04 is meaningful in the game case, but, in the season case, it would show that batting average is barely a repeatable skill at all.

It all depends on context.

I mention this because of a blog entry on the "Wages of Wins" website. There, David J. Berri compares his book’s quarterback ranking to various versions of more sophisticated stats from Football Outsiders. He finds that the correlations are 90%, 92%, and 95% respectively.

And so he writes, "this exercise reveals that there is a great deal of consistency between the Football Outsiders metrics and the metrics we report in The Wages of Wins."

With which I disagree. The interpretation of correlation coefficient depends, again, upon the context. If you were completely ignorant about football statistics, then, yes, a 90% correlation would indicate that you’re measuring roughly the same thing. But given the vast amount of sabermetric knowledge we have about football, 90% could mean the statistics are very different at the margins of knowledge.

For instance, I’d bet that, in baseball, Total Average and Runs Created might correlate on the order of 90%. But, given our knowledge of baseball, we know that Total Average is unsatisfactory in many ways, and the differences are significant at the level of detail that we need for future research. 90% is enough to put Babe Ruth on the top and Mario Mendoza on the bottom. But it’s not good enough to tell the productive base stealers from the unproductive, or give us reliable information about the relative value of hits, or even to distinguish the 55th percentile player from the 45th.

To sum up: in one example, a 0.45 correlation was huge; in another example, a 0.9 correlation was mediocre. If your analysis starts and stops with the correlation coefficient, you really haven’t proven anything at all.


More posts on r and/or r-squared:

The regression equation vs. r-squared

Still more on r-squared

Why r-squared doesn't tell you much, revisited

R-squared abuse

"The Wages of Wins" on r and r-squared

Labels: ,

Monday, August 21, 2006

How good are the best fielders?

How many runs is a great fielder worth? There are lots of rating systems and lots of different answers. Tangotiger (Tom) takes a stab at the answer, but instead of analyzing play-by-play data, he just goes with his gut.

He starts by dividing batted balls into degree of difficulty. About 70% are “automatic,” and there should be little if any difference between good and bad fielders -- Tom figures the good fielders convert 99% of those plays, and the bad fielders convert 97%. On plays that take “some effort,” Tom estimates 80% vs. 50%. On difficult plays, it’s 50% vs. 15%. And on “highlight reel” plays, he estimates that great fielders convert five times as many of those plays as bad fielders do: 15% to 3%.

Do the arithmetic, and the difference between a good and bad fielder is 10%, or 60 plays a season. That is, the best fielders are 5% above average, and the worst are 5% below average. It’s higher for a shortstop, because that position sees 30% more balls in play -- so figure 40 plays above average instead of 30.

What it boils down to, basically, is that if 70% of plays are easily handled even by bad fielders, there’s not much room left in the other 30% to make more than 30-40 plays worth of difference.

Tom welcomes you to substitute your own numbers -- but they have to meet the constraint that, on average, 70% of plays get turned into outs, since that’s the MLB average conversion rate. He says that the highest he can go on reasonable assumptions is maximum 42 plays above average, which is 55 for a shortstop.

(What he doesn’t mention is that some shortstops might see even more plays, due to an overbalance of left-handed pitchers, or groundball pitchers, or just plain luck. But the 5% figure would still hold, even if the number of plays winds up a bit higher.)

Now, obviously, this kind of analysis is not an evidence-based, empirical study. But it helps a lot. It gives us a general range that we expect any result to fall into. If someone comes out with a study that purportedly shows that a second baseman can save his team 100 hits per year, we now have a basis to question it. I agree with Tom that it’s very difficult to save that many hits -- with 70% of balls in play being automatic, that leaves only 225 plays remaining in which to make a difference. It’s very difficult to watch a game and pick up one-and-a-half balls a game that (a) are hit in the vicinity of the second baseman, (b) are difficult plays, (c) that the fielder still manages to turn into an out, and (d) an average second baseman would almost certainly not have made. Our general observations, while not in the same league as good empirical evidence, are still pretty good, good enough to justify a strong skepticism.

And it seems to me that 40 plays a year is about right. Turning 40 hits into outs is equivalent to about 80 points in batting average for a regular player. If the average SS hits .270, even the best fielding shortstops won’t be in the lineup if they hit .190. And Derek Jeter, whose defense has a horrible reputation in sabermetric circles, just happens to be more than 40 singles above average offensively.

Saturday, August 19, 2006

Challenge: design a study that measures player improvement

If everything doubled in size last night, how could we prove it happened? We can’t do it by measuring things, because the size of rulers, and therefore of inches, would also have doubled. We could check the speed of light, but if the laws of nature changed at the same time, to reflect the doubling, we’d be out of luck. This is an old philosophical riddle, to which one answer is that it makes no sense to say that everything doubled, because if there’s absolutely no way in which the universe can be seen to differ after the doubling, the universe is actually exactly the same as before.

The baseball equivalent is, if all players got twice as good in the last 100 years, how could we prove that happened? I posted on this a couple of weeks ago (here), and my point then was that the Cramer method (which compares players last year to the same players this year to see if the rest of the league got better) doesn’t work.

As I argued, one thing you can do is notice that unlike the “doubling in size” case, where the laws of physics also changed, in baseball, the laws of nature are the same as ever. You can look at pitch speeds, and ball distances, and so on. You can use physics as the unchanging ruler to check performance against.

But can you design an experiment, like Cramer tried to do, that will find an answer without looking to physics? I can’t find the reference, but I’m pretty sure Bill James once speculated that there’s no way to do it. I think I agree.

I think the changes in major-league skill are so closely tied in with the changes in players as they age, that you can’t disentangle one from the other. For instance, Bill James’ famous 1982 study showed (I’m oversimplifying) that hitters lost 7% of their value between age 27 and 28. Which is true. But did they also lose 7% of their ability? Not necessarily. They might have lost only 6% of their ability, but the league improved by 1% under their feet, so they lost a total of 7% of their value. Or maybe they lost 5% and the league improved by 2%. Or some other combination. Right now, we don’t know which is correct.

Can anyone think of a study, using only Retrosheet data (no physics), that would allow us to reach any conclusions about the rate of improvement in major league baseball? You don’t have to actually do the study, just describe it. It can be as complicated as you want. And it can return any valid conclusions at all – even a wide confidence interval is OK, as long as it’s justified by the logic of the study and the evidence.

I’m betting it can’t be done.

Friday, August 18, 2006

Hockey: a shot quality formula

In this post from last week, I described a study by Alan Ryder that analyzed play-by-play data to determine the expected value of different types of shots on goal, based on distance and shot type.

Ryder’s data was in the form of graphs, rather than formulae – if you wanted to find out how many goals an even-strength 40-foot wrist shot was worth, you’d up some values on his graphs and do a quick calculation.

this study by Ken Krzywicki (also on, actually calculates a single formula for all shots.

Krzywicki uses a logistic regression, which is just a regular regression, but instead of trying to predict the probability of scoring, it predicts the logarithm of the odds of scoring. The regression results are on page 4 of the study, but I’ll just run through a quick example.

What is the chance of scoring on a 20 foot wrist shot on the power play that’s not a rebound? According to Krzywicki’s formula,

Start with –2.2369.
Add 0.3654 for a shot of 17-22 feet.
Add 0.0093 for a wrist shot.
Add 0.0000 because the shot is not a rebound.
Add 0.4007 for a shot that’s on the power play.
The total is –1.4615. Take the negative, which is 1.4615.
Take the natural antilog (e to the power) of 1.4615, giving 4.31.
Add 1, giving 5.31.
Finally, compute 1 divided by 5.31. That gives 0.19.
So the chance of scoring on the shot is 0.19.

Or, put another way, a 20 foot power-play wrist shot is worth 0.19 goals.

Krzywicki runs through a bunch of statistical tests to validate the model ... for instance, he presents a graph of predicted versus actual probabilities by decile, and the model appears to fit quite well. As far as I understand the tests, it’s still possible that the model is biased for certain types and distances of shots, but if so, the biases appear to balance out overall.

Wednesday, August 16, 2006

MLB overstates interleague attendance boost

Today, SABR’s “Business of Baseball” committee released the summer issue of its quarterly newsletter, “Outside the Lines.” The issue includes a new study (scroll to page 5) by Gary Gillette and Pete Palmer, entitled “Interleague Attendance Boost Mostly a Mirage.”

Here's a summary.

Gillette and Palmer start by quoting an
MLB press release boasting that since 1997, attendance for interleague games has outdrawn regular games by 13.2%.

But then they break down the numbers. Overall, interleague outdraws regular 32,838 to 29,099. (They say that is indeed a 13.2% increase, but my calculator says 12.8%.)

But. Gillette and Palmer find that most interleague games take place in June and July, when regular game attendance is high. Adjusting for dates, the 29,099 average attendance for regular games becomes 29,763, which reduces the increase to 10%. (My calculator agrees this time and from here on.)

Also, 61% of interleague games take place on weekends, versus only 46% of regular games. Adjusting regular attendance for the increased weekends would raise average attendance from 29,099 to 29,910, which again translates to only a 10% increase.

Making both adjustments – calculating average attendance of regular games on June and July weekend days – gives a regular game attendance figure of 30,574. The difference is now only 7%.

Finally, if you take 1997 out of the equation – when fan interest in interleague play was at its highest due to novelty – the figures become 32,782 interleague versus 31,122. That’s only a 5.3 percent increase, rather than the 13.2% in MLB’s claims.

Here's all the numbers again:

32,838 ... 29,099 ... 13.2% (unadjusted)
32,838 ... 29,763 ... 10.3% (adjusted for month)
32,838 ... 29,910 .... 9.8% (adjusted for day of week)
32,838 ... 30,574 .... 6.8% (both adjustments)
32,782 ... 31,122 .... 5.3% (both adjustments, 1998-2006)

(Update: the issue is available to all -- use the link in the first paragraph.)

Tuesday, August 15, 2006

Are pitchers chicken?

The issue of retaliation for hit batsman is a popular one among economists, due to its testing of the standard “people respond to incentives” theory.

The theory predicts that a pitcher will plunk more batters if he doesn’t have to come to the plate and risk being hit himself. That is, in the National League, when the pitcher faces potential retaliation, he has an incentive to refrain from deliberate plunkings – or, at least, to pitch more carefully to prevent the accidental hit batsman. But with the DH, the pitcher doesn’t have to bat, and he has less incentive to avoid the HBP.

(This is a hot topic, and there are more studies than I’m summarizing here, but I’ll eventually get to the only two Retrosheet-data-based ones I know of, which are probably the best.)

In 1997, Brian Goff, William Shughart, and Robert Tollison started the ball rolling – er, beaning – with a study showing that hit batsmen were substantially more common in the AL than NL, seemingly confirming the “moral hazard” hypothesis. (The paper doesn’t seem to be online
and I haven’t actually read it, but it’s cited a lot.)

Steven Levitt responded in 1998 with
this nice article, pointing out that almost the entire difference is explained by the fact that pitchers, being sucky hitters (I am paraphrasing), don’t get hit very much. Non-pitchers are hit at almost the same rate in both leagues. Furthermore, pitchers who hit more batters are no more likely to be hit themselves, which means that self-preservation is unlikely to be a motive.

More recently,
this study by J. C. Bradbury (of and Douglas Drinen used Retrosheet data to analyze the question in more detail. For all interleague games 1997-2003, they ran a regression on a game-by-game basis instead of team by team, attempting to predict HBP (received) based on a bunch of season and game team factors.

They find some serious significance. Walks, HR, and game score difference were significant at 1% or better.

Most importantly, the DH was significant at the 5% level (the DH meant more HBP, as expected). Retaliation, as measured by the number of batters the team itself hit, was extremely significant at 1% -- in fact, the observed value was seven standard deviations from the mean.

The authors write that this means the DH is associated with an 11% increase in HBP, and each batter hit increases the number of its own HBP by 10% to 15%.

I don’t know if the authors accounted for this, but if one HBP in the game increases the frequency of opponent HBPs by 10% in that game, it follows that it must have increased the frequency of subsquent retaliatory HBPs by more – since only half of HBPs can be retaliatory. (Assuming a team won’t retaliate after a retaliation.) And that’s the number that we’re most concerned with: how the rate of plunking increases after an HBP, not before.

Here’s an example, which you can ignore. Consider two teams, A and B. A hits B unprovoked in 1/3 of games. B hits A unprovoked in 1/3 of games. Nobody gets hit in the last 1/3 of games. Finally, teams retaliate 50% of the time they are hit unprovoked.

On average, every six games will look like this:

1. No HBP
2. No HBP
3. A hits
4. A hits then B hits
5. B hits
6. B hits than A hits

On average, A and B hit 1/2 batter per game each. But in games when a team is hit, it hits 2/3 batters per game. That looks like an increase of 33% due to retaliation. But we’ve seen that there is actually 50% retaliation. That comes out only if you look at the order in which the plunkings occurred.

So when Bradbury and Drinen find an increase of 10% to 15%, that’s probably understating the actual retaliation effect, because they’re including “pre-taliations” – and those presumably are less frequent, which brings the average down.

The same would be true of the home run and score situation. It seems to me that because this study doesn’t account for what came first, the effects it finds are lower bounds, and real life probably has an even stronger cause and effect relationship than what the study found.

But anyway, the main purpose of the study was to find a DH effect. And it did find one, significant at the 5% level.

Finally, Bradbury and Drinen have one more paper, this time analyzing the question in even more detail – by plate appearance rather than by game. For each plate appearance from several seasons in both leagues, they regressed HBP on about 15 variables, including score, batter and pitcher quality, whether the previous batter hit a home run, whether there was an opposing HBP in the previous half-inning, and so on.

Their findings: the DH increases the probability of an HBP by 11 to 17 percent. Also, the study finds that the pitcher has four times the chance of being personally hit after hitting an opponent in the previous half-inning. It appears that pitchers do reduce their HBP out of fear of getting hit themselves.

We can even estimate the fear. I’ll run through the calculation – let me know if I’ve made any mistakes.

Levitt’s study shows that pitchers are hit once per 335 at-bats. The increase in risk is three times that (the difference between average and four times average) or, say, about once per 100 at-bats. The chance of a pitcher coming to bat the inning following an HBP is, say, 40%. So after an HBP, the increase in pitcher plunkings is about 1 in 250 at-bats. That is, each 250 HBP committed by the pitcher cause him to get hit once himself.

So since pitchers reduce their HBP in response to this incentive, they marginally value their own safety about 250 times as much as they value the safety of the other team’s batters.

That might not be fair … Bradbury and Drinen’s study assumes that pitchers are out of danger if they don’t bat the next inning. Suppose that they always bear the four times risk in their next at-bat, even if it comes in a later inning or game. Then the 40% factor disappears, and instead of valuing their own safety 250 times as much, it becomes only 100 times as much.

So: introduce the DH, and a pitcher feels emboldened enough to hit 100 batters – just because it’ll save him being hit once. Is 100 to 1 the normal level of human self-interest? Or are pitchers particularly chicken?

Monday, August 14, 2006

How good was the WHA?

In 1985, Bill James’ famous minor-league study determined that the level of pitching in the majors was 18% higher than in AAA. That is, you’d have to discount minor-league hitter’s stats by 18% to predict what they would have done in the major leagues.

In “
League Equivalencies,” an article on, Gabriel Desjardins looks to do the same for hockey.

For instance, Desjardins found all players in the 1972-73 WHA (its inaugural season) who played in the NHL the following year. Those 39 players subsequently scored 46% as many points per game in the NHL as they had in the WHA, and so the “league quality” of the 1972-73 WHA was 0.46.

The WHA’s 0.46 was only slightly higher than the 0.43 for the minor-league AHL that year. “This is not surprising,” Desjardins writes, “since the WHA mined the AHL to fill out its teams.”

The WHA’s quality increased during its life – up to 0.76 the next year, and then irregularly to 0.89 in its final season of 1978-79. By contrast, the AHL stayed in the 0.50 range in the 70s, and is now at 0.45.

The Russian Elite League is the highest-quality non-NHL league at 0.91; the Czech league is second at 0.61, followed by Sweden (0.59) and Finland (0.54).

Desjardins argues, also, that the “real” quality is likely to be higher than the figures he presents, because players moving to the NHL normally get much less power play time than they did in the other leagues. That reduces their point scoring more than just their ability would suggest.

To which I would add: what about playing time? Wouldn’t it also be true that players good enough to be promoted will get less playing time in the NHL than they did in the minors? That would deflate their numbers even more. You’d think this would be a very large factor, at least as large as the power play issue. (On the other hand, you’d think that playing time in high-caliber leagues (like the Russian league) would be less of an issue, since the better the hockey, the more likely the NHL recruit is good enough to get substantial ice time.)

In baseball, of course, we have statistics broken down by plate appearance or outs made, so playing time is accounted for. In hockey, though, without playing time numbers, these results are lower-bound estimates and may be substantially off. But, on that basis, and taking the numbers for what they're worth, this is pretty valuable information.

A Response to Win Shares: A Partial Defense of Linear Weights

In a new article written for this blog, Charlie Pavitt defends the Linear Weights system against some Bill James criticisms:

"One of the impressions that reading [Win Shares] left me with is the seemingly constant attacks Bill makes in this book against Pete Palmer’s Linear Weights player evaluation system ... I believe these attacks to be at least partly misguided, in that, at least implicitly, Pete’s system is attempting to do something quite a bit different than Bill’s system is."

Click for the full article, "A Response to Win Shares: A Partial Defense of Linear Weights."

For those of you not familiar with Charlie's work, he regularly writes reviews of sabermetric studies for "By the Numbers." (Click here, scroll down for current and back issues.) He also maintains an indispensable sabermetric bibliography.

Friday, August 11, 2006

Bill James for Nobel

Apparently, economists are starting to pay attention to sports.

In one sense, they always have; there’s always been talk about player salaries, and stadium deals, and luxury boxes, and profit and loss. But that’s the traditional part of economics, the money part. All that talk about employment and recessions and interest rates and GDP and stuff is pretty boring, even when the subject is sports.

The fun part of economics is the side that applies it to human behavior. Economist Steven E. Landsburg argued that all of economics is based on one principle: “people respond to incentives.” And there are all kinds of interesting non-monetary incentives out there.

For instance (as Landsburg explains in
one of his books), you would think that the introduction of air bags and ABS in cars would reduce accidents. But accidents actually increased. The reason: with all the safety features protecting them, drivers have less incentive to drive carefully. And, more recently, Landsburg described a study describing how women respond to the incentives for faking orgasm.

As well, Tim Harford has explained why his favorite restaurant has
an incentive to hire rude waiters. And in Freakonomics, famed economist Steve Levitt explains his academic finding that real estate agents get higher prices when selling their own houses than when selling their clients’ houses. Again, that’s because of incentives. For an equivalent amount of effort, they get to keep 100% of any their own price, but only 5% of their client’s price.

Until recently, there hasn’t been a whole lot of this kind of thinking applied to sports. The only older example I can think of is Bill James arguing in the 1985 Abstract that because of the low level of competition in the AL West, the teams there had less incentive to get rid of their mediocre players. After all, if 86 games is enough to win the pennant, there’s less need to take chances than if you need 95 games or more.

But now, the economic way of thinking is gaining ground. Levitt has written
a paper on incentives to hit batters. This Levitt paper shows that when a third referee was added to college basketball, the number of fouls dropped (since the probability of getting caught increased) -- but when a second referee was added to NHL games, the frequency of penalties didn't change (because the probability of getting caught remained the same).

And, of course, there’s this study on how Sumo wrestlers cheat when it’s in their interests to do so (see my summary here).

There’s also the field of “
behavioral economics.” In one of my favorite chapters in “Basketball on Paper,” Dean Oliver talks about the famous “ultimatum game.” In that game, player A is given $10, and makes an ultimatum offer to player B about how to split it between them. If B agrees, the money is split. If B refuses, neither player gets anything.

In theory, A should offer B only a penny, and B should accept, since even a penny is better than nothing. But, in real life, players offered a penny (or even a dollar) refuse out of spite, feeling that fairness entitles them to a larger chunk of the $10.

Oliver parallels this game to a coach splitting playing time among his team. Players not receiving enough time (and glory) may rebel out of spite, putting in less effort and co-operation despite the negative effect it would have on their careers. The coach’s job, Oliver argues, is to manage the situation and keep players from feeling slighted by what is, in effect, the coach’s ultimatum.

So there’s three different aspects of economics so far – the boring money stuff, the fun incentive stuff, and the field of behavioral economics.

But now, there’s the possibility that economics may be branching into mainstream sabermetrics. “
The Wages of Wins,” the recent book by three academic economists, starts out talking about salaries and attendance. But it quickly moves on to evaluating basketball players, and much of the book deals with formulas for measuring offensive productivity – no incentives, no dollar figures, just sabermetric analysis. I wouldn’t have expected this from economics, since it’s not about responses to incentives, but the internal details of how to measure output. It’s as if an economic analysis of outsourcing suddenly started talking about what kind of computers make your customer service representatives in India most productive.

But, having said that, is there any other academic field is most suitable for sabermetric work? Probably not. In terms of the actual nuts and bolts of sabermetrics, it consists of taking large databases containing the end results of human interactions, and trying to come up with theories and relationships that make the data meaningful. And that’s what economists do all the time. Whether it’s clutch batting statistics, accident rates with and without air bags, sumo wrestling decisions in different types of critical situations, or lists of real estate transactions in Chicago involving real estate agents’ own houses, the logic and math involved in doing the actual work are pretty much the same. If Retrosheet had decades worth of play-by-play traffic data instead of baseball data, sabermetricians could tell as much about the value of performance tires as they can about the value of a stolen base.

So if this is a real phenomenon, instead of just a fad, we will see economists getting more and more into sports analysis over the next while. And if that happens, and academia eventually accepts sabermetrics as a worthy and legitimate branch of economics, there’s a good chance that Bill James could be in the running for a Nobel.

No, seriously, I mean it. You’ve got to admit that the body of Bill’s work is an amazing intellectual achievement. Is it as good as other Nobel work? My uneducated impression is that it certainly is. All that’s left is for some Nobel field to encompass this area of study in its purview. And, now, there’s a small but real chance it could be economics.

The Bill James Nobel is a long shot, sure, but it’s not impossible. If someone offered me, say, 500 to 1, I’d probably take it.

Thursday, August 10, 2006

NCAA home field advantage estimated within 14 points

How much blood can one study try to squeeze from a tiny little stone?

A lot. This academic study by Byron J. Gajewski, “There’s no Place Like Home: Estimating Intra-Conference Home Field Advantage Using a Bayesian Piecewise Linear Model,” tries to estimate home field advantage in Big 12 NCAA football – by using a sample of only 432 intra-conference games from 1996-2004.

With 432 games, the standard deviation of winning percentage for a .600 team is about .023. So, even if you observed a home winning percentage of .600, the 95% confidence interval would still be (.553, .647) – a pretty wide interval. So you’re not going to get all that much useful information from a sample of this size.

But the study is much more ambitious than even that – it tries to estimate a separate home field advantage (HFA) for each of the twelve teams. The assumption that each team has its own particular home field advantage is, I suppose, not unreasonable, but trying to find it off a sample of only 72 games per team seems like overreaching.

Further, those 72 games are played over nine years. If you assume that every team is going to have a different home field advantage, wouldn’t you also assume that it could vary from year to year, along with the players? This is college football, where there’s complete turnover every four years. Why assume that the 1996 Sooners will have the same HFA as the 2002 Sooners? The author calls the assumption “reasonable” because “the fan base is likely to be very stable,” making the implicit assumption (and I think I’ve seen at least one study disproving this for baseball) that HFA is a function of attendance.

There’s still more complexity. The study doesn’t just figure out the difference between a team’s home record and its road record. It actually tries to estimate each team’s intrinsic quality, and the quality of its opponent. And that’s hard to do. You can’t just take the season record, because of luck. You could take each season and regress it to the mean, but then you’re ignoring information about the previous or next season. For instance, a team that goes 6-2 is perhaps really a 5-3 team that got lucky -- but a team that goes 6-2 between 8-0 seasons might actually have 6-2 talent, or even 7-1.

The study chooses to solve this problem by fitting a straight line over the nine years, but allowing it to change direction at three fixed points. (That is, the best-fitting four straight lines of certain fixed length with no discontinuity.) This seems reasonable, but other, equally reasonable decisions could give substantially different results, especially with so few data points.

Finally, the author uses a Bayesian model and a simulation via “Markov Chain Monte Carlo” to get the final results. I’m not sure how this affects the conclusions, but some of them are unexpected. For instance, over the years of the study, Baylor was .167 (6-30) at home but .000 (0-36) on the road. A naive estimate of the home field advantage would be half of the difference, or .083. Gajewski’s method comes up with -.025, suggesting that Baylor was actually worse at home than on the road. (Part of the reason, presumably, is that they faced easier opponents at home, a fact the naive method wouldn’t consider.)

Here’s the full list. The study’s numbers are approximate, as I had to read them off a graph.

Team ....... Naive estimate .. Study estimate

Baylor ........ .083 .......... -.02
Colorado ...... .069 ........... .02
Iowa State .... .069 ........... .15
Kansas State .. .069 ........... .03
Kansas ........ .097 ........... .10
Missouri ...... .083 ........... .12
Nebraska ...... .139 ........... .22
Oklahoma State. .083 ........... .15
Oklahoma ...... .069 ........... .08
Texas A&M ..... .097 ........... .11
Texas Tech .... .139 ........... .17
Texas ......... .111 ........... .05

While I admittedly don’t understand all of the Bayesian and Monte Carlo aspects of the study, I can’t imagine how this small amount of data, with so many variables, could yield anything close to an accurate estimate of anything.

And the study admits it. The estimates in the table above have very wide 90% error bars -- estimating from the graph, about .15 (almost exactly one touchdown) in each direction. The Oklahoma State home field advantage could be as low as zero points, or as high as 14 points. Which, really, isn’t anything we didn’t already know.

Wednesday, August 09, 2006

New Issue: By the Numbers

The just-released May issue of “By the Numbers,” the sabermetrics newsletter of the Society for American Baseball Research (SABR), is now available at my website. I am the editor, so I won’t be writing reviews of these articles, just these summaries:

Academic Research: Pitcher Luck vs. Skill” by Charlie Pavitt: a review of two academic studies by Jim Albert, which talk about how to break down a pitcher’s record between luck and skill components.

The Wages of Wins – Right Questions, Wrong Answers” by Phil Birnbaum (me): a review of the recent book on sports sabermetrics/econometrics.

The Interleague Home Field Advantage” by Eric Callahan, Thomas J. Pfaff, and Brian Reynolds: a study that shows that the home field advantage in interleague games is significantly (in the baseball sense, not the statistical sense) higher than in other games.

Best of the Ball Hawks” by Tom Hanrahan: a study that evaluates the best centerfielders of all time using Win Shares and the Baseball Prospectus ratings.

We are always looking for material for future issues – please e-mail me if interested in contributing.

Tuesday, August 08, 2006

How much is a slap shot worth? An emprical study

Back in 1986, Jeff Z. Klein and Karl-Eric Reif released a book called “The Klein and Reif Hockey Compendium.” Obviously modeled after The Bill James Baseball Abstract, it was entertaining and had a lot of numbers, but was not all that sabermetrically informative.

In that book, and again in the
2001 update (now simply called “The Hockey Compendium”), they introduced a goaltender rating stat called “perseverance.” The idea was that save percentage isn’t good enough – goalies who face many more shots per game will also face a higher caliber of shots, and their rating should be adjusted upward to account for their more difficult task.

It sounded reasonable, but the authors didn’t do any testing on it -- they chose their formula because it looked good to them. With the proper data, it would be pretty simple to actually investigate how save percentages change with number of shots, and create a formula that matches the empirical evidence. I’ve always thought that would be a good study to do.

But now, Alan Ryder, at, has an awesome study that goes many steps better and makes the Klein/Reif method obsolete. It uses (extremely useful!) NHL play-by-play data (sample) to investigate shot stoppability not just by number of shots, but by type of shot and distance from the net.

Here’s what Ryder did. First, he found that there are five special kinds of shots where type and distance don’t matter much – empty net goals, penalty shots, very long shots, rebounds, and scramble shots (shots from less than six feet away that weren’t rebounds). For those, the shot is rated by the overall probability of a goal from that type – so an even-strength rebound, which went in 34.8 percent of the time, counts as 34.8 percent regardless of the details of the shot.

For all other shots – “normal” shots -- distance matters. At even strength, the chance of scoring on a 10-foot shot was 15% -- but from 20 feet out, the chance dropped to 10%. (All figures are for even strength – on the power play, “normal” shots are uniformly about 50% more effective.)

He also found that the probabilities varied for different types of shots. Only 6.7% of slapshots were goals, but over 20% of tip-ins went in.

(Ryder is quick to note that the data does not mean that players should change their shot selection based on these findings, implictly acknowledging that players are likely choosing the shot most appropriate for the situation.)

Combining types and distances, Ryder came up with a graph of the chance of scoring on any combination of shot and distance, and then smoothed out the results. Unfortunately, we don’t get the full set of data, but we get a graph of relative probabilities. For instance, a slapshot is a bit above average in effectiveness (relative to other shot types at that distance) anywhere from 5 to 50 feet, but drops after that.

Having done all that groundwork, Ryder is now in a position to easily evaluate defenses and goaltenders. Basically, the best defense is one that keeps the opposition from taking more dangerous shots. It’s now possible to add up the probabilities of all shots taken, to see how many “expected goals” the defense allowed. For instance, if a team allows six shots, each with a 15% chance of scoring, it’s effectively yielded 0.9 of a statistical goal.

And, of course, you can now evaluate the goalies, too. If the offense’s shot probabilities added up to 4.4, but only four goals were scored, you can credit the goalie with the 0.4 goals saved. Or, as Ryder chooses to do it, you’d give him a “goaltending index” of 0.909, which is 4 divided by 4.4.

I’ll leave it to you to check out the study – which is very easy to read and understand – to find out who the best and worst goalies and defenses are. I’ll mention only one of Ryder’s examples. In 2002-03, the Rangers allowed 21 more goals than the Lightning. But, after adjusting for the types and distances of shots, it turns out that the Rangers’ goaltending was actually significantly better – but was more than made up for by a defense that allowed many more quality opportunities.

Ryder’s study is by far the best hockey study I’ve seen (subject to the disclaimer that I haven’t seen that many). My only concern is that, just as Ryder points out that not all shots are equal, it’s probably also true that not all 30-foot wrist shots are equal. There could be many other factors that affect that kind of shot – who the shooter is, whether it was a one-timer, whether the goalie is screened, whether the defense is out of position, and so forth.

This doesn’t affect a team’s overall rating (which is simply goals allowed), but it would affect the proportion of credit or blame to assign to the goaltender. If the goalie is faced with a lot of difficult 30-foot wrist shots, he will be underrated by this system. If the 30-foot wrist shots are easy, he will be overrated.

Is this a big factor? One way to find out would be to see if a goalie’s rating is reliable and consistent from year to year, especially when he changes teams. If it’s not, and to what extent it’s not, that would be evidence that defenses vary in ways that aren’t captured by shot type and distance alone.

Monday, August 07, 2006

A flawed competitive balance hockey study

Pitching has nothing to do with winning baseball games. I've proved it!

Here’s what I did: I ran a regression to predict team wins. I had ten dependent variables: team home runs, batting average, OPS, winning percentage, sacrifice hits, ERA, manager experience, total average, total payroll, and pitcher strikeouts.

After running the regression, only one variable was significant: team winning percentage. The others weren’t significant at all. And, so, obviously, ERA and strikeouts have nothing to do with winning. I’ve proven it!

Well, of course, I haven’t, and the flaw is kind of obvious: the regression includes “winning percentage” as one of the dependent variables. Winning percentage is almost exactly the same as the “wins” I’m trying to predict. In fact, the regression equation would work out close to:

Wins = (162 * winning percentage) + (0 * ERA) + (0 * batting average) + (0 * OPS) …

ERA has a lot of effect on wins, but it does so by changing winning percentage. In this case, winning percentage “absorbs” all the effects of ERA. That is, once you know winning percentage, knowing ERA doesn’t help you predict wins any better. A .600 team wins 96 games, regardless of how good its pitching staff was.

(I’m sure there’s a statistical term for this effect, when you have massive cross-correlations in your dependent variables that cause otherwise-significant variables to be absorbed by others. But if there is, I don’t know it.)

Suppose we try to cure the problem by getting rid of “winning percentage” and using “expected wins” (pythagoras) instead. Our correlation would still be very high, because pythagoras predicts wins very well. And, again, we’d wind up with ERA being insignificant, for the same reason – all of the information ERA gives you is already included in the information in "expected wins". A team that scores 500 runs and allows 450 will win about 90 games, regardless of its staff’s ERA.

One more try: let’s remove "expected wins", and add separate variables for “runs scored” and “runs allowed”. The flaw is more subtle, but it’s still there. Our correlation will drop a bit because, while you can predict winning percentage by a combination of runs scored and allowed (at about 10 runs equals one win), it’s not as accurate as pythagoras. But, still, the other variables will still wind up not significant. And again, that’s because once you know a team’s runs scored and allowed, the ERA does not give you any more information.

This last situation is the flaw in
this hockey study by Tom Preissing and Aju J. Fenn.

Preissing (who is now an active
NHL player) and Fenn tried to figure out what factors are predictive of competitive balance in the NHL (as measured by single season clustering around .500). They included variables like free agency, the draft, the availability of European players, the existence of the WHA, and so on. Unfortunately, they also included variables for competitive balance in goals scored and allowed – which, as we saw, is a fatal flaw.

Since GS and GA directly cause winning percentage, most of the study’s other variables show up as insignificant. For instance, the amateur draft may have increased competitive balance measured in wins – but if it did, it would have done so by the mechanism of increasing competitive balance in goals, or by the mechanism of increasing competitive balance in goals allowed.

That is, a league that has lots of variation in goals scored and lots of variation in goals allowed will have lots of variation in wins, regardless of whether there was an amateur draft or not.

Sadly, the flaw means we don’t really get any reliable information from the study. But it would sure be interesting to run it again, without those two variables.

(Thanks to
Tangotiger for the pointer.)

Friday, August 04, 2006

"Driving the green" among best predictors of PGA success

Golf Digest reports some real statistical research from the PGA itself:

"… five [variables] have clearly emerged as leading indicators and predictors of success: "birdie average," "par breakers," "par-5 scoring average," "par-5 birdie percentage" and "going for the green" (the percentage of times a player tries to drive a par 4 or hit a par 5 in two.) In these stats in 2004, the worst ranking recorded by any of the top five players in the world--Vijay Singh, Tiger Woods, Ernie Els, Retief Goosen and Phil Mickelson--was eighth (Goosen in par breakers and Lefty in par-5 birdie percentage)."

Four of their top five categories relate to scoring, which tells us nothing. The finding that golfers who score low are more successful is not a useful piece of knowledge – it’s a tautology.

The useful and surprising information is the fifth category, “going for the green.” This isn’t the number of times the golfer is successful, but simply the number of times he tries. This suggests that driving the green is a good strategy.

It further implies that long drives are very important, and that “drive for show, putt for dough” might not be as accurate as we thought.

Of course, it could be that golfers who try to drive the green win not because it’s a good strategy for everyone, but that they’re simply good at it. But still, it's something to think about for other players.

Thanks to John Matthew IV for the pointer.

Thursday, August 03, 2006

Can we measure player improvement over the decades?

Conventional wisdom is that baseball players are getting better and better over the decades. How can we know if that’s true? 

We can’t go by hitting stats, because the pitchers are improving just as much as the batters. We can’t go by pitching stats, either, because the batters are improving just as much as the pitchers. It could be that players have improved so much that if Babe Ruth came back today, he’d only hit like, say, Tino Martinez, or maybe Raul Mondesi. But can we prove that? 

One way to measure improvement is to look at evidence that doesn’t involve other players. If the Babe regularly hit a 90-mph fastball 400 feet, but Martinez and Mondesi don’t, that’s good evidence that Ruth is better. If the best pitchers in 1960 could throw fastballs at 95 miles per hour, and that’s still their speed today, that might be evidence that there hasn’t been much improvement. But we don’t have enough data on pitch speeds or bat speeds to do valid comparisons. For another, it’s hard to compare intangibles, like the average deceptive movement on a slider today versus 15 years ago.

But it would be nice if there were a way to measure improvement just from the statistics, from the play-by-play data.  One such attempt to measure batter improvement was a study by Dick Cramer in SABR’s 1980 “Baseball Research Journal”. For every year in baseball history, Cramer compared every player’s performance this year to the same players’ performance last year. Cramer found, for instance, that the same group of batters hit .004 worse in 1957 than they did in 1956. Therefore, the estimate is that baseball players in general had improved by four points in 1957. 

Of course, there’s a fair bit of randomness in this measure, and so when Cramer graphs it over the years, there are lots of little ups and downs. In general, the line rises a bit every year, but, like a stock price graph, there are lots of little adjustments. In fact, the method works quite well in identifying those seasons when there were reasons for the level of play getting better or worse. There’s a sharp drop in the mid-1940s, when many players went off to war, and an increase in 1946. There are drops in 1961 and 1977 for expansion. 

But while the method may look like it works, it doesn’t. As Bill James explained, the study doesn’t take into account that players’ performances change naturally with age. A group of 26-year-olds will play better, on average, than they did at 25 – not because they league is worse, but because those players themselves are better. And the reverse is true for 36-year-olds; they’ll play worse than when they were 35, independent of the rest of the league. 

Now, since the method covers all players, not just one age, you’d think it would all cancel out. But it doesn’t, necessarily. There’s no guarantee that if you take all the players in the league one year, and compare their performance the next year, that the age-related performance change will be zero. And it has to be zero for this to work. Even if it’s only close to zero, it ruins the study. So when you compare 1974 to 1973, and find a difference of 2 points in batting average, that’s the sum of two things: 

-- The improvement in the league, plus 

-- The age-related improvement/decline in the players studied. 

And there’s no easy way to separate the two. However, if you assume that the age-related term is constant, you can see the year-to-year relative improvement. That’s why the method correctly shows a decline during the war and an improvement after. You can see which years had the biggest changes, but you can’t figure out just how much that change was, or even if it was an improvement or decline. On page 440 of his first Bill James Historical Baseball Abstract (the 1986 hardcover), Bill James had a long discussion of the method and why it doesn’t work. He wrote:

"Suppose you were surveying a long strip of land with a surveying instrument which was mismarked, somehow, so that when you were actually looking downward at an 8% grade, you thought you were looking level. As you moved across the strip of land, what would you find? Right. You would wind up reporting that the land was rising steadily, at a very even pace over a large area. You would report real hills and real valleys, but you would report these things as occurring along an illusory slope. "And what does Cramer report? Average batting skill improving constantly at a remarkably even rate, over a long period of time."

Cramer’s graph is a jagged, irregular line, at about 45 degrees. What James is saying is that the jaggedness is correct, but the 45 degrees is likely wrong. Imagine rotating the jagged line so that it’s at 60 degrees, which means a stronger batting improvement than Cramer thought. That’s consistent with the data. Rotate it down so it’s now at 20 degrees, which means slight improvement. That’s also consistent. Rotate it down to zero degrees, so it’s roughly horizontal, meaning no improvement. That’s still consistent. And even rotate it so it slopes down, meaning a gradual decline in batting ability over time. That, also, is perfectly consistent with the statistical record. 

Here’s a hypothetical situation showing how this might work. Suppose that every player enters the league with 15 "points" of talent. This declines by one point every year of the player’s career – he retires 15 years later, after he’s dropped to zero. If you were to draw this as a graph, it would be a straight line going down at 45 degrees – the older the player, the worse the talent. Now, that curve is the same every year, so it’s obvious that the talent level of the league is absolutely steady from year to year – right? But if you look at any player who is in the league in consecutive seasons, his talent drops by one point. By Cramer’s logic, it looks like the league is improving by one point a year. But it isn’t – it’s standing still! 

If you don’t like the declining talent curve, here’s one that’s more realistic. Suppose that again, players come into the league at 15 talent points. But this time, they rise slowly to 25 talent points at age 27, then decline down to zero, at which point they retire after a 16-year career. The exact same analysis holds. If you were to actually put some numbers in, and look at the players who play consecutive seasons, you would find that, on average, they drop one point a year. (Mathematically, if every player declines from 15 to zero in fifteen years, the average must be minus one per year.) It still looks like the league is improving, when in fact it’s standing still. 

It’s easy to come up with similar examples where the average consecutive-year player declines, but the league is declining; or where the average player improves even while the league is getting better. 

The bottom line is that even if players appear to be declining, it does not mean the league is improving -- or vice versa. 

Why do I mention this now, 26 years after Cramer’s study? Because this year, in the excellent Baseball Prospectus book “Baseball Between the Numbers,” Nate Silver repeats the same study in Chapter 1. Silver figures that if Babe Ruth played in the modern era, he’d look like – yes, Tino Martinez or Raul Mondesi. For personal, non-empirical, hero-worshipping reasons, I’m glad that’s not necessarily true.

Wednesday, August 02, 2006


My friend John Matthew IV e-mails me,

"Why limit your blog to sports?"

Effect of batting order and inning on scoring

Retrosheet guru David Smith, looking at tons of play-by-play data, breaks down scoring by inning, and by which batter led off the inning.

Among many other findings,
David’s study confirms Bill James’ observations that

-- the first inning is the highest scoring
-- the second inning is the lowest scoring
-- taken together, the first and second inning are below average.

Tuesday, August 01, 2006

Southern pitcher update: few data points

Over at Baseball Think Factory, where they have a thread on yesterday’s post here about Tom Timmerman's paper on Southern pitchers, poster Swedish Chef (I’ll call him SC) alertly notes that the southern pitcher data is based on very few hit batsman (search for comment 14 in the above thread).

SC notes that there were originally 20,357 HBPs in the study. The chart on page 27 shows that of those, only 1.9% of them were when the previous batter hit a home run. Also, only 25.7% of HBPs were by pitchers from the South, and only 29.3% of HBPs were against black batters.

Assuming these are “pretty independent variables,” you can multiply them all out and get 29. SC concludes that there are only

"about 30 HBP on black players by southern pitchers and about 90 by non-southern pitchers after a HR (… these numbers aren't in the paper, why not?)."
Timmerman's study concluded that pitchers from the South are much less likely to bean black batters. But these relatively small numbers certainly cast the study's conclusions into a bit more doubt. Even though the results in the paper are said to be statistically significant, could it be just a couple of pitchers creating the significance?

More data would help.

Thanks to Swedish Chef for the catch.

Basketball study: should you foul in the dying seconds when three points ahead?

Your basketball team is up by three points with only a few seconds left to play. The other team has the ball. Which defensive strategy do you choose?

Strategy 1: play defense and try to prevent the other team from sinking a three point shot.

Strategy 2: immediately foul the other team, giving them two foul shots. They will try to sink the first foul shot. But they can’t sink the second, because then they will lose possession while still down one point, thus losing the game. Therefore, they will deliberately miss the second shot, hoping to get the offensive rebound and tip in a two point shot to tie the game. (If they missed the first foul shot, they will try for a desperation three point shot after the deliberate miss.)

In his study "Optimal End-Game Strategy in Basketball," author David H. Annis compares the chances of winning with each of the two strategies. The method is pretty simple: Annis creates two “decision trees,” which are flow charts of how the play will go depending on what happens, and calculates the probabilities from the charts. It’s a bit boring reading while he goes through the theoretical formulas, but picks up once he assigns actual probabilities. For his example, he uses:

Probability of making a free throw = 0.75;
Probability of getting an offensive rebound after deliberately missing a free throw = 0.15;
Probability of a two-point tip-in after getting the offensive rebound = 0.7;
Probability of successfully sinking a desperation three-pointer after an offensive rebound with almost zero time left on the clock = 0.1;
Probability of successfully sinking a non-desperation three-pointer if not fouled = 0.25;
Probability that the offense will be unable to even attempt a non-desperation three point shot = 0.1;
Probability that the defense will get the rebound after a missed three-point shot = 0.7.

Under these assumptions, the probability of winning if you foul the other team is .9588. The probability of winning if you don’t foul them is much less -- .8661. So you should foul.

The author concludes, “for virtually any reasonable values of these probabilities,” fouling is the better strategy.


Addendum: In the comments, "alan r." points to a nice study by Adrian Lawhorn at Lawhorn examines the same question, does a similar calculation, and comes to conclusions that are roughly equivalent.

One difference is that Lawhorn's non-foul probability works out to 0.8 instead of 0.8661. I think that's partly because he doesn't consider the possibility that the offense can't get a shot off (which Annis estimated at 10%), and partly because (to his credit) he used actual game data from that situation.