Sabermetric Research: March 2012

Tuesday, March 27, 2012

How stable is a baseball player's talent?

Last post showed that players fluctuate a fair bit, year-to-year, in their talent for free-throw shooting and field goal shooting.

What about baseball? How much does batting average talent vary from year to year?

I used the same method that I did in the NBA case, but I used a lot more data: specifically, every player in MLB from 1973 (compared to 1974) to 2008 (compared to 2009). Players were included if they had at least 100 AB both years.

Going through the numbers real quick (skip to the bolded part if you want):

I calculated the Z-score for what each player's year-to-year difference would be if it were just binomial variation. I adjusted for league changes (thanks to a suggestion from Tango).

The SD of the Z-scores was 1.04875.

Since the SD (of the Z-scores) from luck should have been 1.000, the SD (of the Z-scores) from talent is the square root of [1.04875 squared minus 1 squared], which works out to 0.3135.

Assuming a typical count of 300 AB, the SD (of the difference) from luck should have been 0.232. That means the SD (of the difference) from talent is 0.3135 multiplied by 0.232, which works out to .0099. So:

The SD of the change in an MLB player's batting average talent, from year to year, is about 10 points.

(As before, keep in mind that "talent" actually refers to anything other than binomial variation -- injuries, changes to park, differences in batting opportunities with runners on base, and so on.)

I repeated the calculation, but this time only for players who had 400 AB both seasons (the average in this sample was 534 AB the first year, and 532 the second year). This time, the SD of the change was only about 7 points.

That makes sense -- players who played full-time two years in a row were probably pretty good both years, which means large changes in talent are left out of the sample. Still, it's nice to see that it worked out as expected.

Labels: baseball, distribution of talent, luck

Monday, March 26, 2012

Which NBA talent is less stable: free throws or field goals?

Last post, I wondered about why NBA players' free-throw percentage seems to fluctuate so much. If you start with the hypothesis that changes are all just luck, you'd expect only one player in 50 to be more than 2.5 SD from zero. But, from 2004-05 to 2005-06, there were 10 such players. So, clearly, it can't be just luck. Talent must be changing too.

Since then, commenter "doc" was kind enough to e-mail me similar numbers for field goal attempts. My hypothesis was that the talent change for FG should be bigger than the talent change for FT, because FG requires a bigger set of skills.

The results surprised me.

Let me take you through the math of how I did this, just so you can make sure I got it right. If you don't care, skip to the bold parts and the last section.

-----

Start, for instance, with Allen Iverson. In ~~2004-05~~ 2005-06, Iverson shot .447 in 1,822 FG attempts. The binomial SD of that is the square root of (.447 * (1-.447) / 1822), which works out to .01165.

In ~~2005-06~~ 2004-05, Iverson was .424 in 1,818 attempts. That's an SD of .01159.

The SD of the *difference* between the two seasons is the square root of (.01165 squared + .01159 squared), which works out to .0164.

Iverson's actual difference between the two seasons was .023. Divide .023 by .0164, and you get 1.400. So, Iverson's Z-score for the difference is 1.4.

Repeating this for all players in Doc's sample, we wind up with 120 separate Z-scores. If the differences were all luck, we'd expect that if we took the standard deviation of those 120 numbers, we'd get exactly 1.00. Instead, we get an SD of 2.18.

----

So, FG shooters change year-to-year by an SD of 2.18 "luck units". Since we know 1.00 of those units are actually luck, that leaves 1.94 units of talent change (since 2.18 squared minus 1.00 squared equals 1.94 squared).

What does 1.94 units mean in real life? Well, the typical player in Doc's sample went around 45% in 930 attempts. So, the SD from luck would have been .0231 (or 2.31 percentage points). Multiply that by 1.94, and you get .045. So:

-- An NBA player's FG% talent changes from year-to-year with an SD of 4.5 percentage points.

-----

Now, let's do the same for FT%, so we can compare.

FT shooters change year-to-year by an SD of 1.87 "luck units". That leaves 1.59 units of talent change.

The typical foul shooter went 77% in 471 attempts. So, the SD from luck would have been .0274. Multiply that by 1.59, and you get .043. So:

-- An NBA player's FT% talent changes from year-to-year with an SD of 4.3 percentage points.

-----

They're almost exactly the same!

Part of the reason I wouldn't have expected that is that for FG attempts, what we're calling "talent" isn't really just talent. It's "everything except binomial luck." So, it also includes changes in quality of opposition, quality of teammates, role on the team, ratio of 2-point and 3-point tries, and so on -- actually, quality of shot attempts. Since we're really measuring the sum of two variances -- talent, and shot quality -- we'd expect it to be higher than just for FT attempts.

Another reason I expected FG to be higher is that there might be some selective sampling involved in the FG case: a player who has a really bad year (perhaps by luck) might not play again next year. That would remove a bunch of outliers. But, the average player in the sample actually declined the second year, by 0.7 percentage points, so it doesn't look like that's it.

On the other hand, part of the reason could be that FG percentages are lower than FT percentages. For FG, we have 4.5 percentage points out of 45. For FT, we have 4.3 percentage points out of 77. So, looking at it that way, the talent change for FG is 10% either way, but for FT, it's less than 6% either way.

What do you guys think?

-----

P.S. I ran the same numbers for batting average in MLB. I'll save that for a future post.

UPDATE: commenter bsball points out that FG% includes both 2-point and 3-point attempts. So, that's another way shot quality is affected.

Indeed, this might be a big effect. Iverson took 338 three-point tries in 2004-05, but only 223 of them in 2005-06.

I've updated the post. Also, I corrected where I had inadvertently reversed the two seasons.

Labels: basketball, luck, NBA

Sunday, March 18, 2012

Why does free throw percentage fluctuate so much?

Typically, we assume that a player's talent doesn't vary much from year to year, except perhaps because of injury and aging. When an established hitter drops, say, from .320 to .280, while in the prime of his career, we usually assume most of it is luck. It might be just randomness of outcome, perhaps in combination with random circumstances, like differences the opposition he faced. On occasion, we might consider the possibility that the player changed his approach or technique.

If you were to choose a skill that doesn't depend much on circumstances or technique, it might be NBA free throws. Every shot is from exactly the same place, and it's something the player has been doing since childhood. You wouldn't expect a player's foul shooting talent to jump around a lot.

But ... it seems that it does.

I took the 50 NBA shooters with the most attempts in 2004-2005, and checked how they did the next season, 2005-2006. I expected they'd be roughly the same. Specifically, if I expected that if I converted each difference to a Z-score, they'd form a bell curve with mean 0 and SD about 1. I say "about" 1 rather than exactly 1, because there are probably slight changes due to injuries, aging, etc. There's also a selective sampling issue, since I chose the players with the most attempts. (E-mail me for the spreadsheet -- I had to do it manually.)

But, the SD of the difference was much more than 1.00. It was 1.87.

The biggest changes were almost all declines. Desmond Mason went from .802 to .682 (4.8 SD). Tyson Chandler dropped from .673 to .503 (4.7 SD). Drew Gooden, Stephon Marbury, Jalen Rose, Mehmet Okur, and Pau Gasol also had big declines, 2.5 SD or more.

Only Yao Ming (.783 to .853, 2.7 SD) and Mike Bibby (2.5 SD) had improvements above 2.5.

Looking at these players' careers, you see that there's definitely some luck involved: many of them went from "too high" relative to the rest of their careers, to "too low". A couple of them changed (roughly) permanently -- Yao, for instance.

The scale for free throws isn't that much different than the scale for batting averages. For Mason, imagine a player who hits .302 in 420 AB one season, and then hits .182 in 524 AB the next. That can happen, but it doesn't happen that often (even taking into account that MLB players are unlikely to get 524 AB while hitting .182). Furthermore, whatever reasons you can think of that it would happen in MLB -- injury, conditioning, age -- you'd think wouldn't really apply to foul shooting.

None of the changes are that extreme, taken alone: the issue is that there are *too many of them* for it just to be binomial randomness. A Z-score of 3.0 or more should happen only one in 300 times, by chance. In this sample, it happened 4 times out of 50.

Let me make one more adjustment, for a "season effect". It turns out the overall league average FT% dropped by .011 between those two seasons. I don't know if that was cause or effect -- but, in any case, if I adjust every player by that .011, the scale is now balanced between improvers and decliners. But the overall SD stays the same (actually, it increases slightly, from 1.87 to 1.88). Also, the three biggest changes are still all decliners, and those three (Mason, Chandler, and Gooden) are at 4.4, 4.4, and 3.7 SD, respectively.

So, what's going on? Do players' free throw talents fluctuate that much? If so, what does that say about more complicated talents?

---

P.S. This post was inspired by a discussion about correlation at Sport Skeptic, here.

Labels: basketball, free thttp://www.blogger.com/img/blank.gifhrows, luck, NBA

Thursday, March 15, 2012

Stop revering doctors

Warning: Non-sports post. And, this is only tangentially related to the recent "should employers' insurance companies have to pay for employees' birth control" controversy, or Rush Limbaugh's response. My arguments do not depend on which side of that debate is correct. They apply to only one small aspect. (If you're not familiar with the debate, Google "Sandra Fluke".)

----

A little while ago, Steven E. Landsburg decided to post about some of economic issues surrounding the question of employer coverage of contraceptives. He was critical of Fluke for demanding free birth control without providing a good economic reason why. He then got in trouble from the president of the University of Rochester, where he teaches.

Afterward, reporters started calling him for comment. One reporter asked him,

"Do you think that “reasons” accepted by an economist deserve more weight and respect than “reasons” a medical doctor might have for recommending that birth control be universally covered by medical insurance?"

Landsburg replied,

Yes, absolutely. Here’s why: Economists are trained to look at all the consequences of a decision before passing judgment; doctors tend to focus only on some kinds of consequences (those directly related to health) while ignoring others (for example, the many other effects that flow from raising people’s taxes or insurance premiums). ...

Economists have thought long and hard about how to make sure we do that. We don’t always get it right, but at least we’ve got a framework for it. Doctors don’t.

Landsburg is, of course, absolutely correct.

Why would this reporter, or, anyone, think that a doctor was as qualified to talk about an economic question as an economist is?

Because we respect doctors too much. We respect them well beyond the scope of their expertise. We somehow think that they're better and smarter than the rest of us, and we have this unspoken feeling that their opinions have extra weight, because of their higher class and status.

It's certainly not that we need their subject matter expertise to argue the question. Because, suppose Sandra Fluke had stepped up before Congress to demand that the car companies include oil changes in their warranty coverage. And economists disagreed that that was a good idea. And suppose the reporter had asked,

"Do you think that “reasons” accepted by an economist deserve more weight and respect than “reasons” an auto mechanic might have for recommending that oil changes be universally covered by warranty?"

That would be laughable. It's equally laughable when it's a doctor.

I think it's all a matter of pecking order. Doctors have high status, and auto mechanics have low status.

Economists have high status too, but significantly lower status than doctors. To see that, imagine the situation reversed, where Landsburg said something about medicine -- say, about why birth control works, biochemically -- and a doctor corrected him. And the reporter asked the doctor,

"Do you think that “reasons” accepted by a doctor deserve more weight and respect than “reasons” an economist might have for understanding how birth control works, biochemically?"

That question answers itself. The other one should have, too.

-----

P.S. A previous post about doctors overstepping their expertise is here.

Labels: doctors

Tuesday, March 13, 2012

An economist predicts the Olympic medal standings: summer edition

Dan Johnson, a professor at Colorado College, did a regression to predict medal wins at the Olympics for a given country. There are articles about it in various newspapers.

I wrote about this a couple of years ago, when he did the same thing for the Winter Olympics. At the time, I was a little skeptical. Nothing much has changed.

Except ... this time we have the equation! The National Post was kind enough to provide it in the print edition.

Here it is:

Medals = 0.33 +
.00271 * total medals available +
.02 * income +
.0000549 * income squared +
.024 * population +
19.02 * population squared +
11.91 if it's the home nation +
3.85 if it'll be the next home nation +
3.35 if it was the home nation last time or the time before +
0.29 if it borders the current home nation +
coefficients for dummy variables for nation

Apparently, the dummy variables are new -- they represent a

... "'cultural specific factor' to account for things that are hard to quantify, like the prevalence of doping or the culture of competition, and also to correct the historical under-predictions for countries such as China and Australia."

The dummy variables should make the predictions even more accurate.

BTW, an interesting thing about the regression is that if you have a dummy variable for each country, your accuracy should be ~~the same~~ very similar even if you don't include the income and population variables! Those variables are nice to have because they illustrate the effects of income and population, but if you leave them out, the dummy variables will pick up the slack and adjust themselves for the income and population of each country. (UPDATE: it won't be exactly the same, just close. See the bottom of the post for an explanation.)

(The Post story also says that Johnson also updated his previous regression to remove variables for climate and politics. By the same token, I think he'd get the same results if he left them in.)

So I think that if you leave out income and population, you'll get the same coefficients for everything except the dummy variables:

Medals = 0.33 +
.00271 * total medals available +
11.91 if it's the home nation +
3.85 if it'll be the next home nation +
3.35 if it was the home nation last time or the time before +
0.29 if it borders the current home nation +
new coefficients for dummy variables for nation

I don't think this is the best way to account for the variables still included. That's because (as I think I said in the other post) all the factors are linear. I'm not sure that's right: it implies that if the USA is the home team, it gains 12 medals on top of the 100 or so it usually wins, but if Canada is the home team, they also gain 12 medals (to go with their 15 or so).

The "12 extra medals" is, roughly speaking, the average. It should work for a roughly average country. Since the UK isn't too far from average, the formula should probably work pretty well this year.

----------

UPDATE: it now occurs to me that that's not exactly right. I was assuming that the population and income were constants for each country. They're not -- they vary over time. So, the regression coefficients with population/income will be *close* to the ones without, but won't be exactly equal.

(I should have also realized that the regression wouldn't work if population and income were constants, because you'd get multicollinearity.)

Labels: forecasting, olympics

Tuesday, March 06, 2012

Are early NFL draft picks no better than late draft picks? Part IV

This is the last post about the Berri/Simmons NFL draft paper, in which they say that draft position doesn't matter much for predicting quarterback performance. Here are Parts I and II and III.

------

In his Freakonomics post, Dave Berri argues, reasonably, that quarterbacks are harder to predict from season to season than basketball players.

When he runs a regression to predict NFL quarterbacks' completion percentage this season, based on only the stat from last season, he gets an r-squared of .311. On the other hand, if he does the same thing for "many statistics in the NBA," his r-squared "exceeds 70 percent."

According to Berri,

"This is not surprising since much of what a quarterback does depends upon his offensive line, running backs, wide receivers, tight ends, play calling, opposing defenses, etc. Given that many of these factors change from season to season, we should not be surprised that predicting the performance of veteran quarterbacks is difficult."

But ... aren't basketball players also subject to changes in the quality of their teammates? Why should teammates be so much more important for football than basketball?

Well, they're not. Almost the entire difference is just sample size. Let me show you.

The r-squared from season to season depends on the variances of what kinds of things are constant between seasons, and what kinds of things are not. For the most part, we can call these "talent" (t) and "noise" (n), respectively.

If the r-squared for QBs between seasons is .31, that means

(t/(t+n)) * (t/(t+n)) = .31

Taking the square root of both sides gives

t / (t+n) = .56

And from there, you can multiply both sides by (t+n), and discover that

n = .79 * t

So, for a single NFL season, the variance due to noise is 79% of the variance due to talent.

Now, in the NFL, a quarterback will get maybe 450 passing attempts per season. In the NBA, a full-time player might get three times as many (FG attempts, FT attempts (even if you take those at half weight), and 3P attempts). So, the noise should be only 1/3 as large. Instead of noise being 79% of talent, it will be only maybe 26%. Call the new value of noise n'. Then,

n' = .26 * t

If you sub that back into the first equation, you get

(t/(t+n')) * (t/(t+n')) = .63

See? Just considering opportunities raises the r-squared of .31 all the way up to .63. Berri says it should "exceed 70%", and we probably could get that to happen if we included rebounds, or used a more sophisticated stat than just shooting percentage.

So, if quarterbacks are harder to predict than basketball players, it's simply because they don't play enough for their stats to be as reliable.

(UPDATE: As Alex alludes to in the first comment to this post, my logic assumes that "t" -- the variance of talent -- is roughly the same for QBs and NBA shooters. It might not be. But the point is, assuming they're the same is a reasonable first approximation, and that leads to the conclusion that sample size is the biggest difference.

So, maybe I should have been more conservative and said that it *could be* that they don't play enough for their stats to be as reliable.)

------

Which brings me back to Berri's (and Simmons's) academic study. There, they write,

"[Our] results suggest that NFL scouts are more influenced by what they see when they meet the players at the combine than what the players actually did playing the game of football."

Well, yes -- and perhaps the scouts SHOULD be more influenced by the combine. There's lots of noise in only one season of performance, and a rational scout won't weight it too heavily. What if the scout only saw one play? Then, it's obvious that he should be more influenced by the combine than the results. The less data you have, the more you have to weight the combine results.

Look at it this way. You have two pitchers. One throws 100 mph and had an ERA of 3.50 in 50 innings. The other throws 80 mph and had an ERA of 3.20. Which do you draft? Well, *of course* you draft the 100 mph guy. It's only after 200, 300, 400, 1000 innings that you might have enough evidence to change your mind.

------

The idea of random noise and sample size never figures into this paper at all. I don't think the authors even think about it. When they see unexplained variance, they always argue that it's something like the effect of teammates, instead of looking at binomial randomness. In fact, you get the impression they think there's no randomness at all, and the scouts could be perfect if only they were smarter.

For the record, the paper has no occurrences of the words "luck," "random," "binomial," or "sample size."

Labels: Berri, draft, football, freakonomics, NFL

Saturday, March 03, 2012

Are early NFL draft picks no better than late draft picks? Part III

This is about the Berri/Simmons NFL draft paper, in which they say that draft position doesn't matter much for predicting quarterback performance. Here are Parts I and II.

-----

One of the paper's most important claims is that scouts are looking at the wrong things -- specifically, the results of the NFL combine.

At the combine, prospects are tested on a bunch of objective measurements. How fast they run the 40-yard dash. Their BMI (a measure of weight to height ratio). Their intelligence, as measured by the Wonderlic test. And, of course, their height.

But, the authors argue, those things don't matter, and scouts are completely misguided in looking at what happens at the combine. They say that a QB prospect's height, BMI, 40-yard-dash time, and Wonderlic score have almost no effect on performance.

And that one point is key. Because, most of the authors' argument goes (in my words):

Premise 1: Scouts care about combine stats.
Premise 2: Combine stats affect draft position.
Premise 3: Combine stats don't predict performance.

Conclusion: Scouts don't know what they're doing.

So, premise 3 is key. How do the authors prove it?

Here's what they did. They took the 121 QBs drafted from 1999 to 2008, for which they had full combine data. Then they ran a regression to predict the QB's senior year performance based on those factors.

They found no statistical significance for any of them. And they conclude:

"Such results indicate that the combine measures are not able to capture key attributes of the quarterback."

And, in a related footnote,

"Such results indicate that there is little relationship between the combine statistics and per play performance."

Again, I think the problem is that the effect is there, but there just isn't enough data for significance. Indeed, it seems to me that they almost COULDN'T find significance in a study of that type.

Look, how much of QB performance is affected by height? Probably not much, right? There are so many other things involved. I mean, this isn't basketball: you don't see a lot of quarterbacks who are 6-foot-9, which suggests that height can't be that big a deal.

If the effect is that small, how are you going to find statistical significance with only 121 datapoints? Especially when you're trying to predict ONE SINGLE YEAR of college performance, which is very noisy (made even more so because, first, the authors chose to predict a measure that's dependent on playing time)?

You can't, and the authors didn't.

But ... that just means your study isn't precise enough. It doesn't show the effect isn't there. You can't look for a needle in a haystack, from fifty feet away, looking through the wrong end of a pair of binoculars, then say, "we didn't see a needle, so it doesn't exist."

Berri and Simmons didn't even show the results of that regression, even though it's key to their story. They just mention "not significant therefore zero" and move on. But if they HAD given their results, I bet you'd see the standard error is wide enough to encompass not just zero, but also many possible values that are perfectly reasonable and perfectly in line with what scouts think height is worth.

The same thing for the other factors -- 40 yard dash speed, Wonderlic, and BMI. It was almost guaranteed that the regression wouldn't find small effects in that sample.

What about the overall results for the four factors? You might get none of the individual combine stats being significant, but the overall correlation might be. Was it?

We really need to see the estimates for the coefficients. How many of them are reasonable individually? If you add them all up, are they also reasonable? If they are, that's all the more reason to point out that the lack of significance doesn't prove anything.

Again, the authors don't show the results ... but they do give a little hint. They run a second regression, this time using rate statistics instead of playing time stats. In a footnote to that, the authors say,

"The adjusted R-squared from these regressions, though, is in the negative range and the F-statistic is statistically insignificant."

A negative adjusted R-squared ... at first glance, that seems to say no relationship.

Except ... I looked up "adjusted R-squared". And, it turns out, for a regression with 5 variables and 121 rows, you can have a negative adjusted R-squared even if the "real" R-squared is as high as .042. That's not as small as it looks. An r-squared of .042 is an r of around 0.2, which is nothing to sneeze at.

(That makes sense. According to this calculator, a single-variable regression on 121 rows needs to find a correlation of 0.178 to find statistically significance, and I think the "adjusted" is meant to make the 5-variable case comparable to the 1-variable case.)

But 0.2 is probably higher than the effect we're looking for. Or at least, on par with it.

Suppose you ranked all the QBs on their combine stats. And then you took a QB who was +1 in SD in combine stats, and compared him to one that was -1 SD in combine stats. What kind of difference would you expect in on-field performance between the two?

Well, to get a correlation of 0.2, you'd have to expect a difference of about 3 points of NFL Quarterback Rating, or 3 or 4 positions in the performance rankings. (To estimate that, I looked here, and added 3 points to the rating of a typical QB.)

Now, remember, QB performance is very noisy. 3-4 positions in the performance rankings probably means 5-6 positions in *talent* rankings.

That seems to me like it's too much. There's no way height, BMI, Wonderlic, and 40-yard-dash speed could be *that* important, could it, that it's 5 or 6 rankings?

If not, then we're looking for an effect that's too small to find with only 121 datapoints to look at.

So, I think Berri and Simmons' regression was doomed from the start. They were guaranteed not to find significance, even if the scouts were right.

Labels: Berri, draft, football, freakonomics, NFL

Sabermetric Research

Tuesday, March 27, 2012

How stable is a baseball player's talent?

Monday, March 26, 2012

Which NBA talent is less stable: free throws or field goals?

Sunday, March 18, 2012

Why does free throw percentage fluctuate so much?

Thursday, March 15, 2012

Stop revering doctors

Tuesday, March 13, 2012

An economist predicts the Olympic medal standings: summer edition

Tuesday, March 06, 2012

Are early NFL draft picks no better than late draft picks? Part IV

Saturday, March 03, 2012

Are early NFL draft picks no better than late draft picks? Part III

About Me

My stuff

Hardcore Sabermetric Research Links

Other Sports Research Links

Medium Core Sabermetric/Baseball Links (more to come)

More Baseball Stuff

Blogroll

Previous Posts

Archives