Sabermetric Research: Replacement level talent vs. observations: a study

Recently, ~~both~~ JC Bradbury ~~and King Kaufman~~ expressed some skepticism about the concept of "replacement value". In theory, replacement value is the level of talent that can be obtained quickly at league minimum salary, like your best minor-leaguer, or the free agent who almost got an offer but didn't.

Generally, conventional sabermetric wisdom is that a replacement-level player is one who performs at a level 20 runs (or two wins) below average, if pro-rated to a full season. (Two wins below average is zero "wins above replacement," or "WAR".) By this standard, no team should play anyone expected to perform below this level.

In their arguments, Bradbury points to the fact that there were, in fact, many players who had performed at less than this level of performance. In a recent post on replacement value, King Kaufman checked, and found that

"In the major leagues in 2010, 24.5 percent of all innings were thrown by pitchers who ended the season with a negative WAR."

(UPDATE: In the original version of this post, I had originally incorrectly painted Kaufman as a replacement-value skeptic, which he is not. See his comment below.)

The explanation, of course, is that the fact that they *performed* below replacement doesn't mean teams *expected* them to perform below replacement. Teams might have overestimated their abilities, or, more likely, they just had bad years due to random chance. If you're bet on heads ten times, some coins are going to land heads only 4 times out of 10. It doesn't mean that there was anything wrong with your prior expectations of the fairness of the coin.

Anyway, I thought I'd run a little experiment.

I started with every batter in Jeff Sackmann's "Marcel" database from 2000-2009. ("Marcel" is a prediction method created by Tom Tango, which forecasts a player's performance this year based on his statistics the previous three years.) Assuming the Marcel predictions are reasonable, I counted how many player-seasons were expected to be below replacement level. If the theory is correct, it should be "zero."

It wasn't zero, but it was close. Over 10 years, only 152 players total were expected below replacement value that year. That's 15 players per year, spread among 30 teams. Half a player per team. That's not bad.

And, it's possible I had replacement value wrong. I didn't include fielding, just basic linear weights. And I used arbitrary position adjustments -- catchers and middle infielders had to be -40 runs per 600 PA to be replacement level, 1B and DH had to be 0, and everyone else had to be -20. It's very possible that, of the 152 players, many of them actually *weren't* below replacement level because of their defensive skills.

Also, there were playing-time limits. Jeff's database excludes players with low expected playing time (the smallest he forecasts is for 185 PA). And I left out all players who had fewer than 50 actual plate appearances that season. I figured that if a guy projected below replacement, but the team only gave him 10 AB, that's close enough to zero that we won't count it.

So, half a player per team per season seems reasonably consistent with the concept of replacement level. It's not like teams are signing these guys left and right.

If there were 15 players per season *projected* to be below replacement level, how many of them actually *performed* below replacement level?

The answer: 1,025 total, or 102 per season. There were about seven times as many players *observed* to be below replacement level than *predicted* to be below replacement level.

That makes sense -- if you flip 100 pennies, where replacement level is 0.5, there would be *infinitely* more coins observed below 0.5 than actually below 0.5. (Assuming all pennies are fair coins, that is.)

-----

Now, the experiment. For every player in the database, I decided to randomly simulate their season based on their Marcels. Basically, I treated their Marcel prediction like an APBA card, and ran off a bunch of plate appearances. I simulated the exact number of PA that they *actually* had that season, regardless of how Marcel predicted their playing time.

Then, to simulate the uncertainty about the player's talent, I chose an adjustment from a normal curve, with standard deviation of +5 runs, and added that to the performance. (UPDATE: the +5 was for 500 Marcel PA. I adjusted accordingly for fewer Marcel PA by the square root of the ratio, so for 125 PA, I used +10.)

If Marcels are good, unbiased predictors, and teams were indeed getting rid of players who fell below replacement, then we should see 1,025 below-replacement performances in the simulation, not just in real life.

Well, we don't.

I ran the simulation 10 times, and the average was 561 players, not 1,025. We got a little over half.

Why? After I ran this, I realized the reason is selective sampling. Suppose you have two players who have talent of -10. Six weeks into the season, and just by luck, one of them is awful, at a rate of -30, and the other one is doing OK, at a rate of +10.

What happens? The -30 guy is released, and winds up the season at -30 over 100 AB. The second guy is allowed to play the whole year, and winds up at -5 over 500 AB.

One out of two wound up having performed below replacement in real life. But, in the simulation, it'll be less than that.

In the simulation, there's less than a 50% chance that the first guy will wind up at less than -20 over 100 AB. And there's a much, much smaller than 50% chance that the -10 guy will wind up below -20 over a full 500 AB.

So the simulation will underestimate the number of below-replacement performances, because, in real life, once a marginal player is below replacement, he's not often given a chance to rise back out of it. But in the simulation, he gets his full number of PA regardless.

-----

in that light, I adjusted the simulation to add one new rule: if a player was expected to be +10 or less, and, a third of the way through his expected season, he's below replacement, he gets released. (If, after a third of the season, he's above replacement -- even a little bit -- he plays the entire rest of the season regardless of what happens afterwards.)

Now, the simulation goes from 561 below-replacement performances, to 800. Still less than 1,025, but better.

So, finally, I did one more thing: I changed the standard deviation of the uncertainty of the player's talent from 5 runs to 10.

Now, we get to 863. That's 84 percent of the way there.

-----

After all that, I'm not sure quite how much the simulation tells us. To do a proper comparison, we need a better model of how teams decide how much playing time to give a hitter based on expectations and performance.

What we *do* find out, though is:

-- If you trust Marcel, then it does seem that few teams are willing to keep a player who has performed below replacement.

-- Regardless, many players *do* perform below replacement.

-- Simple probability shows that, at a bare minimum, over half the players who perform below replacement do so because of luck.

-- With other not-too-unreasonable assumptions, we can get that percentage up into the 80s.

My view about all this that it's less than fully conclusive. Still, it should be fairly persuasive. If you didn't accept the "replacement player" hypothesis before, this little study should have enough in it to get you to reconsider.

What do you think?

-----

UPDATE, 12/14: King Kaufman posts in the comments that I misinterpreted his views on replacement value. My apologies to King, and I've revised the post accordingly.

Labels: baseball, replacement level

10 Comments:

At Monday, December 13, 2010 11:31:00 PM, Phil Birnbaum said...: Tango suggested using these position adjustments to the -20 replacement value:

Catcher -20 (so -40 is now replacement)
1B/DH: +20
SS/2B/3B: -5
OF: +5

Using Tango’s position values, the results change to:

121 players sub-replacement Marcel
1206 players sub-replacement observed performance
1048 players sub-replacement observed simulation.

Tango also suggested using -17 as the replacement value base instead of -20. Using Tango’s position values and -17 instead of -20 as replacement level:

217 players sub-replacement Marcel
1412 players sub-replacement observed performance
1237 players sub-replacement observed simulation.

A bit of an improvement ... thanks!
At Tuesday, December 14, 2010 12:05:00 AM, Phil Birnbaum said...: In the previous comment, I should have said catcher was -15 (so -35 is now replacement). All other numbers stand.
At Tuesday, December 14, 2010 7:46:00 PM, King Kaufman said...: Not that it matters with regard to the rest of what you write here, but I just want to say that I did not express "some skepticism about the concept of 'replacement value.'"

You quote me saying that in 2010 MLB "24.5 percent of all innings were thrown by pitchers who ended the season with a negative WAR," then you write: "The rebuttal, of course, is that the fact that they *performed* below replacement doesn't mean teams *expected* them to perform below replacement."

Rebuttal to what? In my post, I wrote about replacement level: "It’s the level of play expected from the least valuable players who are still good enough to play in the majors."

I also wrote: "But just because replacement level is the lowest level of play teams can expect from a big league player, that doesn’t mean they’ll get it. There are always some guys who perform below replacement level. Their Wins Above Replacement, or WAR, is a negative number."

In other words, the observed performance is below replacement level. That doesn't by itself cast doubt on the concept of replacement level as an expected performance, and I didn't write that it does.

I've seen the conversation about my post at Tango's website, and I acknowledge that I got a little loose with the language toward the end of my post, not making crystal clear the distinction between expected and observed performance at every mention, of which there were a lot. Basically, once I'd explained the terms, I used phrases like "below-replacement guys" as a shorthand for "guys whose observed performance in 2010 was below replacement."

We might have to agree to disagree about whether that is acceptable or not. I don't know. I'm going to do a little follow-up post on that issue in the next day or two. But I certainly didn't express skepticism about the concept of replacement level.

Thanks for listening. A lot of what you, Tango and others discuss is a little over my head, but I love it.

Cheers.

king
At Tuesday, December 14, 2010 7:51:00 PM, Phil Birnbaum said...: King,

Fair enough, and thanks! I'll revise the post to reflect what you meant.
At Tuesday, December 14, 2010 8:04:00 PM, King Kaufman said...: Hey, thanks. I appreciate that. But what took you so long! Five minutes!
At Tuesday, December 14, 2010 8:05:00 PM, Phil Birnbaum said...: Sorry, I was in the bathroom. :)
At Wednesday, December 15, 2010 10:03:00 AM, Mike said...: Phil, you crap out some great posts. Really enjoyed this.
At Wednesday, December 15, 2010 10:14:00 AM, Phil Birnbaum said...: Thank you!
At Saturday, December 18, 2010 1:53:00 AM, Wheell said...: How many teams would be predicted to be below replacement level? I only ask because of the Tigers.
At Saturday, December 18, 2010 11:28:00 AM, Phil Birnbaum said...: Not sure ... but the SD of team wins just by binomial luck is more than 6. So, if a team is a 63-win talent, 1 in 40 of those teams will win 50 games or less just by luck.

I did a study a few years back on which teams were lucky, and, up to 2001, no team was "really" worse than 53-109. So every team with less than 53 wins was unlucky.

The data/slides are at my website, www.philbirnbaum.com. Search for "1994 Expos".

<< Home

Sabermetric Research

Monday, December 13, 2010

Replacement level talent vs. observations: a study

10 Comments:

About Me

Previous Posts