### New Bill James study on Pythagoras

In the light of lots of discussion on why the Diamondbacks beat their Pythagorean Projection, and what that means, Bill James wrote up a new study on whether such teams improve beyond expected in the following season.

Bill sent the study to interested readers on the SABR Statistical Analysis mailing list, but kindly allowed me to post his article, and the accompanying spreadsheet, for anyone interested:

Article

Spreadsheet

You will notice that at the end of the article, Bill asks for comments ... if you comment here, I'll post to the list. I already posted some comments, which I will reprint below in small font. You probably want to read the study first.

------------------------------------------------------------------------------

Bill, thanks for the article! I thought I'd comment, since nobody else has (oops, Dvd Avins posted as I was writing this).

-----

Summarizing Bill's study, if I've got it right:

If you take teams that overachieved – that "should have" been about .482 according to Pythagoras, but were instead actually .538 -- they wound up at .496 the next season.

If you take the most closely matched teams in runs scored and runs allowed, but that *underachieved* Pythagoras – they "should have been" .478, but were actually .447 – they would up at .474 the next season.

Converting everything to a 162 game season, to make things easier to understand:

Group 1: Should have been 78-84. Were actually 87-75. Next year were actually 80-82.

Group 2: Should have been 77-85. Were actually 72-90. Next year were actually 77-85.

The difference: three games (actually, 3.7 if you don't round everything). Adjusting for the fact that Group 1 was (according to Pythagoras) about 0.6 games better than Group 2 in the first place, brings us back to about 3 games.

In terms of runs scored and runs allowed, the difference is only about 2.3 games. The other 0.7 comes from Pythagoras. That is, the teams with the pythagorean advantage of 13 wins the first year had a pythagorean advantage of only 0.7 wins the next year.

----

First, I think it's plausible that the 0.7 win advantage is real. Pythagoras just counts runs; it doesn't count how important they are. Bill once wrote (and several others have verified) that each run given up by a stopper is, in terms of wins, worth double what it's worth to an average pitcher. So if the stopper on one team has an ERA 1.00 run better than another, and he pitches 90 innings, the 10 runs he saves are actually worth 20. That will mean his team beats Pythagoras by one win.

It's probably reasonable to assume that the teams that beat Pythagoras the most would have better stoppers than the ones who "un-beat" Pythagoras the most. 0.7 wins – 7 runs difference in stopper talent – seems reasonable to me.

-----

That brings the unexplained difference down to 2.3 wins. What could explain that?

Dvd Avins suggested that it's management making changes: that, next year, when the overachieving team drops back to normal, the team will make some improvements.

Or, perhaps the changes might come in the off-season. The team that went 87-75 thought it was really an 87-75 team, and went out and signed an expensive free agent – the one guy they thought could take them over the top to the playoffs. The 72-90 team did not.

But an average of one free agent per team, over 100 teams, seems large, especially when Bill's sample covered all of baseball history, much of which didn't allow for easy free-agent signings.

However, these 87-75 teams are different from other 87-75 teams, in that the players' season records (not including pitcher W-L) look like the records of an 79-83 team, not an 87-75 team. That means that there's more opportunity for management to make changes. When your team scores 120 more runs than they allow, almost everyone looks good. But when your team gives up more runs than they allow, there's going to be more players that obviously seem in need of replacement. This may be *especially* true if the team is perceived to be a playoff contender.

That is, take two identical teams who give up a few more runs than they score. They both have a below-average DH who hits .260 with 15 home runs. The team that went 87-75 might consider it urgent to replace him – they think they need just one or two moves to make the playoffs. The team that went 72-90 knows they need a lot more than that, and may not spend money on free agents until their young players start improving.

Anyway, this may be wrong ... I’m just thinking out loud.

---

As for significance testing ... Tom Tango has figured that in MLB these days, the SD of team wins is 11.66 games. That breaks down as 9.7 games (in 162) for team talent, and 6.3 games for luck.

For the teams in Bill's list, 9.7 games for talent is too much, since the teams were not chosen randomly, but specifically to match the other group. So let's assume that instead of 9.7 games, the SD of the talent difference between the matched pairs is, say, 3 games.

That makes the SD of actual next-season wins come out to about 7 games (the square root of 6.3 squared plus 3 squared).

Since there were 100 pairs in the study, the SD of the average is 7 divided by the square root of 100, which is 0.7. And so the 2.3 win difference is three-and-a-bit SDs. Obviously that's significant, and so not just chance. But the management effect could be causing it.

Suppose you wanted to get that down to 1 SD, to feel comfortable calling it random variation. You'd have to reduce the difference by 1.6 wins. The most plausible way to do that is to attribute those 1.6 wins to management.

Suppose you have those two identical teams, but one overshoots Pythagoras to win 87 games, and the one undershoots to win 72 games. Both teams may try to improve ... but is it plausible to argue that, averaged over all 100 pairs, the 87-win teams will deliberately go out and improve their teams by 1.6 wins more than the 72-win teams?

I honestly don't know if that sounds reasonable. It's only 16 runs, though ...

Labels: baseball, Bill James, pythagoras

## 20 Comments:

Thanks to you and Bill James for posting this.

Phil: One observation and one theory.

Observation: I think it's interesting to look not just at run differential, but RS and RA separately. The control teams are +4 offense and -4 on pitching in yr2, what you'd expect (+4 probably reflects a small tendency toward higher scoring over time). But the overperformers increase their RS by 30 runs (!), while pitching/defense is 14 runs *worse*. So these teams didn't just improve generically, they dramatically improved their offense while losing ground on defense. That doesn't tell us why it happened, of course, but may be important clue.

I can see a reason for the pitching decline: one way to overperform pythag is very inconsistent pitching. So many overperformers probably have 1 or 2 starters enjoying Cy Young type seasons, while other starters lag considerably. Between regression to the mean and injuries, I'm sure the average yr2 falloff for pitchers enjoying great seasons must be substantial.

Related to this, it would be nice to know how these teams compared to league average. For example, were the overachievers 12 runs below average on both offense and defense in yr1, or, say, -35 on offense and +10 defense? If the latter, it's easier to see how they would improve offensively.

Theory: it seems very likely to me that changes in personnel account for more of the 2 game improvement by overachievers than does performance change from returning players. As you say, the overachievers have more incentive to improve, having won about 15 more games (in 162-game context). I would add to that the issue of revenue: 15 extra wins has a considerable attendance/revenue impact. So the overachievers have more money to spend than a true 78-win team should, to go along with their motivation to improve. Doesn't seem farfetched to think those two together could produce a two-game improvement. However, I don't know why the gains would come exclusively on the offensive side.

Good points, Guy. No time now, will reply later.

A casual check of Guy's theory yields inconclusive results. Most teams who overperformed did not have 1 or 2 starters enjoying Cy Young type seasons. One notable exception was the 1997 Giants who had Kirk Rueter and Shawn Estes having great years. Regression to the mean is a particularly powerful concept in these instances.

Mike: I should amend my point to include teams with very highly-performing relievers. Such high-leverage performance would help a team exceed pythag, but also include a high risk of decline in following season (given fluctuation of reliever performance). Several teams on James' list got exceptional performance from 2 or more relievers (the 2007 DBacks had 5 relievers with ERA+ above 140).

That said, it's just a theory and could certainly be wrong. And of course it's the offensive improvement, more than the pitching decline, that's the real mystery here.

Hey, Guy, I'm back.

I think the fact that offense improves more than defense is evidence for the "management makes more changes" theory.

Two reasons for that. First, "clutch pitching" would cause pythagorean discrepancies more than "clutch hitting". That's because in a clutch situation, you can choose your pitcher, but not your batter. So if your team hits better than expected in the clutch, it's probably luck. But if it *pitches* better than average in the clutch, part of the reason is that your stopper was most excellent.

Therefore, you'd expect the pitching part of the effect to be real. And if your stopper gets hurt, or gets traded, or loses effectiveness, or becomes a free agent, your runs allowed will increase. There is no similar effect for hitting, at least none that (it seems to me) would correlate with beating pythagoras.

Second, if management is more likely to look at W-L records for pitchers, those look better when you beat Pythagoras. But hitting records don't look better. So management may feel their hitting is worse than their pitching, and look there when trying to improve the team.

--

Can you really outperform Pythagoras by inconsistent pitching? How do you mean?

---

As for the 2-game improvement, your theory makes sense -- more money to spend, more improvement. It would be interesting to see if there is a bigger effect in the free-agent era (Bill's study was all of baseball history).

Two games ON AVERAGE, including non-free-agent decades, still seems like a lot to me. I'm more comfortable guessing that maybe it was one game, with the last game being luck. But, that's just my opinion, and yours is probably as good as mine.

Phil:

A team with high variance in RA should win more games for any given mean. For example, in a 4.5 R/G league, a 2.50 pitcher will be .736 while a 6.50 pitcher is .325. Being +2 runs wins more games than you lose by being -2. So a 2.5/2.5/6.5/6.5 staff should win more games than a 4.5/4.5/4.5/4.5 staff (to use an extreme example). Hitting is just the reverse: a team that scored 4 or 5 runs every single day would win more games than an inconsistent offense. (Another way to say this is that the runs with the greatest marginal win impact are #2 thru #5.)

Less technically: win a lot of blowouts, you will underperform pythag (you 'wasted' RS); lose a lot of blowouts, you will overperform (you gave up harmless RA).

Incidentally, this means that really great starters who suppress scoring down to the 2-3 R/G level are undervalued by a stat like PRAA. The runs they prevent have more win value than average.

* *

I agree that the large number of pre-free agent teams raises questions about the importance of revenue. Still, one can imagine ways that being flush with revenue would make a difference. The owner of the 90-win team might be more willing to trade for a relatively well-paid veteran, or to keep a good player who the 75-win owner might unload.

But yes, 2 extra wins still seems like a lot.....

Question? Are we sure that the 2+ win difference is real and needs explaining?

Looking at expected Wins predicting Next-yr expected Wins.

The Pythag overachievers beat their 1st year record by 2.2 games on average. But that distribution was spread pretty wide. the stdev = 10.5 (1.05 for a sample of 100), so the 2.2 is just barely 2 stdev away from 0.

The uneracheivers essentially equalled their 1st year record in year 2 (with a stdev of 12.6).

It seems like there is too much variation in these numbers to assume there is something in there to explain.

bsball,

Are you sure the SD was 10.5 for the discrepancy? Maybe it was 10.5 for the actual number of wins, but the SD for the discrepancy (of the 100 teams) looks to be only a couple of games.

But your point makes me realize I was wrong about one thing: I said the SD of the difference between the overachieving team and its matching team was 7 wins. That's not right -- as I calculated it, 7 wins is the SD of the one team. The difference between the two teams is 7 times the square root of 2, which is about 10.

So we really *are* talking about only 2 SDs.

Or is that what you meant?

Guy (two comments up):

Right, I see what you're saying. But didn't Bill James once show that each of the first five runs were of roughly equal value, and they started diminishing after that? So an increasing variance in runs allowed would beat Pythagoras only to the extent that it caused more games allowing 6+ runs. Which, as you say, means more blowouts.

If that's correct, you can use variance as a proxy for blowouts, but using runs allowed in *actual* blowouts would be better.

Phil:

I think the first run scored has less win value than those that follow, but I might be wrong about that (and would depend on run environment: getting 1 run in 1968 meant more than it does today).

I don't see how a team could have a lot of variance without having a bunch of 6+ run games, so I'm not sure I follow your argument. In any case, I'm pretty sure that higher SD for RA means higher win%, while reverse is true for RS.

* *

BTW, the fact that each successive run prevented is worth a bit more than the one before it is pretty important. It means that a metric like Win Shares that treats all runs as equally valuable will understate the contribution of great starting pitchers. David Gassko has written about this, and uses a pytag win% approach to try to adjust for it.

Guy, I haven't read the Gassko study, will check it out.

Yes, you're right that more variance means more 6+ run games. But if a great starting pitcher doesn't give up very many 6+ run games, even with high variance, then it might not matter. But if the value of each prevented run increases over the last, then I see your point.

Phil (a few up),

Yes, that's pretty much what I'm saying. It's been a long time since I took stats, but if I do an ANOVA analysis in Excel it seems to say to me that the p-value of a F-test on the means is .19, which tells me that it's not a slam dunk that our means are different at all, much less by 2 wins.

I'm puzzled that there seems to be no interest in the distribution of runs scored by both winning and losing teams. I assume that this distribution is approximately binomial, but it might not be, for reasons explained in the next paragraph.

The reason that the distribution of runs scored might not be binomial would be the behavior of the manager. In the 2007 season, Bob Melvin seemed sometimes to reach a point at which he concluded the game was lost. Then he would cease to use his best bullpen men, instead sending in guys who were obviously sacrifical lambs.

For teams that exceeded their Pythagorean expectations, I would be interested in some kind of time-series analysis to see if there was improvement during the season. For a start, it might be nice to try a 10-game, 20-game, and 30-game smoothed average. A simple graph could show if this is worth pursuing.

Finally, the usual "Pythagorean" projections are a kind of resampling, though the quality could be improved by enriching (introducing more variables) the simulation. In any case, instead of simply giving the expected wins, one ought to give the distribution of results, maybe in a little box-and-whisker plot or something.

You know all this talk about Pythagoras is annoying. If you take runs scored minus runs allowed divided by 10, so a plus +10 team would project to 82 wins, you get the same answer and it is much easier to calculate. If you want a little more accuracy, then use 10 x sqr(runs both teams/inning) It took Bill 20 years to figure out how to relate runs created to win shares, when actually the realtionship is quite simple.

Hi, Pete,

I usually use the 10-run rule in casual calculations, but for more serious stuff I use Pythagoras. Am I mistaken in thinking that Pythagoras works a bit better for extreme teams?

My first thought is this just supports the idea that the difference between actual and expected record is not all luck. Teams that outperform their expected record have certain qualities in common and those qualities are fairly likely to be there the next season.

Maybe it supports the idea that leveraging players (managerial decisions) is more important than even we think.

Phil: has there been any further discussion on the SABR list that you can share?

A little bit more. The two most significant (IMO) are these. Hope Joe and Mike don't mind me reproducing their comments.

---------

Joe Dimino wrote,

"One of the things I noticed when reading the article is that the teams that overachieved showed their improvement the following season on offense.

They scored 30 more runs and allowed 14 more in the season following their overachieving year.

The matched teams scored and allowed 4 more in the subsequent season.

Not sure if it means anything, but it seemed interesting that the improvement wasn't split equally between offense and defense/pitching."

------

Mike Emeigh wrote,

> [quoting Phil Birnbaum] "And one explanation is (still) that the overperforming team is more likely to spend money to improve it over the off-season." [end quote]

It's an explanation, certainly, and might very well apply in the free-agent era (if not before). But you can find counter-examples. The 1978 Reds show up in Bill's list of overachieving teams. That team was pretty clearly UN-willing to spend money to improve in the offseason: the only major change they made was to let Pete Rose go in free agency and to replace him with the younger, cheaper Ray Knight. Granted, they didn't improve, but they stayed at about the same level in 1979, winning 90 games and a division title.

Most of the overachieving teams at which I looked did (somewhat) aggressively attack their weakest areas. The 1932 Pirates replaced their worst regular hitter, Adam Comorosky, with a much better hitter, Freddy Lindstrom. The 1984 Mets did the same thing, bringing in Gary Carter. The 1997 Giants, saddled with a top-heavy bullpen which lost its ace closer, rebuilt it around Robb Nen and a better supporting cast. The 1982 Orioles added Cal Ripken and (in-season) moved him to SS, adding an offensive player (Glenn Gulliver) who was an improvement over the cast of SS they had been running out there. Very few of the overachievers stood pat. I didn't look at the underachievers, by comparison, and that's probably worth a look.

------

The offensive improvement is interesting, but I think it makes sense if we buy the assumption that the overachiever has more incentive to improve than the underachiever. Simply put, it's easier to reliably improve offense through personnel moves than it is to improve pitching. Even looking at the dumbest batting stats (AVG/HR/RBI), they tell you something about true skill levels which can be expected to repeat in subsequent years. Pitchers, on the other hand, are legendary for their inconsistency; even basing investments on something like ERA+ could lead to disappointment, much less that which you would experience when basing decisions on W-L record, etc. In other words, even dumb GMs can improve an offense if they put their mind to it, and even smart GMs can look stupid in pitching acquisitions, because batters are more consistent than pitchers, and conventional wisdom better describes true batting skill than pitching skill.

Post a Comment

<< Home