Monday, April 21, 2014

Which players are making baseball games so slow?

Who are the fastest and slowest players in baseball, in terms of keeping the games moving along?

Four years ago, I wrote two posts that tried to answer that question.  I took Retrosheet game logs from 2000 to 2009, and ran a huge regression, trying to predict game time (hours and minutes) from the 18 players in the starting lineups, the two starting pitchers, the two teams, the years, the number of pitches, the number of stolen base attempts, and anything else I could think of.  

It turned out that players appeared to vary quite a bit -- at the extremes, it appeared that some batters could affect game time by as much as four minutes, and pitchers as much as seven minutes.  For the record, here are the "fastest" and "slowest" players I found, in minutes and seconds they appear to add:

+4:10 Slowest batter:  Denard Span 
-4:11 Fastest batter:  Chris Getz

+7:43 Slowest pitcher: Matt Garza 
-7:13 Fastest pitcher: Tim Wakefield

+5:45 Slowest catcher: Gary Bennett
-4:39 Fastest catcher: Eddie Perez

A full spreadsheet of my 2010 estimates is here.


Last week, Carl Bialik, of FiveThirtyEight, revisited the question of slow games, and I learned that FanGraphs now has real data available.  They took timestamped PITCHf/x game logs, and calculated the average time between pitches, for each batter and pitcher.  They call that statistic "pace," and it's available for 2007 to 2014.

So, now, there's some hard data to verify my numbers with.  

I found 290 batters who were in both my study and the Fangraphs data.  Then, I ran a regression to predict FanGraphs' number from mine.

It turned out that for every extra minute per game that I found, FanGraphs found only about a half a second per pitch (.4778).  So if I have a player at +2 minutes per game, you'd expect FanGraphs to have clocked him at +0.95 seconds per pitch.

How many pitches does a starting batter see in a game? Maybe, 16?  So, I have 120 seconds, and FanGraphs has 15 seconds.  That's eight times the difference!  

But, hang on.  The correlation between the two measures is surprisingly high -- +0.40.  That suggests that, generally, I got the *order* of the players right.  Generally, the regression was successful in separating the fast batters from the slow batters.

If you look at the 20 players Fangraphs has as slowest, my study estimated 19 of them as slower than average.   (Wil Nieves was the exception.)  

For their fastest 19 batters (I skipped over the six-way tie for 20th), my study wasn't quite as good.  It found only 11 of 19 as faster than average.  But, of Fangraphs' next 19 fastest, my study was 15-4.  

My feeling is ... it's not great, but it's not bad.  I'm willing to argue that the regression is reasonably capable of differentiating the speedy batters from the time-wasters. 


For pitchers, the fit was much better.  

The correlation was +0.73, much higher than I had expected. And the units were decent, too.  For every minute of slowness per game the regression found, it translated to 42 seconds in Fangraphs (based on a guess of 94 pitches per start).  

Why did the pitchers come out so much better than the hitters?  My guess: it's hard to separate the effect of Derek Jeter from the other eight batters, because they play together so much of the time.  In contrast, for every starting pitcher, there are at least 120 games without him where the rest of the lineup is almost the same.  That makes the differences much more obvious for the regression to figure out.

(BTW, Derek Jeter was the second "slowest" batter in my study, at +3.78 minutes per game.)


But, still ... why is the regression so far off for the batters, by a factor of 8?

No doubt, some of it is the effects of random luck.  But I think the most important factor is that time between pitches isn't the only factor affecting the length of a game.  

In his article, Carl Bialik noted that from 2008 to 2014, Yankees games took 12.8 minutes longer, on average, than Blue Jays games.  But, after adjusting for the pace of the hitters and pitchers, there were still 6.5 minutes unexplained.  

For the Red Sox, it was +11.8 minutes including pace, and +3.0 minutes without.

That's just a two-team sample, but "pace" appears to be only about a half to two-thirds of the reason games take so long.

What's the rest?  

There are lots of possibilities. Time between innings?  16 half-inning breaks, an extra 5 seconds each, adds up to 80 seconds.  Seventh-inning stretches, I bet those take longer in some parks than others.  The time it takes to announce a batter ... are some parks faster than others, and does the batter wait for it?  Does it take five extra seconds for crowd cheering when Derek Jeter bats?  What about defense ... how long it takes to get the ball back to the pitcher after an out?  Do some outfielders take their time throwing the ball back? 

To check for the park-related factors, I re-ran my regression, but, this time, I used an extra dummy variable for home/road. And, indeed, some significant differences showed up.  Even after controlling for everything else, including the players, games take an extra 1.9 minutes at Yankee Stadium, and an extra 1.4 minutes at Fenway Park.

And ... it looks like there might possibly be a pattern: larger market teams appear to be hosting slower games than small-market teams. Here are the teams with the largest unexplained slowness:

+1.91 Yankees
+1.88 Nationals
+1.71 Braves
+1.45 Red Sox
+1.40 Dodgers

And the most unexplained quickness:

-2.94 Blue Jays
-2.45 Giants
-2.25 A's
-1.83 Tigers
-1.17 Cubs

Maybe it's not a perfect pattern, but it's still suggestive.


So, that's some support for the idea that there is some kind of park-related effect.  It doesn't explain why the batter numbers are so extreme, since this after controlling for players.  But, it could still be part team and part player, if (say) the fans cheer an extra 10 seconds for any Yankee batter, but an extra 30 seconds for Derek Jeter.  

Another thing that might be happening: there are many players who miss only a few games.  I checked Derek Jeter in 2010.  He started all but seven games, and every one of those seven was on the road.  Could the regression be conflating Jeter games with home games?  (It doesn't really show up as a huge confidence interval in the regression, though.)

Or, maybe everyone works faster in meaningless end-of-season games, and those are the ones where Derek Jeter is more likely to be sat out.  

You can probably think of other possibilities, some being a real effect, and some a statistical artifact.  Or, I might have made a mistake somewhere.  

But, I suspect there's something real going on, something other than what's measured by "pace." And, the regression seems to be assigning a lot of it to individual batters -- with some to the team in general, and some to the home park.

I'm not sure what it is, though. 

Labels: , , ,

Monday, April 14, 2014

Accurate prediction and the speed of light III

There's a natural "speed of light" limitation on how accurate pre-season predictions can be.  For a baseball season, that lower bound is 6.4 games out of 162.  That is, even if you were to know everything that can possibly be known about a team's talent -- including future injuries -- the expected SD of your prediction errors could never be less than 6.4.   (Of course, you could beat 6.4 by plain luck.)

Some commenters at FiveThirtyEight disagree with that position.  One explicitly argued that the limit is zero -- which implies that, if you had enough information, you could be expected to get every team exactly right.  That opinion isn't an outlier  -- other commenters agreed, and the original comment got five "likes," more than any other comment on the post where it appeared.


Suppose it *were* possible to get the win total exactly right. By studying the teams and players intently, you could figure out, for instance, that the 2014 Los Angeles Dodgers would definitely go 92-70.

Now, after 161 games, the Dodgers would have to be 91-70 or 92-69.  For them to finish 92-70 either way, you would have to *know*, before the last game, whether it would be a win or a loss.  If there were any doubt at all, there would be a chance the prediction wouldn't be right.

Therefore, if you believe there is no natural limit to how accurate you can get in predicting a season, you have to believe that it is also possible to predict game 162 with 100% accuracy.

Do you really want to bite that bullet?  

And, is there something special about the number 162?  If you also think there's no limit to how accurate you can be for the first 161 games ... well, then, you have the same situation.  For your prediction to have been expected to be perfect, you have to know the outcome of the 161st game in advance.

And so on for the 160th game, and 159th, and so on.  A zero "speed of light" means that you have to know the result of every game before it happens.  


From what I've seen, when readers reject the idea that the lowest error SD is 6.4, they're reacting to the analogy of coin flipping.  They think something like, "sure, the SD is 6.4 if you think every baseball game is like a coin flip, or a series of dice rolls like in an APBA game.  But in real life, there are no dice. There are flesh-and-blood pitchers and hitters.  The results aren't random, so, in principle, they must be predictable."

I don't think they are predictable at all.  I think the results of real games truly *are* as random as coin tosses.  

As I've argued before, humans have only so much control of their bodies. Justin Verlander might be wanting to put the ball in a certain spot X, at a certain speed Y ... but he can't.  He can just come as close to X and Y as he can, and those discrepencies are random.  Will it be a fraction of a millimeter higher than X, or a fraction lower?  Who knows?  It depends on which neurons fire in his brain at which times.  It depends on whether he's distracted for a millionth of a second by crowd noise, or his glove slipping a bit.  It probably depends on where the seam of the baseball is touching his finger. (And we haven't even talked about the hitter yet.)

It's like the "chaos theory" example of how a butterfly flapping its wings in Brazil can cause a hurricane in Texas. Even if you believe it's all deterministic in theory, it's indistinguishable from random in practice.  I'd bet there aren't enough atoms in the universe to build a computer capable of predicting the final position of the ball from the state of Justin Verlander's mind while he goes into his stretch -- especially, to an extent where you can predict how Mike Trout will hit it, assuming you have a second computer for *his* mind.

What I suspect is, people think of the dice rolls as substitutes for identifiable flesh-and-blood variation. But they aren't. The dice rolls are substitutes for randomness that's already there. The flesh-and-blood variation goes *on top of that*.

APBA games *don't* consider the flesh-and-blood variation, which is why it's much easier to predict an APBA team's wins than a real-life team's wins.  In a game, you know the exact probabilities before every plate appearance.  In real life, you don't know that the batter is expecting fastballs today, but the pitcher is throwing more change-ups.  

The "speed of light" is *higher* in real-life than in simulations, not lower.


Now, perhaps I'm attacking a straw man here.  Maybe nobody *really* believes that it's possible to predict with an error of zero.  Maybe it's just a figure of speech, and what they're really saying is that the natural limit is much lower than 6.4.

Still, there are some pretty good arguments that it can't be all that much lower.

The 6.4 figure comes from the mathematics of flipping a fair coin.  Suppose you try to predict the outcome of a single flip. Your best guess is, "0.5 heads, 0.5 tails".  No matter the eventual outcome of the flip, your error will be 0.5.  That also happens to be the SD, in this case.  

(If you want, you can guess 1-0 or 0-1 instead ... your average error will still be 0.5.  That's a special case that works out for a single fair coin.)  

It's a statistical fact that, as the number of independent flips increases, the expected SD increases by the square root of the number of flips.  The square root of 162 is around 12.7. Multiply that by 0.5, and you get 6.364, which rounds to 6.4. That's it!

Skeptics will say ... well, that's all very well and good, but baseball games aren't independent, and the odds aren't 50/50!  

They're right ... but, it turns out, fixing that problem doesn't change the answer very much.

Let's check what happens if one team is a favorite.  What if the home team wins 60% of the time instead of 50%?

Well, in that case, your best bet is to guess the home team will have a one-game record of 0.6 wins and 0.4 losses.  Six times out of 10, your error will be 0.4.  Four times out of ten, your error will be 0.6.  The root mean square of (0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.6, 0.6, 0.6, 0.6) is around 0.4899. Multiply that by the square root of 162, and you get 6.235.  

That, indeed, is smaller than 6.364.  But not by much.  Still, the critics are right ... the assumption that all games are 50/50 does make for slightly too much pessimism.

Mathematically, the SD is always equal to the square root of (chance of team A winning)*(chance of team B winning)*(162).  That has its maximum value when each game is a 50/50 toss-up.  If there's a clear favorite, the SD drops;  the more lopsided the matchup, the lower the SD.  

But the SD drops very slowly for normal, baseball levels of competitive balance.  As we saw, a season of 60/40 games (60/40 corresponds to Vegas odds of +150/-150 before vigorish) only drops the speed of light from 6.36 to 6.23.  If you go 66/33 -- which is roughly the equivalent of a first-place team against a last-place team -- the SD drops to exactly 6.000, still pretty high.

In real life, every game is different; the odds depend on the teams, the starting pitchers, the park, injuries, and all kinds of other things. Still, that doesn't affect the logic much. With perfect information, you'll know the odds for each individual game, and you can just use the "average" in some sense.  

Looking at the betting lines for tomorrow (April 15, 2014) ... the average favorite has an implied expected winning percentage of .568 (or -132 before vigorish).  Let's be conservative, and say that the average is really .600.  In that case, the "speed of light" comes out to 6.2 instead of 6.4.  

For an analogy, if you like, think of 6.4 as the speed of light in a vacuum, and 6.2 as the speed of light in air.  


What about independence?  Coin flips are independent, but baseball games might not be.  

That's a fair point.  If games are positively correlated with each other, the SD will increase; if they're negatively correlated, the SD will decrease.  

To see why: imagine that games are so positively correlated that every result is the same as the previous one.  Then, every team goes 162-0 or 0-162 ... your best bet is to predict each team at their actual talent level, close to .500. Your error SD will be around 81 games, which is much higher than 6.2.

More importantly: imagine that games are negatively correlated so that every second game is the opposite of the game before.  Then, every team goes 81-81, and you *can* predict perfectly.

But ... these are very extreme, unrealistic assumptions.  And, as before, the SD drops very, very slowly for less implausible levels of negative correlation.

Suppose that every second game, your .500 team has only a 40% chance of the same result as the previous game.  That would still be unrealistically huge ... it would mean an 81-81 team is a 97-65 talent after a loss, and a 65-97 talent after a win.  

Even then, the "speed of light" error SD drops only slightly -- from 6.364 to 6.235.  It's a tiny drop, for such a large implausibility.

But, yes, the point stands, in theory.  If the games are sufficiently "non-independent" of each other, you can indeed get from 6.4 to zero.  But, for anything remotely realistic, you won't even be able to drop your error by a tenth of a game. For that reason, I think it's reasonable to just go ahead and do the analysis as if games are independent.  


Also, yes, it is indeed theoretically possible to get the limit down if you have very, very good information.  To get to zero, you might need to read player neurons.  But, is it more realistic to try to reduce your error by half, say?  How good would you need to be to get from 6.4 down to 3.2?  

*Really* good.  You'd need to be able to predict the winner 93.3% of the time ... 15 out of 16 games.


You have to be able to say something like, "well, it looks like the Dodgers are 60/40 favorites on opening day, but I know they're actually 90/10 favorites because Clayton Kershaw is going to be in a really good mood, and his curve will be working really well."  And you have to repeat that 161 times more.  And 90/10 isn't actually enough, overall ... you need to average around 93/7. 

Put another way: when the Dodgers win, you'd need to have been able to predict that 15/16 of the time.  And when the Dodgers lose, you'd need to have been able to predict that 15/16 of the time.  

That's got to be impossible.  

And, of course, you have to remember: bookies' odds on individual games are not particularly extreme.  If you can regularly predict winners with even 65% accuracy, you can get very, very rich.  This suggests that, as a practical matter, 15 out of 16 is completely out of reach.  

As a theoretical possibility ... if you think it can be done in principle, what kind of information would you actually need in order to forecast a winner 93% of the time?  What the starter ate for breakfast?  Which pitches are working and which ones aren't?  The algorithm by which the starter chooses his pitches, and by which each batter guesses?  

My gut says, all the information in the world wouldn't get you anywhere close to 93%.

Take a bunch of games where where Vegas says the odds are 60/40.  What I suspect is: even if you had a team of a thousand investigative reporters, who can hack into any computer system, spy on the players 24 hours a day, ask the manager anything they wanted, and do full blood tests on every player every day ... you still wouldn't have enough information to pick a winner even 65 percent of the time.  

There's just too much invisible coin-flipping going on.

Labels: , , ,

Tuesday, April 08, 2014

Predictions should be narrower than real life

Every year since 1983 (strike seasons excepted), at least one MLB team finished with 97 wins or more.  More than half the time, the top team had 100 wins or more.

In contrast, if you look at ESPN's 2014 team projections, their highest team forecast is 93-69.  

What's going on?  Does ESPN really expect no team to win more than 93 games?

Nope.  I bet ESPN would give you pretty good odds that some team *will* win 94 games or more, once you add in random luck.  

The standard deviation (SD) of team wins due to binomial randomness is around 6.4.  That means on average, nine teams per season will be lucky by six wins or more.  If you have a bunch of teams forecasted in the low 90s -- and ESPN has five of those -- chances are, one of them will get lucky and finish around a hundred wins.

But you can't predict which teams will get that luck.  So, if you care only about getting the best accuracy of the individual team forecasts, you're always going to project a narrower range than the actual outcomes.

A more obvious way to see that is to imagine simulating a season by flipping coins -- heads the home team wins, tails the visiting team wins.  Obviously, any one team is as "good" as any other.  Under those circumstances, the best prediction you can make is that every team will go 81-81.  Of course, that probably won't happen, and some teams, by just random chance, will go 92-70, or something.  But you don't know which teams those will be, and, since it's just luck, there's no way of being able to guess.  

It's the same logic in for real baseball.  No *specific* team can be expected to win more than 93 games.  Some teams will probably win more than 96 by getting lucky, but there's no way of predicting which ones.


That's why your range of projections has to be narrower than the expected standings.  How much narrower?

Over the past few seasons, the SD of team wins has been around 11.  Actually, it fluctuates a fair bit (which is expected, due to luck and changing competitive balance).  In 2002, it was over 14.5 wins; in 2007, it was as low as 9.3.  But 11 is pretty typical.

Since a team's observed performance is the sum of talent and luck, and because talent and luck are independent, 

SD(observed)^2 = SD(talent)^2 + SD(luck)^2.

Since SD(observed) equals 11, and SD(luck) = 6.4, we can figure that, after rounding,

SD(talent) = 9

So: if a season prediction has an SD that's significantly larger than 9, that's a sign that someone is trying to predict which teams will be lucky.  And that's impossible.  


As I wrote before, it's *really* impossible, as in, "exceeding the speed of light" impossible.  It's not a question of just having better information about the teams and the players.  The "9" already assumes that your information is perfect -- "perfect" in the sense of knowing the exact probability of every team winning every game.  If your information isn't perfect, you have to settle for even less than 9.

Let's break down the observed SD of 11 even further.  Before, we had

Observed = talent + luck

We can change that to:

Observed = talent we can estimate + talent we can't estimate + luck

Clearly, we'd be dumb in trying to estimate talent we don't know about -- by definition, we'd just be choosing random numbers.  What kind of things are there that affect team talent that we can't estimate?  Lots:

-- which players will get injured, and for how long?
-- which players will blossom in talent, and which will get worse?
-- how will mid-season trades change teams' abilities?
-- which players will the manager choose to play more or less than expected?

How big are those issues?  I'll try guessing.

For injuries: looking at this FanGraphs post, the SD of team player-days lost to injury seems to be around 400.  If the average injured player has a WAR of 2.0, that's an SD of about 5 wins (400 games is around 2.5 seasons).  

But that's too high.  A WAR of 2.0 is the standard estimate for full-time players, but there are many part-time players whose lost WAR would be negligible.  The Fangraphs data might also include long term DL days, where the team would had time to find a better-than-replacement substitute.

I don't know what the right number is ... my gut says, let's change the SD to 2 wins instead of 5.

What about players blossoming in talent?  I have no idea.  Another 2 wins?  Trades ... call it 1 win?  And, playing time ... that could be significant.  Call that another 2 wins.

Use your own estimates if you think I don't know what I'm talking about (which I don't).  But for now, we have:

9 squared equals
 -- SD of known talent squared +
 -- 2 squared [injuries] +
 -- 2 squared [blossoming] +
 -- 1 squared [trades] +
 -- 2 squared [playing time].

Solving, and rounding, we get 

SD of known talent = 8


What that means is: the SD of your predictions shouldn't be more than around 8.  If it is, you're trying to predict something that's random and unpredictable.

And, again: that calculation is predicated on *perfect knowledge* of everything that we haven't listed here.  If you don't have perfect knowledge -- which you don't -- you should be even lower than 8.

In a sense, the SD of your projections is a measure of your confidence.  The higher the SD, the more you think you know.  A high SD is a brag.  A low SD is humility.  And, a too-high SD -- one that violates the "speed of light" limit -- is a sign that there's something wrong with your methdology.


What about an easy, naive prediction, where we just project based on last year's record?  

This blog post found a correlation of .58 between team wins in 2012 and 2013.  That would suggest that, to predict next year, you take last year, and regress it to the mean by around 42 percent.  

If you do that, your projections would have an SD of 6.38.  (It's coincidence that it works out nearly identical to the SD of luck.)

I'd want to check the correlation for other pairs of years, to get a bigger sample size for the .58 estimate.  But, 6.38 does seem reasonable.  It's less than 8, which assumes excellent information, and it's closer to 8 than 0, which makes sense, since last year's record is still a pretty good indicator of how good a team will be this year.


A good practical number has to be somewhere betwen 6.38 (where we only use last year's record), and 8 (where we have perfect information, everything that can truly be known). 

Where in between?  For that, we can look to the professional bookmakers.  

I think it's safe to assume that bookies pretty much need to have the most accurate predictions of anyone.  If they didn't, smart bettors would bankrupt them.

The Bovada pre-season Over/Under lines had a standard deviation of 7.16 wins.  The Las Vegas Hotel and Casino also came in at 7.16.   (That's probably coincidence -- their lines weren't identical, just the SD.)

7.16 seems about right.  It's almost exactly halfway between 6.38 and 8.00.

If we accept that number, it turns out that more of the season is unpredictable than predictable.  The SD of 11 wins per team comes from 7.2 wins that the sports book can figure out, and 8.3 that the sports book *can't* figure out (and is probably unknowable).  


So, going back to ESPN: how did they do?  When I saw they didn't predict any teams higher than 93 wins, I suspected their SD would come out reasonable.  And, yes, it's OK -- 7.79 wins.  A little bit immodest, in my judgment, but not too bad.

I decided to check some others.  I did a Google search to find all the pre-season projections I could, and then added the one I found in Sports Illustrated, and the one from "ESPN The Magazine" (without their "chemistry" adjustments).  

Here they are, in order of [over]confidence:

  11.50 Sports Illustrated
   9.23 Mark Townsend (Yahoo!)
   9.00 Speed of Light (theoretical est.)
   8.76 Jeff Passan (Yahoo!)
   8.72 ESPN The Magazine 
   8.53 Jonah Keri (Grantland)
   8.51 Sports Illustrated (runs/10)
   8.00 Speed of Light (practical est.)
   7.83 Average ESPN voter (FiveThirtyEight)
   7.79 ESPN Website
   7.78 Mike Oz (Yahoo!)
   7.58 David Brown (Yahoo!)
   7.16 Vegas Betting Line
   6.90 Tim Brown (Yahoo!)
   6.38 Naive previous year method (est.)
   5.55 Fangraphs (4/7/14)
   0.00 Predict 81-81 for all teams
Anything that's not from an actual prediction is an estimate, as discussed in the post.  Even the theoretical "speed of light" is an estimate, since we arbitrarily chose 11.00 as the SD of observed wins.  None of the estimates are accurate to two decimals (or even one decimal), but I left them in to make the chart look nicer.

Sports Illustrated ran two sets of predictions: wins, and run differential.  You'd think they would have predicted wins by using Pythagoras or "10 runs equals one win", but they didn't.  It turns out that their runs estimates are much more reasonable than their win predictions.  

Fangraphs seems way too humble.  I'm not sure why.  Fangraphs updates their estimates every day, for the season's remaining games.  At time of writing, most teams had 156 games to go, so I pro-rated everything from 156 to 162.  Still, I think I did it right; their estimates are very narrow, with no team projected better than 90-72.

FiveThirtyEight got access to the raw projections of the ESPN voters, and ran a story on them.  It would be fun to look at the individual voters, and see how they ranged, but FiveThirtyEight gave us only the averages (which, after rounding, are identical to those published on the ESPN website).

If you have any others, let me know and I'll add them in.  


If you're a forecaster, don't think you need to figure out what your SD should be, and then adjust your predictions to it.  If you have a logical, reasonable algorithm, it should just work out.  Like, when we realized we had to predict 81-81 for every coin.  We didn't need to say, "hmmm, how do I get SD to come out to zero?"  We just realized that we knew nothing, so 81-81 was the right call.  

The SD should be a check on your logic, not necessarily part of your algorithm.


Related links: 

-- Last year, Tango and commenters discussed this in some depth and tested the SDs of some 2013 predictions.  

-- Here's my post on why you shouldn't judge predictions by how well they match outliers.  

-- Yesterday, FiveThirtyEight used this argument in the context of not wanting to predict the inevitable big changes in the standings.

-- IIRC, Tango had a nice post on how narrow predictions and talent estimates, but I can't find it right now.  

Labels: , ,

Friday, March 21, 2014

ESPN quantifies clubhouse chemistry

"ESPN the Magazine" says they've figured out how to measure team chemistry. 

"For 150 years, "clubhouse chemistry" has been impossible to quantify. ... Until now.

"Working with group dynamics experts Katerina Bezrukova, an assistant professor at Santa Clara, and Chester Spell, an associate professor at Rutgers, we built a proprietary team-chemistry regression model.  Our algorithm combines three factors -- clubhouse demographics, trait isolation and stratification of performance to pay -- to discover how well MLB teams concoct positive chemistry.

"According to the regression model, teams that maximize these factors can produce a four-win swing during a season."

The article doesn't tell us much more about the algorithm, calling it "proprietary".  They do define the three factors, a bit:

"Clubhouse demographics" is "the impact from diversity, measured by age, tenure with the team, nationality, race, and position.  Teams with the highest scores have several overlapping groups based on shared traits and experiences."

"Trait isolation" is when you have so much diversity that some players are have too few teammates similar to them, and thus are "isolated" in the clubhouse.

"Stratification of performance to pay" -- or, "ego factor" -- is based on how many all-stars and highly-paid players the team has.  A happy medium is best.  Too few big shots creates "a lack of leadership," but too many creates conflict.

Sounds silly to me, but who knows?  The data might prove me wrong.  Unfortunately, the article provides no evidence at all, not even anecdotal.  


This is from the magazine's 2014 MLB preview issue, and their little twist is that they add the chemistry estimates onto the regular team projections.  For instance, Tampa Bay is expected to rack up 91 "pre-chem" wins.  But, their chemistry is worth an extra two victories, for a final projection of 93-69.  (.pdf)

But even if you accept that the regression got the effects of chemistry exactly right -- unlikely as that is, but just pretend -- there's an obvious problem here.

If Tampa's chemistry is worth two wins, how would those two wins manifest themselves?  They must show up in the stats somewhere, right?  It's not like the Rays lose 5-4 to the Red Sox, but the chemistry fairy comes along and says, "Tampa, you guys love each other so much that you're getting the win anyway."

The idea must be that if a team has better chemistry, they'll play better together.  They'll hit better, pitch better, and/or field better.  There are other possibilities -- I suppose they could get injured less, or the manager could strategize better, or an aging superstar might swallow his ego and accept a platoon role.  But, you'd think, most of the benefit should show up in the statistics.

But, if that's the case, chemistry must already be embedded in last year's statistics, on which the magazine based its "pre-chem" projections.  Since teams are largely the same from year to year, ESPN is adding this year's chemistry on top of last year's chemistry.  In other words, they're double counting.

Maybe Tampa's "chemistry" was +1.5 last year ... in that case, they've only improved their 2004 chemistry by half a win, so you should put them at 91.5 wins, not 93.

It's possible, of course, that ESPN backed out last year's chemistry before adding this year's.  But they didn't say so, and the footnote on page 57 gives the impression that the projections came from outside sources.  


Here's another thing: every overall chemistry prediction adds up to an whole number.  Taking the Rays again, they get +0.1 wins in "ego", +1.7 in "demographics," and +0.2 in "isolation".  The total: +2 even.

There are a couple of teams ending in .9 or .1, which I assume are rounding errors, but the rest are .0.

How did that happen?  Maybe they rounded the wins and worked backwards to split them up?  


Another important problem: there are so many possible confounding factors that the effect ESPN found could be one of a million other things.

We can't know for sure, because we don't know what the model actually is.  But we can still see some possible issues.  Like age.  Performance declines as a player gets older, so, holding everything else equal, as a regression does, older teams will tend to look like they underperformed, while younger teams will look like they overperformed.

The regression's "demographic factor" explicitly looks at diversity of age.  The more players of diverse ages, they say, the better the chemistry.  

I did a quick check ... in 2008, the higher the diversity (SD) of batter ages, the older the team.  The five lowest SDs had a team average (integer) age of 27.8.  The seven highest had a team average of 29.1.  

Hmmm ... that goes the opposite way from the regression, which says that the older, high-diversity, high-chemistry teams do *better* than expected.  Anyway, the point remains: there's a hidden correlation there, and probably in the other "chemistry" measures, too.

A team with lots of All-Stars?  Probably older.  Few highly-paid players?  Probably younger with lots of turnover.  "Isolated" players?  Maybe a Yu Darvish, plays for a good team that will do whatever it takes to win next year.  Lots of variation in nationality?  Maybe a team with a good scouting department or farm system, that can easily fill holes.

You can probably think of lots of others.

Oh, wait, I found a good one.  

On page 46, ESPN says that high salary inequality is one of the things that's bad for chemistry.  In 2008, the five teams with the highest SD of batter salary had an average age of 30.4.  The seven teams with the lowest SD had an average age of 28.7.  

That one goes the right way.


Anyway, this is overkill, and  I probably wouldn't have written it if I hadn't gotten so frustrated when I read ESPN's piece.  Geez, guys, you have as much right to draw dubious conclusions from fancy regressions by academic experts as anyone else.  But if the regression isn't already public, you've got to publish it.  At the very least, give us details about what you did.  "We figured out the right answer and you just have to trust us" just doesn't cut it.

Journalistically, "We have a secret algorithm that measures clubhouse chemistry" is the sports equivalent of, "We have a secret expert that proves Barack Obama was born in Kenya."

Labels: , , ,

Friday, March 14, 2014

Consumer Reports on extended warranties

When you buy a new car, the dealer will try to sell you an extended warranty.  It would add on to your regular warranty, extending coverage for a certain number of extra miles or months.  It's like extra "repair insurance" for your car.

Is it worth buying?  I've always wondered, so when I saw that Consumer Reports (CR) tackles the subject in this month's magazine, I thought it might help with my decision next car.  

It turns out that CR believes an extended warranty to be a bad purchase.  Alas, their logic is so flawed that I don't think their conclusion is supported at all.


Several months ago, CR surveyed 12,000 subscribers who had purchased an extended warranty for a 2006 to 2010 car, and asked them how satisfied they were.

But their answers don't tell us much, because of hindsight bias.  For instance, many customers were unsatisfied with their warranty *because their cars never broke down.*  

"100,000 miles came and went, and the car never needed any repairs other than regular maintenance.  What a waste!  I will never buy another extended warranty for a car," said Honda Civic owner Liz Garibaldi.

At the risk of stating the obvious: the fact that you didn't have a claim doesn't mean you shouldn't have bought the insurance.  In fact, you almost *always* hope not to use your coverage.  Your life insurance policy wasn't a waste of money just because you didn't die last year.  

You'd think CR would recognize the effects of hindsight bias.  They do, but concentrate only on the side of that supports their position:  

"[Making more claims] probably helps customers feel more justified about having spent money for the coverage -- a bittersweet way to rationalize the purchase."

But doesn't it go both ways?  What about customers who had the opposite experience, like Liz Garibaldi?  Why doesn't CR say,

"[Not having had a claim] probably helps customers feel unjustified about having spent money for the coverage -- a convenient way to rationalize not making the purchase in future."

"Rationalize" is a biased word, one that presupposes faulty reasoning.  CR doesn't use the second quote because they don't believe you "rationalize" a correct decision.  If the article were about fire insurance, instead, you can be sure it would be the second quote that would have made it into print.


Buyer satisfaction isn't very useful in figuring out whether or not to buy an extended warranty, but what *would* be useful is to know what the coverage is actually worth.  On average, how much does an extended warranty save you, out of pocket?

In CR's survey, the median price of a warranty was $1,214.  The expected value of covered repairs would certainly be less than that, to allow for a sufficient margin of profit.  But how much less?  If it's only a bit less -- say, $100 -- it might be a small price to pay for peace of mind.  But if it's $1,000 less, then it's probably not worth it, even to the most risk-averse buyer.

So that's the key number, isn't it?  But CR doesn't give it to us.  They don't even try to estimate it, or acknowledge that it matters.  What they *do* tell us is:

-- 55 percent of owners didn't use their extended warranty at all;

-- of the other 45 percent, the median out-of-pocket savings was $837.

That helps a little, but not enough.  

First, why do they tell us the median but not the mean?  The mean is almost certainly higher than the median: a blown engine runs several thousand dollars.

Second, the survey includes warranties that are still in force, which means the final number will wind up higher once all the claims are in.  Wouldn't it have been better to restrict the sample to warranties that had already expired? 

From what we're given, all we can tell is that the value of the warranty is at least $376 (45 percent of $837).  But it's probably significantly more.  

It's very strange, the way CR laid it all out.  I don't know what was going on in their heads, but ... the most obvious possibility is that once they concluded that extended warranties were a bad buy, they chose to give only the numbers, arguments, and shadings that support their conclusion.   


In a box, on the first page of the article, in a huge font, is the number "26%".  That's the "percentage of consumers who would definitely buy the same extended warranty again."

Note the word "definitely."  I've seen that word in surveys before ... usually, it's when they give you several choices: "definitely," "probably," "not sure," "probably not," and "definitely not."

But CR doesn't tell us the options, or even how many there were.  So how can we intepret what the "26 percent" means?  If the other 74 percent said "probably," that's very different from if they said "definitely not."  

The "26%" on its own doesn't mean anything, but if you don't realize that -- or you miss the word "definitely" -- it sure looks damning.


On the second page of the article, there's a box with a picture of an unhappy looking customer named Helene Heller, who says,
"It was a horrific experience.  I feel like the dealer ripped me off."

Ms. Heller makes no other appearance in the article.  We never find out why the experience was horrific and why Ms. Heller feels cheated.  It sounds like what someone would say when the dealer refuses to honor a claim.  Perhaps there was a dispute over whether a particular problem was covered?  

Probably not, since the article doesn't actually raise any issues about claims being dishonored -- or, indeed, any issues other than cost and usage.  So, we're in the dark.

Expecting us to judge by anecdote is bad enough, but CR refuses to even give us the anecdote!


Another photo features Brent Lammers, who says,
"I feel like I probably paid too much for peace of mind."

That's a fair enough comment, but ... why choose that one?  Since 26 percent of buyers would "definitely" repeat their purchase, CR could have easily find a quote from someone who feels otherwise.  And was there nobody who blew a transmission and regretted *not* buying the warranty?

In any case, what's the relevance?  We don't want to know what Mr. Lammers feels.  We do want to know if he did, in fact, pay too much.  We never find out.


And then there's the title: "Extended warranties: An expensive gamble."

That's completely backwards!  The warranty may be expensive, but it isn't a gamble at all.  The gamble is in NOT buying the warranty!

A gamble is something that increases your risk.  That is: it increases the variance of possible outcomes.  You can have a sure $10 in your wallet (variance of zero), or you can bet it on a hand of blackjack (outcome from $0 to $25).  

If you buy the warranty, you're buying certainty: $1,214, or whatever, is what your repairs will cost.  If you don't, anything can happen: you could be out anywhere from zero to tens of thousands of dollars.  That's why the warranty gives some people "peace of mind" -- it eliminates the worry of the outlier, an extremely expensive repair bill.

Now, it's fine to accept the gamble if the warranty is too expensive.  There's only so much a reasonable person should be willing to pay to eliminate the risk.  But a justifiable gamble is still a gamble.

What CR was probably thinking, is something like this: "buying the warranty is a bad decision, and you could easily regret it if the car doesn't break.  That sounds like a gamble to us."  

But: a gamble is a gamble regardless of whether or not it's a good decision. If the manufacturers had a 60% off sale on an extended warranty, I bet CR would quickly stop calling it a gamble.  But a gamble doesn't stop being a gamble just because the odds shift in you favor.

Moreover, *any* decision can lead to regret over the alternative.  If a tipster suggests I put a month's pay on a certain longshot horse, I'll probably pass.  I will certainly regret that if the horse winds up winning at 100:1.  But my regret doesn't mean my refusal itself was a "gamble."  

There's some "status quo bias" here too.  Suppose that, historically, the extended warranty was built into the price of the car, but you could get $1,214 back in the mail by agreeing to get rid of it.  CR's headline would now have to read: "Keeping your six-year warranty instead of forfeiting half of it: an expensive gamble."

Less likely, I'd say.


There's one thing that CR didn't talk about that I wish it had.  When I got my car, I wondered if the warranty might pay for itself when I eventually sold the vehicle. 

Suppose that my $1,500 warranty has an actuarial value of only $1,000.  Wouldn't I recoup the extra $500 when I sell the car?  If I'm still covered, buyers will be able to expect that there are no significant problems with my car (because I'd have had them fixed for free).  And, since most warranties are transferable, they can even claim any repairs that I missed.

I know I'd pay a lot more than a $500 premium for a car still under warranty, and I bet others would too.

CR didn't consider that.  Still, if they had at least told us the actual value of the warranty, I might have been able to estimate it for myself.


OK, here's the final kicker.  After ragging on warranties with every argument they can think of for why they're not worth the money and you'll regret having bought one ... CR actually recommends you buy one!

"Consider an extended warranty for the long haul.  All cars tend to become less reliable over time, so an extended warranty might be worth considering if you're planning to keep your vehicle long after the factory warranty runs out."

What they seem to be saying is: we just spent hundreds and hundreds of words telling you, over and over, why extended warranties aren't worth it.  But we meant that only for short warranties.  For long warranties, you should buy one, because our reasoning doesn't apply any more!

Why the change of mind?  

It's true, of course, that cars break down more often when they're older.  But the prices of the warranties rise to reflect that.  Don't they?  Does CR believe that that short warranties are overpriced, but longer ones are underpriced?

Obviously, the total value of repairs over a longer warranty is going to be higher than for a shorter warranty.  That means the median value of covered repairs -- which CR earlier told us is zero -- might move into positive territory.  Is that the difference, the higher median?

In their survey, CR likely found fewer buyers regretting their warranty purchase when they had more time to use it.  Did that change their thinking, the higher satisfaction scores?

Maybe it's an editing issue, and they just said it wrong.  The quote actually appears in a separate section with separate advice for people who *do* decide to buy a warranty.  Maybe they're just saying, "if you *must* buy one, at least buy a long one."  (But it sure doesn't read that way.)

Perhaps they didn't change their mind at all.  Maybe the extra section and the article were written by two different people, with two different opinions.  Or they just cut and pasted their "buy" advice from a previous article, one written before their recent survey data came in.

Or, they're just going with their gut.  Before writing the article, the CR editorial staff thinks, "Newer cars don't break much, so extended warranties are dumb.  But older cars have lots of problems, so it makes sense then."  And then it's all confirmation bias from there.

I really have no idea.


Sometimes, it seems like Consumer Reports is two different magazines.  You've got the product ratings, which are written by respected sabermetricians.  Then you've got the advice and investigative pieces, which are written by outraged sportswriters who are sure that Jack Morris needs to be in the Hall of Fame, and have the hand-selected numbers to prove it, and logic and reason be damned.

Can I buy just half a subscription?

(Some of my previous Consumer Reports posts are here.)

Labels: ,

Friday, March 07, 2014

Rating systems and rationalizations

The Bill Mazeroski Baseball Annuals, back in the 80s, had a rating system that didn't make much sense.  They'd assign each team a bunch of grades, and the final rating would be the total.  So you'd get something out of 10 for outfielders, something out of 10 for the starting pitching staff, and something out of 10 for the manager, and so on.  Which meant, the manager made as much difference as all three starting outfielders combined.

Maclean's magazine's ratings of Canadian universities are just as dubious.  Same idea: rate a bunch of categories, and add them up.   I've been planning to write about that one for a while, but I just discovered that Malcolm Gladwell already took care of it, three years ago, for similar American rankings.

In the same article, Gladwell also critiques the rating system used by "Car and Driver" magazine.

In 2010, C&D ran a "comparo" of three sports cars -- the Chevy Corvette Grand Sport, the Lotus Evora, and the Porsche Cayman S.  The Cayman won by several points:

193 Cayman
186 Corvette
182 Evora

But, Gladwell points out, the final score is heavily dependent on the weights of the categories used.  Car and Driver assigned only four percent of the available points to exterior styling.  That makes no sense: "Has anyone buying a sports car ever placed so little value on how it looks?"

Gladwell then notes that if you re-jig the weightings to make looks more important, the Evora comes out on top:

205 Evora
198 Cayman
192 Corvette

Also, how important is price?  The cost of the car counted for only ten percent of C&D's rating.  For normal buyers, though, price is one of the most important criteria.  What happens when Gladwell increases that weighting relative to the others?

Now, the Corvette wins.

205 Corvette
195 Evora
185 Cayman


Why does this happen?  Gladwell argues that it's because Car and Driver insists on using the same weightings for every car in every issue in every test.  It may be reasonable for looks to count for only four percent when you're buying an econobox, but it's much more important for image cars like the Porsche.  

"The magazine’s ambition to create a comprehensive ranking system—one that considered cars along twenty-one variables, each weighted according to a secret sauce cooked up by the editors—would also be fine, as long as the cars being compared were truly similar. ...  A ranking can be heterogeneous, in other words, as long as it doesn’t try to be too comprehensive. And it can be comprehensive as long as it doesn’t try to measure things that are heterogeneous. "

I think Gladwell goes a bit too easy on Car and Driver.  I don't think the entire problem is that the system tries to be overbroad.  I think a big part of the problem is that, unless you're measuring something real, *every* weighting system is arbitrary.  It's true that a system that works well for family sedans might not work nearly as well as for luxury cars, but it's also true that the system doesn't necessarily work for eiher of them separately, anyway!

It's like ... rating baseball players by RBIs.  Sure, it's true that this system is inappropriate for comparing cleanup hitters to leadoff men.  But even if you limit your evaluation to cleanup hitters, it still doesn't do a very good job. 

In fact, Gladwell shows that explicitly in the car example.  His two alternative weighting systems are each perfectly defensible, even within the category of "sports car".  Which is better?  Which is right?  Neither!  There is no right answer.

What I'd conclude, from Gladwell's example, is that rating systems are inappropriate for making fine distinctions.  Any reasonable system can tell the good cars from the bad, but there's no way an arbitrary evaluation process can tell whether the Evora is better than the Porsche.  It would always be too sensitive to the weightings.

In fact, you can always make the result come out either way, and there's no way to tell which one is "right."  In fact, there's no "right" at all, because "better" has no actual definition.  Your  inexpressible intuitive view of "better" might involve a big role for looks, while mine might be more weighted to handling.  Neither of us is wrong.

However: most people's definitions of "better" aren't *that* far from each other.  We may not be able to agree whether the Porsche is better than the Corvette, but we definitely can agree that both are better than the Yugo.  Any reasonable system should wind up with the same result.

Which, in general, is what rating systems are usually good for: identifying *large* differences.  I may not believe Consumer Reports that the Sonata (89/100) is better than the Passat (80) ... but I should be able to trust them when they say the Camry (92) is better than the Avenger (43).  


In the March, 2004, issue, Car and Driver compares six electric cars.  The winner was the Chevrolet Spark EV, with 181 points out of 225.  The second place Ford Focus Electric was only eight points behind, at 173.

That's pretty typical, that the numerical ratings are close.  They're always much closer than they are in Consumer Reports.  I dug out a few back issues of C&D, and jotted down the comparo scores.  Each row below is a different test:

189 - 164
206 - 201 - 200 - 192
220 - 205
196 - 190 - 184 - 179

All are pretty close -- the biggest gap from first to last is 15 percent.  Although, I deliberately left out the March issue: there, the gap is bigger, mostly because of the electric Smart car, which they didn't like at all:

181 - 173 - 161 - 157 - 153 - 126

Leaving out the Smart, the difference between first and last is 18 percent.  (For the record: Consumer Reports didn't rate the electric Smart, but they gave the regular one only 28/100, the lowest score of any car in their ratings.)

Anyway, as I said, the Spark beat the Focus by only 8 ratings points, or five percent.  But, if you read the evaluations of those two cars ... the editors like the Spark *a lot more* than the Focus.  

Of the Spark, they say,

"Here's a car that puts it all together ... It's a total effort, a studied application of brainpower and enthusiasm that embraces the electric mandate with gusto ... Everything about the Spark is all-in. ... It is the one gold that sparkles."

But they're much more muted when it comes to the Focus, even in their compliments:

"The most natural-feeling of our EVs, the Focus delivers a smooth if somewhat muted rush of torque and has excellent brakes. ... At low speeds ... you can catch the motor clunking ... but otherwise the Focus feels solid and well integrated. ... What the Focus Electric really does best is give you a reason to go test drive the top-of-the-line gas-burning Focus."

When Car and Driver actually tells you what they think, it sounds like the cars are worlds apart.  All that for eight points?  Actually, it's worse than that: the Spark had a price advantage of seven points.  So, when it comes to the car itself, the Chevy wins by only *one point* -- but gets much, much more appreciation and plaudits.

What's going on?  Gladwell thinks C&D is putting too much faith in its own rating:

"Yet when you inspect the magazine’s tabulations it is hard to figure out why Car and Driver was so sure that the Cayman is better than the Corvette and the Evora."

I suspect, though, that it's the other way around: after they drive the cars, they decide which they liked best, then tailor the ratings to come out in the right order.  I suspect that, if the ratings added up to make the Focus the best, they'd say to each other, "Well, that's not right!  There must be something wrong with our numbers."  And they'd rejig the category scores to make it work out.

Which probably isn't too hard to do, because, I suspect, the system is deliberately designed to keep the ratings close.  That way, every car looks almost as good as the best, and everybody gets a good grade.  A Ford salesman can tell his customer, "Look, we finished second, but only by 8 points ... and, 7 of them were price!  And look at all the categories we beat them in!"

That doesn't mean the competition is biased.  The magazine is just making sure the best car wins.  Car and Driver is my favorite car magazine, and I think the raters really know their stuff.  I don't want the winner to go the highest-scorer of an arbitrary point system ... I want the winner to be the one the magazine thinks is best.  That's why I'm reading the article, to get the opinions of the experts.  

So, they're not "fixing" the competition, as in making sure the wrong car wins.  They ARE "fixing" the ratings --  but in the sense of "repairing" them.  Because, if you know the Spark is the best, but it doesn't rate the highest, you must have scored it wrong!  Well, actually, you must have chosen a lousy rating system ... but, in this case, the writer is stuck with the longstanding C&D standard.


"Fixing" the algorithm to match your intuition is probably a standard feature of ranking systems.  In baseball, we've seen the pattern before ... someone decides that Jim Rice is underrated, and tries to come up with a rating that slots him where his gut says he should be slotted.  Maybe give more weight to RBIs and batting average, and less to longevity.  Maybe add in something for MVP votes, and lower the weighting for GIDP.  Eventually, you get to a weighting that puts Jim Rice about as high as you'd like him to be, and that's your system.

And it doesn't feel like you're cheating, because, after all, you KNOW that Jim Rice belongs there!  And, look, Babe Ruth is at the top, and so is Ted Williams, and a whole bunch of other HOFers.  This, then, must be the right system!

That's what always has to happen, isn't it?  Whether you're rating cars, or schools, or student achievement, or fame, or beauty, or whatever ... nobody just jots a system down and goes with it.  You try it, and you see, "well, that one puts all the small cars at the top, so we've rated fuel economy too high."  So you adjust.  And now you see, "well, now all the luxury cars rate highest, so we better increase the weighting for price."  And so on, until you look at the results, and they seem right to you, and the Jim Ricemobile is in its proper place.

That's another reason I hate arbitrary rankings: they're almost always set to fit the designer's preconceptions.  To a certain extent, rating systems are just elaborate rationalizations.

Labels: , , ,