Sabermetric Research: March 2014

Friday, March 21, 2014

ESPN quantifies clubhouse chemistry

"ESPN the Magazine" says they've figured out how to measure team chemistry.

"For 150 years, "clubhouse chemistry" has been impossible to quantify. ... Until now.

"Working with group dynamics experts Katerina Bezrukova, an assistant professor at Santa Clara, and Chester Spell, an associate professor at Rutgers, we built a proprietary team-chemistry regression model. Our algorithm combines three factors -- clubhouse demographics, trait isolation and stratification of performance to pay -- to discover how well MLB teams concoct positive chemistry.

"According to the regression model, teams that maximize these factors can produce a four-win swing during a season."

The article doesn't tell us much more about the algorithm, calling it "proprietary". They do define the three factors, a bit:

"Clubhouse demographics" is "the impact from diversity, measured by age, tenure with the team, nationality, race, and position. Teams with the highest scores have several overlapping groups based on shared traits and experiences."

"Trait isolation" is when you have so much diversity that some players are have too few teammates similar to them, and thus are "isolated" in the clubhouse.

"Stratification of performance to pay" -- or, "ego factor" -- is based on how many all-stars and highly-paid players the team has. A happy medium is best. Too few big shots creates "a lack of leadership," but too many creates conflict.

Sounds silly to me, but who knows? The data might prove me wrong. Unfortunately, the article provides no evidence at all, not even anecdotal.

----

This is from the magazine's 2014 MLB preview issue, and their little twist is that they add the chemistry estimates onto the regular team projections. For instance, Tampa Bay is expected to rack up 91 "pre-chem" wins. But, their chemistry is worth an extra two victories, for a final projection of 93-69. (.pdf)

But even if you accept that the regression got the effects of chemistry exactly right -- unlikely as that is, but just pretend -- there's an obvious problem here.

If Tampa's chemistry is worth two wins, how would those two wins manifest themselves? They must show up in the stats somewhere, right? It's not like the Rays lose 5-4 to the Red Sox, but the chemistry fairy comes along and says, "Tampa, you guys love each other so much that you're getting the win anyway."

The idea must be that if a team has better chemistry, they'll play better together. They'll hit better, pitch better, and/or field better. There are other possibilities -- I suppose they could get injured less, or the manager could strategize better, or an aging superstar might swallow his ego and accept a platoon role. But, you'd think, most of the benefit should show up in the statistics.

But, if that's the case, chemistry must already be embedded in last year's statistics, on which the magazine based its "pre-chem" projections. Since teams are largely the same from year to year, ESPN is adding this year's chemistry on top of last year's chemistry. In other words, they're double counting.

Maybe Tampa's "chemistry" was +1.5 last year ... in that case, they've only improved their 2004 chemistry by half a win, so you should put them at 91.5 wins, not 93.

It's possible, of course, that ESPN backed out last year's chemistry before adding this year's. But they didn't say so, and the footnote on page 57 gives the impression that the projections came from outside sources.

-----

Here's another thing: every overall chemistry prediction adds up to an whole number. Taking the Rays again, they get +0.1 wins in "ego", +1.7 in "demographics," and +0.2 in "isolation". The total: +2 even.

There are a couple of teams ending in .9 or .1, which I assume are rounding errors, but the rest are .0.

How did that happen? Maybe they rounded the wins and worked backwards to split them up?

-----

Another important problem: there are so many possible confounding factors that the effect ESPN found could be one of a million other things.

We can't know for sure, because we don't know what the model actually is. But we can still see some possible issues. Like age. Performance declines as a player gets older, so, holding everything else equal, as a regression does, older teams will tend to look like they underperformed, while younger teams will look like they overperformed.

The regression's "demographic factor" explicitly looks at diversity of age. The more players of diverse ages, they say, the better the chemistry.

I did a quick check ... in 2008, the higher the diversity (SD) of batter ages, the older the team. The five lowest SDs had a team average (integer) age of 27.8. The seven highest had a team average of 29.1.

Hmmm ... that goes the opposite way from the regression, which says that the older, high-diversity, high-chemistry teams do *better* than expected. Anyway, the point remains: there's a hidden correlation there, and probably in the other "chemistry" measures, too.

A team with lots of All-Stars? Probably older. Few highly-paid players? Probably younger with lots of turnover. "Isolated" players? Maybe a Yu Darvish, plays for a good team that will do whatever it takes to win next year. Lots of variation in nationality? Maybe a team with a good scouting department or farm system, that can easily fill holes.

You can probably think of lots of others.

Oh, wait, I found a good one.

On page 46, ESPN says that high salary inequality is one of the things that's bad for chemistry. In 2008, the five teams with the highest SD of batter salary had an average age of 30.4. The seven teams with the lowest SD had an average age of 28.7.

That one goes the right way.

-----

Anyway, this is overkill, and I probably wouldn't have written it if I hadn't gotten so frustrated when I read ESPN's piece. Geez, guys, you have as much right to draw dubious conclusions from fancy regressions by academic experts as anyone else. But if the regression isn't already public, you've got to publish it. At the very least, give us details about what you did. "We figured out the right answer and you just have to trust us" just doesn't cut it.

Journalistically, "We have a secret algorithm that measures clubhouse chemistry" is the sports equivalent of, "We have a secret expert that proves Barack Obama was born in Kenya."

Labels: baseball, chemistry, ESPN, forecasting

Friday, March 14, 2014

Consumer Reports on extended warranties

When you buy a new car, the dealer will try to sell you an extended warranty. It would add on to your regular warranty, extending coverage for a certain number of extra miles or months. It's like extra "repair insurance" for your car.

Is it worth buying? I've always wondered, so when I saw that Consumer Reports (CR) tackles the subject in this month's magazine, I thought it might help with my decision next car.

It turns out that CR believes an extended warranty to be a bad purchase. Alas, their logic is so flawed that I don't think their conclusion is supported at all.

-----

Several months ago, CR surveyed 12,000 subscribers who had purchased an extended warranty for a 2006 to 2010 car, and asked them how satisfied they were.

But their answers don't tell us much, because of hindsight bias. For instance, many customers were unsatisfied with their warranty *because their cars never broke down.*

"100,000 miles came and went, and the car never needed any repairs other than regular maintenance. What a waste! I will never buy another extended warranty for a car," said Honda Civic owner Liz Garibaldi.

At the risk of stating the obvious: the fact that you didn't have a claim doesn't mean you shouldn't have bought the insurance. In fact, you almost *always* hope not to use your coverage. Your life insurance policy wasn't a waste of money just because you didn't die last year.

You'd think CR would recognize the effects of hindsight bias. They do, but concentrate only on the side of that supports their position:

"[Making more claims] probably helps customers feel more justified about having spent money for the coverage -- a bittersweet way to rationalize the purchase."

But doesn't it go both ways? What about customers who had the opposite experience, like Liz Garibaldi? Why doesn't CR say,

"[Not having had a claim] probably helps customers feel unjustified about having spent money for the coverage -- a convenient way to rationalize not making the purchase in future."

"Rationalize" is a biased word, one that presupposes faulty reasoning. CR doesn't use the second quote because they don't believe you "rationalize" a correct decision. If the article were about fire insurance, instead, you can be sure it would be the second quote that would have made it into print.

-----

Buyer satisfaction isn't very useful in figuring out whether or not to buy an extended warranty, but what *would* be useful is to know what the coverage is actually worth. On average, how much does an extended warranty save you, out of pocket?

In CR's survey, the median price of a warranty was $1,214. The expected value of covered repairs would certainly be less than that, to allow for a sufficient margin of profit. But how much less? If it's only a bit less -- say, $100 -- it might be a small price to pay for peace of mind. But if it's $1,000 less, then it's probably not worth it, even to the most risk-averse buyer.

So that's the key number, isn't it? But CR doesn't give it to us. They don't even try to estimate it, or acknowledge that it matters. What they *do* tell us is:

-- 55 percent of owners didn't use their extended warranty at all;

-- of the other 45 percent, the median out-of-pocket savings was $837.

That helps a little, but not enough.

First, why do they tell us the median but not the mean? The mean is almost certainly higher than the median: a blown engine runs several thousand dollars.

Second, the survey includes warranties that are still in force, which means the final number will wind up higher once all the claims are in. Wouldn't it have been better to restrict the sample to warranties that had already expired?

From what we're given, all we can tell is that the value of the warranty is at least $376 (45 percent of $837). But it's probably significantly more.

It's very strange, the way CR laid it all out. I don't know what was going on in their heads, but ... the most obvious possibility is that once they concluded that extended warranties were a bad buy, they chose to give only the numbers, arguments, and shadings that support their conclusion.

-----

In a box, on the first page of the article, in a huge font, is the number "26%". That's the "percentage of consumers who would definitely buy the same extended warranty again."

Note the word "definitely." I've seen that word in surveys before ... usually, it's when they give you several choices: "definitely," "probably," "not sure," "probably not," and "definitely not."

But CR doesn't tell us the options, or even how many there were. So how can we intepret what the "26 percent" means? If the other 74 percent said "probably," that's very different from if they said "definitely not."

The "26%" on its own doesn't mean anything, but if you don't realize that -- or you miss the word "definitely" -- it sure looks damning.

-----

On the second page of the article, there's a box with a picture of an unhappy looking customer named Helene Heller, who says,

"It was a horrific experience. I feel like the dealer ripped me off."

Ms. Heller makes no other appearance in the article. We never find out why the experience was horrific and why Ms. Heller feels cheated. It sounds like what someone would say when the dealer refuses to honor a claim. Perhaps there was a dispute over whether a particular problem was covered?

Probably not, since the article doesn't actually raise any issues about claims being dishonored -- or, indeed, any issues other than cost and usage. So, we're in the dark.

Expecting us to judge by anecdote is bad enough, but CR refuses to even give us the anecdote!

-----

Another photo features Brent Lammers, who says,

"I feel like I probably paid too much for peace of mind."

That's a fair enough comment, but ... why choose that one? Since 26 percent of buyers would "definitely" repeat their purchase, CR could have easily find a quote from someone who feels otherwise. And was there nobody who blew a transmission and regretted *not* buying the warranty?

In any case, what's the relevance? We don't want to know what Mr. Lammers feels. We do want to know if he did, in fact, pay too much. We never find out.

-----

And then there's the title: "Extended warranties: An expensive gamble."

That's completely backwards! The warranty may be expensive, but it isn't a gamble at all. The gamble is in NOT buying the warranty!

A gamble is something that increases your risk. That is: it increases the variance of possible outcomes. You can have a sure $10 in your wallet (variance of zero), or you can bet it on a hand of blackjack (outcome from $0 to $25).

If you buy the warranty, you're buying certainty: $1,214, or whatever, is what your repairs will cost. If you don't, anything can happen: you could be out anywhere from zero to tens of thousands of dollars. That's why the warranty gives some people "peace of mind" -- it eliminates the worry of the outlier, an extremely expensive repair bill.

Now, it's fine to accept the gamble if the warranty is too expensive. There's only so much a reasonable person should be willing to pay to eliminate the risk. But a justifiable gamble is still a gamble.

What CR was probably thinking, is something like this: "buying the warranty is a bad decision, and you could easily regret it if the car doesn't break. That sounds like a gamble to us."

But: a gamble is a gamble regardless of whether or not it's a good decision. If the manufacturers had a 60% off sale on an extended warranty, I bet CR would quickly stop calling it a gamble. But a gamble doesn't stop being a gamble just because the odds shift in you favor.

Moreover, *any* decision can lead to regret over the alternative. If a tipster suggests I put a month's pay on a certain longshot horse, I'll probably pass. I will certainly regret that if the horse winds up winning at 100:1. But my regret doesn't mean my refusal itself was a "gamble."

There's some "status quo bias" here too. Suppose that, historically, the extended warranty was built into the price of the car, but you could get $1,214 back in the mail by agreeing to get rid of it. CR's headline would now have to read: "Keeping your six-year warranty instead of forfeiting half of it: an expensive gamble."

Less likely, I'd say.

-----

There's one thing that CR didn't talk about that I wish it had. When I got my car, I wondered if the warranty might pay for itself when I eventually sold the vehicle.

Suppose that my $1,500 warranty has an actuarial value of only $1,000. Wouldn't I recoup the extra $500 when I sell the car? If I'm still covered, buyers will be able to expect that there are no significant problems with my car (because I'd have had them fixed for free). And, since most warranties are transferable, they can even claim any repairs that I missed.

I know I'd pay a lot more than a $500 premium for a car still under warranty, and I bet others would too.

CR didn't consider that. Still, if they had at least told us the actual value of the warranty, I might have been able to estimate it for myself.

-----

OK, here's the final kicker. After ragging on warranties with every argument they can think of for why they're not worth the money and you'll regret having bought one ... CR actually recommends you buy one!

"Consider an extended warranty for the long haul. All cars tend to become less reliable over time, so an extended warranty might be worth considering if you're planning to keep your vehicle long after the factory warranty runs out."

What they seem to be saying is: we just spent hundreds and hundreds of words telling you, over and over, why extended warranties aren't worth it. But we meant that only for short warranties. For long warranties, you should buy one, because our reasoning doesn't apply any more!

Why the change of mind?

It's true, of course, that cars break down more often when they're older. But the prices of the warranties rise to reflect that. Don't they? Does CR believe that that short warranties are overpriced, but longer ones are underpriced?

Obviously, the total value of repairs over a longer warranty is going to be higher than for a shorter warranty. That means the median value of covered repairs -- which CR earlier told us is zero -- might move into positive territory. Is that the difference, the higher median?

In their survey, CR likely found fewer buyers regretting their warranty purchase when they had more time to use it. Did that change their thinking, the higher satisfaction scores?

Maybe it's an editing issue, and they just said it wrong. The quote actually appears in a separate section with separate advice for people who *do* decide to buy a warranty. Maybe they're just saying, "if you *must* buy one, at least buy a long one." (But it sure doesn't read that way.)

Perhaps they didn't change their mind at all. Maybe the extra section and the article were written by two different people, with two different opinions. Or they just cut and pasted their "buy" advice from a previous article, one written before their recent survey data came in.

Or, they're just going with their gut. Before writing the article, the CR editorial staff thinks, "Newer cars don't break much, so extended warranties are dumb. But older cars have lots of problems, so it makes sense then." And then it's all confirmation bias from there.

I really have no idea.

------

Sometimes, it seems like Consumer Reports is two different magazines. You've got the product ratings, which are written by respected sabermetricians. Then you've got the advice and investigative pieces, which are written by outraged sportswriters who are sure that Jack Morris needs to be in the Hall of Fame, and have the hand-selected numbers to prove it, and logic and reason be damned.

Can I buy just half a subscription?

(Some of my previous Consumer Reports posts are here.)

Labels: Consumer Reports, gambling

Friday, March 07, 2014

Rating systems and rationalizations

The Bill Mazeroski Baseball Annuals, back in the 80s, had a rating system that didn't make much sense. They'd assign each team a bunch of grades, and the final rating would be the total. So you'd get something out of 10 for outfielders, something out of 10 for the starting pitching staff, and something out of 10 for the manager, and so on. Which meant, the manager made as much difference as all three starting outfielders combined.

Maclean's magazine's ratings of Canadian universities are just as dubious. Same idea: rate a bunch of categories, and add them up. I've been planning to write about that one for a while, but I just discovered that Malcolm Gladwell already took care of it, three years ago, for similar American rankings.

In the same article, Gladwell also critiques the rating system used by "Car and Driver" magazine.

In 2010, C&D ran a "comparo" of three sports cars -- the Chevy Corvette Grand Sport, the Lotus Evora, and the Porsche Cayman S. The Cayman won by several points:

193 Cayman
186 Corvette
182 Evora

But, Gladwell points out, the final score is heavily dependent on the weights of the categories used. Car and Driver assigned only four percent of the available points to exterior styling. That makes no sense: "Has anyone buying a sports car ever placed so little value on how it looks?"

Gladwell then notes that if you re-jig the weightings to make looks more important, the Evora comes out on top:

205 Evora
198 Cayman
192 Corvette

Also, how important is price? The cost of the car counted for only ten percent of C&D's rating. For normal buyers, though, price is one of the most important criteria. What happens when Gladwell increases that weighting relative to the others?

Now, the Corvette wins.

205 Corvette
195 Evora
185 Cayman

------

Why does this happen? Gladwell argues that it's because Car and Driver insists on using the same weightings for every car in every issue in every test. It may be reasonable for looks to count for only four percent when you're buying an econobox, but it's much more important for image cars like the Porsche.

"The magazine’s ambition to create a comprehensive ranking system—one that considered cars along twenty-one variables, each weighted according to a secret sauce cooked up by the editors—would also be fine, as long as the cars being compared were truly similar. ... A ranking can be heterogeneous, in other words, as long as it doesn’t try to be too comprehensive. And it can be comprehensive as long as it doesn’t try to measure things that are heterogeneous. "

I think Gladwell goes a bit too easy on Car and Driver. I don't think the entire problem is that the system tries to be overbroad. I think a big part of the problem is that, unless you're measuring something real, *every* weighting system is arbitrary. It's true that a system that works well for family sedans might not work nearly as well as for luxury cars, but it's also true that the system doesn't necessarily work for eiher of them separately, anyway!

It's like ... rating baseball players by RBIs. Sure, it's true that this system is inappropriate for comparing cleanup hitters to leadoff men. But even if you limit your evaluation to cleanup hitters, it still doesn't do a very good job.

In fact, Gladwell shows that explicitly in the car example. His two alternative weighting systems are each perfectly defensible, even within the category of "sports car". Which is better? Which is right? Neither! There is no right answer.

What I'd conclude, from Gladwell's example, is that rating systems are inappropriate for making fine distinctions. Any reasonable system can tell the good cars from the bad, but there's no way an arbitrary evaluation process can tell whether the Evora is better than the Porsche. It would always be too sensitive to the weightings.

In fact, you can always make the result come out either way, and there's no way to tell which one is "right." In fact, there's no "right" at all, because "better" has no actual definition. Your inexpressible intuitive view of "better" might involve a big role for looks, while mine might be more weighted to handling. Neither of us is wrong.

However: most people's definitions of "better" aren't *that* far from each other. We may not be able to agree whether the Porsche is better than the Corvette, but we definitely can agree that both are better than the Yugo. Any reasonable system should wind up with the same result.

Which, in general, is what rating systems are usually good for: identifying *large* differences. I may not believe Consumer Reports that the Sonata (89/100) is better than the Passat (80) ... but I should be able to trust them when they say the Camry (92) is better than the Avenger (43).

-------

In the March, 2004, issue, Car and Driver compares six electric cars. The winner was the Chevrolet Spark EV, with 181 points out of 225. The second place Ford Focus Electric was only eight points behind, at 173.

That's pretty typical, that the numerical ratings are close. They're always much closer than they are in Consumer Reports. I dug out a few back issues of C&D, and jotted down the comparo scores. Each row below is a different test:

189 - 164
206 - 201 - 200 - 192
220 - 205
196 - 190 - 184 - 179

All are pretty close -- the biggest gap from first to last is 15 percent. Although, I deliberately left out the March issue: there, the gap is bigger, mostly because of the electric Smart car, which they didn't like at all:

181 - 173 - 161 - 157 - 153 - 126

Leaving out the Smart, the difference between first and last is 18 percent. (For the record: Consumer Reports didn't rate the electric Smart, but they gave the regular one only 28/100, the lowest score of any car in their ratings.)

Anyway, as I said, the Spark beat the Focus by only 8 ratings points, or five percent. But, if you read the evaluations of those two cars ... the editors like the Spark *a lot more* than the Focus.

Of the Spark, they say,

"Here's a car that puts it all together ... It's a total effort, a studied application of brainpower and enthusiasm that embraces the electric mandate with gusto ... Everything about the Spark is all-in. ... It is the one gold that sparkles."

But they're much more muted when it comes to the Focus, even in their compliments:

"The most natural-feeling of our EVs, the Focus delivers a smooth if somewhat muted rush of torque and has excellent brakes. ... At low speeds ... you can catch the motor clunking ... but otherwise the Focus feels solid and well integrated. ... What the Focus Electric really does best is give you a reason to go test drive the top-of-the-line gas-burning Focus."

When Car and Driver actually tells you what they think, it sounds like the cars are worlds apart. All that for eight points? Actually, it's worse than that: the Spark had a price advantage of seven points. So, when it comes to the car itself, the Chevy wins by only *one point* -- but gets much, much more appreciation and plaudits.

What's going on? Gladwell thinks C&D is putting too much faith in its own rating:

"Yet when you inspect the magazine’s tabulations it is hard to figure out why Car and Driver was so sure that the Cayman is better than the Corvette and the Evora."

I suspect, though, that it's the other way around: after they drive the cars, they decide which they liked best, then tailor the ratings to come out in the right order. I suspect that, if the ratings added up to make the Focus the best, they'd say to each other, "Well, that's not right! There must be something wrong with our numbers." And they'd rejig the category scores to make it work out.

Which probably isn't too hard to do, because, I suspect, the system is deliberately designed to keep the ratings close. That way, every car looks almost as good as the best, and everybody gets a good grade. A Ford salesman can tell his customer, "Look, we finished second, but only by 8 points ... and, 7 of them were price! And look at all the categories we beat them in!"

That doesn't mean the competition is biased. The magazine is just making sure the best car wins. Car and Driver is my favorite car magazine, and I think the raters really know their stuff. I don't want the winner to go the highest-scorer of an arbitrary point system ... I want the winner to be the one the magazine thinks is best. That's why I'm reading the article, to get the opinions of the experts.

So, they're not "fixing" the competition, as in making sure the wrong car wins. They ARE "fixing" the ratings -- but in the sense of "repairing" them. Because, if you know the Spark is the best, but it doesn't rate the highest, you must have scored it wrong! Well, actually, you must have chosen a lousy rating system ... but, in this case, the writer is stuck with the longstanding C&D standard.

-------

"Fixing" the algorithm to match your intuition is probably a standard feature of ranking systems. In baseball, we've seen the pattern before ... someone decides that Jim Rice is underrated, and tries to come up with a rating that slots him where his gut says he should be slotted. Maybe give more weight to RBIs and batting average, and less to longevity. Maybe add in something for MVP votes, and lower the weighting for GIDP. Eventually, you get to a weighting that puts Jim Rice about as high as you'd like him to be, and that's your system.

And it doesn't feel like you're cheating, because, after all, you KNOW that Jim Rice belongs there! And, look, Babe Ruth is at the top, and so is Ted Williams, and a whole bunch of other HOFers. This, then, must be the right system!

That's what always has to happen, isn't it? Whether you're rating cars, or schools, or student achievement, or fame, or beauty, or whatever ... nobody just jots a system down and goes with it. You try it, and you see, "well, that one puts all the small cars at the top, so we've rated fuel economy too high." So you adjust. And now you see, "well, now all the luxury cars rate highest, so we better increase the weighting for price." And so on, until you look at the results, and they seem right to you, and the Jim Ricemobile is in its proper place.

That's another reason I hate arbitrary rankings: they're almost always set to fit the designer's preconceptions. To a certain extent, rating systems are just elaborate rationalizations.

Labels: Car and Driver, cars, Consumer Reports, ratings

Sabermetric Research

Friday, March 21, 2014

ESPN quantifies clubhouse chemistry

Friday, March 14, 2014

Consumer Reports on extended warranties

Friday, March 07, 2014

Rating systems and rationalizations

About Me

My stuff

Hardcore Sabermetric Research Links

Other Sports Research Links

Medium Core Sabermetric/Baseball Links (more to come)

More Baseball Stuff

Blogroll

Previous Posts

Archives