Sabermetric Research: Rating systems and rationalizations

The Bill Mazeroski Baseball Annuals, back in the 80s, had a rating system that didn't make much sense. They'd assign each team a bunch of grades, and the final rating would be the total. So you'd get something out of 10 for outfielders, something out of 10 for the starting pitching staff, and something out of 10 for the manager, and so on. Which meant, the manager made as much difference as all three starting outfielders combined.

Maclean's magazine's ratings of Canadian universities are just as dubious. Same idea: rate a bunch of categories, and add them up. I've been planning to write about that one for a while, but I just discovered that Malcolm Gladwell already took care of it, three years ago, for similar American rankings.

In the same article, Gladwell also critiques the rating system used by "Car and Driver" magazine.

In 2010, C&D ran a "comparo" of three sports cars -- the Chevy Corvette Grand Sport, the Lotus Evora, and the Porsche Cayman S. The Cayman won by several points:

193 Cayman
186 Corvette
182 Evora

But, Gladwell points out, the final score is heavily dependent on the weights of the categories used. Car and Driver assigned only four percent of the available points to exterior styling. That makes no sense: "Has anyone buying a sports car ever placed so little value on how it looks?"

Gladwell then notes that if you re-jig the weightings to make looks more important, the Evora comes out on top:

205 Evora
198 Cayman
192 Corvette

Also, how important is price? The cost of the car counted for only ten percent of C&D's rating. For normal buyers, though, price is one of the most important criteria. What happens when Gladwell increases that weighting relative to the others?

Now, the Corvette wins.

205 Corvette
195 Evora
185 Cayman

------

Why does this happen? Gladwell argues that it's because Car and Driver insists on using the same weightings for every car in every issue in every test. It may be reasonable for looks to count for only four percent when you're buying an econobox, but it's much more important for image cars like the Porsche.

"The magazine’s ambition to create a comprehensive ranking system—one that considered cars along twenty-one variables, each weighted according to a secret sauce cooked up by the editors—would also be fine, as long as the cars being compared were truly similar. ... A ranking can be heterogeneous, in other words, as long as it doesn’t try to be too comprehensive. And it can be comprehensive as long as it doesn’t try to measure things that are heterogeneous. "

I think Gladwell goes a bit too easy on Car and Driver. I don't think the entire problem is that the system tries to be overbroad. I think a big part of the problem is that, unless you're measuring something real, *every* weighting system is arbitrary. It's true that a system that works well for family sedans might not work nearly as well as for luxury cars, but it's also true that the system doesn't necessarily work for eiher of them separately, anyway!

It's like ... rating baseball players by RBIs. Sure, it's true that this system is inappropriate for comparing cleanup hitters to leadoff men. But even if you limit your evaluation to cleanup hitters, it still doesn't do a very good job.

In fact, Gladwell shows that explicitly in the car example. His two alternative weighting systems are each perfectly defensible, even within the category of "sports car". Which is better? Which is right? Neither! There is no right answer.

What I'd conclude, from Gladwell's example, is that rating systems are inappropriate for making fine distinctions. Any reasonable system can tell the good cars from the bad, but there's no way an arbitrary evaluation process can tell whether the Evora is better than the Porsche. It would always be too sensitive to the weightings.

In fact, you can always make the result come out either way, and there's no way to tell which one is "right." In fact, there's no "right" at all, because "better" has no actual definition. Your inexpressible intuitive view of "better" might involve a big role for looks, while mine might be more weighted to handling. Neither of us is wrong.

However: most people's definitions of "better" aren't *that* far from each other. We may not be able to agree whether the Porsche is better than the Corvette, but we definitely can agree that both are better than the Yugo. Any reasonable system should wind up with the same result.

Which, in general, is what rating systems are usually good for: identifying *large* differences. I may not believe Consumer Reports that the Sonata (89/100) is better than the Passat (80) ... but I should be able to trust them when they say the Camry (92) is better than the Avenger (43).

-------

In the March, 2004, issue, Car and Driver compares six electric cars. The winner was the Chevrolet Spark EV, with 181 points out of 225. The second place Ford Focus Electric was only eight points behind, at 173.

That's pretty typical, that the numerical ratings are close. They're always much closer than they are in Consumer Reports. I dug out a few back issues of C&D, and jotted down the comparo scores. Each row below is a different test:

189 - 164
206 - 201 - 200 - 192
220 - 205
196 - 190 - 184 - 179

All are pretty close -- the biggest gap from first to last is 15 percent. Although, I deliberately left out the March issue: there, the gap is bigger, mostly because of the electric Smart car, which they didn't like at all:

181 - 173 - 161 - 157 - 153 - 126

Leaving out the Smart, the difference between first and last is 18 percent. (For the record: Consumer Reports didn't rate the electric Smart, but they gave the regular one only 28/100, the lowest score of any car in their ratings.)

Anyway, as I said, the Spark beat the Focus by only 8 ratings points, or five percent. But, if you read the evaluations of those two cars ... the editors like the Spark *a lot more* than the Focus.

Of the Spark, they say,

"Here's a car that puts it all together ... It's a total effort, a studied application of brainpower and enthusiasm that embraces the electric mandate with gusto ... Everything about the Spark is all-in. ... It is the one gold that sparkles."

But they're much more muted when it comes to the Focus, even in their compliments:

"The most natural-feeling of our EVs, the Focus delivers a smooth if somewhat muted rush of torque and has excellent brakes. ... At low speeds ... you can catch the motor clunking ... but otherwise the Focus feels solid and well integrated. ... What the Focus Electric really does best is give you a reason to go test drive the top-of-the-line gas-burning Focus."

When Car and Driver actually tells you what they think, it sounds like the cars are worlds apart. All that for eight points? Actually, it's worse than that: the Spark had a price advantage of seven points. So, when it comes to the car itself, the Chevy wins by only *one point* -- but gets much, much more appreciation and plaudits.

What's going on? Gladwell thinks C&D is putting too much faith in its own rating:

"Yet when you inspect the magazine’s tabulations it is hard to figure out why Car and Driver was so sure that the Cayman is better than the Corvette and the Evora."

I suspect, though, that it's the other way around: after they drive the cars, they decide which they liked best, then tailor the ratings to come out in the right order. I suspect that, if the ratings added up to make the Focus the best, they'd say to each other, "Well, that's not right! There must be something wrong with our numbers." And they'd rejig the category scores to make it work out.

Which probably isn't too hard to do, because, I suspect, the system is deliberately designed to keep the ratings close. That way, every car looks almost as good as the best, and everybody gets a good grade. A Ford salesman can tell his customer, "Look, we finished second, but only by 8 points ... and, 7 of them were price! And look at all the categories we beat them in!"

That doesn't mean the competition is biased. The magazine is just making sure the best car wins. Car and Driver is my favorite car magazine, and I think the raters really know their stuff. I don't want the winner to go the highest-scorer of an arbitrary point system ... I want the winner to be the one the magazine thinks is best. That's why I'm reading the article, to get the opinions of the experts.

So, they're not "fixing" the competition, as in making sure the wrong car wins. They ARE "fixing" the ratings -- but in the sense of "repairing" them. Because, if you know the Spark is the best, but it doesn't rate the highest, you must have scored it wrong! Well, actually, you must have chosen a lousy rating system ... but, in this case, the writer is stuck with the longstanding C&D standard.

-------

"Fixing" the algorithm to match your intuition is probably a standard feature of ranking systems. In baseball, we've seen the pattern before ... someone decides that Jim Rice is underrated, and tries to come up with a rating that slots him where his gut says he should be slotted. Maybe give more weight to RBIs and batting average, and less to longevity. Maybe add in something for MVP votes, and lower the weighting for GIDP. Eventually, you get to a weighting that puts Jim Rice about as high as you'd like him to be, and that's your system.

And it doesn't feel like you're cheating, because, after all, you KNOW that Jim Rice belongs there! And, look, Babe Ruth is at the top, and so is Ted Williams, and a whole bunch of other HOFers. This, then, must be the right system!

That's what always has to happen, isn't it? Whether you're rating cars, or schools, or student achievement, or fame, or beauty, or whatever ... nobody just jots a system down and goes with it. You try it, and you see, "well, that one puts all the small cars at the top, so we've rated fuel economy too high." So you adjust. And now you see, "well, now all the luxury cars rate highest, so we better increase the weighting for price." And so on, until you look at the results, and they seem right to you, and the Jim Ricemobile is in its proper place.

That's another reason I hate arbitrary rankings: they're almost always set to fit the designer's preconceptions. To a certain extent, rating systems are just elaborate rationalizations.

Labels: Car and Driver, cars, Consumer Reports, ratings

9 Comments:

At Friday, March 07, 2014 7:58:00 PM, Don Coffin said...: When I teach introductory microeconomics, I spend some time talking about product "quality." Students always start by thinking that "quality" matters a lot. Until we start trying to define "quality." (I use automobiles as my primary example, and computers as a secondary example.) They wind up confused, which is what I'm aiming for.

But there's a line of research in economics that talks about "hedonic" pricing. The idea is that you look at (similar) products, each of which differs from the others in specific ways. For example, housing prices. You have your interior square footage, your exterior square footage, the number of bedrooms or bathrooms. The location of the house. Fireplace or not, garage or not, basement or not. Then you run a regression with housing characteristics as the independent variables and housing prices (actual sale prices) as the dependent variable. One study I saw found that basements has positive values in Chicago, but negative values in Houston...

I don't have any major point here, but the value of hedonics is that it ties back to something measurable (one way or another)...
At Saturday, March 08, 2014 12:24:00 AM, Anonymous said...: When I read those articles I look at those ratings individually. I imagine a lot of people do. That's really what they are designed for. Adding them up and calling that a "total rating" is just an afterthought - or at least it should be.

MGL

P.S. I don't have a clue what the first word of the captcha is. Phil, why don't you just require that we hike up Mount Everest in order to prove we're not a robot! (I realize it's not your very own hosting site!) ;)
At Saturday, March 08, 2014 12:26:00 AM, Anonymous said...: Why don't you just turn off the captcha if you can? Do you get many robots posting comments?

MGL
At Saturday, March 08, 2014 10:39:00 AM, Phil Birnbaum said...: I turned off the captcha at 5:00, and by 10:00 I had seven spams. Had to turn it back on, sorry!
At Sunday, March 09, 2014 12:53:00 PM, Unknown said...: I thought of something similar when watching The Godfather. People fundamentally want to think of themselves as being good people. People who do things that society would not consider good rejig their rankings to downplay that and emphasize what they are good at.

The example from the Godfather would be that they could not do crime but it is far easier to just say family or loyalty are the most important things.
At Thursday, March 13, 2014 10:01:00 AM, Zach said...: I think there is a possibility they determine which car they like best and then rate it in various categories. This essentially becomes their baseline for the test. They then rank the other cars as slightly better or worse in individual categories and can easily make sure that their favorite car ends up in the lead.
At Saturday, March 15, 2014 4:59:00 AM, Anonymous said...: I'm still confused by this fixation on rating systems being "arbitrary" that's lasted several posts now. And I think your own analogy says it best:

"It's like ... rating baseball players by RBIs. Sure, it's true that this system is inappropriate for comparing cleanup hitters to leadoff men. But even if you limit your evaluation to cleanup hitters, it still doesn't do a very good job."

Right, so RBI is an example of a rating system that is bad. It's bad for comparing different types of hitters, and it's bad for comparing within a type.

But then you have WAR as a very very very explicit counterexample of a rating system that is good.

Ratings are just a way of explicitly defining how you weight a set of criteria. They all are. And if you choose to not use a ratings system because they're "arbitrary" and "not measurements", then you're using one anyway - you're just using your own criteria, whether you make them explicit or just go by instinct or estimation.

All you're doing is giving examples of poorly weighted rating systems and then generalizing to some degree about rating systems in general. It would be like generalizing your RBI example to WAR, which would be poor. I really don't see what you're going for here.
At Sunday, March 16, 2014 1:44:00 PM, Phil Birnbaum said...: Hi, Anonymous,

WAR is not a rating; it's a measurement. Roughly speaking, it measures the number of wins the player produced compared to a minimum-salary free-agent.

Because it's a measurement, it's liable to being refined or overturned on objective grounds -- if you can prove a different version of WAR does it better.

Actually, WAR is a *bad* rating of player talent. For instance, it doesn't differentiate between average full-time players and superstar part-time players.
At Sunday, March 16, 2014 1:45:00 PM, Phil Birnbaum said...: *Bad* is probably too strong, but you know what I mean.

<< Home

Sabermetric Research

Friday, March 07, 2014

Rating systems and rationalizations

9 Comments:

About Me

Previous Posts