Sabermetric Research: Number grades aren't really numbers

Recently, Consumer Reports (CR) evaluated the "Model S", the new electric car from Tesla. It got their highest rating ever: 99 points out of a possible 100.

I was thinking about that, and I realized that the "99 out of 100" scale is the perfect example of what Bill James was talking about when, in the 1982 Abstract, he wrote about how we use numbers as words:

"Suppose that you see the number 48 in a player's home run column... Do you think about 48 cents or 48 soldiers or 48 sheep jumping over a fence? Absolutely not. You think about Harmon Killebrew, about Mike Schmidt, about Ted Kluszewski, about Gorman Thomas. You think about power.

"In this way, the number 48 functions not as a number, as a count of something, but as a word, to suggest meaning. The existence of universally recognized standards -- .300, 100 RBI, 20 wins, 30 homers -- ... transmogrifies the lines of statistics into a peculiar, precise form of language. We all know what .312 means. We all know what 12 triples means, and what 0 triples means. We know what 0 triples means when it is combined with 0 home runs (slap hitter, chokes up and punches), and we know what it means when it is combined with 41 home runs (uppercuts, holds the bat at the knob, can't run and doesn't need to). ..."

That's exactly what's going on with the 99/100, isn't it? The 99 doesn't mean anything, as a number. It's short for, "really, really outstanding," but said in a way that makes it look objective.

If you try to figure out what the 99 means as a number ... well, there isn't anything. As great a car as the Tesla might be, is it really 99% of the way to perfection? Is it really that close to the most perfect car possible?

One thing CR didn't like about the car was that the air conditioning didn't work that great. Is that the missing point? If Tesla fixes the A/C, does the rating bump to 100?

If so, what happens when Tesla improves the car even more? If they put in a bigger battery to get better acceleration, or they improve the handling, or they improve the ergonomics, what's CR going to do -- give them 106 out of 100?

Does it really make sense to fashion a rating on the idea that you can compare something to "perfect"?

----

To get the "99", CR probably broke the evaluation into categories -- handling, acceleration, comfort, etc. -- and then rated the car on each, maybe out of 10.
The problem with getting rated that way is that if you underperform in one category, you can't make up for it in another in which you're already excellent.

Suppose a car has a tight back seat, and gets docked a couple of points. And the manufacturer says, "well, it's a tradeoff. We needed to do that to give the car a more aerodynamic shape, to save gas and get better acceleration." Which is fine, if it's a mediocre car: in that case, the backseat rating drops from (say) 7 to 5, but the acceleration rating jumps from 6 to 8. The tradeoff keeps the rating constant, as you'd expect.

But, what if the car is really, really good already? The backseat rating goes from 10 to 8, and the acceleration rating goes from 10 to ... 12?

The "out of 10" system arbitrarily sets a maximum "goodness" that it will recognize, and ignores everything above that. That's fine for evaluating mediocrity, but it fails utterly when trying to evaluate excellence.

-----

In the same essay, Bill James added,

"What is disturbing to some people about sabermetrics is that in sabermetrics we use the numbers as numbers. We add them together; we multiply and divide them."

That, I think, is the difference between numbers that are truly numbers, and numbers that are just words. You can only do math if numbers mean something numeric.

The "out of 100" ratings don't. They're not really numbers, they're just words disguised as numbers.

We even acknowledge that they're words by how we put the article in front of them. "I got AN eighty in geography." It's not the number 80; it's a word that happens to be spelled with numbers instead of letters.

It's part of the fabric of our culture, what "out of 100" numbers mean as words. Think of a report card. A 35 is a failure, and a 50 is a bare pass, and a 60 is mediocre, and a 70 is average, and an 80 is very good, and a 90 is outstanding, and a 95 is very outstanding. It's just the way things are.

And that's why treating them as numbers doesn't really work. We try -- we still average them out -- but it's not really right. For instance: Ann takes an algebra course and gets a 90, but doesn't take calculus or statistics. Bob takes all three, but gets a grade of 35 in each. Does Bob, with a total of 105, know more math than Ann, who only has 90? No, they don't. It's clear, intuitively, that Ann knows more than Bob. That's true even if Ann *does* take calculus and statistics, but gets 0 in each. Bob has the higher average, but Ann really did better in math.

On the other hand, there are situations in which we CAN take those numbers as numbers -- but only if we're not interpreting them with normal meanings on the "out of 100" scale. Suppose Ann knows 90 percent of the answers (sorry, "questions") in Jeopardy's "Opera" category, but 0 percent in "Shakespeare" and "Potent Potables." Bob knows 30 percent in each. In this case, Bob and Ann *are* equivalent.

What's the difference? In Jeopardy, "30 out of 100" isn't a rating -- it's a measurement. It's actually a number, not a word. And so, it doesn't have the same meaning as it does as a rating. Because, here, 30 percent is actually pretty good.

If we said, " Bob got a grade of 30% in Potent Potables," that would be almost a lie. A grade of 30% is a failure, and Bob did not fail.

-----

The calculus example touches on the system's biggest failing: it can't handle quantity, only quality. In real life, it's almost always true that good things are better when you have more. But the "out of 100" system doesn't handle that well.

Suppose I have two mediocre cars, a Tahoe which rates a 58, and a Yaris that rates a 41. Those ) add up to 99. Are the two cars together, in any sense, equivalent to a Tesla? I don't think so.

Or, suppose I have the choice: drive a Yaris for two years, or a Tesla for six months. Can I say, "well, two years of the Yaris is 82 point-years, and six months of Tesla is 49.5 point-years, so I should go with the Yaris?" Again, no.

What's better, watching two movies rated three stars, or watching three movies rated two stars? It depends on the person, of course. For me, I still wouldn't know the answer. And, it completely depends on the context of what you're rating.

My gut says that a five-star movie (to me) is worth at least ten three-star movies, and probably more. But a night at a five-star hotel is worth a lot less than ten nights at a three-star hotel.

Rate your enjoyment of "The Hidden Game of Baseball," on a scale of 1 to 10. Maybe, it's a 9. Now, rate your enjoyment of the 1982 Baseball Abstract on the same scale. Maybe that's a 9.5. Fair enough.

Now, rate your enjoyment of the entire collected works of Pete Palmer. I'll say 9.5. Then, rate the entire collected works of Bill James. I'd say 10.

[If you're not familiar with sabermetrics books, try these translations: "The Hidden Game of Baseball" = "Catcher in the Rye". "Pete Palmer" = "J.D. Salinger". "1982 Baseball Abstract" = "Hamlet". "Bill James" = "William Shakespeare".]

But: Bill James has probably published at least three times as many words as Pete Palmer! And he gets no credit for that at all. Because, Pete's other books (say, three of them) raised his rating half a point; and Bill's other books (say, ten of them) raised his rating by the same half a point! If you insist on treating ratings as numbers, you have to conclude that you enjoyed each of Bill's subsequent books less than half as much as Pete's subsequent books.

Even worse: if Bill James had written nothing else after 1988, I'd still give him a 10. The "out of 10" rating system completely ignores that you liked Bill's later works.

The system just doesn't understand quantity very well.

But, it *does* understand *quality*. What if we can somehow collapse quantity into quality?

A few paragraphs ago, the entire collected works of Pete Palmer rated a 9.5 in enjoyment. The entire collected works of Bill James rated a 10.

But: now, compare two Christmas gifts under the tree. One, the entire collected works of Pete Palmer. Second, the entire collected works of Bill James.

Now, we start thinking about the quality of the gift, rather than the quality of the *enjoyment* of the gift. And the quality of a gift does, obviously, depend on how much you're getting. And so, my gut is now happy to rate the Pete Palmer gift as a 7, and the Bill James gift as a 9.

They shouldn't really be different, right? Because, effectively, the quality of the gift is exactly the same as the amount of enjoyment, almost by definition. (It's not like I was thinking I could use the Bill James books as doorstops.) But in the gift case, we're mentally equipped to calibrate by quantity. In the enjoyment case, our brain has no idea how to make that adjustment.

Labels: Bill James, ratings

12 Comments:

At Sunday, January 26, 2014 8:12:00 PM, Anonymous said...: Talk about complaining for the sake of complaining... I'd give this post a zero (0) out of ten (10).
At Sunday, January 26, 2014 8:48:00 PM, Phil Birnbaum said...: Ha! I should have seen that coming.
At Sunday, January 26, 2014 10:04:00 PM, Don Coffin said...: You wound up not going where I thought you were going. Which is OK, but I think there's a stronger argument to be made.

Grades (and measurements underlying grades) are course-specific. It's really not reasonable to try to aggregate that across courses. Although we do every time we "compute" a "grade point average." (Or when someone "averages" the "scores" on a Likert (Strongly Agree/Agree/Neutral/Disagree/Strongly Disagree) scale. These things are labels, not numbers, and the appropriate way to deal with them is to look at the distribution of the labels. In what percentage of courses did Ann or Bob get an A? *Within* the course, the numbers (of points accumulated on assignments, for example) may make sense, but the grades aren't numbers, they are labels, and should be treated that way.

The way I try to get this across is to point out that the "average" of "I love you" and "I hate you" is not "I don't care."

So I get the point, but I'm not sure anyone would really disagree.
At Sunday, January 26, 2014 10:17:00 PM, Don Coffin said...: "Suppose I have two mediocre cars, a Tahoe which rates a 58, and a Yaris that rates a 41. Those ) add up to 99. Are the two cars together, in any sense, equivalent to a Tesla? I don't think so."

But you have *2* cars, and even if the numbers meant something, *2* Teslas would have a score of 198, right? So you'd need to take the *average* here, if these are numbers. Or am I missing something?

Roger Maris hit 61 home runs in 1961. Norm Cash and Rocky Colavito hit 86. It is correct to say (a) that Maris hit more HRs than Cash or Colavito and (b) that Cash and Colavito combined hit more HRs than Maris, right? Because these are, in fact, numbers.

Mickey Mantle had an OPS+ of 206 in 1967. Cash had an OPS+ of 201 & Colavito had an OPS+ of 167. No one would add Cash's and Colavito's OPS+ together, because they are *labels*. No one, really, would even average them, because that are *labels*.

I think you sort of confuse that in your post, to be honest.
At Sunday, January 26, 2014 11:27:00 PM, Phil Birnbaum said...: Doc,

My thinking is that the number of cars shouldn't matter. Like, if I rate a $10 bill "10", and a $5 bill "5", then if I have two fives and a ten, the sum of their ratings is exactly the same as the rating for a single $20.

That is: if the number was a measure of "goodness" of some sort, then "goodness" should be additive, generally, or at least there should be some way of combining two or more goodnesses into a bigger goodness.

You average them if they're rate stats, and you add them if they're bulk stats, or you do something else otherwise (like take the harmonic mean if they're reciprocals of rate stats).

But for course grades, what do you do? How does a 30 and a 40 compare to a single 50? Those are truly labels.

OPS+, on the other hand, is probably a rate stat. You can probably average them to see what a team of Cash/Colavito clones (4.5 of each) would hit.
At Sunday, January 26, 2014 11:28:00 PM, Phil Birnbaum said...: "I'm not sure anyone would really disagree."

But everyone uses GPA, or the average of raw grades! Doesn't the revealed preference show they disagree?
At Monday, January 27, 2014 10:40:00 AM, Anonymous said...: Thanks for the great article! You're describing the reasons economists developed preference theory and the notion of utility.

To an economist - and to you - it doesn't make sense to say "a Tahoe rating plus a Yaris rating equals a Tesla rating, therefore the two are equivalent to the one." The important question is: "how much would you enjoy owning both a Tahoe and a Yaris, compared to how much you would enjoy owning a Tesla?" (Note: your enjoyment of the Tahoe and Yaris need not be the sum of your enjoyment of owning each separately, in fact, it's probably not since you can't drive two cars at the same time).

None of this has anything to do with your discussion of the use of numbers as words - a topic which economics has not touched as far as I know - though there must be a linguist somewhere who has put some thought into this. Interesting stuff!
At Monday, January 27, 2014 1:27:00 PM, Mike said...: It is for reasons like this post that I enjoyed learning about Ratio, Interval, Ordinal, and Nominal variables in my first college stats course.

I think at the beginning, Phil/Bill is saying that Home Runs are not only an Interval measure, but a Nominal one as well. Which is an interesting way to look at things.
At Monday, January 27, 2014 1:55:00 PM, Phil Birnbaum said...: What kind of measure is "30% in Psych 101"?
At Monday, January 27, 2014 5:04:00 PM, Glert said...: "What kind of measure is "30% in Psych 101"?"

1 minus the score represents the probability that a philosophy or theatre degree is in their future.
At Wednesday, January 29, 2014 9:21:00 PM, Anonymous said...: I'm going to operate on the basis that this wasn't a joke post, although I'm not sure.

First, let me say I grant the initial point (Bill James's): that sometimes a number is a stand-in for a concept. "That movie was a 9 out of 10" is probably not based on any kind of rigorous framework. It just means "I liked that movie a lot". Is it a problem when people use words in the guise of numbers to make a statement sound more scientific? Maybe, but you don't really pursue that at all.

But everything else you said seemed really silly (and why I'm still not sure this wasn't a joke post).

Your percentage grade in a class IS a measurement. In fact, a great analogy is baseball WAR. The teacher is attempting to condense your total performance in the course into one number. This number is representative of the teacher's subjective rubric, but it is applied consistently across all students. Maybe you disagree with the weights applied, but a student's grade is CLEARLY a measurement. You may think it's not doing a good job of measuring what it says it's measuring, or you may think that it should be trying to measure something else in the first place, but it's clearly a measurement.

The car analogy seems just as silly. In this case, these cars are all being graded at a certain point in time. In essence, they're being graded on a curve. It's not hard to see how these grades are applied, either: figure out "best in class" and work back from there. Car A has the best acceleration but very poor trunk space. Car B has the second-best acceleration but great trunk space.

These car grades are, again, measurements. They are designed to perform one job: to compare the cars to the other available cars, for the benefit of potential consumers. And next year, when the next set of grades come out, the curve will be readjusted based on new and different competitors. It's beyond common to see in reviews (of any product) "the 2014 model brings back the excellent from the 2013 model, but since it's been surpassed by the 2014 , its score drops from a 10 to a 9.5".

I really don't know what you were trying to say in this post.
At Friday, January 31, 2014 10:36:00 PM, Anonymous said...: I love your posts, but...

"1982 Baseball Abstract" = "Hamlet". "Bill James" = "William Shakespeare".<<

...uh, no.

--Bob Lince

<< Home

Sabermetric Research

Sunday, January 26, 2014

Number grades aren't really numbers

12 Comments:

About Me

Previous Posts