Number grades aren't really numbers
Recently, Consumer Reports (CR) evaluated the "Model S", the new electric car from Tesla. It got their highest rating ever: 99 points out of a possible 100.
I was thinking about that, and I realized that the "99 out of 100" scale is the perfect example of what Bill James was talking about when, in the 1982 Abstract, he wrote about how we use numbers as words:
"Suppose that you see the number 48 in a player's home run column... Do you think about 48 cents or 48 soldiers or 48 sheep jumping over a fence? Absolutely not. You think about Harmon Killebrew, about Mike Schmidt, about Ted Kluszewski, about Gorman Thomas. You think about power.
"In this way, the number 48 functions not as a number, as a count of something, but as a word, to suggest meaning. The existence of universally recognized standards -- .300, 100 RBI, 20 wins, 30 homers -- ... transmogrifies the lines of statistics into a peculiar, precise form of language. We all know what .312 means. We all know what 12 triples means, and what 0 triples means. We know what 0 triples means when it is combined with 0 home runs (slap hitter, chokes up and punches), and we know what it means when it is combined with 41 home runs (uppercuts, holds the bat at the knob, can't run and doesn't need to). ..."
That's exactly what's going on with the 99/100, isn't it? The 99 doesn't mean anything, as a number. It's short for, "really, really outstanding," but said in a way that makes it look objective.
If you try to figure out what the 99 means as a number ... well, there isn't anything. As great a car as the Tesla might be, is it really 99% of the way to perfection? Is it really that close to the most perfect car possible?
One thing CR didn't like about the car was that the air conditioning didn't work that great. Is that the missing point? If Tesla fixes the A/C, does the rating bump to 100?
If so, what happens when Tesla improves the car even more? If they put in a bigger battery to get better acceleration, or they improve the handling, or they improve the ergonomics, what's CR going to do -- give them 106 out of 100?
Does it really make sense to fashion a rating on the idea that you can compare something to "perfect"?
To get the "99", CR probably broke the evaluation into categories -- handling, acceleration, comfort, etc. -- and then rated the car on each, maybe out of 10.
The problem with getting rated that way is that if you underperform in one category, you can't make up for it in another in which you're already excellent.
Suppose a car has a tight back seat, and gets docked a couple of points. And the manufacturer says, "well, it's a tradeoff. We needed to do that to give the car a more aerodynamic shape, to save gas and get better acceleration." Which is fine, if it's a mediocre car: in that case, the backseat rating drops from (say) 7 to 5, but the acceleration rating jumps from 6 to 8. The tradeoff keeps the rating constant, as you'd expect.
But, what if the car is really, really good already? The backseat rating goes from 10 to 8, and the acceleration rating goes from 10 to ... 12?
The "out of 10" system arbitrarily sets a maximum "goodness" that it will recognize, and ignores everything above that. That's fine for evaluating mediocrity, but it fails utterly when trying to evaluate excellence.
In the same essay, Bill James added,
"What is disturbing to some people about sabermetrics is that in sabermetrics we use the numbers as numbers. We add them together; we multiply and divide them."
That, I think, is the difference between numbers that are truly numbers, and numbers that are just words. You can only do math if numbers mean something numeric.
The "out of 100" ratings don't. They're not really numbers, they're just words disguised as numbers.
We even acknowledge that they're words by how we put the article in front of them. "I got AN eighty in geography." It's not the number 80; it's a word that happens to be spelled with numbers instead of letters.
It's part of the fabric of our culture, what "out of 100" numbers mean as words. Think of a report card. A 35 is a failure, and a 50 is a bare pass, and a 60 is mediocre, and a 70 is average, and an 80 is very good, and a 90 is outstanding, and a 95 is very outstanding. It's just the way things are.
And that's why treating them as numbers doesn't really work. We try -- we still average them out -- but it's not really right. For instance: Ann takes an algebra course and gets a 90, but doesn't take calculus or statistics. Bob takes all three, but gets a grade of 35 in each. Does Bob, with a total of 105, know more math than Ann, who only has 90? No, they don't. It's clear, intuitively, that Ann knows more than Bob. That's true even if Ann *does* take calculus and statistics, but gets 0 in each. Bob has the higher average, but Ann really did better in math.
On the other hand, there are situations in which we CAN take those numbers as numbers -- but only if we're not interpreting them with normal meanings on the "out of 100" scale. Suppose Ann knows 90 percent of the answers (sorry, "questions") in Jeopardy's "Opera" category, but 0 percent in "Shakespeare" and "Potent Potables." Bob knows 30 percent in each. In this case, Bob and Ann *are* equivalent.
What's the difference? In Jeopardy, "30 out of 100" isn't a rating -- it's a measurement. It's actually a number, not a word. And so, it doesn't have the same meaning as it does as a rating. Because, here, 30 percent is actually pretty good.
If we said, " Bob got a grade of 30% in Potent Potables," that would be almost a lie. A grade of 30% is a failure, and Bob did not fail.
The calculus example touches on the system's biggest failing: it can't handle quantity, only quality. In real life, it's almost always true that good things are better when you have more. But the "out of 100" system doesn't handle that well.
Suppose I have two mediocre cars, a Tahoe which rates a 58, and a Yaris that rates a 41. Those ) add up to 99. Are the two cars together, in any sense, equivalent to a Tesla? I don't think so.
Or, suppose I have the choice: drive a Yaris for two years, or a Tesla for six months. Can I say, "well, two years of the Yaris is 82 point-years, and six months of Tesla is 49.5 point-years, so I should go with the Yaris?" Again, no.
What's better, watching two movies rated three stars, or watching three movies rated two stars? It depends on the person, of course. For me, I still wouldn't know the answer. And, it completely depends on the context of what you're rating.
My gut says that a five-star movie (to me) is worth at least ten three-star movies, and probably more. But a night at a five-star hotel is worth a lot less than ten nights at a three-star hotel.
Rate your enjoyment of "The Hidden Game of Baseball," on a scale of 1 to 10. Maybe, it's a 9. Now, rate your enjoyment of the 1982 Baseball Abstract on the same scale. Maybe that's a 9.5. Fair enough.
Now, rate your enjoyment of the entire collected works of Pete Palmer. I'll say 9.5. Then, rate the entire collected works of Bill James. I'd say 10.
[If you're not familiar with sabermetrics books, try these translations: "The Hidden Game of Baseball" = "Catcher in the Rye". "Pete Palmer" = "J.D. Salinger". "1982 Baseball Abstract" = "Hamlet". "Bill James" = "William Shakespeare".]
But: Bill James has probably published at least three times as many words as Pete Palmer! And he gets no credit for that at all. Because, Pete's other books (say, three of them) raised his rating half a point; and Bill's other books (say, ten of them) raised his rating by the same half a point! If you insist on treating ratings as numbers, you have to conclude that you enjoyed each of Bill's subsequent books less than half as much as Pete's subsequent books.
Even worse: if Bill James had written nothing else after 1988, I'd still give him a 10. The "out of 10" rating system completely ignores that you liked Bill's later works.
The system just doesn't understand quantity very well.
But, it *does* understand *quality*. What if we can somehow collapse quantity into quality?
A few paragraphs ago, the entire collected works of Pete Palmer rated a 9.5 in enjoyment. The entire collected works of Bill James rated a 10.
But: now, compare two Christmas gifts under the tree. One, the entire collected works of Pete Palmer. Second, the entire collected works of Bill James.
Now, we start thinking about the quality of the gift, rather than the quality of the *enjoyment* of the gift. And the quality of a gift does, obviously, depend on how much you're getting. And so, my gut is now happy to rate the Pete Palmer gift as a 7, and the Bill James gift as a 9.
They shouldn't really be different, right? Because, effectively, the quality of the gift is exactly the same as the amount of enjoyment, almost by definition. (It's not like I was thinking I could use the Bill James books as doorstops.) But in the gift case, we're mentally equipped to calibrate by quantity. In the enjoyment case, our brain has no idea how to make that adjustment.