Sunday, January 26, 2014

Number grades aren't really numbers

Recently, Consumer Reports (CR) evaluated the "Model S", the new electric car from Tesla.  It got their highest rating ever: 99 points out of a possible 100.

I was thinking about that, and I realized that the "99 out of 100" scale is the perfect example of what Bill James was talking about when, in the 1982 Abstract, he wrote about how we use numbers as words:

"Suppose that you see the number 48 in a player's home run column... Do you think about 48 cents or 48 soldiers or 48 sheep jumping over a fence? Absolutely not. You think about Harmon Killebrew, about Mike Schmidt, about Ted Kluszewski, about Gorman Thomas. You think about power.
"In this way, the number 48 functions not as a number, as a count of something, but as a word, to suggest meaning. The existence of universally recognized standards -- .300, 100 RBI, 20 wins, 30 homers -- ... transmogrifies the lines of statistics into a peculiar, precise form of language. We all know what .312 means. We all know what 12 triples means, and what 0 triples means. We know what 0 triples means when it is combined with 0 home runs (slap hitter, chokes up and punches), and we know what it means when it is combined with 41 home runs (uppercuts, holds the bat at the knob, can't run and doesn't need to). ..."

That's exactly what's going on with the 99/100, isn't it?  The 99 doesn't mean anything, as a number.  It's short for, "really, really outstanding," but said in a way that makes it look objective.  

If you try to figure out what the 99 means as a number ... well, there isn't anything.  As great a car as the Tesla might be, is it really 99% of the way to perfection?  Is it really that close to the most perfect car possible?   

One thing CR didn't like about the car was that the air conditioning didn't work that great.  Is that the missing point?  If Tesla fixes the A/C, does the rating bump to 100?  

If so, what happens when Tesla improves the car even more?  If they put in a bigger battery to get better acceleration, or they improve the handling, or they improve the ergonomics, what's CR going to do -- give them 106 out of 100?

Does it really make sense to fashion a rating on the idea that you can compare something to "perfect"?

----

To get the "99", CR probably broke the evaluation into categories -- handling, acceleration, comfort, etc. -- and then rated the car on each, maybe out of 10.  
The problem with getting rated that way is that if you underperform in one category, you can't make up for it in another in which you're already excellent.

Suppose a car has a tight back seat, and gets docked a couple of points.  And the manufacturer says, "well, it's a tradeoff.  We needed to do that to give the car a more aerodynamic shape, to save gas and get better acceleration."  Which is fine, if it's a mediocre car: in that case, the backseat rating drops from (say) 7 to 5, but the acceleration rating jumps from 6 to 8.  The tradeoff keeps the rating constant, as you'd expect.

But, what if the car is really, really good already?  The backseat rating goes from 10 to 8, and the acceleration rating goes from 10 to ... 12?  

The "out of 10" system arbitrarily sets a maximum "goodness" that it will recognize, and ignores everything above that.  That's fine for evaluating mediocrity, but it fails utterly when trying to evaluate excellence.  

-----

In the same essay, Bill James added,

"What is disturbing to some people about sabermetrics is that in sabermetrics we use the numbers as numbers. We add them together; we multiply and divide them."

That, I think, is the difference between numbers that are truly numbers, and numbers that are just words.  You can only do math if numbers mean something numeric.

The "out of 100" ratings don't.  They're not really numbers, they're just words disguised as numbers.

We even acknowledge that they're words by how we put the article in front of them.  "I got AN eighty in geography."  It's not the number 80; it's a word that happens to be spelled with numbers instead of letters.

It's part of the fabric of our culture, what "out of 100" numbers mean as words.  Think of a report card.  A 35 is a failure, and a 50 is a bare pass, and a 60 is mediocre, and a 70 is average, and an 80 is very good, and a 90 is outstanding, and a 95 is very outstanding.  It's just the way things are.  

And that's why treating them as numbers doesn't really work.  We try -- we still average them out -- but it's not really right.  For instance: Ann takes an algebra course and gets a 90, but doesn't take calculus or statistics.  Bob takes all three, but gets a grade of 35 in each.  Does Bob, with a total of 105, know more math than Ann, who only has 90?  No, they don't.  It's clear, intuitively, that Ann knows more than Bob.  That's true even if Ann *does* take calculus and statistics, but gets 0 in each.  Bob has the higher average, but Ann really did better in math.

On the other hand, there are situations in which we CAN take those numbers as numbers -- but only if we're not interpreting them with normal meanings on the "out of 100" scale.  Suppose Ann knows 90 percent of the answers (sorry, "questions") in Jeopardy's "Opera" category, but 0 percent in "Shakespeare" and "Potent Potables."  Bob knows 30 percent in each.  In this case, Bob and Ann *are* equivalent.  

What's the difference?  In Jeopardy, "30 out of 100" isn't a rating -- it's a measurement.  It's actually a number, not a word.  And so, it doesn't have the same meaning as it does as a rating. Because, here, 30 percent is actually pretty good.

If we said, " Bob got a grade of 30% in Potent Potables," that would be almost a lie.  A grade of 30% is a failure, and Bob did not fail. 

-----

The calculus example touches on the system's biggest failing: it can't handle quantity, only quality.  In real life, it's almost always true that good things are better when you have more.  But the "out of 100" system doesn't handle that well.  

Suppose I have two mediocre cars, a Tahoe which rates a 58, and a Yaris that rates a 41.  Those ) add up to 99.  Are the two cars together, in any sense, equivalent to a Tesla?  I don't think so.

Or, suppose I have the choice: drive a Yaris for two years, or a Tesla for six months.  Can I say, "well, two years of the Yaris is 82 point-years, and six months of Tesla is 49.5 point-years, so I should go with the Yaris?"  Again, no.

What's better, watching two movies rated three stars, or watching three movies rated two stars?  It depends on the person, of course.  For me, I still wouldn't know the answer.  And, it completely depends on the context of what you're rating.  

My gut says that a five-star movie (to me) is worth at least ten three-star movies, and probably more.  But a night at a five-star hotel is worth a lot less than ten nights at a three-star hotel.

Rate your enjoyment of "The Hidden Game of Baseball," on a scale of 1 to 10.  Maybe, it's a 9.  Now, rate your enjoyment of the 1982 Baseball Abstract on the same scale.  Maybe that's a 9.5.  Fair enough.

Now, rate your enjoyment of the entire collected works of Pete Palmer.  I'll say 9.5.  Then, rate the entire collected works of Bill James.   I'd say 10.  

[If you're not familiar with sabermetrics books, try these translations: "The Hidden Game of Baseball" = "Catcher in the Rye".  "Pete Palmer" = "J.D. Salinger".  "1982 Baseball Abstract" = "Hamlet".  "Bill James" = "William Shakespeare".]

But: Bill James has probably published at least three times as many words as Pete Palmer!  And he gets no credit for that at all.  Because, Pete's other books (say, three of them) raised his rating half a point; and Bill's other books (say, ten of them) raised his rating by the same half a point!  If you insist on treating ratings as numbers, you have to conclude that you enjoyed each of Bill's subsequent books less than half as much as Pete's subsequent books.  

Even worse: if Bill James had written nothing else after 1988, I'd still give him a 10.  The "out of 10" rating system completely ignores that you liked Bill's later works.

The system just doesn't understand quantity very well.  

But, it *does* understand *quality*.  What if we can somehow collapse quantity into quality?

A few paragraphs ago, the entire collected works of Pete Palmer rated a 9.5 in enjoyment.  The entire collected works of Bill James rated a 10.

But: now, compare two Christmas gifts under the tree.  One, the entire collected works of Pete Palmer.  Second, the entire collected works of Bill James.  

Now, we start thinking about the quality of the gift, rather than the quality of the *enjoyment* of the gift.  And the quality of a gift does, obviously, depend on how much you're getting.  And so, my gut is now happy to rate the Pete Palmer gift as a 7, and the Bill James gift as a 9.  

They shouldn't really be different, right?  Because, effectively, the quality of the gift is exactly the same as the amount of enjoyment, almost by definition.  (It's not like I was thinking I could use the Bill James books as doorstops.)  But in the gift case, we're mentally equipped to calibrate by quantity.  In the enjoyment case, our brain has no idea how to make that adjustment. 




Labels: ,

Monday, January 20, 2014

Luck and the Olympic hockey tournament

There are twelve countries represented in Olympic men's hockey.  They play only three or four games each before getting to the quarter-finals; then it's single elimination after that.  

In the NHL, it takes a huge number of games -- 36 or 73 games per team -- to get to the point where talent is as important as luck.  But the entire Olympic tournament takes only 30 games.  Not 30 games per team, but 30 games period.

If that's the case, then how can the Olympics possibly filter out the best teams in such a short span?  

That's the subject of my article in the Ottawa Citizen today.  If you don't want to read the whole thing, the answer, basically, is:

1.  There is a much wider range of talent in the Olympics than in the NHL.  The top teams are almost as good as an NHL all-star team; the bottom teams are below minor-league.  That makes it much, much easier to separate good from bad.

2.  The IIHF (which structured the tournament) created an unbalanced schedule -- the bad teams disproportionately face the good teams, and vice-versa.  This makes it easier for the good teams to rise to the top.

3.  The IIHF noticed that, roughly speaking, there are six strong teams and six weak teams.  Therefore, luck will affect the middle of the standings more than the top or the bottom.  So, the teams in the middle get an extra game (against the bottom), in order to make it more likely that the better teams rise and the worse teams fall.

My conclusion is that the IIHF did an outstanding job in terms of squeezing the most "talent information" out of so few games.  For the full argument, check out the link.

Labels: , , ,

Friday, January 10, 2014

Narrative stats vs. Rudy Gay

Earlier this season, the Toronto Raptors traded forward Rudy Gay to the Sacramento Kings.  At the time of the deal, Gay was second on the team in points per game and thought of as a star player.  His salary this season is $17 million.

At the time Gay was traded, the Raptors were 6-12.  Since the trade, they've gone 11-5.

What happened?  Gay's salary may be high, and his scoring may be decent, but, as Dave Berri points out, it looks like Gay just isn't very good.  He scores a lot of points, but only because he takes a lot of shots.  Berri writes,


"...Gay had an effective field goal percentage of 42.1%.  It shouldn’t take an understanding of “advanced stats” to conclude that such a player isn’t “good”."

Which makes complete sense.  A team has about 100 possessions per game in which to score.  A player who takes a shot has "used" one of his team's possessions, and, so, needs to use it wisely.  

The NBA average, last year, was that teams took shots in around 83 of those possessions, which (coincidentally) led to around 83 points on field goals.  But, if every player on a team shot like Rudy Gay, they'd score only around 70 points.  

Of course, it could be that Gay's EFG% is low because he's the one who has to take the harder shots, like desperation attempts with the shot clock winding down.  But, unless you have some reason to believe that's what's happening, your first reaction would have to be that Gay's shooting is hurting the team.

(UPDATE: I should clarify that you can't conclude that Gay is hurting the team just because he's below average; a below-average player can still have value if he's better than the next best alternative (that is, better than "replacement value").  In this case, I'm assuming that Gay is sufficiently below average that he's worse than the bench, a resonable assuption considering the Raptors' subsequent performance after the trade.)

Has any mainstream basketball journalist acknowledged Gay's poor efficiency?  Dave Berri asked on Twitter,


"Has any sports writer argued that the Raptors are better since the Gay trade because Gay is not a very productive player?"

There were only a couple of responses, pointing to online articles at Grantland and ESPN.  I did find another article that suggests teams are finally catching on, but also implies that, yes, it is indeed a fact that efficiency tends to be underrated.

------

As I said, I don't follow basketball as much as I should, so I may be wrong, but it does seem to me that broadcasters and sportswriters like to concentrate on per-game counting stats: points, rebounds, and assists.  When they discuss efficiency, it's usually for the team as a whole: "The Raptors lost 100-81 while shooting only 39% from the field."  Sometimes, they'll mention a player, but only when it's something really extreme: "Player X went only 2-for-15 in the 107-79 loss."

So, why the emphasis on points, instead of actual benefit to the team?  

I think it's because, as a rule, counting stats normally "work," in a narrative sense.  It's a rare case when you see a decent-sounding counting stat that doesn't mean a good performance that game.

Take baseball batters, for instance.  In a single game, if you accumulate any significant number of anything good, you probably were one of the biggest contributors, since your opportunities are limited.  The average player gets less than half an RBI per game.  So, if you drove in two or three, that's excellent.  If you got more than one hit, that's probably a .400 average for the game, which is outstanding.  

In hockey, the best forwards may average half a goal per game.  So, even a single goal is twice as good as average.  For a one-goal performance to be inefficient, you'd have to have taken twice as many shots as normal, which I guess would be possible but unlikely (except maybe for Alex Ovechkin).  And, of course, if you score two goals, or three, you've definitely helped the team win.

What about football?  Well, maybe, for rushing yards, you have the situation where you also need to know opportunities, how often that running back was given the ball.  But for QBs, there's less of that.  QB totals usually include a completion rate, and a negative (interceptions) to go with the positives ("Andrew Luck completed 29 of 45 passes for 443 yards with four touchdown passes and three interceptions").  Moreover, the number of throws doesn't vary *that* much from game to game ... for 443 yards to be mediocre, you'd probably need to have thrown, I dunno, 75 passes or something?  And that doesn't happen much.

Even in basketball, it's not a problem if the numbers are exceptionally high.  If a player scores 40, he's probably had an excellent game.  To score 40 points and still be inefficient, you'd have to take a lot of shots.  That doesn't happen.  Last season, LeBron James had seven games of 35+ points, but had only one game with more than 30 field goal attempts.

So a high scoring game is almost certainly good.  But those middling games, the ones in the low 20s ... those are where you can't tell from just the number.  Those are the ones where what looks like a positive performance may be good, or bad, depending.  

-----

My guess is: in a world where every other high counting stat is a sign of a good performance, it's easier to carry the narrative along to Rudy Gay territory than to remember this rare exception.



Labels: , ,

Thursday, January 02, 2014

Probabilities, genetic testing, and doctors, part II

(Part I is here)

------

Kira Peikoff ordered "direct-to-consumer" genetic tests from three competing companies.  In some cases, they gave her results that were very different from each other.  This led Peikoff to think that maybe she got ripped off, or that the firms aren't able to deliver what they promise.  In a New York Times article, she writes,


"At a time when the future of such companies hangs in the balance, their ability to deliver standardized results remains dubious, with far-reaching implications for consumers."

But: I think her concern stems from a misunderstanding of how the probabilities work.

The provider "23andMe" -- the one recently shut down by the FDA -- reported to Peikoff that she had a higher-than-normal risk of contracting psoriasis, twice the normal chance.  But a rival company, Genetic Testing Laboratories (GTL), told her she had a much *lower* risk -- 80% less than average.  

The two companies differed by a factor of ten, a proverbial "order of magnitude".  Clearly, those results can't both be right, can they?

Well, actually, they can, because GTL tested more genes than 23andMe.

In the illustration that accompanies the article, we can see that GTL tested eight sets of genes: HLA, IL12B, IL23R, Intergenic_1q21, SPATA2, STAT2, TNFAIP3, and TNIP1.

The article doesn't say what genes 23andMe tested, but, in my own report, my result is based on only 3 tests: HLA-C, IL12B, and IL23R.

So, it's quite reasonable that the two analyses would give different results, since they're based on different information. And, they're both correct, as far as they go.  If all you have is the three genes that 23andMe looked at, it's reasonable to say that your risk is twice normal.  The extra genes that GTL tested provided more information, and more information always changes an estimate.  

This is the essence of Bayesian reasoning: start with your prior, and update your beliefs based on new information.

------

You flip two coins, and leave them covered.  You ask a statistician what the chance is that you got to heads.  He says, "one in four."  That is the correct answer.

Then, you call in a second statistician, and you uncover the first coin, which turns out to be a head.  You ask the same question.  The second statistician says, "one in two".  That is again the correct answer.

But the first statistician was not wrong.  He was absolutely correct.  It's perhaps counterintuitive.  I mean, he said "one in four," and now we know the answer is "one in two".  How could he have been right?  You can argue that he did the best he could with the information available, and it's not his fault that he was wrong, but ... his answer wasn't right.

But his answer WAS right.  That's because the two statisticians were asked two different questions.  The first one was asked, "what's the chance that both coins landed heads?"  The second one was asked, "what's the chance that both coins landed heads given that we know the first one did?"

-----

Doesn't this at least demonstrate that 23andMe is being lax in its testing, not using enough information?  No, it doesn't. Any information is better than none.  23andMe costs $99 and uses saliva.  The GTL test costs $259 and uses blood.  I'm sure if you wanted to spend $1000, you could find even more genes to test.

Say you're buying car insurance.  Company A asks if you use a seat belt.  You say no, and they quote you a high rate. You go to company B.  They secretly shadow you around for a week, and discover that you're actually such a safe and cautious driver that it completely cancels out your non-seatbelt-risk, and they quote you a lower rate.  

Was Company A wrong in quoting you a high rate?  No, they weren't.  It was the right answer for the information they had.  Unless you fault them for not following you around to get the information a better estimate.  If you do choose to fault them for that, then you have to fault every real-life risk estimate ever made, because there's always more information you can get if you take the time to uncover it.  Risk estimates *always* change with additional relevant information, which is what Bayes' Theorem is all about.


-----

This is a variation of the thinking I argued against last post: the idea that since there's always more information -- including  information that hasn't even yet been discovered -- the information we do have is incomplete, and therefore not relevant.  From Peikoff's article:


"Imagine if you took a book and you only looked at the first letter of every other page,” said Dr. Robert Klitzman, a bioethicist and professor of clinical psychiatry at Columbia. (I [reporter Peikoff] am a graduate student there in his Master of Bioethics program.) "You’re missing 99.9 percent of the letters that make the genome. The information is going to be limited.""

Again: the information is limited, but still useful -- like the fact that you don't wear a seatbelt.  If gene X is linked to double the risk, it's not reasonable to say, "well, we might later find that gene Y turns off gene X, so don't worry about it."

Interestingly, in the same article, another bioethicist implicitly contradicts Klitzman!  Arthur L. Caplan, director of medical ethics at New York University, writes,


"If you want to spend money wisely to protect your health and you have a few hundred dollars, buy a scale, stand on it, and act accordingly."

That completely contradicts Dr. Klitzman, doesn't it?  Klitzman is saying, "if you don't have all the information on risk factors, the genetic information you do have isn't useful."  Caplan is saying, "if you don't have all the information on risk factors, the obesity information you do have is still very important."

What's going on, where the same story can quote two opposite arguments without noticing the contradiction?  I think, maybe, it's the fallacy of mental accounting.  There's the obesity mental account, and the DNA mental account. We have full knowledge how fat you are, so we should consider what we know.  But we have only partial knowledge of your DNA, so we have to ignore what we know.

Except, probabilities don't work that way.  They don't keep separate mental buckets.  If there are 100 independent DNA datapoints, and 1 obesity datapoint, the laws of probability treat them the same, as 101 datapoints.  

It's like, if you roll 101 dice, but 100 are blue and only one is red ... the first blue die is just as useful in predicting the overall total as the first (and only) red one.

Sure, my obesity might give me twice the risk of disease X. But if a gene you looked at gives you three times the risk ... you should be more worried than me, even if you only looked at one gene, and even if your other 999 genes might cancel it out.  

That's just how probabilities work.  



Labels: , , ,