Thursday, January 07, 2016

When log5 does and doesn't work

Team A, with an overall winning percentage talent of .600, plays against a weaker Team B with an overall winning percentage of .450. What's the probability that team A wins? 

In the 1980s, Bill James created the "log5" method to answer that question. The formula is

P = (A - AB)/(A+B-2AB)

... where A is the talent level of team A winning (in this case, .600), and B is the talent level of team B (.450).

Plug in the numbers, and you get that team A has a .647 chance of winning against team B. 

That makes sense: A is .600 against average teams. Since opponent B is worse than average, A should be better than .600. 

Team B is .050 worse than average, so you'd kind of expect A to "inherit" those 50 points, to bring it to .650. And it does, almost. The final number is .647 instead of .650. The difference is because of diminishing returns -- those ".050 lost wins" are what B loses to *average* teams because it's bad. Because A is better than average, it would have got some of those .050 wins anyway because it's good, so B can't "lose them again" no matter how bad it is.

In baseball, the log5 formula has been proven to work very well.


There was some discussion of log5 lately on Tango's site (unrelated to this post, but very worthwhile), and that got me thinking. Specifically, it got me thinking: log5 CANNOT be right. It can be *almost* right, but it can never be *exactly* right.

In the baseball context, it can be very, very close, indistinguishable from perfect. But in other sports, or other contexts, it could be way wrong. 

Here's one example where it doesn't work at all.

Suppose that, instead of actually playing baseball games, teams just measured their players' average height, and the taller team wins. And, suppose there are 11 teams in the league, and there's a balanced 100-game season.

What happens? Well, the tallest team beats everyone, and goes 100-0. The second-tallest team beats everyone except the tallest, and winds up 90-10. The third-tallest goes 80-20. And so on, all the way down to 0-100.

Now: when a .600 team plays a .400 team, what happens? The log5 formula says it should win 69.2 percent of those games. But, of course, that's not right -- it will win 100 percent of those games, because it's always taller.

For height, the log5 method fails utterly.


What's the difference between real baseball and "height baseball" that makes log5 work in one case but not the other?

I'm not 100% sure of this, but I think it's due to a hidden, unspoken assumption in the log5 method. 

When we say, "Team A is a .600 talent," what does that mean? It could mean either of two things:

-- A1. Team A is expected to beat 60 percent of the opponents it plays.

-- A2. If Team A plays an average team, it is expected to win 60 percent of the time.

Those are not the same! And, for the log5 method to work, assumption A1 is irrelevent. It's assumption A2 that, crucially, must be true. 

In both real baseball and "height baseball," A1 is true. But that doesn't matter. What matters is A2. 

In real baseball, A2 is close enough. So log5 works.

In "height baseball," A2 is absolutely false. If Team A (.600) plays an average team (.500), it will win 100 percent of the time, not 60 percent! And that's why log5 doesn't work there.


What it's really coming down to is our old friend, the question of talent vs. luck. In real baseball, for a single game, luck dwarfs talent. In "height baseball," there's no luck at all -- the winner is just the team with the most talent (height). 

Here are two possible reasons a sports team might have a .600 record:

-- B1: Team C is more talented than exactly 60 percent of its opponents

-- B2: Team C is more talented than average, by some unknown amount (which varies by sport) that leads to it winning exactly 60 percent of its games.

Again, these are not the same. And, in real life, all sports (except "height baseball") are some combination of the two. 

B1 refers completely to talent, but B2 refers mostly to luck. The more luck there is, in relation to talent, the better log5 works.

Baseball has a pretty high ratio of luck to talent -- on any given day, the worst team in baseball can beat the best team in baseball, and nobody bats an eye. But in the NBA, there's much less randomness -- if Philadelphia beats Golden State, it's a shocking upset. 

So, my prediction is: the less that luck is a factor in an outcome, the more log5 will underestimate the better team's chance of winning.

Specifically, I would predict: log5 should work better for MLB games than for NBA games.


Maybe someone wants to do some heavy thinking and figure how to move this forward mathematically.  For now, here's how I started thinking about it.

In MLB, the SD of team talent seems to be about 9 games per season. That's 90 runs. Actually, it's less, because you have to regress to the mean. Let's call it 81 runs, or half a run per game. (I'm too lazy to actually calculate it.) Combining the team and opponent, multiply by the square root of two, to give an SD of around 0.7 runs.

The SD of luck, in a single game, is much higher. I think that if you computed the SD of a single team's 162 runs-scored-that-game, you'd get around 3. The SD of runs allowed is also around 3, so the SD of the difference would be around 4.2.

SD(MLB talent) = 0.7 runs
SD(MLB luck)   = 4.2 runs

Now, let's do the NBA. From, the SD of the SRS rating seems to be just under 5 points. That's based on outcomes, so it's too high to be an estimate of talent, and we need to regress to the mean. Let's arbitrarily reduce it to 4 points. Combining the two teams, we're up to 5.2 points.

What about the SD of luck? This site shows that, against the spread, the SD of score differential is around 11 points. So we have

SD(NBA talent) =  5.2 points
SD(NBA luck)   = 11.0 points

In an MLB game, luck is 6 times as important as talent. In an NBA game, luck is only 2 times as important as talent. 

But, how you apply that to fix log5, I haven't figured out yet. 

What I *do* think I know is that the MLB ratio of 6:1 is large enough that you don't notice that log5 is off. (I know that from studies that have tested it and found it works almost perfectly.) But I don't actually know whether the NBA ratio of 2:1 is also large enough. My gut says it's not -- I suspect that, for the NBA, in extreme cases, log5 will overestimate the underdog enough so that you'll notice. 


Anyway, let me summarize what I speculate is true:

1. The log5 formula never works perfectly. Only as the luck/talent ratio goes to infinity, will log5 be theoretically perfect. (But, then, the predictions will always be .500 anyway.)  In all other cases, log5 will underestimate, to some extent, how much the better team will dominate.

2. For practical purposes, log5 works well when luck is large compared to talent. The 6:1 ratio for a given MLB game seems to be large enough for log5 to give good results.

3. When comparing sports, the more likely it is that the more-talented team beats the less-talented team, the worse log5 will perform. In other words: the bigger the Vegas odds on underdogs, the worse log5 will perform for that sport.

4. You can also estimate how well log5 will perform with a simple test. Take a team near the extremes of the performance scale (say, a .600/.400 team in MLB, or a .750/.250 team in the NBA), and see how it performed specifically against only those teams with talent close to .500.

If a .750 team has a .750 record against teams known to be average, log5 will work great. But if it plays .770 or .800 or .900 ball against teams known to be average, log5 will not work well. 


All this has been mostly just thinking out loud. I could easily be wrong.

Labels: , , , , ,

Friday, December 04, 2015

A new "hot hand" study finds a plausible effect

There's a recent baseball study (main page, .pdf) that claims to find a significant "hot hand" effect. Not just statistically significant, but fairly large:

"Strikingly, we find recent performance is highly significant in predicting performance ... Furthermore these effects are of a significant magnitude: for instance, ... a batter who is “hot” in home runs is 15-25% more likely (0.5-0.75 percentage points or about one half a standard deviation) more likely to hit a home run in his next at bat."

Translating that more concretely into baseball terms: imagine a batter who normally hits 20 HR in a season. The authors are saying that when he's on a hot streak of home runs, he actually hits like a 23 or 25 home run talent. 

That's a strong effect. I don't think even home field advantage is that big, is it?

In any case, after reading the paper ... well, I think the study's conclusions are seriously exaggerated. Because, part of what the authors consider a "hot hand" effect doesn't have anything to do with streakiness at all.


The study took all player seasons from 2000 to 2011, subject to an AB minimum. Then, the authors tried to predict every single AB for every player in that span. 

To get an estimate for that AB, the authors considered:

(a) the player's performance in the preceding 25 AB; and
(b) the player's performance in the rest of the season, except that they excluded a "window" of 50 AB before the 25, and 50 AB after the AB being predicted.

To make this go easier, I'm going to call the 25 AB sample the "streak AB" (since it measures how streaky the player was). I'm going to call the two 50 AB exclusions the "window AB". And, I'm going to call the rest of the season the "talent AB," since that's what's being used as a proxy for the player's ability.

Just to do an example: Suppose a player had 501 AB one year, and the authors are trying to predict AB number 201. They'd divide up the season like this:

1. the first 125 AB (part of the "talent AB")
2. the next 50 AB, (part of the "window AB)
3. the next 25 AB (the "streak AB")
4. the single "current AB" being predicted
5. the next 50 AB, (part of the "window AB")
6. the next 250 AB (part of the "talent AB").

They run a regression to predict (4), based on two variables:

B1 -- the player's ability, with is how he did in (1) and (6) combined
B2 -- the player's performance in (3), the 25 "streak AB" that show how "hot" or "cold" he was, going into the current AB.

Well, not just those two -- they also include the obvious control variables, like season, park, opposing pitcher's talent, platoon, and home field advantage. 

(Why did they choose to exclude the "windows" (2) and (5)? They say that because the windows occur so close to the actual streak, they might themselves be subject to streakiness, and that would bias the results.)

What did the study find? That the estimate of "B2" was large and significant. Holding the performance in the 375 AB "ability" sample constant, the better the player did in the immediately preceding 25 "streak" AB, the better he did in the current AB.

In other words, a hot player continues to be hot!


But there's a problem with that conclusion, which you might have figured out already. The methodology isn't actually controlling for talent properly.

Suppose you have two players, identical in the "talent" estimate -- in 375 AB each, they both hit exactly 21 home runs.

And suppose that in the streak AB, they were different. In the 25 "streak" AB, player A didn't hit any home runs. But player B hit five additional homers.

In that case, do you really expect them to hit identically in the 26th AB? No, you don't. And not because of streakiness -- but, rather, because player B has demonstrated himself to be a better home run hitter than player A, by a margin of 26 to 21. 

In other words, the regression coefficient confounds two factors -- streakiness, and additional evidence of the players' relative talent.


Here's an example that might make the point a bit clearer.

(a)  in their first 10 AB -- the "talent" AB -- Mario Mendoza and Babe Ruth both fail to hit a HR.
(b)  in their second 100 AB -- the "streak" AB -- Mendoza hits no HR, but the Babe hits 11.
(c)  in their third 100 AB -- the "current" AB -- Mendoza again hits no HR, but the Babe hits 10.

Is that evidence of a hot hand? By the authors' logic, yes, it is. They would say:

1. The two players were identical in talent, from the control sample of (a). 
2. In (b), Ruth was hot, while Mendoza was cold.
3. In (c), Ruth outhit Mendoza. Therefore, it must have been the hot hand in (b) that caused the difference in (c)!

But, of course, the point is ... (b) is not just evidence of which player was hot. It's also evidence of which player was *better*. 


Now, the authors did actually understand this was an issue. 

In a previous version of their paper, they hadn't. In 2014, when Tango posted a link on his site, it took only two-and-a-half hours for commenter Kincaid to point out the problem (comment 6).  (There was a follow-up discussion too.)

The authors took note, and now realize that their estimates of streakiness are confounded by the fact that they're not truly controlling for established performance. 

The easiest way for them to correct the problem would have been just to include the 25 AB in the talent variable. In the "Player A vs. Player B" example, instead of populating the regression with "21/0" and "21/4", they could easily have populated it with "21/0 and "25/4". 

Which they did, except -- only in one regression, and only in an appendix that's for the web version only.

For the published article, they decided to leave the regression the way it was, but, afterwards, try to break down the coefficient to figure out how much of the effect was streakiness, and how much was talent. Actually, the portion I'm calling "talent" they decided to call "learning," on the grounds that it's caused by performance in the "streak AB" allowing us to "learn" more about the player's long-term ability. 

Fine, except: they still chose to define "hot hand" as the SUM of "streakiness" and "learning," on the grounds that ... well, here's how they explain it:

"The association of a hot hand with predictability introduces an issue in interpretation, that is also present but generally unacknowledged in other papers in the area. In particular, predictability may derive from short-term changes in ability, or from learning about longer-term ability. ... We use the term “hot hand” synonymously with short-term predictability, which encompasses both streakiness and learning."

To paraphrase, what they're saying is something like:

"The whole point of "hot hand" studies is to see how well we can predict future performance. So the "hot hand" effect SHOULD include "learning," because the important thing is that the performance after the "hot hand" is higher, and, for predictive purposes, we shouldn't care what caused it to be higher."

I think that's nuts. 

Because, the "learning" only exists in this study because the authors deliberately chose to leave some of the known data out of the talent estimate.

They looked at a 25/4 player (25 home runs of which 4 were during the "streak"), and a 21/0 player (21 HR, 0 during the streak), and said, "hey, let's deliberately UNLEARN about the performance during the streak time, and treat them as identical 21-HR players. Then, we'll RE-LEARN that the 25/4 guy was actually better, and treat that as part of the hot hand effect."


So, that's why the authors' estimates of the actual "hot hand" effect (as normally understood outside of this paper) are way too high. They answered the wrong question. They answered,

"If a guy hitting .250 has a hot streak and raises his average to .260, how much better will he be than a guy hitting .250 who has a cold streak and lowers his average to .240?"

They really should have answered,

"If a guy hitting .240 has a hot streak and raises his average to .250, how much better will he be than a guy hitting .260 who has a cold streak and lowers his average to .250?"


But, as I mentioned, the authors DID try to decompose their estimates into "streakiness" and "learning," so they actually did provide good evidence to help answer the real question.

How did they decompose it? They realized that if streakiness didn't exist at all, each individual "streak AB" should have the same weight as each individual "talent AB". It turned out that the individual "streak AB" were actually more predictive, so the difference must be due to streakiness.

For HR, they found the coefficient for the "streak AB" batting average was .0749. If a "streak AB" were exactly as important as important as a "talent AB", the coefficient would have been .0437. The difference, .0312, can maybe be attributed to streakiness.

In that case, the "hot hand" effect -- as the authors define it, as the sum of the two parts -- is 58% learning, and 42% streakiness.


They didn't have to do all that, actually, since they DID run a regression where the Streak AB were included in the Talent AB. That's Table A20 in the paper (page 59 of the .pdf), and we can read off the streakiness coefficient directly. It's .0271, which is still statistically significant.

What does that mean for prediction?

It means that to predict future performance, based on the HR rate during the streak, only 2.71 percent of the "hotness" is real. You have to regress 97.29 percent to the mean. 

Suppose a player hit home runs at a rate of 20 HR per 500 AB, including the streak. During the streak, he hit 4 HR in 25 AB, which is a rate of 80 HR per 500 AB. What should we expect in the AB that immediately follows the streak?

Well, during the streak, the player hit at a rate 60 HR / 500 AB higher than normal. 60 times 2.71 percent equals 1.6. So, in the AB following the streak, we'd expect him to hit at a rate of 21.6 HR, instead of just 20.


In addition to HR, the authors looked at streaks for hits, strikeouts, and walks. I'll do a similar calculation for those, again from Table A20.

Batting Average

Suppose a player hits .270 overall (except for the one AB we're predicting), but has a hot streak where he hits .420. What should we expect immediately after the streak?

The coefficient is .0053. 150 points above average, times .0053, is ... less than one point. The .270 hitter becomes maybe a .271 hitter.


Suppose a player normally strikes out 100 times per 500 AB, but struck out at double that rate during the streak (which is 10 K in those 25 AB). What should we expect?

The coefficient is .0279. 100 rate points above average, times .0279, is 2.79. So, we should expect the batter's K rate to be 102.79 per 500 AB, instead of just 100. 


Suppose a player normally walks 80 times per 500 PA, but had a streak where he walked twice as often. What's the expectation after the streak?

The coefficient here is larger, .0674. So, instead of walking at a rate of 80 per 500 PA, we should expect a walk rate of 85.4. Well, that's a decent effect. Not huge, but something.

(The authors used PA instead of AB as the basis for the walk regression, for obvious reasons.)


It's particularly frustrating that the paper is so misleading, because, there actually IS an indication of some sort of streakiness. 

Of course, for practical purposes, the size of the effect means it's not that important in baseball terms. You have to quadruple your HR rate over a 25 AB streak to get even a 10 percent increase in HR expectation in your next single AB. At best, if you double your walk rate over a hot streak, you walk expectation goes up about 7 percent.

But it's still a significant finding in terms of theory, perhaps the best evidence I've ever seen that there's at least *some* effect. It's unfortunate that the paper chooses to inflate the conclusions by redefining "hot hand" to mean something it's not.

(P.S.  MGL has an essay on this study in the 2016 Hardball Times. My book arrived last week, but I haven't read it yet. Discussion here.)

Labels: , , , , ,

Tuesday, October 06, 2015

Does vaping induce teenagers to become smokers?

Do electronic cigarettes lead users into smoking real cigarettes? In other words, is vaping a "gateway activity" to smoking?

A recent study says that, yes, vapers are indeed more likely to become smokers than non-vapers are. In fact, they're *four times* as likely to do so. 

The study looked at a sample of young people aged 16 to 26 who said they didn't intend to become smokers. When they caught up with them a year later, only 9.6 percent of the non-vapers had smoked in the past year. But 37.5 percent of the vapers had!

Seems like pretty strong evidence, right? The difference was certainly statistically significant.

Except ... here's an article from FiveThirtyEight that suggests that, no, this is NOT strong evidence that vaping leads to smoking. Why not? Because the sample size was very small. The vaping group comprised only 16 participants, compared to 678 for the control group.

Vaping:      6/ 16  (37.5%)
Non-Vaping: 65/678   (9.6%)

FiveThirtyEight says,

"Voila, six out of 16 makes 37.5 percent — it’s a big number that comes from a small number, which makes it a dubious one. 
So because six people started smoking, news reports alleged that e-cigs were a gateway to analog cigs."

Well, I have some sympathy for that argument, but ... just a little. Statistical significance does adjust for sample size, so, in effect, the data does actually say that the sample size issue isn't that big a deal. To argue that 16 people isn't enough, you need something other than a "gut feel" argument. For instance, you could hypothesize that 16 vapers out of 694 people is a lower incidence of vaping than in the general population, and, therefore, you're getting only "out of the closet vapers" self-identifying, which makes the 16 vapers unrepresentative. 

But, the article doesn't make any arguments like that.


The FiveThirtyEight story tries to make the case that the study, and the press release describing it, are biased, because they're too overconfident about a sample that's too small to draw any conclusions. 

I don't agree with that, but I DO agree that there's bias. A much, much worse bias, one that's obvious when you think about it, but one that has nothing to do with the actual statistical techniques. 

What's the actual problem? It's that the whole premise is mistaken. Comparing vapers to non-vapers is NOT evidence for whether vaping entices young people into smoking. Not at all. Even with a huge sample size. Even if you actually counted everyone in the world, and it turned out that vapers were five times as likely to become smokers as non-vapers, that would NOT imply that vaping leads to smoking, and it would NOT imply that banning vaping would "protect our youth" from the dangers of smoking real cigarettes.

It could even be that, depsite vapers being five times as likely to take up smoking, vaping actually *reduces* the incidence of smoking.

How? Well, suppose that vapers and smokers are the same "types" of people, those who want to send a signal that they're risk-takers and nonconformists. Before, they all took up smoking. Now, some take up smoking and some vaping. Sure, some of the vapers become smokers later. But, overall, you could easily have fewer smokers than before you started. 

"What do I think? A vaper is in denial. It’s not the vaping itself that causes you to become a smoker, but simply that a smoker is a closet-vaper. 
"This is likely true of most vices. It won’t act as a gateway, but simply that you will try it because you were going to try to harder stuff anyway. Even if you didn’t want to admit it. 
" ... There’s a dozen ways to get from Chinatown to Times Square. Manhattan then adds a direct bus line that goes up Broadway. Does that bus “cause” people to go from Chinatown to Times Square? Or, does it simply become a stepping stone that they would have otherwise bypassed? 
"Basically, do the same number of people end up going Chinatown to Times Square? 
"Do the same number of people end up smoking the real stuff anyway? All vaping is doing is redirecting the flow of people?"


If that sounds too abstract in words, it'll become crystal clear if we just change the context, but leave the numbers and arguments the same.

"Ignore The Headlines: We Don’t Know If Suicide Hotlines Lead Kids to Kill Themselves.
"After a year, 37.5 percent of those who had called a Suicide Hotline had gone on to end their own lives. That's a big percentage when you consider that the suicide rate was only 9.6 percent among respondents who hadn’t called the hotline.  
"Our study identified a longitudinal association between suicide hotline use and progression to actual suicide, among adolescents and young adults. Especially considering the rapid increase and promotion of distress lines, these findings support regulations to limit suicide hotlines and decrease their appeal."

It's exactly the same thing! Really. I edited a bit, but most of the words come exactly from the original articles on vaping.

Now, you could argue: well, it's not REALLY the same thing. We know that suicide hotlines decrease suicide, but, come on, can you really believe that vaping reduces smoking?

To which I answer: absolutely. I *do* believe that vaping reduces smoking. If you believe differently, then, study the issue! This particular study doesn't provide evidence either way.

And, more importantly: "can you really believe?" is not science, no matter how incredulously you say it.


Logically and statistically, the relevant number is NOT what percentage of vapers (hotline callers) go on to smoke (commit suicide). The relevant number is, actually, how many people would go on to smoke (commit suicide) if vaping (suicide hotlines) did not exist. 

Why is this not as obvious in the vaping case as in the hotline case? Because of bias against vaping. No other reason. The researchers and doctors start out with the prejudice that vaping is a bad thing, and, because of confirmation bias, interpret the result as, obviously, supporting their view. It seems so obvious that they don't even consider any other possibility.

I bet it's not just vaping and suicide hotlines. I suspect that we'd be eager to accept the "A leads to more bad things than non-A" if we're against A, but we see it's obviously a ridiculous argument if we approve of A. Here are a few I thought of:

"37% of teenagers who play hockey went on to commit assault, as compared to only 9% who didn't play hockey. Therefore, hockey is a gateway to violence, and we need to limit access to hockey and make it less appealing to adolescents." 
"37% of teenagers who use meth go on to commit crimes, as opposed to only 9% who didn't use meth. Therefore, meth is a gateway to criminal behavior, and we need to limit access to meth and make it less appealing to adolescents." 
"37% of patients who get chemotherapy go on to die of cancer, as opposed to only 9% of patients who don't get chemo. Therefore, chemotherapy leads to cancer, and we need to limit access to chemo and make it less appealing to oncologists." 
"37% of men who harass women at work go on to commit at least one sexual assault in the next ten years. This shows that harassment is a precursor to violence, and we need to take steps to reduce it in society."

If you're like me, in the cases of "bad" precursors -- meth and harassment and vaping -- the arguments seem to make sense. But, in the cases of "good" precursors -- hockey and chemotherapy and suicide prevention lines -- the conclusions seem obviously, laughably, wrong.

It's all just confirmation bias at work.


The FiveThirtyEight piece references one of their other posts, titled: "Science Isn’t Broken.  It’s just a hell of a lot harder than we give it credit for."

In that piece, they give several reasons for why so many scientific findings turn out to be false. They mention poor peer review, "p-hacking" results, and failure to self-correct.

Those may all be happening, but, in my opinion, it's much less complicated than that. 

It's just bad logic. It's not as obvious as the bad logic in this case, but, a lot of the time, it's just errors in statistical reasoning that have nothing to do with confidence intervals or methodology or formal statistics. It's a misunderstanding of what a number really means, or a reversal of cause and effect, or an "evidence of absence" fallacy, or ... well, lots of other simple logical errors, like this one.

Regular readers of this blog should not be too surprised by my diagnosis here: most of the papers I've critiqued here suffer from that kind of error, the kind that's obvious only after you catch it. 

FiveThirtyEight writes:

"Science is hard — really f*cking hard."

But, no. It's *thinking straight* that's hard. It's being unbiased that's hard. It must be. There were hundreds of people involved in that vaping study -- scientists, FiveThirtyEight writers, doctors, statisticians, public policy analysts, editors, peer reviewers, anti-smoking groups -- and NONE of them, as far as I know, noticed the real problem: that the argument just doesn't make any sense.

Hat Tip: Tom Tango, who figured it out.

Labels: , , ,

Tuesday, September 01, 2015

Consumer Reports on auto insurance, part IV

(Previous posts: part I; part II; part III)

Consumer Reports' biggest complaint is that insurance companies set premiums by including criteria that, according to CR, don't have anything to do with driving. The one that troubles them the most is credit rating:

"We want you to join forces with us to demand that insurers -- and the regulators charged with watching them on ouir behalf -- adopt price-setting practices that are more meaningfully tethered to how you drive, not to who they think you are. ..."

"In the states where insurance companies don't use credit information, the price of car insurance is based mainly on how people actually drive and other factors, not some future risk that a credit score 'predicts'. ..."

"... an unfair side effect of allowing credit scores to be used to set premium prices is that it effectively forces customers to dig deeper into their pockets to pay for accidents that haven't happened and may never happen."


Well, you may or may not agree on whether insurers should be allowed to consider credit scores, but, even if CR's conclusion is correct, their argument still doesn't make sense.

First: the whole idea of insurance is EXACTLY what CR complains about:  to "pay for accidents that haven't happened and may never happen." I mean, that's the ENTIRE POINT of how insurance works -- those of us who don't have accidents wind up paying for those of us who do. 

In fact, we all *hope* that we're paying for accidents that don't happen and may never happen! It's better if we don't suffer injuries, and our car stays in good shape, and our premium stays low. 

Well, maybe CR didn't actually mean that literally. What they were *probably* thinking, but were unable or unwilling to articulate explicitly, is that they think credit scores are not actually indicative of car accident risk -- or, at least, not correlated sufficiently to make the pricing differences fair.

But, I'm sure the insurance industry could demonstrate, immediately, that credit history IS a reliable factor in predicting accident risk. If that weren't true, the first insurance company to realize that could steal all the bad-credit customers away by offering them big discounts!

It's possible, I guess, that CR is right and all the insurance companies are wrong. Since it's an empirical question ... well, CR, show us your evidence! Prove to us, using actual data, that bad-credit customers cause no more accidents than their neighbors with excellent credit. If you can't do that, maybe show us that the bad-credit customers aren't as bad as the insurers think they are. Or, at the very, very least, explain how you figured out, from an armchair thought experiment and without any numbers backing you up, that insurance company actuaries are completely wrong, and have been for a long time, despite having the historical records of thousands, or even millions, of their own customers.


Just using common sense, and even without data, it IS reasonable that a better credit rating should predict a lower accident rate, holding everything else equal. You get better credit by paying your bills on time, and not overextending your finances -- both habits that demonstrate a certain level of reliability and conscientiousness. And driving safely requires ... conscientiousness. It's no surprise at all, to me, that credit scores are predictive, to some extent, of future accident claims.

And CR's own evidence supports that! As I mentioned, the article lauds USAA as being the cheapest, by far, of the large insurers they surveyed. 

But USAA markets to only a subset of the population. As Brian B. wrote in the comments to a previous post:

"[USAA insurance is available only to] military and families. So their demographics are biased by a subset of hyper responsible and conscientious professionals and their offspring."

Consumer Reports did, in fact, note that USAA limits its customers selectively. But they didn't bother demanding that USAA raise its rates, or stop unfairly judging military families by "what they think they are" -- more conscientious than average.


Not only does CR not bother mentioning the possibility that drivers with bad credit scores might actually be riskier drivers, they don't even hint that it ever crossed their minds. They seem to stick to the argument that nothing can possibly "predict" future risk except previous driving record. They even put "predict" in scare quotes, as if the idea is so obviously ludicrous that this kind of "prediction" must be a form of quackery.

Except when it's not. In the passage I quoted at the beginning of this post, they squeeze in a nod to "other factors" that might legitimately affect accident risk. What might those factors be? From the article, it seems they have no objection to charging more to teenagers. Or, to men. They never once mention the fact that female drivers pay less than males -- which, you'd think, would be the biggest, most obvious non-driving factor there is.

CR demands that I be judged "not by who the insurance companies think I am!" Unless, of course, I'm young and male, in which case, suddenly it's OK.

Why is it not a scandal that I pay more just for being a man? I may not be the aggressive testosterone-fueled danger CR might "think I am."  If I'm actually as meek as the average female, the insurer is going to "profit from accidents I may never have!"


I suspect they're approaching the question from a certain moral standard, rather than empirical considerations of the actual risk. It just bugs them that the big, bad insurance companies make you pay more just for having worse credit. On the other hand, men are aggressive, angry, muscle-car loving speeders, and it's morally OK for them to get punished. As well as young people, who are careless risk-takers who text when they drive.

A less charitable interpretation is that CR is just jumping to the conclusion that higher prices are unjustified, even when based on statistical risk, when they affect "oppressed" groups, like the poor -- but OK when they favor "oppressor" groups, like men. (Recall that CR also complained about "good student" discounts because they believe those discounts benefit wealthier families.)

A more charitable interpretation might see CR's thinking as working something like this:

-- It's OK to charge more to a certain group where it's obvious that they generally have a higher risk. Like, teenage drivers, who don't have much experience. Common sense suggests, of course they get into more accidents.

-- Higher rates are like a "punishment," and it's OK, and even good, to punish people who do bad things. People who have at-fault accidents did something bad, so their rates SHOULD go up, to teach them a lesson! As CR says,

"In California, the $1,188 higher average premium our single drivers had to pay because of an accident they caused is a memorable warning to drive more carefully. ... In New York, our singles received less of a slap, only $429, on average."

-- It's OK for men to pay more than women because psychologists have long known that men are more aggressive and prone to take more risks.

-- But it's *not* OK to charge more for someone in a high-risk group when (a) there's no proof that they're actually, individually, a high risk, and (b) the group is a group that CR feels has been unfairly disadvantaged already. Just because someone has bad credit doesn't mean they're a worse driver, even if, statistically, that group has more accidents than others. Because, maybe a certain driver has bad credit because he was conned into buying a house he couldn't afford. First, he was victimized by greedy bankers and unscrupulous developers ... now he's getting conned a second time, by being gouged for his auto policy, even though he's as safe as anyone else!

If CR had actually come out and said this explicitly, and argued for it in a fair and unbiased fashion, maybe I would change my mind and come to see that CR is right. But ... well, that doesn't actually seem to be what CR is arguing. They seem to believe that their plausible examples of bad credit customers with low risk are enough to prove that the overall correlation must be zero!

When a certain model of car requires twice as many repairs as normal, CR recommends not buying it. But when a certain subset of drivers causes twice as many accidents as average, CR not only suggests we ignore the fact -- they even refuse to admit that it's true!


Here's a thought experiment to see how serious CR is about considering only driving history.

Suppose an insurer decided to charge more for drivers who don't wear a helmet when riding a bicycle, based on evidence that legitimately shows that people who refuse to wear bicycle helmets are more likely to refuse to wear seatbelts.

But, they note, it's not a perfect correlation. I, for instance, am an exception. I don't wear a bicycle helmet, but I wouldn't dream of driving without a seatbelt (and I might even be scared to drive a car without airbags). 

Would CR come to my defense, demanding that my insurer stop charging me extra?  Would they insist they judge me by how I drive, not by "who they think I am" based on my history of helmetlessness?

I doubt it. I think they'd be happy that I'm being charged more. I think it's about CR judging which groups "deserve" higher or lower premiums, and then rationalizing from there.

(If you want to argue that bicycling is close enough to driving that this analogy doesn't work, just substitute hockey helmets, or life jackets.)


I'm not completely unsympathetic to CR's position. They could easily make a decent case.  They could say, "look, we know that drivers with bad credit cause more accidents, as a group, than drivers with good credit. But it seems fundamentally unfair, in too many individual cases, to judge people by the characteristics of their group, and make them pay twice as much without really knowing whether they fit the group average."

I mean, if they said that about race, or religion, we'd all agree, right? We'd say, yeah, it DOES seem unfair that a Jewish driver like Chaim pays less (or more) than a Muslim driver like Mohammed, just because his group is statistically less (or more) risky. 

But, what if it's actually the case that, statistically, one group causes more accidents than the other? We tell the insurance companies, look, it's not actually because of religion that the groups are different. It must be just something that correlates to religion, perhaps by circumstance or even coincidence.  So, stop being so lazy.  Instead of deciding premiums based on religion, get off your butts and figure out what's actually causing the differences! 

Maybe the higher risk is because of what neighborhoods the groups tend to live in, that some neighborhoods have narrower streets and restricted sightlines that lead to more accidents. Shouldn't the insurance company figure that out, so that if they find that Chaim (or Mohammed) actually lives in a safer neighborhood, they can set his premium by his actual circumstances, instead of his group characteristics, which they will now realize don't apply here?  That way, fewer drivers will be stuck paying unfairly high or low premiums because of ignorance of their actual risk factors.

If that works for religion, or race, it should also work for credit score. Can't the insurance companies do a bit more work, and drill down a bit more, to figure out who has bad credit out of irresponsibility, and who has bad credit because of circumstances out of their control?

Yes! And, I'd bet, the insurance companies are already doing that! Their profits depend on getting risk right, and they can't afford to ignore anything that's relevant, lest other insurers figure it out first, and undercut their rates.

And CR actually almost admits that this is happening. Remember, the article tells us that the insurers aren't actually using the customer's standard credit score -- they're tweaking parts of it to create their own internal metric. CR tells us that to complain about it -- it's arbitrary, and secret! -- but it might actually be the way the insurers make premiums more accurate, and therefore fair.  It might be the way insurers make it less likely that a customer will be forced to pay higher premiums for "accidents that may never happen."


But I don't think CR really cares that premiums are mathematically fair. Their notion of fairness seems to be tied to their arbitrary, armchair judgments about who should be paying what. 

I suspect that even if the insurance companies proved that their premiums were absolutely, perfectly correlated with individual driving talent, CR would still object. They don't have a good enough understanding of risk -- or a willingness to figure it out.

A driver's expected accident rate isn't something that's visible and obvious. It's hard for anyone but an actuary to really see that Bob is likely to have an accident every 10 years, while Joe is likely to have an accident every 20. To an outsider, it looks arbitrary, like Bob is getting ripped off, having to pay twice as much as Joe for no reason. 

The thing is: some drivers really ARE double the risk. But, because accidents are so rare, their driving histories look identical, and there doesn't seem to be any reason to choose between them. But, often, there is.

If you do the math, you'll see that a pitcher who retires batters at a 71% rate is at more than double the "risk" of pitching a perfect game than a pitcher with only a 69% rate. But, in their normal, everyday baseball card statistics, they don't look that much different at all -- just a two percentage point difference in opposition OBP.

I think a big part of the problem is just that luck, risk, and human behavior follow rules that CR isn't willing to accept -- or even try to understand.

Labels: ,

Wednesday, August 26, 2015

Consumer Reports on auto insurance, part III

(This is part III.  Part I is here; part II is here.)


As part of its auto insurance research, Consumer Reports says they "analyzed more than 2 billion car insurance price quotes."  

That number seems reasonable. CR looked at all 33,419 general US ZIP codes. Multiply that by the 20 demographic groups they considered, then up to 19 different insurance companies per state, and you're up to about 10 million. That leaves around 200 different combinations of variations for each (credit rating, accident history, speeding tickets, etc).

In practical terms, how do you arrange to get two billion quotes? Even if you can get 20 at a time from a website, typing in all that information would take forever. Even at one quote per second, two billion quotes would take 63 years. Or, just one year, if CR had 63 staffers doing the work.

Well, it's much easier than that. CR reports that in most states, insurers have to file their mathematical pricing formulas -- their actuarial trade secrets, by the sound of it -- with state regulators. A private company, Quadrant Information Services, has access to all those formulas, somehow, and sells its services to clients like CR. So, two billion quotes was probably just a matter of contracting with Quadrant, who would just query its database and send along the results.

I always wondered how Progressive was able to get competitors' quotes, in real time, to compare alongside their own. Now I know.


CR says those quotes are the real deal, the actual prices policyholders pay:

"Under the state laws that regulate automobile insurance, carriers are required to adhere to the prices generated by their public rate filings. So the premiums we obtained from Quadrant are what each company legally obligates itself to charge consumers."

But ... I'm skeptical. If those quotes are really the actual premiums paid, that would have to contradict some of the issues CR raises elsewhere in the article.

For instance, one thing they're upset about is that some companies practice "price optimization." That's a euphemism for, jacking up the price for certain customers, the ones the company thinks won't complain. For instance, CR says, some insurers might bump your premium if "you're sticking with Verizon FIOS when DirectTV [sic] might be cheaper."

Except ... how can that be possible, if it's all done by formula? When you ask Progressive for quotes, they don't ask you who your TV provider is (or how many iPhones or beers you've purchased, which are other criteria CR mentions). 

Second, CR mentions that each insurer creates their own proprietary credit score, "very different from the FICO score you might be familiar with."  But, again, the formulas can't be taking that into account, can they? CR requested premiums for "poor," "good," and "excellent" credit scores ... but how would they know which was which, without knowing each insurer's proprietary formula?

Third, and also in the context of price optimization, they advise,

"... don't be shy about complaining a little more [to show you're not a pushover for next time]."

But if those formula prices are fixed and non-negotiable, how will complaining help? Unless "number of times having complained" is in the formulas filed with regulators.  


So, it doesn't make sense that the entire pricing system is encapsulated in the online (or Quantum) pricing formulas.

So, what's actually going on? 

Maybe what's happening is that the companies aren't obligated to charge EXACTLY those formula prices -- maybe they're obligated to charge those prices OR LESS. 

Kind of like those prices you see on the back of your hotel room door, the maximum that room would ever go for, like (I imagine) Cooperstown on induction weekend. Or, they're like the MSRP on cars, where you can negotiate down from there. Or, maybe they're like car repair estimates, where, if they don't know for sure how much it will cost, they give you the worst-case scenario, because they can lower their estimate much easier than they can raise it.

If that's what's going on, that would easily and plausibly explain the pricing anomalies that CR found.

Take, for instance, the one that surprised me most -- the finding that some companies discriminate against long-term customers. As CR puts it, "some insurers salute your allegiance with a price hike." 

In the state of Washington, the article says, almost half the insurers surveyed didn't offer any discount at all to customers who had been with them for at least 15 years. That doesn't sound right, but, get this: not only did Geico not offer a discount, they actually charged their loyal customers MORE: almost $700 more, according to the article.

That smells a bit fishy to me.  But here's one that smells ... well, I don't have a good metaphor.  Maybe, like a rotting pile of fish carcasses in your driveway?

"Geico Casualty gave us whiplash with its $3,267 loyalty penalty in New Jersey and its $888 discount just across the state line in New York for longtime customers."

Well, that's just not possible, right? Overcharging loyal New Jersey customers THREE THOUSAND DOLLARS A YEAR? That would triple the typical price, wouldn't it? 

When CR came up with that result, didn't any of their staff think, "WTF, can that really be true?" At Consumer Reports, they must have some weird priors. I know they think insurance companies are out to gouge consumers for everything they can, but ... this is too big to make any sense at all, even under CR's own assumptions. 


I'd wager that those particular automated quotes aren't at all representative of what those particular customers actually pay. 

Insurance companies don't ask their long-term policyholders to go online and renew anonymously. They send renewal quotes directly. Which they have to, if premiums are tailored for characteristics that aren't used in online applications, like those details from the customer's credit record, like beer purchases.

What CR found could just be a Geico company practice, of not bothering to produce competitive "formula" quotes for established customers who won't be using them anyway. 

I don't know if that's actually the right answer, but, whatever the true explanation is ... well, I'd bet a lot of money that if the magazine surveyed real long-term Geico policyholders in New Jersey, and asked about their premiums, CR would find that "loyalty penalty" doesn't actually exist.  Or at least, not at anything within an order of magnitude of $3,267.

I might be wrong. Feel free to tell me what I'm missing.

(to be continued)

Labels: ,

Thursday, August 20, 2015

Consumer Reports on auto insurance, part II

(Part I is here.)


Last post, I linked to an article showing auto insurance profit margins were very low, less than 10 percent of premiums. And, I wondered, if that's the case, how is it possible that CR reports such a large difference in pricing between companies?

In its research, Consumer Reports got quotes for thousands of different drivers -- that is, different combinations of age/sex/and ZIP code -- from five different companies. The average premiums worked out to:  

$1,540 Allstate
$1,414 Progressive
$1,177 Geico
$1,147 State Farm
$  817 USAA

How is it possible that Allstate charged 34% more than Geico, but still made less profit (and only 5.3% profit, at that)? How does USAA stay in business charging a third what the others charge, when those others are barely in the black, as is?

For anyone to take those numbers seriously, Consumer Reports has to explain this apparent impossibility. Otherwise, the only reasonable conclusion is that something went badly wrong with CR's analysis or methodology.

Which I think is what happened. I'm going to take a guess at what's actually going on. I don't know for sure, but I'd be willing to bet it's pretty close.


With margins so low, and competition so tight, companies really, really have to get their risk estimates right. If not, they're in trouble.

Let's make some simple assumptions, to keep the analysis clean. First, suppose all customers shop around and always choose the lowest-priced quote. 

Second, suppose that the average teenage driver carries $3,000 in annual risk -- that is, the average teenager will cause $3,000 worth of claims each year. 

Now, we, the blog readers, know the correct number is $3,000 because we just assumed it -- we gave ourselves a God's-eye view. The insurance companies don't have that luxury. They have to estimate it themselves. That's hard work, and they're not going to be perfect, because there's so much randomness involved. (Also, they're all using different datasets.)

Maybe the actuaries at Progressive come up with an estimate of $3,200, while Geico figures it's $2,700. (I'll ignore profit to keep things simple -- if that bothers you, add $100 to every premium and the argument will still work.)

What happens? Every teenager winds up buying insurance from Geico. And Geico loses huge amounts of money: $300 per customer, as the claims start to roll in. Eventually, Geico figures out they got it wrong, and they raise their premiums to $3,000. They're still the cheapest, but now, at least, they're not bleeding cash.

This goes on for a bit, but, of course, Progressive isn't sitting still. They hire some stats guys, do some "Insurance Moneyball," and eventually they make a discovery: good students are better risks than poor students. They find that good students claim $2,500 a year, while the others claim $3,500.

Progressive changes their quotes to correspond to their new knowledge about the "driving talent" of their customers. Instead of charging $3,200 to everyone, they now quote the good students $2,500, and the rest $3,500, to match their risk profiles. That's not because they like the pricing that way, or because they think good students deserve a reward ... it's just what the data shows, the same way it shows that pitchers who strike out a lot of batters have better futures than pitchers who don't.

Now, when the good students shop around, they get quotes of $2,500 (Progressive) and $3,000 (Geico). The rest get quotes of $3,500 (Progressive) and $3,000 (Geico).

So, what happens? The good students go to Progressive, and the rest go to Geico. Progressive makes money, but Geico starts bleeding again: they're charging $3,000 to drivers who cost them $3,500 per year.

Geico quickly figures out that Progressive knows something they don't -- that, somehow, Progressive figured out which teenage customers are lower risk, and stole them all away by undercutting their price. But they don't know how to tell low risk from high risk. They don't know that it has to do with grades. So, Geico can't just follow suit in their own pricing.

So what do they do? They realize they've been "Billy Beaned," and they give up. They raise their price from $3,000 to $3,500. That's the only way they can keep from going bankrupt.

The final result is that, now, when a good student looks for quotes, he sees

$2,500 Progressive
$3,500 Geico

When a bad student looks for quotes, he sees

$3,500 Progressive
$3,500 Geico

Then Consumer Reports comes along, and gets a quote for both. When they average them for their article, they find

$3,000 Progressive
$3,500 Geico

And they triumphantly say, "Look, Progressive is 14 percent cheaper than Geico!"

But it's not ... not really. Because, no good student actually pays the $3,500 Geico quotes them. Since everyone buys from the cheapest provider, Geico's "good student" quote is completely irrelevant. They could quote the good students a price of $10 million, and it wouldn't make any difference at all to what anyone paid.

That's why averaging all the quotes, equally weighted, is not a valid measure of which insurance company gives the best deal.


Want a more obvious example?

Company X:
$1,300      25-year-old male
$1,000      30-year-old female
$1 million  16-year-old male

Company Y:
$1,400      25-year-old male
$1,100      30-year-old female
$4,000      16-year-old male

By CR's measure, which is to take the average, company Y is much, much cheaper than company X: $2,166 to $334,100. But in real life, which company is giving its customers greater value? Company X, obviously. NOBODY is actually accepting the $1,000,000 quote. In calculating your average, you have to give it a weight of zero. 

Once you've discarded the irrelevant outlier, you see that, contrary to what the overall average suggested, company X is cheaper than company Y in every (other) case.


Want a non-insurance analogy?

"Darget" competes with Target. Their prices are all triple Target's, which is a big ripoff -- except that every product that starts with "D" sells for $1. By a straight average of all items, Darget is almost 300% as expensive as Target. Still, at any given time, Darget has twenty times the number of customers in the store, all crowding the aisles buying diamonds and diapers and DVD players. 

When evaluating Darget, the "300%" is irrelevant. Since everyone buys deodorant at Darget, but nobody buys anti-perspirant at Darget, it makes no sense to average the two equally when calculating a Darget price index.


And I suspect that kind of thing is exactly what's happening in CR's statistics. Allstate *looks* more expensive than USAA because, for some demographics of customer, they haven't studied who's less risky than whom. They just don't know. And so, to avoid getting bled dry, they just quote very high prices, knowing they probably won't get very many customers from those demographics.

I don't know which demographics, but, just to choose a fake example, let's say, I dunno, 75-year-olds. USAA knows how to price seniors, how to figure out the difference between the competent ones and the ones whose hands shake and who forget where they are. Allstate, however, can't tell them apart. 

So, USAA quotes the best ones $1,000, and the worst ones $5,000. Allstate doesn't know how to tell the difference, so they have to quote all seniors $5,000, even the good ones. 

What Allstate is really doing is telling the low-risk seniors, "we are not equipped to recognize that you're a safe driver; you'll have to look elsewhere."  But, I'm guessing, the quote system just returns an uncompetitively high price instead of just saying, "no thank you."


Under our assumption that customers always comparison shop, it's actually *impossible* to compare prices in a meaningful way. By analogy, consider -- literally -- apples and oranges, at two different supermarkets.

Store A charges $1 an apple,  and $10 an orange.
Store B charges $2 an orange, and  $5 an apple.

Who's cheaper overall? Neither! Everyone buys their apples at Supermarket A, and their oranges at Supermarket B. There's no basis for an apples-to-apples comparison.

We *can* do a comparison if we relax our assumptions. Instead of assuming that everyone comparison shops, let's assume that 10 percent of customers are so naive that they by all their fruit at a single supermarket. (We'll also assume those naive shoppers eat equal numbers of apples and oranges, and that they're equally likely to shop at either store.)

Overall, combining both the savvy and naive customers, Store A sells 100 Apples and 10 Oranges for a total of $200. Store B sells 100 Oranges and 10 Apples for a total of $250.

Does that mean Store B is more expensive than Store A? No, you still can't compare, because store B sells mostly oranges, and store B sells mostly apples.

To get a meaningful measure, you have to consider only the 10 percent of customers who don't comparison shop. At store A, they spend $11 for one of each fruit. At store B, they spend $7 for one of each fruit.

Now, finally, we see that store B is cheaper than store A!


1. To be able to say that, we had to know that the naive customers are evenly split both on the fruit they buy, and the stores they go to. We (and CR) don't know the equivalent statistics in the auto insurance case.

2. If "Store B is cheaper" it's only for those customers who don't shop around. For the 90 percent who always accept only the lowest price, the question still has no answer. CR wants us to be one of those 90 percent, right? So, their comparison is irrelevant if we follow their advice!

3. All CR's analysis tells us is, if we're completely naive customers, getting one quote at random from one insurance company, then blindly accepting it ... well, in that case, we're best off with USAA.

But, wait, even that's not true! It's only true if we're exactly, equally likely to be any one of CR's thousands of representative customers. Which we're not, since they gave ZIP code 10019 in Manhattan (population 42,870) equal weight with ZIP code 99401 in Alaska (population 273).


CR's mistake was to weight the quotes equally, even the absurdly high ones. They should have weighted them by how often they'd actually be accepted. Of course, nobody actually has that information, but you could estimate it, or at least try to. One decent proxy might be: consider only quotes that are within a certain (small) percentage of the cheapest. 

Also, you want to weight by the number of drivers in the particular demographic, not treat each ZIP code equally. You don't want to give a thousand 30-year-old Manhattanites the same total weight as the three 80-year-olds in a rural county of Wyoming.

By adjusting for both those factors, CR would be weighting the products by at least a plausible approximation of how often they're actually bought. 


Anyway, because of this problem -- and others that I'll get to in a future post -- most of CR's findings wind up almost meaningless. Which is too bad, because it was a two-year project, and they did generate almost a billion quotes in the effort. And, they're not done yet -- they promise to continue their analysis in the coming months. Hopefully, their coming analysis will be more meaningful.

(to be continued)

Labels: ,