Friday, May 15, 2015

Consumer Reports on bicycle helmets

In the June, 2015 issue of their magazine, Consumer Reports (CR) tries to convince me to wear a bicycle helmet. They do not succeed. Nor should they. While it may be true that we should all be wearing helmets, nobody should be persuaded by CR's statistical arguments, which are so silly as to be laughable.

It's actually a pretty big fail on CR's part. Because, I'm sure, helmets *do* save lives, and it should be pretty easy to come up with data that illustrate that competently. Instead ... well, it's almost like they don't take the question seriously enough to care about what the numbers actually mean. 

(The article isn't online, but here's a web page from their site that's similar but less ardent .)


Here's CR's first argument:

"... the answer is a resounding yes, you *should* wear a helmet. Here's why: 87 percent of the bicyclists killed in accidents over the past two decades were not wearing helmets."

Well, that's not enough to prove anything at all. From that factoid alone, there is really no way to tell whether or not helmets are good or bad. 

If CR doesn't see why, I bet they would if I changed the subject on them:

"... the answer is a resounding yes, you *should* drive drunk. Here's why: 87 percent of the drivers killed in accidents over the past two decades were stone cold sober."

That would make it obvious, right?


If the same argument that proves you should wear a helmet also proves you should drive drunk, the argument must be flawed. What's wrong with it? 

It doesn't work without the base rate. 

In order to see if "87 percent" is high or low, you need something to compare it to. If fewer than 87 percent of cyclists go helmetless, then, yes, they're overrepresented in deaths, and you can infer that helmets are a good thing. But if *more* than 87 percent go without a helmet, that might be evidence that helmets are actually dangerous.

To make the drinking-and-driving argument work, you have to show that fewer than 87 percent of drivers are drunk. 

Neither of those is that hard. But you still have to do it!


Why would CR not notice that their argument was wrong in the helmet case, but notice immediately in the drunk-driving case? There are two possibilities:

1. Confirmation bias. The first example fits in with their pre-existing belief that helmets are good; the second example contradicts their pre-existing belief that drunk driving is bad.

2. Gut statistical savvy. The CR writers do feel, intuitively, that base rates matter, and "fill in the blanks" with their common sense understanding that more than 13 percent of cyclists wear helmets, and that more than 13 percent of drivers are sober.

As you can imagine, I think it's almost all number 1. I'm skeptical of number 2. 

In fact, there are many times that the base rates could easily go the "wrong" way, and people don't notice. One of my favorite examples, which I mentioned a few years ago, goes something like this:

"... the answer is a resounding yes, you *should* drive the speed limit. Here's why: 60 percent of fatal accidents over the past two decades involved at least one speeder."

As I see it, this argument actually may support speeding! Suppose half of all drivers speed. Then, there's a 75 percent chance of finding at least one speeder out of two cars. If those 75 percent of cases comprise only 60 percent of the accidents ... then, speeding must be safer than not speeding!

And, of course, there's this classic Dilbert cartoon.


It wouldn't have been hard for CR to fix the argument. They could have just added the base rate:

"... the answer is a resounding yes, you *should* wear a helmet. Here's why: 87 percent of the bicyclists killed in accidents over the past two decades were not wearing helmets, as compared to only 60 percent of cyclists overall."

It does sound less scary than the other version, but at least it means something.

(I made up the "60 percent" number, but anything significantly less than 87 percent would work. I don't know what the true number is; but, since we're talking about the average of the last 20 years, my guess would be that 60 to 70 percent would be about right.)


Even if CR had provided a proper statistical argument that riding with a helmet is safer than riding without ... it still wouldn't be enough justify their "resounding yes". Because, I doubt that anyone would say,

"... the answer is a resounding yes, you *should* avoid riding a bike. Here's why: 87 percent of commuters killed in accidents over the past two decades were cycling instead of walking -- as compared to only 60 percent of commuters overall."

That would indeed be evidence that biking is riskier than walking -- but I think we'd all agree that it's silly to argue that it's not worth the risk at all. You have to weigh the risks against the benefits.

On that note, here's a second statistical argument CR makes, which is just as irrelevant as the first one:

"... wearing a helmet during sports reduces the risk of traumatic brain injury by almost 70 percent."

(Never mind the article chooses to lump all sports together; we'll just assume the 70 percent is true for all sports equally.)

So, should that 70 percent statistic alone convince you to wear a helmet? No, of course not. 

Last month -- and this actually happened -- my friend's mother suffered a concussion in the lobby of her apartment building. She was looking sideways towards the mail room, and collided with another resident who didn't see her because he was carrying his three-year-old daughter. My friend's Mom lost her balance, hit her head on an open door, blacked out, and spent the day in hospital.

If she had been wearing a helmet, she'd almost certainly have been fine. In fact, it's probably just as true that

"... wearing a helmet when walking around in public reduces the risk of traumatic brain injury by almost 70 percent."

Does that convince you to wear a helmet every second of your day? Of course not.

The relevant statistic isn't the percentage of injuries prevented. It's how big the benefit is as compared to the cost and inconvenience.

The "70 percent" figure doesn't speak to that at all. 

If I were to find the CR writers, and ask them why they don't wear their helmets while walking on the street, they'll look at me like I'm some kind of idiot -- but if they chose to answer the question, they'd say that it's because walking, unlike cycling, doesn't carry a very high risk of head injury.

And that's the point. Even if a helmet reduced the risk of traumatic brain injury by 80 percent, or 90 percent, or even 100 percent ... we still wouldn't wear one to the mailroom. 

I choose not to wear a helmet for exactly the same reason that you choose not to wear a helmet when walking. That's why the "70 percent" figure, on its own, is almost completely irrelevant. 

Everything I've seen on the web convinces me that the risk is low enough that I'm willing to tolerate it. I'd be happy to be corrected by CR -- but they're not interested in that line of argument. 

I bet that's because the statistics don't sound that ominous. Here's a site that says there's around one cycling death per 10,000,000 miles. If it's four times as high without a helmet -- one death per 2,500,000 miles -- that still doesn't sound that scary. 

That latter figure is about 40 times as high as driving. If I ride 1,000 miles per year without a helmet, the excess danger is equivalent to driving 30,000 miles. I'm willing to accept that level of risk. (Especially because, for me, the risk is lower still because I ride mostly on bike paths, rather than streets, and and most cycling deaths result from collisions with cars.)


You can still argue that three extra deaths per ten million miles is a significant amount of risk, enough to create a moral or legal requirement for helmets. But, do you really want to go there? Because, if your criterion is magnitude or risk ... well, cycling shouldn't be at the top of your list of concerns. 

In the year 2000, according to this website, 412 Americans died after falling off a ladder or scaffolding.

Now, many of those deaths are probably job-related, workers who spend a significant portion of their days on ladders. Suppose that covers 90 percent of the fatalities, so only 10 percent of those deaths were do-it-yourselfers working on their own houses. That's 41 ladder deaths.

People spend a lot more time on bicycles than ladders ... I'd guess, probably by a factor of at least 1000. So 41 ladder deaths is the equivalent of 41,000 cycling deaths. 

But ... there were only 740 deaths from cycling. That makes it around fifty-five times as dangerous to climb a ladder than ride a bicycle.

And that's just the ladders -- it doesn't include deaths from falling after you've made it onto the roof!

If CR were to acknowledging that risk level is important, they'd have to call for helmets for others at high-risk, like homeowners cleaning their gutters, and elderly people with walkers, and people limping with casts, and everyone taking a shower.


Finally, CR gives one last statistic on the promo page:

"Cycling sends more people to the ER for head injuries than any other sport -- twice as many as football, 3 1/2 times as many as soccer."

Well, so what? Isn't that just because there's a lot more cycling going on than football and soccer? 

This document (.pdf) says that in the US in 2014, there were six football fatalities. They all happened in competition, even though there are many more practices than games. All six deaths happened among the 1,200,000 players in high-school level football or beyond -- there were no deaths in youth leagues.

Call it a million players, ten games a year, on the field for 30 minutes of competition per game. Let's double that to include kids' leagues, and double it again to include practices -- both of which didn't have any deaths, but still may have had head injuries. 

That works out to 20 million hours of football.

In the USA, cyclists travel between 6 billion and 21 billion miles per year. Let's be conservative and take the low value. At an average speed of, say, 10 mph, that's 600 million hours of cycling.

So, people cycle 30 times as much as they play football. But, according to CR, they have twice the head injury rate. That means cycling is only 7 percent as dangerous as football.

That just confirms what already seemed obvious, that cycling is pretty safe compared to football. Did CR misunderstand the statistics so badly that they convinced themselves otherwise? 


(My previous posts on bicycle helmets are here:  one two three four)

Labels: , ,

Tuesday, May 05, 2015

Consumer Reports on unit pricing

Consumer Reports (CR) wants government to regulate supermarket "unit pricing" labels because they're inconsistent. Last post, I quoted their lead example:

"Picture this: You're at the supermarket trying to find the best deal on AAA batteries for your flashlight, so you check the price labels beneath each pack. Sounds pretty straightforward, right? But how can you tell which pack is cheaper when one is priced per battery and one is priced per 100?"

The point, of course, is that CR must be seriously math-challenged if they don't know how to move a decimal point.

I laughed at their example, and I thought maybe they just screwed up. But ... no, they also chose a silly example as their real-life evidence. 

In the article's photograph, they show two different salad-dressing labels, from the same supermarket. The problem: one is unit-priced per pint, but the other one is per quart. Comparing the two requires dividing or multiplying by two, which (IMO) isn't really that big a deal. But, sure, OK, it would be easier if you didn't have to do that.

Except: the two bottles in CR's example are *the same size*.

One 24-ounce bottle of salad dressing is priced at $3.69; the other 24-ounce bottle is priced at $3.99. And CR is complaining that consumers can't tell which is the better deal, because the breakdowns are in different units!

That doesn't really affect their argument, but it does give the reader the idea that they don't really have a good grip on the problem. Which, I will argue, they don't. Their main point is valid -- that unit pricing is more valuable when the units are the same so it's easier to compare -- but you'd think if they had really thought the issue through, they'd have realized how ridiculous their examples are.


The reason behind unit pricing, of course, is to allow shoppers compare the prices of different-sized packages -- to get an idea of which is more expensive per unit.

That's most valuable when comparing different products. For the same product in different sizes, you can be pretty confident that the bigger packs are a better deal. It's hard to imagine a supermarket charging $3 for a single pack, but $7 for a double-size pack. That only happens when there's a mistake, or when the small pack goes on sale but the larger one doesn't. 

When it's different products, or different brands ... does unit pricing really mean a whole lot if you don't know how they vary in quality?

At my previous post, a commenter wrote,

"What if some batteries have different life expectancies?"

Ha! Excellent point. 

There's an 18-pack of HappyPower AA batteries for $5.99, and a 13-pack of GleeCell for $4.77. Which is a better deal? I guess if the shelf label tells you that each HappyPower battery works out to 33 cents, but a GleeCell costs 37 cents, that helps you decide, a little bit. If you don't know which one is better, you might just shrug, go for the HappyPower, and save the four cents.

Except ... there's an element of "you get what you pay for."  In general (not always, but generally), higher-priced items are of higher quality. I'd be willing to bet that if you ran a regression on every set of ratings CR has issued over the past decade, 95 percent of them would show a positive correlation between quality and price. There certainly is a high correlation in the battery ratings, at least.  (Subscription required.)

So, at face value, unit price isn't enough. The question you really want to answer is:

If someone chose two random batteries, and random battery A cost 11 percent more in an 18-pack than random battery B in a 13-pack, which is likelier to be the better value?

That's not just a battery question: it's a probability question. Actually, it's even more complicated than that. It's not enough to know whether you're getting more than 11 percent better value, because, to get that 11 percent, you have to buy a larger pack! Which you might not really want to do. 

Pack size matters. I think it's fair to say that, all things being equal, we almost always prefer a smaller pack to a larger pack. That must be true. If it weren't, smaller sizes would never sell, and everything would come in only one large size! 

To make a decision, we wind up doing a little intuitive balancing act involving at least three measures: the quality of the product, the unit price, and the size of the pack. The price is just one piece of the puzzle. 

In that light, I'm surprised that CR isn't calling for regulations to force supermarkets to post CR's ratings on the shelves. After all, you can always calculate unit price on the spot, with the calculator app on your phone. But not everyone has a data plan and a CR subscription.


Here's another, separate issue CR brings up:

"[Among problems we found:] Toilet paper priced by '100 count,' though the 'count' (a euphemism for 'sheets') differed in size and number of plies depending on the brand."

So, CR isn't just complaining that the labels use *inconsistent* units -- they're also complaining that they use the *wrong* units. 

So, what are the right units for toilet paper? Here in Canada, packages give you the total area, in square meters, which corrects for different sizes per sheet. But that won't satisfy CR, because that doesn't take "number of plies" into account. 

What will work, that you can compare a pack of three-ply smaller sheets with a pack of two-ply larger sheets?

I guess they could do "price per square foot per ply."  That might work if you're only comparing products, and don't need to get your head around what the numbers actually mean.

They could also do "price per pound," on the theory that thicker, higher-quality paper is heavier than the thinner stuff. But that seems weird, that CR would want to tell consumers to comparison shop toilet paper by weight.

In either case, you're trading ease of understanding what the product costs, in exchange for the ability to more easily compare two products. Where is the tradeoff? I don't think CR has thought about it. On the promo page for their article, they do an "apples and oranges" joke, showing apples priced at $1.66 per pound, while oranges are 75 cents each. Presumably, they should both be priced per pound. 

Now, I have no idea how much a navel orange weighs. If they were $1.79 a pound, and I wanted to buy one only if it were less than, say, $1, I'd have to take it over to a scale ... and then, I'd have to calculate the weight times $1.79.

According to CR, that's bad: 

"To find the best value on the fruit below, you'd need a scale -- and a calculator."

Well, isn't that less of a problem than needing a scale and calculator *to find out how much the damn orange actually costs*?

I think CR hasn't really thought this through to figure out what it wants. But that doesn't stop it from demanding government action.


In 2012, according to the article, CR worked with the U.S. Department of Commerce (DOC) to come up with a set of recommended standards for supermarket labels. (Here's the .pdf, from the government site.)

One of the things they want to "correct" is a shelf label for a pack of cookies. The product description on the label says "6 count," meaning six cookies. The document demands that it be in grams.

Which is ridiculous, in this case. When products come in small unit quantities, that's how consumers think of them. I buy Diet Mountain Dew in packs of twelve, not in agglomerations of 4.258 liters. 

It turns out that manufacturers generally figure out what consumers want on labels, even if CR is unable to. 

For instance: over the years, Procter and Gamble has made Liquid Tide more and more concentrated. You need less to do the same job. That means that the actual liquid volume of the detergent is completely meaningless. What matters is the amount of active ingredient -- in other words, how many loads of laundry the bottle can do.

Which is why Tide provides this information, prominently, on the bottle. My bottle here says it does 32 loads. There are other sizes that do 26 loads, or 36, or 110, or ... whatever.

But, under the proposed CR/US Government standards, that would NOT BE ALLOWED. From the report:

"Unit prices must be based on legal measurement units such as those for declaring a packaged quantity or net content as found in the Fair Packaging and Labeling Act (FPLA). Use of unit pricing in terms of 'loads,' 'uses,' and 'servings' are prohibited."

CR, and the DOC, believe that the best way for consumers to intelligently compare the price of a bottle of Tide to some off-brand detergent that's diluted to do one-quarter the loads ... *is by price per volume*. Not only do they think that's the right method ... they want to make any other alternative ILLEGAL.

That's just insane.


I have a suggestion to try to get CR to change its mind. 

A standard size of Tide detergent does 32 loads of laundry. The premium "Tide with Febreze" variation does only 26 loads. But the two bottles are almost exactly the same size. 

I'll send a letter. Hey, Consumer Reports! Procter and Gamble is trying to rip us off! The unit price per volume makes it look like the two detergents are the same price, but they're not! The other one is watered down!

I bet next issue, there'll be an article demanding legislation to prohibit unit pricing by volume, so that manufacturers stop ripping us off.

I'm mostly kidding, of course. For one thing, P&G isn't necessarily trying to rip us off. The Febreze in the expensive version is an additional active ingredient. (And a good one: it works great on my stinky ball hockey chest pad.) Which is "more product" -- 32 regular loads, or 26 enhanced loads? P&G thinks they're about the same, which is why they made the bottle the same size, to signal what it thinks the product is worth.

Or, maybe they diluted both products similarly, and it just works out that the combined volume winds up similar.

Either way, unit pricing by volume doesn't tell you much. Unless you want to think that, coincidentally, a load with Febreze is exactly 32/26 as valuable a "unit" as a load without. But then, what will you do when Tide changes the proportions?

It makes no sense.


Anyway, I do agree with CR that it's better if similar products can be compared with the same unit. And, sometimes, that doesn't happen, and you get pints alongside quarts.

But I disagree with CR that the occasional lapse constitutes a big problem. I disagree that supermarkets don't care what consumers want. I disagree that CR knows better than manufacturers and consumers. And I disagree that the government needs to regulate anything, including font sizes (which, yes, CR complains about too -- "as tiny as 0.22 inch, unreadable for impaired or aging eyes"). 

CR's goal, to improve things for comparison shoppers, is reasonable. I'm just frustrated that they came up with such bad examples and bad answers, and that they want to make it illegal to do it any way other than their silly wrong way. 

If their way is wrong, what way is right?

Well, it's different for everyone. We're diverse, and we all have different needs. 

What should we do, for, say, Advil? Some people are always take a 200 mg dose, and will much prefer unit price per tablet. Me, I sometimes take 200 mg, and sometimes 400 mg. For me, "per tablet" isn't that valuable. I'd rather see pricing per unit of active ingredient. In addition, I'm willing to take two tablets for a higher dose, or half a tablet for a lower dose, whichever is cheaper. 

It's an empirical question. It depends on how many people prefer each option. Neither the government nor CR can know without actually going out and surveying. 


Having said all that ... let me explain what *I* would want to see in a unit price label, based on how I think when I shop. You probably think differently, and you may wind up thinking my suggestion is stupid. Which it very well might be. 


A small jar of Frank's Famous Apricot Jam costs 35 cents per ounce. A larger jar costs 25 cents per ounce. Which one do you buy?

It depends on the sizes, right? If the big jar is ten times the size, you're less likely to buy it than if it's only twice the size. Also, it depends on how much you use. You don't want the big jar to go bad before you can finish it. On the other hand, if you use so much jam that the small jar will be gone in three days, you'd definitely buy the bigger one. But what if you've never tried that jam before? Frank's Famous Jam might be a mediocre product, like those Frank's Famous light bulbs you bought in 1985, so you might want to start with the small jar in case you hate it.

You kind of mentally balance the difference in unit price among all those other things.

Now, I'm going to argue: the unit price difference, "35 cents vs. 25 cents" is not the best way to look at it. I think the unit prices seriously underestimate the savings of buying the bigger jar. I think the issue that CR identified, the "sometimes it's hard to compare different units," is tiny compared to the issue that unit prices aren't that valuable in the first place.

Why? Because, as economists are fond of saying, you have to think on the margin, not the average. You have to consider only the *additional* jam in the bigger jar.

Suppose the small jar of jam is 12 ounces, and the large is 24 ounces (twice as big). So, the small jar costs $4.20, and the large costs $6.00.

But consider just the margin, the *additional* jam. If you upgrade to the big jar, you're getting 12 additional ounces, for $1.80 additional cost. The upgrade costs you only 15 cents an ounce. That's 58 percent cheaper! 

If you buy the small jar instead of the big one, you're passing up the chance to get the equivalent of a second jar for less than half price. And that's something you don't necessarily see directly if you just look at the average unit price.


I think that's a much more relevant comparison: 35 cents vs. 15 cents, rather than 35 cents vs. 25 cents. 

Don't believe me? I'll change the example. Now, the small jar is still 35 cents an ounce, but the large jar is 17.5 cents an ounce. Now, which do you buy?

You always buy the large jar.  It's the same price as the small jar! At those unit costs, both jars cost $4.20. 

That's obvious when you see that when you upgrade to the bigger jar, you're getting 12 ounces of marginal jam for $0.00 of marginal cost.  It's not as obvious when you see your unit cost drop from 35 cents to 17.5 cents.


So, that's something I'd like to see on unit labels, so I don't have to calculate it myself: the marginal cost for the next biggest size. Something like this:

"If you buy the next largest size of this same brand of Raisin Bran, you will get 40% more product for only 20% more price. Since 20/40 equals 0.5, it's like you're getting the additional product at 50 percent off."

Or, in language we're already familiar with from store sales flyers:

"Buy 20 oz. of Raisin Bran at regular price, get your next 8 oz. at 50% off."


Unit price is a "rate" statistic. Sometimes, you'd rather have a bulk measure -- a total cost. If I want one orange, I might not care that they're $3 a pound -- I just want to know that particular single orange comes out to $1.06.

In the case of the jam, I might think, well, sure, half price is a good deal, but I'm running out of space in the fridge, and I might get sick of apricot before I've finished it all. What does it cost to just say "screw it" and just go for the smaller one?

In other words: how much more am I paying for the jam in the small jar, compared to what I'd pay if they gave it to me at the same unit price as the big jar?

With the small jar, I'm paying 35 cents an ounce. With the big jar, I'd be paying 25 cents an ounce. So, I'm "wasting" ten cents an ounce by buying the smaller 12 ounce jar. That's a cost of $1.20 for the privilege of not having to upgrade to the bigger one.

That flat cost is something that works for me, that I often calculate while shopping. I can easily decide if it's worth $1.20 to me to not have to take home twice as much jam. 


So here's an example of the kind of unit price label I'd like to see:

-- This size: 12 ounces at $0.35 per ounce

--Next larger size: 12 extra ounces at $0.15 per extra ounce (58% savings)

--This size costs $1.20 more than the equivalent quantity purchased in the next larger size.

I'd love to see some supermarket try this before CR makes it illegal.

Labels: , ,

Thursday, April 30, 2015

Math class is tough

"Picture this: You're at the supermarket trying to find the best deal on AAA batteries for your flashlight, so you check the price labels beneath each pack. Sounds pretty straightforward, right? But how can you tell which pack is cheaper when one is priced per battery and one is priced per 100?"

-- Consumer Reports, June 2015, calling for government regulation of "unit pricing" labels

Labels: , ,

Saturday, April 18, 2015

Making the NHL draft lottery more fun

The NHL draft lottery is to be held today in Toronto. Indeed, it might have already happened; I can't find any reference to what time the lottery actually takes place. The NHL will be announcing the results just before 8:00 pm tonight.

(Post-lottery update: someone on Twitter said the lottery took place immediately before the announcement.  Which makes sense; the shorter wait time minimizes the chance of a leak.  Also: the Oilers won, but you probably know that by now.)

The way it works is actually kind of boring, and I was thinking about ways to make it more interesting ... so you could televise it, and it would hold viewers' interest.

(Post-lottery update: OK, they did televise a medium-sized big deal for the reveal.  But, since the winner was already known, it was more frustrating than suspenseful. This post is about a way to make it legitimately exciting while it's still in progress, before the final result is known.)

Let me start by describing how the lottery works now. You can skip this part if you're already familiar with it.


The lottery is for the first draft choice only. The 14 teams that missed the playoffs are eligible. Whoever wins jumps to the number one pick, and the remaining 13 teams keep their relative ordering. 

The lower a team's position in the standings, the higher the probability it wins the lottery. The NHL set the probabilities like this:

 1. Buffalo Sabres       20.0%
 2. Arizona Coyotes      13.5%
 3. Edmonton Oilers      11.5%
 4. Toronto Maple Leafs   9.5%
 5. Carolina Hurricanes   8.5%
 6. New Jersey Devils     7.5%
 7. Philadelphia Flyers   6.5%
 8. Columbus Blue Jackets 6.0%
 9. San Jose Sharks       5.0%
10. Colorado Avalanche  3.5%
11. Florida Panthers      3.0%
12. Dallas Stars          2.5%
13. Los Angeles Kings     2.0%
14. Boston Bruins         1.0%

It's kind of interesting how they manage to get those probabilities in practice.

The NHL will put fourteen balls in a hopper, numbered 1 to 14. It will then randomly draw four of those balls.

There are exactly 1,001 combinations of 4 balls out of 14 -- that is, 1,001 distinct "lottery tickets". The 1001st ticket -- the combination "11, 12, 13, 14"  -- is designated a "draw again." The other 1,000 tickets are assigned to the teams in numbers corresponding to their probabilities. So, the Sabres get 200 tickets, the Hurricanes get 85 tickets, and so on. (The tickets are assigned randomly -- here's the NHL's .pdf file listing all 1,001 combinations, and which go to which team.)

It's just coincidence that the number of teams is the same as the number of balls. The choice of 14 balls is to get a number of tickets that's really close to a round number, to make the tickets divide easily.


So this works like a standard lottery, like Powerball or whatever: there's just one drawing, and you have the winner. That's kind of boring ... but it works for the NHL, which isn't interested in televising the lottery live.

But it occurred to me ... if you DID want to make it interesting, how would you do it?

Well, I figured ... you could structure it like a race. Put a bunch of balls in the machine, each with a single team's logo. Draw one ball, and that team gets a point. Put the ball back, and repeat. The first team to get to 10 points, wins.

You can't have the same number of balls for every team, because you want them all to have different odds of winning. So you need fourteen different quantities. The smallest sum of fourteen different positive integers is 105 (1 + 2 + 3 ... + 14). That particular case won't work: you want the Bruins to still have a 1 percent chance of winning, but, with only 1 ball to Buffalo's 14, it'll be much, much less than that.

What combinations work? I experimented a bit, and wrote a simulation, and I came up with a set of 746 balls that seems to give the desired result. The fourteen quantities go from 70 balls (Buffalo), down to 39 (Boston). 

In 500,000 runs of my simulation, Buffalo won 20.4 percent of the time, and Boston 1.1 percent. Here are the full numbers. First, the number of balls; second, the percentage of lotteries won; and, third, the percentage the NHL wants.

                                    result  target
 1. Buffalo Sabres        70 balls   20.4%   20.0%
 2. Arizona Coyotes       63 balls   13.5%   13.5%
 3. Edmonton Oilers       61 balls   11.5%   11.5%
 4. Toronto Maple Leafs   58 balls    9.3%    9.5%
 5. Carolina Hurricanes   57 balls    8.5%    8.5%
 6. New Jersey Devils     56 balls    7.3%    7.5%
 7. Philadelphia Flyers   54 balls    6.5%    6.5%
 8. Columbus Blue Jackets 53 balls    5.9%    6.0%
 9. San Jose Sharks       51 balls    4.7%    5.0%
10. Colorado Avalanche    49 balls    3.7%    3.5%
11. Florida Panthers      47 balls    3.2%    3.0%
12. Dallas Stars          45 balls    2.5%    2.5%
13. Los Angeles Kings     43 balls    1.9%    2.0%
14. Boston Bruins         39 balls    1.1%    1.0%

The probabilities are all pretty close. They're not perfect, but they're probably good enough. In other words, the NHL could probably live with awarding the Bruins a 1.1% chance instead of a 1% chance.

If you did the lottery this way, would it be more fun? I think it would. You'd be watching a race to 10 points. It would have a plot, and you could see who's winning, and the odds would change every ball. 

Every team would have something to cheer about, because they'd probably all have at least a few balls drawn. The ball ratio between first and last is only around 1.8 (70/39), so for every 9 points the Sabres got, the Bruins would get 5. 

The average number of simulated balls it took to find a winner was 72.4. If you draw one ball every 30 seconds ... along with filler and commercials, that's a one-hour show. Of course, it could go long, or short. The minimum is 10; the maximum is 127 (after a fourteen-way tie for 9). But I suspect the distribution is tight enough around 72 that it would be reasonably manageable.

Another thing too, is ... every team would have a reasonable chance of being close, and an underdog will almost always challenge. Here's how often each team would finish with 7 or more points (including times they won):

 1. Buffalo Sabres         53.3 percent
 2. Arizona Coyotes        43.4
 3. Edmonton Oilers        40.2
 4. Toronto Maple Leafs    35.7
 5. Carolina Hurricanes    33.9
 6. New Jersey Devils      32.4
 7. Philadelphia Flyers    29.5 
 8. Columbus Blue Jackets  27.7  
 9. San Jose Sharks        25.2 
10. Colorado Avalanche     22.1 
11. Florida Panthers       19.5 
12. Dallas Stars           16.7 
13. Los Angeles Kings      14.5 
14. Boston Bruins          10.6 

And here's the average score, by final position after it's over:

10.0  Winner
 8.0  Second
 7.3  Third
 6.6   ...

That's actually closer than it looks, because you don't know which team the bottom one will be. Also, just before the winning ball is drawn the first-place team would have been at 9.0, which means, at that time, the second-place team would, on average, have been only one point back. 


The problem is ... it still takes 746 balls to make the odds work out that closely. That's a lot of balls to have to put in the machine. Of course, that's just what I found by trial and error; you might be able to do better. Or, you could use a smaller number of balls, and accept a probability distribution that's different from the NHL's, but still reasonable.

Or, you could add a twist. You could give every ball a different number of points. Maybe the Sabres' balls are worth from 5 points down to 1, and the Bruins' balls are only 3 down to 1, and the first team to 20 points wins. I don't think it would be that hard to find some combination that works. 

That's kind of a math nerd thing. I'd bet you can find a system that comes as close as I got with less than 100 balls, and I bet you'd be able to get to it pretty quick by trial and error. 

At least, the NHL could, if it wanted to.

Labels: ,

Saturday, April 11, 2015

MLB forecasters are more conservative this year

Every April, sabermetricians, bookies and sportswriters issue their won-lost predictions for each of the 30 teams in MLB. And, every year, some of them are overconfident, and essentially wind up trying to predict coin flips.

As I've written before, there's a mathematical "speed of light" limit to how much of a team's record can be predicted. That's the part that's determined by player talent. Any spread that's wider than the spread of talent must be just random luck, and, by definition, unpredictable.

Based on the historical record, we can estimate that the spread of team talent in MLB is somewhere around an SD of 9 games. Not all of that talent can be predicted beforehand, because some of it hasn't happened yet -- trades, injuries, players unexpectedly blossoming or declining, and so on. My estimate is that if you were the most omnicient baseball insider in the universe, maybe you could predict an SD of 8 games.

Last year, many pundits exceeded that "speed of light" limit anyway. I thought that there would be fewer this year, that the 2015 forecasts would project a narrower range of team outcomes. That's because last year's standings were particularly tight, and there's been talk about how we may be entering a new era of parity.

And that did happen, to some extent.

I'll show you the 2015s and the 2014s together for easy comparison. A blank space is a forecast I don't have for that year. (For 2015, I think Jonah Keri and the ESPN website predicted only the order of finish, and not the actual W-L record.) 

Like last year, I've included the "speed of light" limits, the naive "last year regressed to the mean" forecast, and the "every team will finish .500" forecast. Links are for 2015 ... for 2014, see last year's post.

 2015  2014
 9.32 11.50  Sports Illustrated
 9.07  8.76  Jeff Passan (Yahoo!)
 9.00  9.00  Speed of Light (theoretical est.)
 8.79        Bruce Bukiet
       8.53  Jonah Keri (Grantland)
       8.51  Sports Illustrated (runs/10)
 8.00  8.00  Speed of Light (practical est.)
       7.79  ESPN website
 7.92  7.78  Mike Oz (Yahoo!)
 6.99        Chris Cwik (Yahoo!)
 6.38  6.38  Naive previous year method (est.)
 6.34  9.23  Mark Townsend (Yahoo!)
 6.10  6.90  Tim Brown (Yahoo!)
 6.03  7.16  Vegas Betting Line (Bovada)
 5.46  5.55  Fangraphs 
 4.93  8.72  ESPN The Magazine 
 0.00  0.00  Predict 81-81 for all teams

Of those who predicted both seasons, only two out of eight increased the spread of their forecasts from last year. And those two, Jeff Passan and Mike Oz, increased only a little bit. 

On the other hand, some of the other forecasters see *dramatically* more equality in team talent. Yahoo's Mark Townsend dropped from 9.23 to a very reasonable 6.34. And ESPN dropped from one of the highest spreads, to the lowest -- by far -- at 4.93. 

Which is strange, because ESPN's words are so much more optimistic than their numbers. about the Washington Nationals, they write,

"It's the Nationals' time."
"They're here to stay."
"Anything less than an NL East crown will qualify as a big disappointment."

But their W-L prediction for the Nationals, whom they projected higher than any other team?  A modest 91-71, only ten games above .500.


In any case ... I wonder how much of the differences between 2014 and 2015 are due to (a) new methodologies, (b) the same methodologies reflecting a real change in team parity, and (c) just a gut reaction to the 2014 standings having been tighter than normal.

My guess is that it's mostly (b). I'd bet on Bovada's forecasts being the most accurate of the group. If that's true, then maybe teams really *are* tighter in talent than last year, by around 1 win of SD. Which is, roughly, in line with the rest of the forecasts.

I guess we'll know more next year. 

Labels: , , ,

Wednesday, April 08, 2015

How much has parity increased in MLB?

The MLB standings were tighter than usual in 2014. No team won, or lost, more than 98 games. That's only happened a couple of times in baseball history.

You can measure the spread in the standings by calculating the standard deviation (SD) of team wins. Normally, it's around 11. Two years ago, it was 12.0. Last year, it was down substantially, to 9.4.

Historically, 9.4 is not an unprecedented low. In 1984, the SD was 9.0; that's the most parity of any season since the sixties. More recently, the 2007 season came in at 9.3, with a team range of 96 wins to 96 losses.

But this time, people are noticing. A couple of weeks ago, Ben Lindbergh showed that this year's preseason forecasts have been more conservative than usual, suggesting that the pundits think last year's compressed standings reflect increased parity of talent. They've also noted another anomaly: in 2014, payroll seemed to be less important than usual in predicting team success. These days, the rich teams don't seem to be spending as much, and the others seem to be spending a little more.

So, have we actually entered into an era of higher parity, where we should learn to expect tighter playoff races, more competitive balance, and fewer 100-win and 100-loss teams? 

My best guess is ... maybe just a little bit. I don't think the instant, single-season drop from 12 games to 9.4 games could possibly be the result of real changes. I think it was mostly luck. 


Here's the usual statistical theory. You can break down the observed spread in the standings into talent and luck, like this: 

SD(observed)^2 = SD(talent)^2 + SD(luck)^2

Statistical theory tells us that SD(luck) equals 6.4 games, for a 162-game season. With SD(observed) equal to 12.0 for 2013, and 9.4 for 2014, we can solve the equation twice, and get

2013: 10.2 games estimated SD of talent 
2014:  7.0 games estimated SD of talent

That's a huge one-season drop, from 10.2 to 7.0 ... too big, I think, to really happen in a single offseason. 

Being generous, suppose that between the 2013 and 2014 seasons, teams changed a third of their personnel. That's a very large amount turnover. Would even that be enough to cause the drop?

Nope. At least, not if that one-third of redistributed "talent wealth" was spread equally among teams. In that case, the SD of the "new" one-third of talent would be zero. But the SD of the remaining two-thirds of team talent would be 8.3 (the 2013 figure of 10.2, multiplied by the square root of 2/3).

That 8.3 is still higher than our 7.0 estimate for 2014! So, for the SD of talent to drop that much, we'd need the one-third of talent to be redistributed, not equally, but preferentially to the bad teams. 

Is that plausible? To how large an extent would that need to happen?

We have a situation like this: 

2014 talent = original two thirds of 2013 talent 
            + new one third of 2013 talent 
            + redistribution favoring the worse teams

Statistical theory says the relationship between the SDs is this:

SD(2014 talent) squared = 

SD(2013 talent teams kept)^2 +
SD(2013 talent teams gained)^2 + 
SD(2013 talent teams kept) * SD(2013 talent teams gained) * correlation between kept and gained * 2

It's the same equation as before, but with an extra term (shown in bold). That term shows up because we're assuming a non-zero correlation between talent kept and talent gained -- that the more "talent kept," the less your "talent gained". When we just did "talent" and "luck", we were assuming there was no correlation, so we didn't need that extra part. (We could have left it in, but it would have worked out to zero anyway.)

The equation is easy to fill in. We saw that SD(2014 talent) was estimated at 9.4. We saw that SD(talent teams kept) was 8.3. And we can estimate that SD(talent teams gained) is 12.0 times the square root of 1/3, which is 5.9.

If you solve, you get 

Correlation between kept and gained = -0.57

That's a very strong correlation we need, in order for this to work out. The -0.57 means that, on average, if a team's "kept" players were, say, 5th best in MLB (that is, 10 teams above average), its "gained" players must have been 9th worst in MLB (5.7 teams below average). 

That's not just the good teams getting worse by signing players that aren't as good as the above-average players they lost -- it's the good teams signing players that are legitimately mediocre. And, vice-versa. At -0.56, the worst teams in baseball would have had to have replaced one-third of their lineup, and those new players would have to have been collectively as good as those typically found on a 90-win team.

Did that actually happen? It's possible ... but it's something that you'd easily be able to have seen at the time. I think we can say that nobody noticed -- going into last season, it didn't seem like any of the mainstream projections were more conservative than normal. (Well, with the exception of FanGraphs. Maybe they saw some of this actually happen? Or maybe they just used a conservative methodology.)

One thing that WAS noticed before 2014 is that the 51-111 Houston Astros had improved substantially. So that's at least something.

And, for what it's worth: the probability of randomly getting a correlation coefficient as extreme as 0.57, in either direction, is 0.001 -- that is, one in a thousand. On that basis, I think we can reject the hypothesis that team talent grew tighter just randomly.

(Technical note: all these calculations have assumed that every team lost *exactly* one-third of its talent, and that those one-thirds were kept whole and distributed to other teams. If you were to use more realistic assumptions, the chances would improve a little bit. I'm not going to bother, because, as we'll see, there are other possibilities that are more likely anyway.)


So, if it isn't the case that the spread in talent narrowed ... what else could it be? 

Here's one possibility: instead of SD(talent) dropping in 2014, SD(luck) dropped. We were holding binomial luck constant at 6.4 games, but that's just the average. It varies randomly from season to season, perhaps substantially. 

It's even possible -- though only infinitesimally likely -- that, in 2014, every team played exactly to its talent, and SD(luck) was actually zero!

Except that ... again, that wouldn't have been enough. Even with zero luck, and talent 10.3, we would have observed an SD of 10.3. But we didn't. We observed only 9.4. 

So, maybe we have another "poor get richer" story, where, in 2014, the bad teams happened to have good luck, and the good teams happened to have bad luck.

We can check that, in part, by looking at the 2014 Pythagorean projections. Did the bad teams beat Pythagoras more than the good teams did?

Not really. Well, there is one obvious case -- Oakland. The 2014 A's were the best team in MLB in run differential, but won only 88 games instead of 99 because of 11 games of bad "cluster luck". 

Eleven games is unusually large. But, the rest of the top half of MLB had a combined eighteen games of *good* luck, which seems like it would roughly cancel things out.

Still ... Pythagoras is only a part of overall luck, so there's still lots of opportunity for the "good teams having bad luck" to have manifested itself. 


Let's do what we did before, and see what the correlation would have to be between talent and luck, in order to get the SD down to 9.4. The relationship, again, is:

SD(2014 standings)^2 = 

SD(2014 talent)^2 + 
SD(2014 luck)^2   + 
SD(2014 talent) * SD(2014 luck) * correlation between 2014 talent and 2014 luck * 2 

Leaving SD(2014 talent) at the 2013 estimate of 10.2, and leaving SD(2014 luck) at 6.4, we get

Correlation between 2014 talent and luck = -0.39

The chance of a correlation that big (either direction) happening just by random luck is 0.033 -- about 1 in 30. That seems like a big enough chance that it's plausible that's what actually happened. 

Sure, 1 in 30 seems low, and is statistically significantly unlikely in the classical "1 in 20" sense. But that doesn't matter. We're being Bayesian here. We know something unlikely happened, and so the reason it happened is probably also something unlikely. And the 1/30 estimate for "bad teams randomly got lucky" is a lot more plausible than the 1/1000 estimate for "bad teams randomly got good talent."  It might also be more plausible than "bad teams deliberately got good talent," considering that nobody noticed any unusual changes in talent at the time.


Having got this far, I have to backtrack and point out that these odds and correlations are actually too extreme. We've been assuming that all the unusualness happened after 2013 -- either in the offseason, or in the 2014 season. But 2013 might have also been lucky/unlucky itself, in the opposite direction.

Actually, it probably was. As I said, the historical SD of actual team wins is around 11. In the 2013 season, it was 12. We would have done better by comparing the "too equal" 2014 to the historical norm, rather than to a "too unequal" 2013. 

For instance, we toyed with the idea that there was less luck than normal in 2014. Maybe there was also more luck than normal in 2013. 

Instead of 6.4 both years, what if SD(luck) had actually been 8.4 in 2013, and 4.4 in 2014?

In that case, our estimates would be

SD(2013 talent) = 8.6
SD(2014 talent) = 7.6

That would be just a change of 1.0 wins in competitive balance, much more plausible than our previous estimate of a 3.2 win swing (10.2 to 7.0).


Still: no matter which of all these assumptions and calculations you decide you like, it seems like most of the difference must be luck. It might be luck in terms of the bad teams winding up with the good players for random reasons, or it might be that 2013 had the good teams having good luck, or it might be that 2014 had the good teams having bad luck.

Whichever kind of luck it is, you should expect a bounceback to historical norms -- a regression to the mean -- in 2015. 

The only way you can argue for 2015 being like 2014, is if you think the entire move from historical norms was due to changes in the distribution of talent between teams, due to economic forces rather than temporary random ones. 

But, again, if that's the case, show us! Personnel changes between 2013 and 2014 are public information. If they did favor the bad teams, show us the evidence, with estimates. I mean that seriously ... I haven't checked at all, and it's possible that it's obvious, in retrospect, that something real was going on.


Here's one piece of evidence that might be relevant -- betting odds. In 2014, the SD of Bovada's "over/under" team predictions was 7.16. This year, it's only 6.03.*

(* Bovada's talent spread is tighter than what we expect the true distribution to be, because some of team talent is as yet unpredictable -- injuries, trades, prospects, etc.)

Some of that might be a reaction to bettor expectations, but probably not much. I'm comfortable assuming that Bovada thinks the talent distribution has compressed by around one win.*

Maybe, then, we should expect a talent SD of 8.0 wins, rather than the historical norm of 9.0. That's more reasonable than expecting the 2013 value of 10.2, or the 2014 value of 7.0. 

If the SD of talent is 8, and the SD of luck is 6.4 as usual, that means we should expect the SD of this year's standings to be 10.2. That seems reasonable. 


Anyway, this is all kind of confusing. Let me try to summarize everything more understandably.


The distribution of team wins was much tighter in 2014 than in 2013. As I see it, there are six different factors that could have contributed to the increase in standings parity:

-- 1. Player movement from 2013 to 2014 brought better players to the worse teams (to a larger extent than normal), due to changes in the economics of MLB.

-- 2. Player movement from 2013 to 2014 brought better players to the worse teams (to a larger extent than normal), due to "random" reasons -- for instance, prospects maturing and improving faster for the worse teams.

-- 3. There was more randomness than usual in 2013, which caused us to overestimate disparities in team talent.

-- 4. There was less randomness than usual in 2014, which caused us to underestimate disparities in team talent.

-- 5. Randomness in 2013 favored the better teams, which caused us to overestimate disparities in team talent.

-- 6. Randomness in 2014 favored the worse teams, which caused us to underestimate disparities in team talent.

Of these six possibilities, only #1 would suggest that the increase in parity is real, and should be expected to repeat in 2015. 

#3 through to #6 suggest that 2013 was a random aberration, and would suggest that 2015 would be more like the historical norm (SD of 11 games) rather than like 2013 (SD of 12 games). 

Finally, #2 is a hybrid -- a one-time random "shock to the system," but with hangover effects into future seasons. If, for instance, the bad teams just happened to have great prospects arrive in 2014, those players will continue to perform well into 2015 and beyond. Eventually, the economics of the game will push everything back to equilibrium, but that won't happen immediately, so much of the 2014 increase in parity could remain.


Here's my "gut" breakdown of the contribution each of those six factors:

25% -- #1, changes in talent for economic reasons
 5% -- #2, random changes in talent
10% -- #3, "too much" luck in 2013
20% -- #4, "too little" luck in 2014
10% -- #5, luck favoring the good teams in 2013
30% -- #6, luck favoring the bad teams in 2014

Caveats: (1) This is just my gut; (2) the percentages don't have any actual meaning; and (3) I could easily be wrong.

If you don't care about the reasons, just the bottom line, that breakdown won't mean anything to you. 

As I said, my gut for the bottom line is that it seems reasonable to expect 2015 to end with a standings SD of 10.2 wins ... based on the changes in the Vegas odds.

But ... if there were an over/under on that 10.2, my gut would say to take the over. Even after all these arguments -- which I do think make sense -- I still have this nagging worry that I might just be getting fooled by randomness.

Labels: , , ,