Thursday, May 28, 2015

Pitchers influence BAbip more than the fielders behind them

It's generally believed that when pitchers' teams vary in their success rate in turning batted balls into outs, the fielders should get the credit or blame. That's because of the conventional wisdom that pitchers have little control over balls in play.

I ran some numbers, and ... well, I think that's not right. I think individual pitchers actually have as much influence on batting average on balls in play (BAbip) as the defense behind them, and maybe even a bit more.


UPDATE: turns out all the work I did is just confirming a result from 2003, in a document called "Solving DIPS" (.pdf).  It's by Erik Allen, Arvin Hsu, and Tom Tango. (I had read it, too, several years ago, and promptly forgot about it.)

It's striking how close their numbers are to these, even though I'm calculating things in a different way than they did. That suggests that we're all measuring the same thing with the same accuracy.

One advantage of their analysis over mine is they have good park effect numbers.  See the first comment in this post for Tango's links to "batting average on balls in play" park effect data.


For the first step, I'll run the usual "Tango method" to divide BAbip into talent and luck.

For all team-seasons from 2001 to 2011, I figured the SD of team BAbip, adjusted for the league average. That SD turned out to be .01032, which I'll refer to as "10.3 points", as in "points of batting average."  

The average SD of binomial luck for those seasons was 7.1 points. Since

SD(observed)^2 = SD(luck)^2 + SD(talent)^2

We can calculate that SD(talent) = 7.5 points.

"Talent," here, doesn't yet differentiate between pitcher and fielder talent. Actually, it's a conglomeration of everything other than luck -- fielders, pitchers, slight randomness of opposition batters, day/night effects, and park effects. (In this context, we're saying that Oakland's huge foul territory has the "talent" of reducing BAbip by producing foul pop-ups.)


7.2 = SD(luck) 
7.5 = SD(talent) 

For a team-season from 2001 to 2011, talent was more important than luck, but not by much. 

I did the same calculation for other sets of seasons. Here's the summary:

            Obsrvd  Luck Talents
1960-1968    11.41  6.95   9.05
1969-1976    12.24  6.86  10.14
1977-1991    10.95  6.94   8.46
1992-2000    11.42  7.22   8.85
2001-2011    10.32  7.09   7.50
"Average"    11.00  7.00   8.50

I've arbitrarily decided to "average" the eras out to round numbers:  7 points for luck, and 8.5 points for talent. Feel free to use actual averages if you like. 

It's interesting how close that breakdown is to the (rounded) one for team W-L records:

          Observed  Luck  Talent
BABIP        11.00  7.00   8.50
Team Wins    11.00  6.50   9.00

That's just coincidence, but still interesting and intuitively helpful.


That works for separating BAbip into skill and luck, but we still need to break down the skill into pitching and fielding.

I found every pitcher-season from 1981 to 2011 where the pitcher faced at least 400 batters. I compared his BAbip allowed to that of the rest of his team. The comparison to teammates effectively controls for defense, since, presumably, the defense is the same no matter who's on the mound. 

Then, I took the player/rest-of-team difference, and calculated the Z-score: if the difference were all random, how many SDs of luck would it be? 

If BAbip was all luck, the SD of the Z-scores would be exactly 1.0000. It wasn't, of course. It was actually 1.0834. 

Using the "observed squared = talent squared plus luck squared", we can calculate that SD(talent) is 0.417 times as big as SD(luck). For the full dataset, the (geometric) average SD(luck) was 21.75 points. So, SD(talent) must be 0.417 times 21.75, which is 9.07 points.

We're not quite done. The 9.07 isn't an estimate of a single pitcher's talent SD; it's the estimate of the difference between that pitcher and his teammates. There's randomness in the teammates, too, which we have to remove.

I arbitrarily chose to assume the pitcher has 8 times the luck variance of the teammates (he probably pitched more than 1/8 of the innings, but there are more than 8 other pitchers to dilute the SD; I just figured maybe the two forces balance out). That would mean 8/9 of the total variance belongs to the individual pitcher, or the square root of 8/9 of the SD. That reduces the 9.07 points to 8.55 points.

8.55 = SD(single pitcher talent)

That's for individual pitchers. The SD for the talent of a *pitching staff* would be lower, of course, since the individual pitchers would even each other out. If there were nine pitchers on the team, each with equal numbers of BAbip, we'd just divide that by the square root of 9, which would give 2.85. I'll drop that to 2.5, because real life is probably a bit more dilute than that.

So for a single team-season, we have

8.5 = SD(overall talent) 
2.5 = SD(pitching staff talent) 
8.1 = SD(fielding + all other talent)


What else is in that 8.1 other than fielding? Well, there's park effects. The only effect I have good data for, right now (I was too lazy to look hard), is foul outs. I searched for those because of all the times I've read about the huge foul territory in Oakland, and how big an effect it has.

Google found me a FanGraphs study by Eno Sarris, showing huge differences in foul outs among parks. The difference between top and bottom is more than double -- 398 outs in Oakland over two years, compared to only 139 in Colorado. 

The team SD from Sarris's chart was about 24 outs per year. Only half of those go to the home pitchers' BAbip, so that's 12 per year. Just to be conservative, I'll reduce that to 10.

Ten extra outs on a team-season's worth of BIP is around 2.5 points.

So: if 8.1 is the remaining unexplained talent SD, we can break it down as 2.5 points of foul territory, and 7.7 points of everything else (including fielding).

Our breakdown is now:

11.0 = SD(observed) 
 7.1 = SD(luck) 
 2.5 = SD(pitching staff)
 2.5 = SD(park foul outs)
 7.7 = SD(fielders + unexplained)

We can combine the first three lines of the breakdown to get this:

11.0 = SD(observed) 
 7.9 = SD(luck/pitchers/park) 
 7.7 = SD(fielders/unexplained)

Fielding and non-fielding are almost exactly equal. Which is why I think you have to regress BAbip around halfway to the mean to get an unbiased estimate for the contribution of fielding.

UPDATE: as mentioned, Tango has better park effect data, here.


Now, remember when I said that pitchers differ more in BAbip than fielders? Not for a team, but for an individual pitcher,

8.5 = SD(individual pitcher)
7.7 = SD(fielders + unexplained)

The only reason fielding is more important than pitching for a *team*, is that the multiple pitchers on a staff tend to cancel each other out, reducing the 8.5 SD down to 2.5.


Well, those last three charts are the main conclusions of this study. The rest of this post is just confirming the results from a couple of different angles.


Let's try this, to start. Earlier, when we found that SD(pitchers) = 8.5, we did it by comparing a pitcher's BAbip to that of his teammates. What if we compare his BAbip to the rest of the pitchers in the league, the ones NOT on his team?

In that case, we should get a much higher SD(observed), since we're adding the effects of different teams' fielders.

We do. When I convert the pitchers to Z-scores, I get an SD of 1.149. That means SD(talent) is  0.57 as big as SD(luck). With SD(luck) calculated to be about 20.54 points, based on the average number of BIPs in the two samples ... that makes SD(talents) equal to 11.6 points.

In the other study, we found SD(pitcher) was 8.5 points. Subtracting the square of 8.5 from the square of 11.6, as usual, gives

11.6 = SD(pitcher+fielders+park)
 8.5 = SD(pitcher)
 7.9 = SD(fielding+park)

So, SD(fielding+park) works out to 7.9 by this method, 8.1 by the other method. Pretty good confirmation.


Let's try another. This time, we'll look at pitchers' careers, rather than single seasons. 

For every player who pitched at least 4,000 outs (1333.1 innings) between 1980 and 2011, I looked at his career BAbip, compared to his teammates' weighted BAbip in those same seasons. 

And, again, I calculated the Z-scores for number of luck SDs he was off. The SD of those Z-scores was 1.655. That means talent was 1.32 times as important as luck (since 1.32 squared plus 1 squared equals 1.655 squared).

The SD of luck, averaged for all pitchers in the study, was 6.06 points. So SD(talent) was 1.32 times 6.06, or 8.0 points.

10.0 = SD(pitching+luck)
 6.1 = SD(luck)
 8.0 = SD(pitching)

The 8.0 is pretty close to the 8.5 we got earlier. And, remember, we didn't include all pitchers in this study, just those with long careers. That probably accounts for some of the difference.

Here's the same thing, but for 1960-1979:

 9.3 = SD(pitching+luck)
 6.0 = SD(luck)
 7.2 = SD(pitching)

It looks like variation in pitcher BAbip skill was lower in the olden times than it is now. Or, it's just random variation.


I did the career study again, but compare each pitcher to OTHER teams' pitchers. Just like when we did this for single seasons, the SD should be higher, because now we're not controlling for differences in fielding talent. 

And, indeed, it jumps from 8.0 to 8.8. If we keep our estimate that 8.0 is pitching, the remainder must be fielding. Doing the breakdown:

10.5 = SD(pitching+fielding+luck)
 5.8 = SD(luck
 8.0 = SD(pitching)
 3.6 = SD(fielding)

That seems to work out. Fielding is smaller for a career than a season, because the quality of the defense behind the pitcher tends to even out over a career. I was surprised it was even that large, but, then, it does include park effects (and those even out less than fielders do). 

For 1960-1979:

10.2 = SD(pitching+fielding+luck)
 5.7 = SD(luck)
 7.2 = SD(pitching)
 4.4 = SD(fielding)

Pretty much along the same lines.


Unless I've screwed up somewhere, I think we've got these as our best estimates for BAbip variation in talent:

8.5 = SD(individual pitcher BAbip talent)
2.5 = SD(team pitching staff BAbip talent)
7.7 = SD(team fielding staff BAbip talent)
2.5 = SD(park foul territory BAbip talent)

And, for a single team-season,

7.1 = SD(team season BAbip luck)

For a single team-season, it appears that luck, pitching, and park effects, combined, are about as big an influence on BAbip as fielding skill.  

Labels: , , , ,

Monday, May 25, 2015

Do defensive statistics overrate the importance of fielding?

The job of a baseball team's pitchers and fielders is to prevent the opposition from scoring. It's easy to see how well they succeeded, collectively -- just count the number of runs that crossed the plate.

But how can you tell how many of those runs were because of the actions of the pitchers, and how many were because of the actions of the fielders?

I think it's very difficult, and that existing attempts do more harm than good.

The breakdowns I've been looking at lately are the ones from Baseball Reference, on their WAR pages -- pitching Wins Above Replacement (pWAR) and defensive Wins Above Replacment (dWAR). 

They start out with team Runs Allowed. Then, they try to figure out how good the defense was, behind the pitchers, by using play-by-play data. If they conclude that the defense cost the team, say, 20 runs, they bump the pitchers by 20 runs to balance that out.

So, if the team gave up 100 runs more than average, the pitchers might come in at -80, and the fielders at -20. 

I'm going to argue that method doesn't work right. The results are unreliable, inaccurate, and hard to interpret. 

Before I start, five quick notes:

1. My argument is strongest when fielding numbers are based on basic play-by-play data, like the kind from Retrosheet. When using better data, like the Fielding Bible numbers (which use subjective observations and batted ball timing measurements), the method becomes somewhat better. (But, I think, still not good enough.)  Baseball Reference uses the better method for 2003 and later.

2. The WAR method tries to estimate the value of what actually took place, not the skill level of the players who made it happen. That is: it's counting performance, not measuring talent. 

3. Defensive WAR (dWAR) correlates highly with opposition batting average on balls in play (BAbip). For team-seasons from 1982 to 2009, the correlation was +0.60. For 2003 to 2009, with the improved defense data, the correlation was still +0.56.

dWAR actually includes much more than just BAbip. There are adjustments for outfielder arms, double plays, hit types, and so on. But, to keep things simple, I'm going to argue as if dWAR and BAbip are measuring the same thing. The argument wouldn't change much if I kept adding disclaimers for the other stuff.

4. I'll be talking about *team* defense and pitching. The calculation for individuals has other issues that I'm not going to deal with here.  

5. I don't mean to pick on B-R specifically ... I think there are other systems that do things the same way. I just happened to run across this one most recently. Also, even though the example is in the context of WAR, the criticism isn't about WAR at all; it's only about splitting observed performance between pitching and fielding. 


The problem here is a specific case of a more general issue: it's easy to see what happens on the field, but often very difficult to figure out how to allocate it to the various players involved.

That's why hockey and football and basketball and soccer are so much harder to figure out than baseball. In basketball, when there's a rebound, how do you figure out who "caused" it? It could be the rebounder being skillful, or the other players drawing the defenders away, or even the coach's strategy.

But in baseball, when a single is hit, you pretty much know you can assign all the offensive responsibility to the batter. (The baserunners might have some effect, but it's small.) And even though you have to assign the defensive responsibility to the nine fielders collectively, the pitcher is so dominant that, for years, we've chosen to almost ignore the other eight players completely.

But now that we *don't* want to ignore them ... well, how do you figure out which players "caused" a run to be prevented? Now it gets hard.


Even for a single ball in play, it's hard.

The pitcher gives up a ground ball to the shortstop, who throws the batter out. We all agree that we observed an out, and we credit the out we observed to the pitcher and fielders collectively (under "opposition batting" ).

But how do we allocate it separately rather than collectively? Do we credit the pitcher? The fielders? Both, in some proportion?

It probably depends on the specifics of the play.

With two outs in the bottom of the ninth, when a third baseman dives to catch a screaming line drive over the bag, we say his defense saved the game. When it's a soft liner hit right to him, we say the pitcher saved the game. Our observation of what happened on the field actually depends on our perception of where the credit lies. 

We might want to credit the fielder in proportion to the difficulty of the play. If it's an easy grounder, one that would be a hit only 5% of the time, we might credit the defense with 5% of the value of the out, with the rest to the pitcher. If it's a difficult play, we might go 80% instead of 5%.

That sounds reasonable. And it's actually what dWAR does; it apportions the credit based on how often it thinks an average defense would have turned that batted ball into an out.

The problem is: how do you estimate that probability? For older seasons, where you have only Retrosheet data, you can't. At best, from play-by-play data, you can figure if it's a ground ball, fly ball, or liner. If you don't have that information, you have to just assume that every ball in play is league-average, around 30 percent chance of being a hit and 70 percent chance of being an out.

Here is the problem, which I think is not immediately obvious: when you assume certain balls in play are the same, you wind up, mathematically, giving 100 percent of the allocation of credit to the fielders. Even if you go through the arithmetic of calculating everything to the average of 70/30, and assigning 30 percent of every hit to the pitcher, and so forth ... if you do all that, you'll still wind up, at the end of the calculation, that if the fielders gave up 29 percent hits instead of 30 percent, the fielders get ALL the credit for the one "missing" hit.

Of course, the actual Baseball Reference numbers don't treat every BIP the same ... they do have the hit type, and, for recent seasons, they have trajectory data to classify BIPs better. But, it's not perfect. There are still discrepancies, all of which wind up in the fielding column.  [UPDATE: I think even the best data currently available barely puts a dent in the problem.]

For Retrosheet data, and making up numbers:  suppose a fly ball has a 26 percent chance of being a hit, but a ground ball has a 33 percent chance. The pitchers will get the credit for what types of balls they gave up, but the fielders will get 100% of the credit after that. So, if the pitcher gives up fewer ground balls, he gets the credit. But if he gives up *easier* ground balls, the fielders get the credit instead.

This is the key point. Everything that's not known about the ball in play, everything that's random, or anything that has to be averaged or guessed or estimated or ignored -- winds up entirely in the fielders' column.

Now, maybe someone could argue that's actually what we want, that all the uncertainty goes to the fielders. Because, it's been proven, pitchers don't differ much in control over balls in play.

But that argument fails. "Pitchers don't differ much in BAbip" is a statement about *talent*, not about *observations*. In actual results, pitchers DO differ in BAbip, substantially, because of luck. Take two identical pitcher APBA cards, and play out seasons, and you'll find big differences. 

Observations are a combination of talent and luck. If you want to divide the observed balls in play into observed pitching and observed fielding, you're also going to have to divide the luck properly. Not, zero to the pitcher and 100 percent to the fielders.


Traditionally, with observations at the team level, our observations were close to 100% perfect, in how they reflected what happened on the field. But when you take those observations and allocate them, it's no longer 100% perfect. You're estimating, rather than observing.

For the first time, we are "guessing" about how to allocate the observation. 

Watching a specific game, in 1973 or whenever, it would have been obvious that a certain run was prevented by the pitcher inducing three weak ground balls to second base. Now, we can't tell that from Retrosheet data, and so we (mistakenly, in this case) assign the credit to the fielders. 

Nowhere else do we have to guess. We'll observe the same number of doubles, home runs, strikeouts, and popups to first base as the fan in the seats in 1973. We'll observe the same runs and runs allowed, and innings pitched, and putouts by the left fielder. But we will NOT "observe" the same "performance" of the fielders. Fans who watched the games got a reasonable idea how many runs the fielders saved (or cost) by their play. We don't; we have to guess.

Of course, in guessing, our error could go either way. Sometimes we'll overcredit the defense (compared to the pitcher), and sometimes we'll undercredit the defense. Doesn't it all come out in the wash? Nope. A single season is nowhere near enough for the variation to even out. 

For single seasons, we will be substantially overestimating the extent of the fielders' responsibility for runs prevented (or allowed) on balls in play.

This means that you actually have to regress team dWAR to the mean to get an unbiased estimate for what happened on the field. That's never happened before. Usually, we need to regress to estimate *talent*. Here, we need to regress just to estimate *observations*. (To estimate talent, we'd need to regress again, afterwards.)


Here's an analogy I think will make it much clearer.

Imagine that we didn't have data about home runs directly, only about deep fly balls (say, 380 feet or more). And we found that, on average, 75% of those fly balls turn out to be home runs.

One year, the opposition hits 200 fly balls, but instead of 150 of them being home runs, only 130 of them are.

And we say, "wow, those outfielders must be awesome, to have saved 20 home runs. They must have made 20 spectacular, reaching-over-the-wall, highlight-reel catches!"

No, probably not. It most likely just turned out that only 130 of those 150 deep fly balls had enough distance. By ignoring the substantial amount of luck in the relationship between fly balls and home run potential, we wind up overestimating the outfielders' impact on runs.

That's exactly what's happening here.


"What actually happened on the field" is partly subjective. We observe what we think is important, and attribute it where we think it makes sense. You could count take the outcome of an AB and assign it to the on-deck hitter instead of the batter, but that would make no sense to the way our gut assigns credit or blame. (Our gut, of course, is based on common-sense understandings of how baseball and physics work.)

We assign the observations to the players we think "caused" the result. But we do it even when we know the result is probably the result of luck. It's just the way our brains work. If it was luck, we want to assign it to the lucky party. Otherwise, our observation is *wrong*. 

Here's an analogy, a coin-flipping game. The rules go like this:

-- The pitcher flips a coin. If it's a head, it's a weak ground ball, which is always an out. If it's a tail, it's a hard ground ball. 

-- If it's a hard ground ball, the fielders take over, and flip their own coin. If that coin is a head, they make the play for an out. If it's a tail, it's a hit. 

You play a season of this, and you see the team allowed 15 more hits than expected. How do you assign the blame? 

It depends.

Suppose the "fielders" coin flipped tails exactly half the time, as expected, but the "pitchers" coin flipped too many tails, so the pitchers gave up 30 too many hard ground balls. In that case, we'd say that the fielders can't be blamed, that the 15 extra hits were the pitcher's fault.

If it were the other way around -- the pitcher coin flipped heads half the time, but the "fielder" coin flipped 15 too few heads, letting 15 too many balls drop in for hits -- we'd "blame" the fielders. 

We have very specific criteria about how to assign the observations properly, even when they're just random.

The dWAR calculation violates those criteria. It refuses to look at the pitcher coin, or at least, it doesn't have complete data for it. So it just assigns all 15 hits, or whatever the incomplete pitcher coin data can't explain, to the fielder coin.


Why is this a big deal? What does it matter, that the split between pitching and defense has this flaw?

1.  First, it matters because it introduces something that's new to us, and not intuitively obvious -- the need to regress the dWAR to the mean *just to get an unbiased estimate of what happened on the field*. It took me a lot of thinking until I realized that's what's going on, partly because it's so counterintuitive.

2.  It matters because most estimates of fielding runs saved don't do any regression to the mean. This leads to crazy overestimates of the impact of fielding. 

My guess, based on some calculations I did, is that you have to regress dWAR around halfway to the mean, for cases where you just use BAbip as your criterion. If my guess is right, it means fielding only half as important as dWAR thinks it is. 

Of course, if you have better data, like the Fielding Bible's, you may have to regress less -- depending on the accuracy of your estimates are of how hard each ball is to field. Maybe with that data you only have to regress, say, a third of the way to the mean, instead of a half. I have no idea.

The first edition of the Fielding Bible figures fielding cost the 2005 New York Yankees a total of 164 hits -- more than a hit a game. That's about 130 runs, or 13 wins.

They were saying that if you watched the games, and evaluated what happened, you'd see the fielders screw up often enough that you'd observe an average of a hit per game. 

I'm saying ... no way, that's too hard to believe. I don't know what the real answer is, but I'd be willing to bet that it's closer to half a hit than a full hit -- with the difference caused by the Fielding Bible's batted-ball observations not being perfect.

I'll investigate this further, how much you have to regress.

3.  The problem *really* screws up the pitching numbers. What you're really trying to is start with Runs Allowed and subtract observed defense. But the measure of observed defense you're using is far too extreme. So, effectively, you're subjecting true pWAR to an overadjustment, along with a random shock.

Even so, that doesn't necessarily mean pWAR becomes less accurate. If the errors were the same magnitude as the true effect of fielding, it would generally be a break-even. If the Yankees are actually 40 runs worse than average, the error is the same whether you credit the pitchers 80 runs, or 0 runs ... it's just a matter of which direction. 

Except: the errors aren't fixed. Even if you were to adjust dWAR by regressing it to the mean exactly the right amount, it would still be just an estimate, with a random error.  Adding that in, and you'd still be behind.  

And, perhaps more importantly, with the adjustment, we lose our understanding of what the numbers might mean. The traditional way, when the error is due to *not* adjusting for defense, we intuitively know how to deal with the numbers, what they might not mean. We've always known we can't evaluate pitchers based on runs allowed, unless we adjusting for fielding. But, we've developed a gut feel for what the unadjusted numbers mean, because we've dealt with them so often. 

We probably even have an idea what direction the adjustment has to go, whether the pitchers in question had a good or bad defense behind them -- because we know the players, and we've seen them play, and we know their reputations. 

But, the dWAR way ... well, we have no gut feel for how we need to adjust, because it's no longer about adding in the fielders; it's about figuring out how bad the defense overestimate might have been, and how randomness might have affected the final number.  

When you adjust for dWAR, what you're saying is: "instead of living with the fact that team pitching stats are biased by the effects of fielding, I prefer to have team pitching stats biased by some random factor that's actually a bigger bias than the original." 

All things considered, I think I'd rather just stick with runs allowed.

Labels: , , , , ,

Friday, May 15, 2015

Consumer Reports on bicycle helmets

In the June, 2015 issue of their magazine, Consumer Reports (CR) tries to convince me to wear a bicycle helmet. They do not succeed. Nor should they. While it may be true that we should all be wearing helmets, nobody should be persuaded by CR's statistical arguments, which are so silly as to be laughable.

It's actually a pretty big fail on CR's part. Because, I'm sure, helmets *do* save lives, and it should be pretty easy to come up with data that illustrate that competently. Instead ... well, it's almost like they don't take the question seriously enough to care about what the numbers actually mean. 

(The article isn't online, but here's a web page from their site that's similar but less ardent .)


Here's CR's first argument:

"... the answer is a resounding yes, you *should* wear a helmet. Here's why: 87 percent of the bicyclists killed in accidents over the past two decades were not wearing helmets."

Well, that's not enough to prove anything at all. From that factoid alone, there is really no way to tell whether or not helmets are good or bad. 

If CR doesn't see why, I bet they would if I changed the subject on them:

"... the answer is a resounding yes, you *should* drive drunk. Here's why: 87 percent of the drivers killed in accidents over the past two decades were stone cold sober."

That would make it obvious, right?


If the same argument that proves you should wear a helmet also proves you should drive drunk, the argument must be flawed. What's wrong with it? 

It doesn't work without the base rate. 

In order to see if "87 percent" is high or low, you need something to compare it to. If fewer than 87 percent of cyclists go helmetless, then, yes, they're overrepresented in deaths, and you can infer that helmets are a good thing. But if *more* than 87 percent go without a helmet, that might be evidence that helmets are actually dangerous.

To make the drinking-and-driving argument work, you have to show that fewer than 87 percent of drivers are drunk. 

Neither of those is that hard. But you still have to do it!


Why would CR not notice that their argument was wrong in the helmet case, but notice immediately in the drunk-driving case? There are two possibilities:

1. Confirmation bias. The first example fits in with their pre-existing belief that helmets are good; the second example contradicts their pre-existing belief that drunk driving is bad.

2. Gut statistical savvy. The CR writers do feel, intuitively, that base rates matter, and "fill in the blanks" with their common sense understanding that more than 13 percent of cyclists wear helmets, and that more than 13 percent of drivers are sober.

As you can imagine, I think it's almost all number 1. I'm skeptical of number 2. 

In fact, there are many times that the base rates could easily go the "wrong" way, and people don't notice. One of my favorite examples, which I mentioned a few years ago, goes something like this:

"... the answer is a resounding yes, you *should* drive the speed limit. Here's why: 60 percent of fatal accidents over the past two decades involved at least one speeder."

As I see it, this argument actually may support speeding! Suppose half of all drivers speed. Then, there's a 75 percent chance of finding at least one speeder out of two cars. If those 75 percent of cases comprise only 60 percent of the accidents ... then, speeding must be safer than not speeding!

And, of course, there's this classic Dilbert cartoon.


It wouldn't have been hard for CR to fix the argument. They could have just added the base rate:

"... the answer is a resounding yes, you *should* wear a helmet. Here's why: 87 percent of the bicyclists killed in accidents over the past two decades were not wearing helmets, as compared to only 60 percent of cyclists overall."

It does sound less scary than the other version, but at least it means something.

(I made up the "60 percent" number, but anything significantly less than 87 percent would work. I don't know what the true number is; but, since we're talking about the average of the last 20 years, my guess would be that 60 to 70 percent would be about right.)


Even if CR had provided a proper statistical argument that riding with a helmet is safer than riding without ... it still wouldn't be enough justify their "resounding yes". Because, I doubt that anyone would say,

"... the answer is a resounding yes, you *should* avoid riding a bike. Here's why: 87 percent of commuters killed in accidents over the past two decades were cycling instead of walking -- as compared to only 60 percent of commuters overall."

That would indeed be evidence that biking is riskier than walking -- but I think we'd all agree that it's silly to argue that it's not worth the risk at all. You have to weigh the risks against the benefits.

On that note, here's a second statistical argument CR makes, which is just as irrelevant as the first one:

"... wearing a helmet during sports reduces the risk of traumatic brain injury by almost 70 percent."

(Never mind the article chooses to lump all sports together; we'll just assume the 70 percent is true for all sports equally.)

So, should that 70 percent statistic alone convince you to wear a helmet? No, of course not. 

Last month -- and this actually happened -- my friend's mother suffered a concussion in the lobby of her apartment building. She was looking sideways towards the mail room, and collided with another resident who didn't see her because he was carrying his three-year-old daughter. My friend's Mom lost her balance, hit her head on an open door, blacked out, and spent the day in hospital.

If she had been wearing a helmet, she'd almost certainly have been fine. In fact, it's probably just as true that

"... wearing a helmet when walking around in public reduces the risk of traumatic brain injury by almost 70 percent."

Does that convince you to wear a helmet every second of your day? Of course not.

The relevant statistic isn't the percentage of injuries prevented. It's how big the benefit is as compared to the cost and inconvenience.

The "70 percent" figure doesn't speak to that at all. 

If I were to find the CR writers, and ask them why they don't wear their helmets while walking on the street, they'll look at me like I'm some kind of idiot -- but if they chose to answer the question, they'd say that it's because walking, unlike cycling, doesn't carry a very high risk of head injury.

And that's the point. Even if a helmet reduced the risk of traumatic brain injury by 80 percent, or 90 percent, or even 100 percent ... we still wouldn't wear one to the mailroom. 

I choose not to wear a helmet for exactly the same reason that you choose not to wear a helmet when walking. That's why the "70 percent" figure, on its own, is almost completely irrelevant. 

Everything I've seen on the web convinces me that the risk is low enough that I'm willing to tolerate it. I'd be happy to be corrected by CR -- but they're not interested in that line of argument. 

I bet that's because the statistics don't sound that ominous. Here's a site that says there's around one cycling death per 10,000,000 miles. If it's four times as high without a helmet -- one death per 2,500,000 miles -- that still doesn't sound that scary. 

That latter figure is about 40 times as high as driving. If I ride 1,000 miles per year without a helmet, the excess danger is equivalent to driving 30,000 miles. I'm willing to accept that level of risk. (Especially because, for me, the risk is lower still because I ride mostly on bike paths, rather than streets, and and most cycling deaths result from collisions with cars.)


You can still argue that three extra deaths per ten million miles is a significant amount of risk, enough to create a moral or legal requirement for helmets. But, do you really want to go there? Because, if your criterion is magnitude or risk ... well, cycling shouldn't be at the top of your list of concerns. 

In the year 2000, according to this website, 412 Americans died after falling off a ladder or scaffolding.

Now, many of those deaths are probably job-related, workers who spend a significant portion of their days on ladders. Suppose that covers 90 percent of the fatalities, so only 10 percent of those deaths were do-it-yourselfers working on their own houses. That's 41 ladder deaths.

People spend a lot more time on bicycles than ladders ... I'd guess, probably by a factor of at least 1000. So 41 ladder deaths is the equivalent of 41,000 cycling deaths. 

But ... there were only 740 deaths from cycling. That makes it around fifty-five times as dangerous to climb a ladder than ride a bicycle.

And that's just the ladders -- it doesn't include deaths from falling after you've made it onto the roof!

If CR were to acknowledging that risk level is important, they'd have to call for helmets for others at high-risk, like homeowners cleaning their gutters, and elderly people with walkers, and people limping with casts, and everyone taking a shower.


Finally, CR gives one last statistic on the promo page:

"Cycling sends more people to the ER for head injuries than any other sport -- twice as many as football, 3 1/2 times as many as soccer."

Well, so what? Isn't that just because there's a lot more cycling going on than football and soccer? 

This document (.pdf) says that in the US in 2014, there were six football fatalities. They all happened in competition, even though there are many more practices than games. All six deaths happened among the 1,200,000 players in high-school level football or beyond -- there were no deaths in youth leagues.

Call it a million players, ten games a year, on the field for 30 minutes of competition per game. Let's double that to include kids' leagues, and double it again to include practices -- both of which didn't have any deaths, but still may have had head injuries. 

That works out to 20 million hours of football.

In the USA, cyclists travel between 6 billion and 21 billion miles per year. Let's be conservative and take the low value. At an average speed of, say, 10 mph, that's 600 million hours of cycling.

So, people cycle 30 times as much as they play football. But, according to CR, they have twice the head injury rate. That means cycling is only 7 percent as dangerous as football.

That just confirms what already seemed obvious, that cycling is pretty safe compared to football. Did CR misunderstand the statistics so badly that they convinced themselves otherwise? 


(My previous posts on bicycle helmets are here:  one two three four)

Labels: , ,

Tuesday, May 05, 2015

Consumer Reports on unit pricing

Consumer Reports (CR) wants government to regulate supermarket "unit pricing" labels because they're inconsistent. Last post, I quoted their lead example:

"Picture this: You're at the supermarket trying to find the best deal on AAA batteries for your flashlight, so you check the price labels beneath each pack. Sounds pretty straightforward, right? But how can you tell which pack is cheaper when one is priced per battery and one is priced per 100?"

The point, of course, is that CR must be seriously math-challenged if they don't know how to move a decimal point.

I laughed at their example, and I thought maybe they just screwed up. But ... no, they also chose a silly example as their real-life evidence. 

In the article's photograph, they show two different salad-dressing labels, from the same supermarket. The problem: one is unit-priced per pint, but the other one is per quart. Comparing the two requires dividing or multiplying by two, which (IMO) isn't really that big a deal. But, sure, OK, it would be easier if you didn't have to do that.

Except: the two bottles in CR's example are *the same size*.

One 24-ounce bottle of salad dressing is priced at $3.69; the other 24-ounce bottle is priced at $3.99. And CR is complaining that consumers can't tell which is the better deal, because the breakdowns are in different units!

That doesn't really affect their argument, but it does give the reader the idea that they don't really have a good grip on the problem. Which, I will argue, they don't. Their main point is valid -- that unit pricing is more valuable when the units are the same so it's easier to compare -- but you'd think if they had really thought the issue through, they'd have realized how ridiculous their examples are.


The reason behind unit pricing, of course, is to allow shoppers compare the prices of different-sized packages -- to get an idea of which is more expensive per unit.

That's most valuable when comparing different products. For the same product in different sizes, you can be pretty confident that the bigger packs are a better deal. It's hard to imagine a supermarket charging $3 for a single pack, but $7 for a double-size pack. That only happens when there's a mistake, or when the small pack goes on sale but the larger one doesn't. 

When it's different products, or different brands ... does unit pricing really mean a whole lot if you don't know how they vary in quality?

At my previous post, a commenter wrote,

"What if some batteries have different life expectancies?"

Ha! Excellent point. 

There's an 18-pack of HappyPower AA batteries for $5.99, and a 13-pack of GleeCell for $4.77. Which is a better deal? I guess if the shelf label tells you that each HappyPower battery works out to 33 cents, but a GleeCell costs 37 cents, that helps you decide, a little bit. If you don't know which one is better, you might just shrug, go for the HappyPower, and save the four cents.

Except ... there's an element of "you get what you pay for."  In general (not always, but generally), higher-priced items are of higher quality. I'd be willing to bet that if you ran a regression on every set of ratings CR has issued over the past decade, 95 percent of them would show a positive correlation between quality and price. There certainly is a high correlation in the battery ratings, at least.  (Subscription required.)

So, at face value, unit price isn't enough. The question you really want to answer is:

If someone chose two random batteries, and random battery A cost 11 percent more in an 18-pack than random battery B in a 13-pack, which is likelier to be the better value?

That's not just a battery question: it's a probability question. Actually, it's even more complicated than that. It's not enough to know whether you're getting more than 11 percent better value, because, to get that 11 percent, you have to buy a larger pack! Which you might not really want to do. 

Pack size matters. I think it's fair to say that, all things being equal, we almost always prefer a smaller pack to a larger pack. That must be true. If it weren't, smaller sizes would never sell, and everything would come in only one large size! 

To make a decision, we wind up doing a little intuitive balancing act involving at least three measures: the quality of the product, the unit price, and the size of the pack. The price is just one piece of the puzzle. 

In that light, I'm surprised that CR isn't calling for regulations to force supermarkets to post CR's ratings on the shelves. After all, you can always calculate unit price on the spot, with the calculator app on your phone. But not everyone has a data plan and a CR subscription.


Here's another, separate issue CR brings up:

"[Among problems we found:] Toilet paper priced by '100 count,' though the 'count' (a euphemism for 'sheets') differed in size and number of plies depending on the brand."

So, CR isn't just complaining that the labels use *inconsistent* units -- they're also complaining that they use the *wrong* units. 

So, what are the right units for toilet paper? Here in Canada, packages give you the total area, in square meters, which corrects for different sizes per sheet. But that won't satisfy CR, because that doesn't take "number of plies" into account. 

What will work, that you can compare a pack of three-ply smaller sheets with a pack of two-ply larger sheets?

I guess they could do "price per square foot per ply."  That might work if you're only comparing products, and don't need to get your head around what the numbers actually mean.

They could also do "price per pound," on the theory that thicker, higher-quality paper is heavier than the thinner stuff. But that seems weird, that CR would want to tell consumers to comparison shop toilet paper by weight.

In either case, you're trading ease of understanding what the product costs, in exchange for the ability to more easily compare two products. Where is the tradeoff? I don't think CR has thought about it. On the promo page for their article, they do an "apples and oranges" joke, showing apples priced at $1.66 per pound, while oranges are 75 cents each. Presumably, they should both be priced per pound. 

Now, I have no idea how much a navel orange weighs. If they were $1.79 a pound, and I wanted to buy one only if it were less than, say, $1, I'd have to take it over to a scale ... and then, I'd have to calculate the weight times $1.79.

According to CR, that's bad: 

"To find the best value on the fruit below, you'd need a scale -- and a calculator."

Well, isn't that less of a problem than needing a scale and calculator *to find out how much the damn orange actually costs*?

I think CR hasn't really thought this through to figure out what it wants. But that doesn't stop it from demanding government action.


In 2012, according to the article, CR worked with the U.S. Department of Commerce (DOC) to come up with a set of recommended standards for supermarket labels. (Here's the .pdf, from the government site.)

One of the things they want to "correct" is a shelf label for a pack of cookies. The product description on the label says "6 count," meaning six cookies. The document demands that it be in grams.

Which is ridiculous, in this case. When products come in small unit quantities, that's how consumers think of them. I buy Diet Mountain Dew in packs of twelve, not in agglomerations of 4.258 liters. 

It turns out that manufacturers generally figure out what consumers want on labels, even if CR is unable to. 

For instance: over the years, Procter and Gamble has made Liquid Tide more and more concentrated. You need less to do the same job. That means that the actual liquid volume of the detergent is completely meaningless. What matters is the amount of active ingredient -- in other words, how many loads of laundry the bottle can do.

Which is why Tide provides this information, prominently, on the bottle. My bottle here says it does 32 loads. There are other sizes that do 26 loads, or 36, or 110, or ... whatever.

But, under the proposed CR/US Government standards, that would NOT BE ALLOWED. From the report:

"Unit prices must be based on legal measurement units such as those for declaring a packaged quantity or net content as found in the Fair Packaging and Labeling Act (FPLA). Use of unit pricing in terms of 'loads,' 'uses,' and 'servings' are prohibited."

CR, and the DOC, believe that the best way for consumers to intelligently compare the price of a bottle of Tide to some off-brand detergent that's diluted to do one-quarter the loads ... *is by price per volume*. Not only do they think that's the right method ... they want to make any other alternative ILLEGAL.

That's just insane.


I have a suggestion to try to get CR to change its mind. 

A standard size of Tide detergent does 32 loads of laundry. The premium "Tide with Febreze" variation does only 26 loads. But the two bottles are almost exactly the same size. 

I'll send a letter. Hey, Consumer Reports! Procter and Gamble is trying to rip us off! The unit price per volume makes it look like the two detergents are the same price, but they're not! The other one is watered down!

I bet next issue, there'll be an article demanding legislation to prohibit unit pricing by volume, so that manufacturers stop ripping us off.

I'm mostly kidding, of course. For one thing, P&G isn't necessarily trying to rip us off. The Febreze in the expensive version is an additional active ingredient. (And a good one: it works great on my stinky ball hockey chest pad.) Which is "more product" -- 32 regular loads, or 26 enhanced loads? P&G thinks they're about the same, which is why they made the bottle the same size, to signal what it thinks the product is worth.

Or, maybe they diluted both products similarly, and it just works out that the combined volume winds up similar.

Either way, unit pricing by volume doesn't tell you much. Unless you want to think that, coincidentally, a load with Febreze is exactly 32/26 as valuable a "unit" as a load without. But then, what will you do when Tide changes the proportions?

It makes no sense.


Anyway, I do agree with CR that it's better if similar products can be compared with the same unit. And, sometimes, that doesn't happen, and you get pints alongside quarts.

But I disagree with CR that the occasional lapse constitutes a big problem. I disagree that supermarkets don't care what consumers want. I disagree that CR knows better than manufacturers and consumers. And I disagree that the government needs to regulate anything, including font sizes (which, yes, CR complains about too -- "as tiny as 0.22 inch, unreadable for impaired or aging eyes"). 

CR's goal, to improve things for comparison shoppers, is reasonable. I'm just frustrated that they came up with such bad examples and bad answers, and that they want to make it illegal to do it any way other than their silly wrong way. 

If their way is wrong, what way is right?

Well, it's different for everyone. We're diverse, and we all have different needs. 

What should we do, for, say, Advil? Some people are always take a 200 mg dose, and will much prefer unit price per tablet. Me, I sometimes take 200 mg, and sometimes 400 mg. For me, "per tablet" isn't that valuable. I'd rather see pricing per unit of active ingredient. In addition, I'm willing to take two tablets for a higher dose, or half a tablet for a lower dose, whichever is cheaper. 

It's an empirical question. It depends on how many people prefer each option. Neither the government nor CR can know without actually going out and surveying. 


Having said all that ... let me explain what *I* would want to see in a unit price label, based on how I think when I shop. You probably think differently, and you may wind up thinking my suggestion is stupid. Which it very well might be. 


A small jar of Frank's Famous Apricot Jam costs 35 cents per ounce. A larger jar costs 25 cents per ounce. Which one do you buy?

It depends on the sizes, right? If the big jar is ten times the size, you're less likely to buy it than if it's only twice the size. Also, it depends on how much you use. You don't want the big jar to go bad before you can finish it. On the other hand, if you use so much jam that the small jar will be gone in three days, you'd definitely buy the bigger one. But what if you've never tried that jam before? Frank's Famous Jam might be a mediocre product, like those Frank's Famous light bulbs you bought in 1985, so you might want to start with the small jar in case you hate it.

You kind of mentally balance the difference in unit price among all those other things.

Now, I'm going to argue: the unit price difference, "35 cents vs. 25 cents" is not the best way to look at it. I think the unit prices seriously underestimate the savings of buying the bigger jar. I think the issue that CR identified, the "sometimes it's hard to compare different units," is tiny compared to the issue that unit prices aren't that valuable in the first place.

Why? Because, as economists are fond of saying, you have to think on the margin, not the average. You have to consider only the *additional* jam in the bigger jar.

Suppose the small jar of jam is 12 ounces, and the large is 24 ounces (twice as big). So, the small jar costs $4.20, and the large costs $6.00.

But consider just the margin, the *additional* jam. If you upgrade to the big jar, you're getting 12 additional ounces, for $1.80 additional cost. The upgrade costs you only 15 cents an ounce. That's 58 percent cheaper! 

If you buy the small jar instead of the big one, you're passing up the chance to get the equivalent of a second jar for less than half price. And that's something you don't necessarily see directly if you just look at the average unit price.


I think that's a much more relevant comparison: 35 cents vs. 15 cents, rather than 35 cents vs. 25 cents. 

Don't believe me? I'll change the example. Now, the small jar is still 35 cents an ounce, but the large jar is 17.5 cents an ounce. Now, which do you buy?

You always buy the large jar.  It's the same price as the small jar! At those unit costs, both jars cost $4.20. 

That's obvious when you see that when you upgrade to the bigger jar, you're getting 12 ounces of marginal jam for $0.00 of marginal cost.  It's not as obvious when you see your unit cost drop from 35 cents to 17.5 cents.


So, that's something I'd like to see on unit labels, so I don't have to calculate it myself: the marginal cost for the next biggest size. Something like this:

"If you buy the next largest size of this same brand of Raisin Bran, you will get 40% more product for only 20% more price. Since 20/40 equals 0.5, it's like you're getting the additional product at 50 percent off."

Or, in language we're already familiar with from store sales flyers:

"Buy 20 oz. of Raisin Bran at regular price, get your next 8 oz. at 50% off."


Unit price is a "rate" statistic. Sometimes, you'd rather have a bulk measure -- a total cost. If I want one orange, I might not care that they're $3 a pound -- I just want to know that particular single orange comes out to $1.06.

In the case of the jam, I might think, well, sure, half price is a good deal, but I'm running out of space in the fridge, and I might get sick of apricot before I've finished it all. What does it cost to just say "screw it" and just go for the smaller one?

In other words: how much more am I paying for the jam in the small jar, compared to what I'd pay if they gave it to me at the same unit price as the big jar?

With the small jar, I'm paying 35 cents an ounce. With the big jar, I'd be paying 25 cents an ounce. So, I'm "wasting" ten cents an ounce by buying the smaller 12 ounce jar. That's a cost of $1.20 for the privilege of not having to upgrade to the bigger one.

That flat cost is something that works for me, that I often calculate while shopping. I can easily decide if it's worth $1.20 to me to not have to take home twice as much jam. 


So here's an example of the kind of unit price label I'd like to see:

-- This size: 12 ounces at $0.35 per ounce

--Next larger size: 12 extra ounces at $0.15 per extra ounce (58% savings)

--This size costs $1.20 more than the equivalent quantity purchased in the next larger size.

I'd love to see some supermarket try this before CR makes it illegal.

Labels: , ,