Wednesday, March 25, 2015

Does WAR undervalue injured superstars?

In an article last month, "Ain't Gonna Study WAR No More" (subscription required), Bill James points out a flaw in WAR (Wins Above Replacement), when used as a one-dimensional measure of player value. 

Bill gives an example of two hypothetical players with equal WAR, but not equal value to their teams. One team has Player A, an everyday starter who created 2.0 WAR over 162 games. Another team has Player B, a star who normally produces at the rate of 4.0 WAR, but one ear created only 2.0 WAR because he was injured for half the season.

Which player's team will do better? It's B's team. He creates 2.0 WAR, but leaves half the season for someone from the bench to add more. And, since bench players create wins at a rate higher than 0.0 -- by definition, since 0.0 is the level of player that can be had from AAA for free -- you'd rather have the half-time player than the full-time player.

This seems right to me, that playing time matters when comparing players of equal WAR. I think we can tweak WAR to come up with something better. And, even if we don't, I think the inaccuracy that Bill identified is small enough that we can ignore it in most cases.

------

First: you have to keep in mind what "replacement" actually means in the context of WAR. It's the level of a player just barely not good enough to make a Major League Roster. It is NOT the level of performance you can get off the bench.

Yes, when your superstar is injured, you often do find his replacement on the bench. That causes confusion, because that kind of replacement isn't what we really mean when we talk about WAR.

You might think -- *shouldn't* it be what we mean? After all, part of the reason teams keep reasonable bench players is specifically in case one of the regulars gets injured. There is probably no team in baseball, when their 4.0 WAR player goes down the first day of the season, can't replace at least a portion of those wins from an available player. So if your centerfielder normally creates 4.0 WAR, but you have a guy on the bench who can create 1.0 WAR, isn't the regular really only worth 3.0 wins in a real-life sense?

Perhaps. But then you wind up with some weird paradoxes.

You lease a blue Honda Accord for a year. It has a "VAP" (Value Above taking Public Transit) of, say $10,000. But, just in case the Accord won't start one morning, you have a ten-year-old Sentra in the garage, which you like about half as much.

Does that mean the Accord is only worth $5,000? If it disappeared, you'd lose its $10,000 contribution, but you'd gain back $5,000 of that from the Sentra. If you *do* think it's only worth $5,000 ... what happens if your neighbor has an identical Accord, but no Sentra? Do you really want to decide that his car is twice as valuable as yours?

It's true that your Accord is worth $5,000 more than what you would replace it with, and your neighbor's is worth $10,000 more than what would he would replace it with. But that doesn't seem reasonable as a general way to value the cars. Do you really want to say that Willie McCovey has almost no value just because Hank Aaron is available on the bench?

------

There's also another accounting problem, one that commenter "Guy123" pointed out on Bill's site. I'll use cars again to illustrate it.

Your Accord breaks down halfway through the year, for a VAP of $5,000. Your mother has only an old Sentra, which she drives all year, for an identical VAP of $5,000.

Bill James' thought experiment says, your Accord, at $5,000, is actually worth more than your mother's Sentra, at $5,000 -- because your Accord leaves room for your own Sentra to add value later. In fact, you get $7,500 in VAP -- $5,000 from half a year of the Accord, and $5,000 from half a year of the Sentra.

Except that ... how do you credit the Accord for the value added by the Sentra? You earned a total of $7,500 in VAP for the year. Normal accounting says $5,000 for the Accord, and $2,500 for the Sentra. But if you want to give the Accord "extra credit," you have to take that credit away from the Sentra! Because, the two still have to add up to $7,500.

So what do you do?

------

I think what you do, first, is not base the calculation on the specific alternatives for a particular team. You want to base the calculation on the *average* alternative, for a generic team. That way, your Accord winds up worth the same as your neighbor's.

You can call that, "Wins Above Average Bench." If only 1 in 10 households has a backup Sentra, then the average alternative is one tenth of $5,000, or $500. So the Accord has a WAAB of $9,500.

All this needs to happen because of a specific property of the bench -- it has better-than-replacement resources sitting idle.

When Jesse Barfield has the flu, you can substitute Hosken Powell for "free" -- he would just be sitting on the bench anyway. (It's not like using the same starting pitcher two days in a row, which has a heavy cost in injury risk.)

That wouldn't be the case if teams didn't keep extra players on the bench, like if the roster size for batters were fixed at nine. Suppose that when Jesse Barfield has the flu, you have to call Hosken Powell up from AAA. In that case, you DO want Wins Above Replacement. It's the same Hosken Powell, but, now, Powell *is* replacement, because replacement is AAA by definition.

Still, you won't go too wrong if you just stick to WAR. In terms of just the raw numbers, "Wins Above Replacement" is very close to "Wins Above Average Bench," because the bottom of the roster, the players that don't get used much, is close to 0.0 WAR anyway.

For player-seasons between 1982 and 1991, inclusive, I calculated the average offensive expectation (based on a weighted average of surrounding seasons) for regulars vs. bench players. Here are the results, in Runs Created per 405 outs (roughly a full-time player-season), broken down by "benchiness" as measured by actual AB that year:

500+ AB: 75
401-500: 69
301-400: 65
201-300: 62
151-200: 60
101-150: 59
 76-100: 45
 51- 75: 33

A non-superstar everyday player, by this chart, would probably come in at around 70 runs. A rule of thumb is that everyday players are worth about 2.0 WAR. So, 0.0 WAR -- replacement level -- would about 50 runs.

The marginal bench AB, the ones that replace the injured guy, would probably come from the bottom four rows of the chart -- maybe around 55. That's 5 runs above replacement, or 0.5 wins. 

So, the bench guys are 0.5 WAR. That means when the 4.0 guy plays half a season, and gets replaced by the 0.5 guy for the other half, the combination is worth 2.25 WAR, rather than 2.0 WAR. As Bill pointed out, the WAR accounting credits the injured star with only 2.0, and he still comes out looking only equally as good as the full-time guy.

But if we switch to WAAB ... now, the full-time guy is 1.5 WAAB (2.0 minus 0.5). The half-time star is 1.75 WAAB (4.0 minus 0.5, all divided by 2). That's what we expected: the star shows more value.

But: not by much. 0.25 wins is 2.5 runs, which is a small discrepancy compared to the randomness of performance in general. And even that discrepancy is random, since something as large as a quarter of a win only shows up when a superstar loses half the season to injury. The only time when it's large and not random is probably star platoon players -- but there aren't too many of those.

(The biggest benefit to accounting for the bench might be when evaluating pitchers, who, unlike hitters, vary quite a bit in how much they're physically capable of playing.)

I don't see it as that a big deal at all. I'd say, if you want, when you're comparing two batters, give the less-used player a bonus of 0.1 WAR for each 100 AB of playing time. 

Of course, that estimate is very rough ... the 0.1 wins could easily be 0.05, or 0.2, or something. Still, it's still going to be fairly small -- small enough that I'd be it wouldn't change too many conclusions that you'd reach if you just stuck to WAR.





        

Labels: ,

Tuesday, December 02, 2014

Players being "clutch" when targeting 20 wins -- a follow-up

In his 2007 essay, "The Targeting Phenomenon," (subscription required), Bill James discussed how there are more single-season 20-game winners than 19-game winners. That's the only time that happens, that the higher number happens more frequently than the lower number. 

This is obviously a case of pitchers targeting the 20-win milestone, but Bill didn't speculate on the actual mechanisms for how the target gets hit. In 2008, I tried to figure it out. But, this past June, Bill pointed out that my conclusion didn't fit with the evidence:

"... the Birnbaum thesis is that the effect was caused one-half by pitchers with 19 wins getting extra starts, and one-half by poor offensive support by pitchers going for their 21st win, thus leaving them stuck at 20. But that argument doesn't explain the real life data. 

"[If you look closely at the pattern in the numbers,] the bulge in the data is exactly what it should be if 20 is borrowing from 19 -- and is NOT what it should be if 20 is borrowing both from 19 and 21."

(Here's the link.  Scroll down to OldBackstop's comment on 6/6/2014.)

So, I rechecked the data, and rethought the analysis, and ... Bill is right, as usual. The basic data was correct, but I didn't do the adjustments properly.

-----

My original study covered 1940 to 2007. This study, though, will cover only 1956 to 2000. That's because I couldn't find my original code and data. The "1956" is what I happened to have handy, and I decided to stop at 2000 because Bill did. 

First, here are the raw numbers of seasons with X wins:

17 wins: 159
18 wins: 132
19 wins:  92
20 wins: 113
21 wins:  56
22 wins:  35
23 wins:  20
24 wins:  20

You can see the bulge we're dealing with: there are way too many 20-win pitchers. And it can't be that the excess comes from the 21-win bucket, because, then, the average of 20 and 21 would stay the same, and wouldn't be much lower than 19. That can't be right. And, as Bill pointed out, even if only *half* the excess came from the 21 bucket, 20 would still be too big relative to 19.

So, let me try fixing the problem.

In the other study, I checked four ways in which 20 wins could get targeted:

1. Extra starts for pitchers getting close
2. Starters left in the game longer when getting close
3. Extra relief apparances for pitchers getting close
4. Better performance or luck when shooting for 20 than when shooting for 21.

I'll take those one at a time.

-------

1. Extra starts

The old study found that pitchers who eventually wound up at 19 or 20 wins did, in fact, get more late-season starts than others -- about 23 more overall. In this smaller study (1956-2000 instead of 1940-2007), that translates down to maybe 18 extra starts. 

That's about 9 extra wins. Let's allocate four of them to pitchers who wound up at 19 instead of 18, and the other five to pitchers who wound up at 20 instead of 19. If we back that out of the actual data, we get:

18 wins: 132 136
19 wins:  92  93
20 wins: 113 108
21 wins:  56  56

(If you're reading this on a newsfeed that doesn't support font variations: the first column is the old values, which should be struck out.)

What happens is: the 18 bucket gets back the four pitchers who won 19 instead. The 19 bucket loses those four pitchers, but gains back the five pitchers who won 20 instead of 19. The 20 bucket loses those five pitchers.

(In the other study, I didn't bother doing this, backing out the effects when I found them, so I wound up taking some of them from the wrong place, which caused the problem Bill found.)

So, we've closed the gap from 21 down to 15.

--------

2. Starters left in the game longer

After I had posted the original study, Dan Rosenheck commented,
"You didn't look at innings per start. I bet managers leave guys with 19 W's in longer if they are tied or trailing in the hope that the lineup will get them a lead before they depart."

I checked, and Dan was right. In a subsequent comment, I figured Dan's explanation accounted for about 10 extra twenty-game winners. Those are all taken from the 19-game bucket, because the effect occurred only for starters currently pitching with 19 wins.

For this smaller dataset, I'll reduce the effect from 10 seasons to 7. 

So:

18 wins: 136 136
19 wins:  93 100
20 wins: 108 101
21 wins:  56  56

Now, the bulge is down to 1.  We still have a ways to go, if the 19 is to be significantly higher than the 20, but we're getting there.

---------

3. Extra Relief Apparances

The other study listed every pitcher who got a win in relief while nearing 20 wins. Counting only the ones from 1956 to 2000, we get:

3 pitchers winding up at 19
5 pitchers winding up at 20
2 pitchers winding up at 21

Backing those out:

18 wins: 136 139
19 wins: 100 102
20 wins: 101  98
21 wins:  56  54

The gap now goes the proper direction, but only slightly.

------

4. Luck

This was the most surprising finding, and the one responsible for the "getting stuck at 20" phenomenon. Pitchers who already had 20 wins were unusually unlikely to get to 21 in a subsequent start. Not because they pitched any worse, but because they got poor run support from their offense.

When Bill pointed out the problem, I wondered if the run-support finding was just a programming mistake. It wasn't -- or, at least, when I rewrote the program, from scratch, I got the same result.

For every current starter win level, here are the pitchers' W-L records in those starts, along with the team's average runs scored and allowed:

17 wins:   483-311 .557   4.30-3.61
18 wins:   350-250 .608   4.30-3.61
19 wins:   260-182 .588   4.24-3.56
20 wins:   150-136 .524   3.81-3.54
21 wins:    94- 61 .606   4.49-3.44
22 wins:    59- 23 .720   4.26-2.80

The run support numbers are remarkably consistent -- except at 20 wins. Absent any other explanation, I assume that's just a random fluke.

If we assume that the 20-win starters "should have" gone 171-115 (.598) instead of 150-136 (.524), that makes a difference of 21 wins.

The mistake I made in the previous study was to assume that those wins were all stolen from the "21-win" bucket. Some were, but not all. Some of the unlucky pitchers eventually got past the 20-win mark; a few, for instance, went on to post 23 wins. In their case, it becomes the 23-win bucket stealing a player from the 24-win bucket.

I checked the breakdown. For every starter who tried for his 21st win but didn't achieve it that game, I calculated where he eventually finished the season. From there, I scaled the totals down to 21, the number of wins lost to bad luck. The result:

20  wins:  9 pitchers
21  wins:  5 pitchers
22  wins:  3 pitchers
23  wins:  1 pitcher
24  wins:  2 pitchers
25+ wins:  less than 1 pitcher

So: the 20-win bucket stole 9 pitchers from the 21-win bucket. The 21-win bucket stole 5 pitchers from the 22-win bucket. And so on. 

Adjusting the overall numbers gives this:

18 wins: 139 139
19 wins: 102 102
20 wins:  98  89
21 wins:  54  50
22 wins:  35  33

-------

And that's where we wind up. It's still not quite enough, to judge by Bill's formula and even just the eyeball test. It still looks like there's a little bulge at 20, by maybe five pitchers. If 20 could steal five more pitchers from 19, we'd be at 107/84, which would look about right.

But, we've done OK. We started with a difference of +21 -- that is, 21 more twenty-game winners than nineteen-game winners -- and finished with a difference of -13. That means we found an explanation for 34 games, out of what looks like a 39-game discrepancy.

Where would the other five come from? I don't know. It could be luck and rounding errors. It could also be that the years 1956-2000 aren't a representative sample of the original study, so we lost a bit of accuracy when I scaled down.  Or, it could be some fifth real factor I haven't thought of.

In any case, here's the final breakdown of the number of "excess" 20-game winners:

-- 5 from getting extra starts;
-- 7 from being left in games longer than usual;
-- 3 from getting extra relief appearances;
-- 9 from bad run support getting them stuck at 20;
-- 5 from luck/rounding/sources unknown.

By the way, one important finding still stands through both studies. Starters didn't seem to pitch any better than normal with their 20th win on the line, so you can't accuse them of trying harder in the service of a selfish personal goal.




Labels: , , ,

Sunday, September 28, 2014

Experts

Bill James doesn't like to be called an "expert." In the "Hey Bill" column of his website, he occasionally corrects readers who refer to him that way. And, often, Bill will argue against the special status given to "experts" and "expertise."

This, perhaps understandably, puzzles some of us readers. After all, isn't Bill's expertise the reason we buy his books and pay for his website?  In other fields, too, most of what we know has been told to us by "experts" -- teachers, professors, noted authors. Do we want to give quacks and ignoramuses the same respect as Ph.Ds?

What Bill is actually arguing, I think, is not that expertise is useless -- it's that in practice, it's used to fend off argument about what the "expert" is saying.  In other words, titles like "expert" are a gateway to the fallacy of "argument from authority."

On November 8, 2011 (subscription required), Bill replied to a reader's question this way:


"I've devoted my whole career to battling AGAINST the concept of expertise. The first point of my work is that it DOESN'T depend on expertise. I am constantly reminding the readers not to regard me an expert, because that doesn't have anything to do with whether what I have to say is true or is not true."

In other words: don't believe something because an "expert" is saying it. Believe it because of the evidence. 

(It's worth reading Bill's other comments on the subject; I wasn't able to find links to everything I remember, but check out the "Hey Bill" pages for November, 2011; April 18, 2012; and August/September, 2014.)

Anyway, I'd been thinking about this stuff lately, for my "second-hand knowledge" post, and Bill's responses got me thinking again. Some of my thoughts on the subject echo Bill's, but all opinions here are mine.

-------

I think believing "experts" is useful when you're looking for the standard, established scientific answer.  If you want to know how far it is from the earth to the sun, an astronomer has the kind of "expertise" you can probably accept.

We grow up constantly learning things from "experts," people who know more than we do -- namely, parents and teachers. Then, as adults, if we go to college, we learn from Ph.D. professors. 

Almost all of our formal education comes from learning from experts. Maybe that's why it seems weird to hear that you shouldn't believe them. How else are you going to figure out the earth/sun distance if you're not willing to rely on the people who have studied astronomy?

As I wrote in that previous post, it's nice to be able to know things on your own, directly from the evidence. But there's a limit to how much we can know that way. For most factual questions, we have to rely on other people who have done the science that we can't do.

-------

The problem is: in our adult, non-academic lives, the people we call "experts" are rarely used that way, to resolve issues of fact. Few of the questions in "Ask Bill" are about basic information like that. Most of them are asking for opinion, or understanding, or analysis. They want to pick Bill's brain.

From 1/31/2011: "Would you have any problem going with a 4-man rotation today?"

From 10/7/2013: "Bill, you wrote in an early Abstract that no one can learn to hit at the major league level. Do you still believe that?"

From 10/29/2012: "Do you think baseball teams sacrifice bunt too much?"

In those cases, sure, you're better off asking Bill than asking almost anyone else, in my opinion. Even so, you shouldn't be arguing that Bill is right because he's an "expert."  

Why?  Because those are questions that don't have an established, scientific answer based on evidence. In all three cases, you're just getting Bill's opinion. 

Moreover: all three of those issues have been debated forever, and there's *still* no established answer. That means there are opinions on both sides. What makes you think the expert you're currently asking is on the correct side? Bill James doesn't think a four-man rotation is a bad idea, but any number of other "experts" believe the opposite. 

Subject-matter experts should agree on the basic canon, sure. It should be rare that a physics "expert" picks up a textbook and has serious disagreements with anything inside.

But: they can only agree on answers that are known. In real life, most interesting questions don't have an answer yet. That's what makes them so interesting!

When will we cure cancer? What's the best way to fight crime? When should baseball teams bunt? Will the Seahawks beat the spread?

Even the expertest expert doesn't know the answer to those questions. Some of them are unknowable. If anyone was "expert" enough to predict the outcome of football games, he'd be the world's richest gambler. 

-----

All you can really expect from an expert is that he or she knows the state of the science.  An expert is an encyclopedia of established knowledge, with enough understanding and experience to draw inferences from it in established ways.

Expertise is not the same as intelligence. It is not the same as wisdom. It is not the same as insight, or freedom from bias, or prescience, or rationality.

And that's why you can get different "experts" with completely different views on the exact same question, each of them thinking the other is a complete moron. That's especially true on controversial issues. (Maybe it's not that controversial issues are less likely to have real answers, but that issues that have real answers are no longer controversial.)

On those kinds of issues, where you know there are experts on both sides, you might as well flip a coin as rely on any given expert.

And hot-button issues are where you find most of the "experts" in the media or on the internet, aren't they?  I mean, you don't hear experts on the radio talking how many neutrons are in an atom of vanadium. You hear them talking about what should be done to revive the sagging economy. Well, there's no consensus answer for that. If there were, the Fed would have implemented it long ago, and the economy would no longer be sagging. 

Indeed, the fact that nobody is taking the expert's advice is proof that there must be other experts that think he's wrong.

Sometimes, still, I find myself reading something an expert says, and nodding my head and absorbing it without realizing that I'm only hearing one side. We don't always conciously notice the difference, in real time, between consensus knowledge and the "expert's" own assertions. 

Part of the reason is that they're said in the same, authoritative tone, most of the time. Listen to baseball commentators. "Jeter is hitting .302." "Pitching is 75 percent of baseball." You really have to be paying attention to notice the difference. And, if you don't know baseball, you have no way of knowing that "75 percent of baseball" isn't established fact! At least, until you hear someone dispute it.

Also, I think we're just not used to the idea that "experts" are so often wrong. For our entire formal education, we absorb what they teach us about science as unquestionably true. Even though we understand, in theory, that knowledge comes from the scientific method ... well, in practice, we have found that knowledge comes from experts telling us things and punishing us for not absorbing them.  It's a hard habit to break.

------

The fact is: for every expert opinion, you can find an equal and opposite expert opinion. 

In that case, if you can't just assume someone's right just because he's an expert, can you maybe figure out who's right by *counting* experts?  

Maybe, but not necessarily. As Bill James wrote (9/8/14),


"An endless list of experts testifying to falsehood is no more impressive than one."

It used to be that an "endless list" of experts believed that W-L record was the best indication of a pitcher's performance. It used to be that almost all experts believed homosexuality was a disease. It used to be that almost no experts believed that gastritis was caused by bacteria -- until a dissenting researcher proved it by drinking a beaker of the offending strain. 

Each of those examples (they're mine, not Bill's) illustrates a different way experts can be wrong. 

In the first case, pitcher wins, the expert conventional wisdom never had any scientific basis -- it just evolved, somehow, and the "experts" resisted efforts to test it. 

In the second case, homosexuality, I suspect a big part of it was the experts interpreting the evidence to conform to their pre-existing bias, knowing that it would hurt their reputations to challenge it. 

In the third case ... that's just the scientific method working as promised. The existing hypothesis about gastritis was refuted by new evidence, so the experts changed their minds. 

Bill has a fourth case, the case of psychiatric "expert witnesses" who just made stuff up, and it was accepted because of their credentials. From "Hey Bill," 11/10/2011 and 11/11/2011:


"Whenever and wherever someone is convicted of a crime he did not commit, there's an expert witness in the middle of it, testifying to something that he doesn't actually know a damned thing about.  In the 1970s expert witnesses would testify to the insanity of anybody who could afford to pay them to do so."

"Expert witnesses are PRAISED by professional expert witnesses for the cleverness with which they discuss psychological concepts that simply don't exist."

In none of those cases would you have got the right answer by counting experts. (Well, maybe in the third case, if you counted after the evidence came out.)  

Actually, I'm cheating here. I haven't actually shown that the majority isn't USUALLY right. I've just shown that the majority isn't ALWAYS right. 

It's quite possible that those four cases were rare exceptions: that, most of the time, when the majority of experts agree, they're generally right. Actually, I think that's true, that the majority is usually right -- but I'm only willing to grant that for the "established knowledge" cases, the "distance from the earth to the sun" issues. 

For issues that are legitimately in dispute, does a majority matter?  And does the size matter?  Does a 80/20 split among experts really mean significantly more reliability than a 70/30 split?  

Maybe. But if you go by that, it's not *knowing*, right?  It's just handicapping. 

Suppose 70% of doctors believe X, and, if you look at all times that seventy percent of doctors believed something else, 9 out of 10 of those beliefs turned out to be true. In that case, you can't say, "you must trust the majority of experts."  You have to say, at best, "there's a 9 out of 10 chance that X is true."

But maybe I can say more, if I actually examine the arguments and evidence.

I can say, "well, I've examined the data, and I've looked at the studies, and I have to conclude that this is the 1 time out of 10 that the majority is dead wrong, and here is the evidence that shows why."  

And you have no reply to that, because you're just quoting odds.

And that's why evidence trumps experts. 

Here's Bill James on climate scientists, 9/9/2014 and 9/10/2014:


"[You should not believe climate scientists] because they are experts, no. You should believe them if they produce information or arguments that you find persuasive. But to believe them BECAUSE THEY ARE EXPERTS -- absolutely not.

"It isn't "consensus" that settles scientific disputes; it is clear and convincing evidence. An issue is settled in science when evidence is brought forward which is so clear and compelling that everyone who looks at the evidence comes to the same conclusion. ... The issue is NOT whether scientists agree; it is whether the evidence is compelling."

If you want to argue that something is true, you have two choices. You can argue from the evidence. Or, you can argue from the secondhand evidence of what the experts believe. 

But: the firsthand evidence ALWAYS trumps the secondhand evidence. Always. That's the basis of the entire scientific method, that new evidence can drive out an old theory, no matter how many experts and Popes believe they're wrong, and no matter how strongly they believe it.

You're talking to Bob, a "denier" who doesn't believe in climate change. You say to Bob, "how can you believe what you believe, when the scientists who study this stuff totally disagree with you?"

If Bob replies, "I have this one expert who says they're wrong" ... well, in that case, you have the stronger argument: you have, maybe, twenty opposing experts to his one. By Bob's own logic -- "trust experts" -- the probabilities must be on your side. You haven't proven climate change is real, but you've convincingly destroyed Bob's argument. 

However: if Bob replies, "I think your twenty experts are wrong, and here's my logic and evidence" -- well, in that case, you have to stop arguing. He's looking at firsthand evidence, and you're not. Your experts might still be right, because maybe he's got bad data, or he's misinterpreting his evidence, or his worthless logic comes out of the pages of the Miss America Pageant. Still, your argument has been rendered worthless because he's talking evidence, which you're not willing or able to look at directly.

As I wrote in 2010,


"Disbelieving solely because of experts is NOT the result of a fallacy. The fallacy only happens when you try to use the experts as evidence. Experts are a substitute for evidence. 

"You get your choice: experts or evidence. If you choose evidence, you can't cite the experts. If you choose experts, you can't claim to be impartially evaluating the evidence, at least that part of the evidence on which you're deferring to the experts. 

"The experts are your agents -- if you look to them, it's because you are trusting them to evaluate the evidence in your stead. You're saying, "you know, your UFO arguments are extraordinary and weird. They might be absolutely correct, because you might have extraordinary evidence that refutes everyone else. But I don't have the time or inclination to bother weighing the evidence. So I'm going to just defer to the scientists who *have* looked at the evidence and decided you're wrong. Work on convincing them, and maybe I'll follow."  

In other words: it's perfectly legitimate to believe in climate change because the scientific consensus is so strong. It is also legitimate to argue with people who haven't looked at the evidence and have no firsthand arguments. But it is NOT legitimate to argue with people who ARE arguing from the evidence, when you aren't. 

That they're arguing first-hand, and you're not, doesn't necessarily mean you're wrong. It  just means that you have no argument or evidence to bring to the table. And if you have no evidence in a scientific debate, you're not doing science, so you need to just ... well, at that point, you really need to just shut up.

The climate change debate is interesting that way, because, most of the activist non-scientists who believe it's real really haven't looked at the science enough to debate it. A large number have *no* firsthand arguments, except the number of scientists who believe it. 

As a result, it's kind of fun to watch their frustration. Someone comes up with a real argument about why the data doesn't show what the scientists think it does, and ... the activists can't really respond. Like me, most have no real understanding of the evidence whatsoever. They could say, like I do to the UFO people, "prove it to the scientists and then I'll listen," but they don't. (I suspect they think that sounds like they're taking the deniers seriously.)

So, they've taken to ridiculing and name-calling and attacking the deniers' motivations. 

To a certain extent, I can't blame them. I'm in the same situation when I read about Holocaust deniers. I mean the serious ones, the "expert" deniers, the ones who post blueprints of the death camps and prepare engineering and logistics arguments about how it wasn't possible to kill that many people in that short a time. And what can I do?  I let other expert historians argue their evidence (which fortunately, they do quite vigorously), and I gnash my teeth and maybe rant to my friends.

That's just the way it has to be. You want to argue, you have to argue the evidence. You don't bring a knife to a gunfight, and you don't bring an opinion poll to a scientific debate.

Labels: , , ,

Sunday, January 26, 2014

Number grades aren't really numbers

Recently, Consumer Reports (CR) evaluated the "Model S", the new electric car from Tesla.  It got their highest rating ever: 99 points out of a possible 100.

I was thinking about that, and I realized that the "99 out of 100" scale is the perfect example of what Bill James was talking about when, in the 1982 Abstract, he wrote about how we use numbers as words:

"Suppose that you see the number 48 in a player's home run column... Do you think about 48 cents or 48 soldiers or 48 sheep jumping over a fence? Absolutely not. You think about Harmon Killebrew, about Mike Schmidt, about Ted Kluszewski, about Gorman Thomas. You think about power.
"In this way, the number 48 functions not as a number, as a count of something, but as a word, to suggest meaning. The existence of universally recognized standards -- .300, 100 RBI, 20 wins, 30 homers -- ... transmogrifies the lines of statistics into a peculiar, precise form of language. We all know what .312 means. We all know what 12 triples means, and what 0 triples means. We know what 0 triples means when it is combined with 0 home runs (slap hitter, chokes up and punches), and we know what it means when it is combined with 41 home runs (uppercuts, holds the bat at the knob, can't run and doesn't need to). ..."

That's exactly what's going on with the 99/100, isn't it?  The 99 doesn't mean anything, as a number.  It's short for, "really, really outstanding," but said in a way that makes it look objective.  

If you try to figure out what the 99 means as a number ... well, there isn't anything.  As great a car as the Tesla might be, is it really 99% of the way to perfection?  Is it really that close to the most perfect car possible?   

One thing CR didn't like about the car was that the air conditioning didn't work that great.  Is that the missing point?  If Tesla fixes the A/C, does the rating bump to 100?  

If so, what happens when Tesla improves the car even more?  If they put in a bigger battery to get better acceleration, or they improve the handling, or they improve the ergonomics, what's CR going to do -- give them 106 out of 100?

Does it really make sense to fashion a rating on the idea that you can compare something to "perfect"?

----

To get the "99", CR probably broke the evaluation into categories -- handling, acceleration, comfort, etc. -- and then rated the car on each, maybe out of 10.  
The problem with getting rated that way is that if you underperform in one category, you can't make up for it in another in which you're already excellent.

Suppose a car has a tight back seat, and gets docked a couple of points.  And the manufacturer says, "well, it's a tradeoff.  We needed to do that to give the car a more aerodynamic shape, to save gas and get better acceleration."  Which is fine, if it's a mediocre car: in that case, the backseat rating drops from (say) 7 to 5, but the acceleration rating jumps from 6 to 8.  The tradeoff keeps the rating constant, as you'd expect.

But, what if the car is really, really good already?  The backseat rating goes from 10 to 8, and the acceleration rating goes from 10 to ... 12?  

The "out of 10" system arbitrarily sets a maximum "goodness" that it will recognize, and ignores everything above that.  That's fine for evaluating mediocrity, but it fails utterly when trying to evaluate excellence.  

-----

In the same essay, Bill James added,

"What is disturbing to some people about sabermetrics is that in sabermetrics we use the numbers as numbers. We add them together; we multiply and divide them."

That, I think, is the difference between numbers that are truly numbers, and numbers that are just words.  You can only do math if numbers mean something numeric.

The "out of 100" ratings don't.  They're not really numbers, they're just words disguised as numbers.

We even acknowledge that they're words by how we put the article in front of them.  "I got AN eighty in geography."  It's not the number 80; it's a word that happens to be spelled with numbers instead of letters.

It's part of the fabric of our culture, what "out of 100" numbers mean as words.  Think of a report card.  A 35 is a failure, and a 50 is a bare pass, and a 60 is mediocre, and a 70 is average, and an 80 is very good, and a 90 is outstanding, and a 95 is very outstanding.  It's just the way things are.  

And that's why treating them as numbers doesn't really work.  We try -- we still average them out -- but it's not really right.  For instance: Ann takes an algebra course and gets a 90, but doesn't take calculus or statistics.  Bob takes all three, but gets a grade of 35 in each.  Does Bob, with a total of 105, know more math than Ann, who only has 90?  No, they don't.  It's clear, intuitively, that Ann knows more than Bob.  That's true even if Ann *does* take calculus and statistics, but gets 0 in each.  Bob has the higher average, but Ann really did better in math.

On the other hand, there are situations in which we CAN take those numbers as numbers -- but only if we're not interpreting them with normal meanings on the "out of 100" scale.  Suppose Ann knows 90 percent of the answers (sorry, "questions") in Jeopardy's "Opera" category, but 0 percent in "Shakespeare" and "Potent Potables."  Bob knows 30 percent in each.  In this case, Bob and Ann *are* equivalent.  

What's the difference?  In Jeopardy, "30 out of 100" isn't a rating -- it's a measurement.  It's actually a number, not a word.  And so, it doesn't have the same meaning as it does as a rating. Because, here, 30 percent is actually pretty good.

If we said, " Bob got a grade of 30% in Potent Potables," that would be almost a lie.  A grade of 30% is a failure, and Bob did not fail. 

-----

The calculus example touches on the system's biggest failing: it can't handle quantity, only quality.  In real life, it's almost always true that good things are better when you have more.  But the "out of 100" system doesn't handle that well.  

Suppose I have two mediocre cars, a Tahoe which rates a 58, and a Yaris that rates a 41.  Those ) add up to 99.  Are the two cars together, in any sense, equivalent to a Tesla?  I don't think so.

Or, suppose I have the choice: drive a Yaris for two years, or a Tesla for six months.  Can I say, "well, two years of the Yaris is 82 point-years, and six months of Tesla is 49.5 point-years, so I should go with the Yaris?"  Again, no.

What's better, watching two movies rated three stars, or watching three movies rated two stars?  It depends on the person, of course.  For me, I still wouldn't know the answer.  And, it completely depends on the context of what you're rating.  

My gut says that a five-star movie (to me) is worth at least ten three-star movies, and probably more.  But a night at a five-star hotel is worth a lot less than ten nights at a three-star hotel.

Rate your enjoyment of "The Hidden Game of Baseball," on a scale of 1 to 10.  Maybe, it's a 9.  Now, rate your enjoyment of the 1982 Baseball Abstract on the same scale.  Maybe that's a 9.5.  Fair enough.

Now, rate your enjoyment of the entire collected works of Pete Palmer.  I'll say 9.5.  Then, rate the entire collected works of Bill James.   I'd say 10.  

[If you're not familiar with sabermetrics books, try these translations: "The Hidden Game of Baseball" = "Catcher in the Rye".  "Pete Palmer" = "J.D. Salinger".  "1982 Baseball Abstract" = "Hamlet".  "Bill James" = "William Shakespeare".]

But: Bill James has probably published at least three times as many words as Pete Palmer!  And he gets no credit for that at all.  Because, Pete's other books (say, three of them) raised his rating half a point; and Bill's other books (say, ten of them) raised his rating by the same half a point!  If you insist on treating ratings as numbers, you have to conclude that you enjoyed each of Bill's subsequent books less than half as much as Pete's subsequent books.  

Even worse: if Bill James had written nothing else after 1988, I'd still give him a 10.  The "out of 10" rating system completely ignores that you liked Bill's later works.

The system just doesn't understand quantity very well.  

But, it *does* understand *quality*.  What if we can somehow collapse quantity into quality?

A few paragraphs ago, the entire collected works of Pete Palmer rated a 9.5 in enjoyment.  The entire collected works of Bill James rated a 10.

But: now, compare two Christmas gifts under the tree.  One, the entire collected works of Pete Palmer.  Second, the entire collected works of Bill James.  

Now, we start thinking about the quality of the gift, rather than the quality of the *enjoyment* of the gift.  And the quality of a gift does, obviously, depend on how much you're getting.  And so, my gut is now happy to rate the Pete Palmer gift as a 7, and the Bill James gift as a 9.  

They shouldn't really be different, right?  Because, effectively, the quality of the gift is exactly the same as the amount of enjoyment, almost by definition.  (It's not like I was thinking I could use the Bill James books as doorstops.)  But in the gift case, we're mentally equipped to calibrate by quantity.  In the enjoyment case, our brain has no idea how to make that adjustment. 




Labels: ,

Wednesday, December 11, 2013

Explaining

When I was a kid, the adult science writer I read the most was Isaac Asimov.  He wasn't the most expert in any of the fields he wrote about (except, perhaps, biochemistry, which was his Ph.D.), but he was easy to read and understand.  

Some call that kind of writing "accessible," which, I guess, means that you don't need a lot of background to follow what the author is saying.  But I don't think that really captures it. It's been a while since I read any Asimov, but I bet that even in subjects where I have a fair bit of background -- math, say -- Asimov would still be a cleaner read than other authors.  I think Asimov's real skill is: he's just really, really good at explaining things.  In fact, he's been nicknamed "The Great Explainer."

Explaining is one of those important skills that, in my view, gets no respect at all.  Ask what makes a good teacher, and what do people say?  Motivating the students, and understanding every pupil's strengths and weaknesses, and being able to gauge the mood of the classroom, and being an interesting and varied speaker, and using multimedia and experiments, and knowing the subject, and stuff like that. But to me, the biggest thing is: finding explanations that students will actually understand.

In my life, there have been things that confused me for years, or that I understood but didn't "really" understand. Then, one day, either I figured it out for myself, or I read something that instantly ended my confusion.  And I asked myself, "Why the hell didn't anyone explain it properly before?"

For example ... for years, I was confused about how one of the aspects of regression analysis.  Statisticians would say, "IQ accounts for 35 percent of the variance of salary, and parental income accounts for another 40 percent."  I wondered, how doe that work?  What if the effect of education tuns out to be as important as IQ?  Then, you have 110 percent!  What's going on?

But I just lived with it, until eventually I figured it out. What's the explanation?  It's a law of nature that standard deviations are pythgorean.  So you can never find independent factor "triangle sides" whose squares add up to more than 100% of the overall "hypotenuse".

And with that, it made sense.  Even though I knew the sum-of-squares thing in another context, I never made the connection, and it was never explained to me -- until I figured it out, two decades after my last statistics class.  

OK, was that one too mathy?  Here's an easier one: why it takes 10 runs in baseball to create one additional win.  I knew it was right, but I understood why only in a roundabout kind of way.  My gut still had a vague notion that 10 runs was much too high, and I had to keep correcting my gut.  

But then I found an explanation where it really made sense to me.  If you prefer a shorter summary, here's Eric T. with the hockey version (6 goals = 1 win):


"... imagine taking an average team, picking six of its games at random, and giving the team an extra goal in each game.

"Three of those games will be games it won anyway, so your extra goal doesn't change the result. In another game or two, the team lost by two or more and your goal still doesn't help. Only occasionally do you turn a loss into a win (or overtime loss), and so in the end, your six extra goals only produce roughly two extra points."

-----

It's not just me, right?  "Why didn't anyone explain it that way before?" happens to everyone.

Think about something you understand well, but had a bit of trouble with in the beginning.  Don't you think that you could have learned it in half the time if it had been explained differently?  How much time is wasted struggling through murky explanations in pedantic textbooks, or incoherent notes from class, when you might have been able to understand it in five minutes if it had been done a bit better?

-----

There are many reasons I admire Bill James.  One of the biggest is his ability to explain the things he's discovered.  His explanations are ... well, I think they're nearly perfect.   He explains what happens, and why, and how his method works, and it all comes together so well that you can read it once, at normal human reading speed, and ... you just get it.  His explanations just penetrate your brain effortlessly.

A lot of that is that he's such a good writer, but that's not enough.  William Shakespeare was a good writer, but I wouldn't bet on him being able to explain Runs Created.  A good writer will say things well, but a good explainer will also choose the right things to say.

-----

I used to teach regularly, a class in how to use a certain niche software product.  There's no proper textbook, so I had to figure out how to explain it so that the students would actually get it.  Some things I did worked better than others, and I'd try adjusting what I did from class to class. Occasionally, I would think, "Geez, you know, it's a complicated subject ... this is probably as clear as it can be explained, and they're going to have to work a bit to actually get it."

But then, I would stop and think, "What would Bill James do?" And I would realize that if it were Bill, he would have a way to get the point across.  He would have found the right analogy, or the right story, or the right thread of logic.  

I've learned a lot of things, beyond just baseball, from following Bill's work over the last thirty years.  One of the most important is: nothing is so complicated that it can't be explained well.  If I try to explain something, and it's not working, and people are having to work hard at understanding ... I have to think: it's my fault.  My explanation isn't good enough. 



Labels: , ,

Tuesday, July 16, 2013

Is there evidence for a "hot hand" effect?

You're in the desert, exactly 10 miles south of home.  Instead of walking straight home, you decide to walk east one mile first.  How far away from home are you when you're done?

The answer: about 10.05 miles.  Your one-mile detour only added 0.05 miles to your return trip: one-half of one percent.

That's just a simple application of the Pythagorean theorem.  Draw a right triangle with sides 10 miles and 1 mile, and the hypotenuse will be 10.05 miles long.

------

That triangle thing is meant to explain why it might be difficult to find evidence for or against streakiness in baseball hitting expectations.

There's a belief that there are times when a player has a "hot hand", in the sense that he's somehow "dialed in" and is likely to perform better than his usual.  And, there might be periods when he's "cold," and should be expected to perform worse.  

Torii Hunter hit .370 in April, 2013.  Did he have a hot hand, or was he just lucky?  That's the question that we want to answer.  Maybe not specifically about Torii Hunter, but, in general ... when a player has a hot streak or a cold streak, is there something behind that?

The difficulty answering that is that there's a lot of luck involved, and it's hard to separate out the luck from the "hot hand" (if any).

If you assume each AB is independent of the previous one, then, over a month's worth -- say, 100 AB, which was Torii Hunter's total -- the SD of batting average, by luck alone, will be about 43 points.  That's massive ... it means that one time out of 20, a player will be 86 points off expected in a given month.

Now, suppose there's also a real "hot hand" effect, that's 1/10 the size of the luck effect.  Now, the SD of batting average, by luck and hot hands combined, will be one half of one percent higher, just like in the Pythagorean triangle example.  That's almost nothing -- .0002 in batting average.

Effectively, this level of hot hand is invisible, if it exists.

This finding might be good news to those who believe that players are often "on" or "off" (Morgan Ensberg is one of those believers).  But ... there are still things we *can* say about hot-hand effects, and those things won't be compatible with some of the ways people talk about streakiness.

-------

First, let's try to make an intelligent guess on how much hot-handedness can remain invisible.  As we saw, 10 percent of luck is impossible to see because the SD difference is only 0.5 percent.  If hot hand variation were 20 percent of luck, the SD would go up by two percent.  If hot hands were 30 percent of luck, the SD would go up by around 4.5 percent.  If it were 40 percent of luck, the SD would go up by around 7.7 percent. 

Actually, let me quote Bill James (subscription required), from his recent essay on streakiness that inspired me to write this:


" ... how much causal variation is it reasonable to think might be completely hidden in a mix of causal and random variation? 

"Well, if it was 50-50, it would be extremely easy ... [i]f you add a second level of variation equal to the first, it will create obviously non-random patterns.

"If it 70-30—in other words, if the causal variation was roughly half the size of the random variation—that, again, would be easy to distinguish from pure random variation.    Even if it was 90-10, we should be able to distinguish between that and pure random variation.   If it was 99-1, maybe we would have a hard time telling one from the other."

I think Bill is a little optimistic on the 90-10 ... you're still talking one-half of one percent.  On 70-30, though, I think he's right.  That would be around a 9 percent increase in the standard deviation, which I think we'd be able to find without too much difficulty.

For the sake of this argument, let's say that the limit we can observe is ... I dunno, let's say, 75-25.  That would mean the SD of hot hands was 33 percent of the SD of luck, which means the overall SD goes up by around 5.3 percent.  I think we could find that if we looked.  I think.

I may be wrong ... substitute your own number if you prefer.

------

It might be legitimate for a hot-hand advocate could say to us, "well, you don't know for sure that there's no talent streakiness, because you admit that your methods can't pick it up."

Well, maybe.  But if you make that argument, don't forget to also limit yourself to hot hand effects that are 33 percent of luck.  That's around 14 points (again, that's batting average over 100 AB).

That means only one player in 20 will be more than 28 points off his expected batting average for any given month.  Only one player in 400 will be more than 42 points "hot" or "cold" for the month.

Are you prepared to live with that?  When you find ten players hitting 100 points off their expected for the month of August, you'd have to say, "Well, it's virtually impossible that *any* of them were *that* hot ... that's 7 standard deviations, which never happens.  They may have been a bit hot, but they were mostly lucky."

Seriously, are you prepared to say that?

There's more.  With a 75-25 mix, the correlation between "hot hand" and performance would be a bit over 0.3.  That means, if you find a player hitting 2 SD better than expected, his "hot hand" expectation is +0.6 SDs.  That's around 8 points.

Seriously.  When a career .250 player hits .340 in May, you'd have to be saying, "wow, I bet that player has a hot hand.  I bet in May, he was really a .258 hitter!!!"

My guess is, most "hot hand" enthusiasts wouldn't bite that bullet.  They don't just want "hot hands" to exist, they want them to be a big part of the game.  When they see a player on a hot streak, they want to believe that player is *significantly better*, that he's a force to be reckoned with.  "He's 8 points hot" is just not a very good narrative.

But that's pretty much what it amounts to.  If 75-25 is the right ratio, then only around TEN PERCENT of a player's observed "hotness" would come from a hot hand.  The other 90 percent would still be luck.

-------

Now, one assumption in all those calculations is that "hot hands" are random and normally distributed.  Followers of streakiness would probably argue that's not the case.  Intuitively, it would seem that most players are just "themselves" most of the time, but, occasionally, someone will get exceptionally hot and really be 100 points better for a brief period.

For instance, maybe hothandedness manifests itself in outliers.  Maybe 38 out of 40 players are just normal.  But, the 39th player is cold and drops 60 points, while the 40th is hot, and gains those 60 points.   That still works out to the SD of around 14 points that we hypothesized as our limit.

If that's the case, then, the "check the SD method" would still fail to find anything.  

But ... now, we'd have other methods that would actually work.  We could count outliers.  If one player in 40 gains .060 every month, we should see a lot more exceptional months than we would otherwise.  

So, if that's how you think hot hands work, let us know.  Those kinds of hot hands *won't* be invisible.

------

And even if they exist at that level, it won't tell us much.  

Suppose a player has a month where he actually hits .060 better than normal.  Can you now say, "Ha!  See, that player had a hot hand!"

Nope.  Because, by the model, the chance of a player having a hot hand is 1 in 40.  But, the chance of a player having a +60 point month, just by luck, is, I think, something like 1 in 12.

A quick naive calculation is that, even in this extreme case, there's a 3-in-4 chance the player was just lucky!  (It's probably actually higher than that, if you do the proper calculation.)

------

So here's what I think we have:

1.  Moderate variations in talent might be impossible to disprove by the standard method.  For monthly "hot hands", that might be .014 points of batting average, as a guess.

2.  If those exist, they will rarely be higher than, say, 30 or 40 points.  So, when a .270 player hits .380 in August, it *can't* all be a hot hand.  It'll still be mostly luck.

3.  If hot hands manifest themselves in outliers, we should have no problem finding them.  The idea that players will regularly get "scorching hot" in talent ... well, I bet we could debunk that pretty easily.

4.  Even if hot hands do exist, unusual performances will be more driven by luck than by performance variations.  There will never be a case where you can look at a hot month, and have good reason to believe it wasn't just luck.





(Hat tip and more discussion: Tango)



Labels: , , ,