Pages

Wednesday, February 05, 2014

Ratings vs. measurements

Sabermetrics may have lot of fancy new methods, but there's no stat that rates who the best players are.  

That's not just because here's no sabermetric "holy grail" of the ultimate statistic.  The real reason is that such a thing isn't even possible.  

You can talk about which player is "better" than another, but the problem is that there's no objective definition of "better".  You can't use statistics to help measure something when you can't even define what it is you're measuring.  

What sabermetrics CAN do is provide statistics that bear on the question of "best".  It can provide objective data that can inform your arguments.  But those arguments, like your definition of "best," are always going to be subjective.

The very first chapter of the 1982 Bill James Baseball Abstract was about comparing hitters.  Who was better?  Johnny Pesky, who had little power but hit for a high average and lots of walks?  Or Dick Stuart, who didn't get on base a whole lot, but regularly hit 30 home runs?

Bill was able to discover that there is a mathematical connection between a batting line and the number of runs that score.  Using that "Runs Created" formula, Bill found that Pesky's three best years were almost perfectly identical to Stuart's three best years.  They created almost the exact same number of runs, in almost exactly the same opportunities (outs).

Doesn't that settle the question?  Doesn't that prove that Pesky and Stuart are equally as good as each other?

No, it doesn't.  "Equal performances" doesn't necessarily imply "equally good hitters."  

Part of the problem, as Bill mentioned in his essay, is that there are a lot of things that Runs Created doesn't consider: park effects, differing quality of opposing pitchers, etc.   We haven't corrected for those.  My experience has been that when people assume that there will eventually be that perfect statistic, this is what they're thinking of, that some day we'll have so much data that we can correct for everything, and be almost perfectly accurate in terms of run estimates.

But, that's not the real problem.  The real problem is that when you're rating players, you're not trying to figure out who created the most runs.  You're trying to figure out who is the *best*.

Who says the highest-rated batter should be the one who creates the most runs?  Sure, creating runs is a very big deal, and our ratings of "best" are tremendously better informed now that we have that information.  But, there are other factors to consider.

For instance: suppose it's obvious a player had a lucky "career year".  Wouldn't that influence your rating of who's better?  What if one of the players saw a platoon advantage much more than the other?  What if one player hit much better with runners on base, but the other hit much better when the score was close?

How do you deal with all those?  It's subjective, isn't it, how much weight you have to give each of those factors in a "best hitter" evaluation?  I don't see how there can possibly be a "right answer" when the question is so subjective and vague.

-----

There are two different ways someone could disagree with you about who's better.  They could disagree with your definition, or they could disagree with your measurement.

The measurement one is easy.  Suppose you think the better player is the one who creates the most runs.  

To measure that, you decide to use "Total Average," the Thomas Boswell stat that's basically bases divided by outs.  You'd calculate every player's stat, and point to the top ones, and say, "those are the best."  And a sabermetrician would come along and say, "well, that's not right.  You're counting a stolen base as equal to a single.  But the SB advances only one runner, while the single advances the batter and also every other baserunner."

And someone else would pull out some other evidence, like when Bill James showed that teams that steal lots of bases don't win as many games as teams who hit lots of singles.  And someone would come along with Pete Palmer's linear weights calculation, and show that a single is worth half a run, on average, while a steal is worth only a fifth of a run.

That's the easy part, critiquing and improving a measurement.  The hard part, and the subjective part, is the definition of "better".  

If you think better means "more runs per out," and I think better means "more runs above replacement," then how do we resolve that?  I guess, like any other debate on what's "better" -- like a political debate, arguing back and forth and appealing to principle.  

Does affirmative action make society better or worse?  Is legal abortion a good thing?  What's better, taking a strict interpretation of the first amendment, or protecting minorities from hate speech?  What's a better performance, a .300 hitter, or a .290 hitter who hits .325 in the clutch?

-----

Even in a simple, two-dimensional case, it can be all gut feel.

We all want the best players to be in the Hall of Fame, right?  Well, some players are among the best because they were very good for many, many years -- Phil Niekro, say.  And some players are among the best because they were superb, but for fewer years -- Sandy Koufax.

What's the tradeoff?  What's the definition of "best" that can tell you whether Koufax or Niekro is "better"?

If you think about it, and look up the stats, your gut will likely come to some firm conclusion about which player is better.  But your intuitive feeling will be different from my intuitive feeling.  And, neither of us can articulate just what the tradeoff is.  Maybe we can say, "well, if Niekro had a few more good years, I'd prefer him."  Or, "I'd be more comfortable saying Koufax is better if he hadn't had those mediocre years at the beginning."  But, we won't be able to say, "I'm weighting a Cy Young Sandy Koufax year as 2.32 times as important as an average Jim Kaat year."

We have a strong intuition, but we can't explain it.

One of my favorite things that Bill James ever did was his "Hall of Fame Monitor" method.  He figured out a point system that scores each player on his Hall of Fame qualifications.  The way the system works, if you have 100 points or more, you'll probably be voted in, and if you don't, you won't.

And it works pretty well -- it does a good job of separating the players who have been enshrined from the ones who haven't.   (It's a prediction, not a recommendation -- it separates on what the voters *did*, not what the voters *should have* done.)

What Bill James did, amazingly, was reverse-engineer the collective brain of the writers, to figure out what their internal mental "formula" was.  

If you look at the details of the system, how players score points for arbitrary-looking achievements, you'd say, "well, that's certainly not how *my* intuitive decision system is figuring it out!"  But ... you know, it's probably reasonably close.  For most of us, we generally agree with the voters.  There are exceptions -- many sabermetricians object to some of the weights the voters apparently apply to various stats -- but, we fans are generally in line with the writers.  Especially on the "peak value vs. career value" question.  

-----

Another way you can see how ratings are subjective is ... the use of the word "rating".  We "rate" the players by who's best, but we don't "rate" the players by who's tallest.  We "measure" the tallest players, or "order" the tallest players, or "rank" the tallest players.  All those words imply some objective criterion, which we can reliably measure.  Runs Created, too.  In his 1982 Pesky/Stuart article, Bill James wrote,


"... runs created is not a rating.  It is an estimated record.  A rating is something which tells you how good; a record is something which tells you how many. ... a record gives you specific information that you could use to move toward those evaluations ... If I *rated* one player at 90 and another at 88, then I would be saying that the player who was at 90 was a better hitter than the one at 88.  But in fact it is entirely possible -- indeed, commonplace -- that that a player who created 88 runs for his team could be considered a better hitter than a player who created 90 ... "

The same thing is true for "grades", like grades in school.  "Jimmy got a grade of 75 percent in math" makes sense.  "Jimmy got a grade of 95 pounds in weight" does not.  

Consumer Reports "assigned" the Tesla a "grade" of 99.  But they did not "assign" the Honda Civic a "grade" of 29 miles per gallon.  

-----

Let me quote Bill James one last time.  In the 1985 Abstract, Bill tried to figure out which great teams were the best of all-time.  He ran a bunch of objective criteria, then added them up to get an answer.  But then he said:


"I offer no proof of that; it is only a carefully worked-out opinion, which is very different from something that can be shown to be true."

Perfect.  If a ranking is not a measurement, then it's an opinion ... no matter how carefully you try to work it out.



3 comments:

  1. Wow, solid piece, Phil. As an armchair sabermetrician myself, I thought for sure I would find something to pick on you about. I didn't. And I agree with most of your thoughts and applaud you for not bringing the tired "small sample size" anywhere into this piece.

    So here's my question on that. (Albeit somewhat unrelated to this post...) When is a small sample size no longer too small? Is it dependent on the stat, or the scale? Ie., when someone hits two HRs in four games, no one thinks they're going to hit 81 for the year. But if a pitcher earns two wins in four starts, you may more reasonably hypothesize that he'll win 16 games that year. Is the difference that a hitter may play 162 games, and a pitcher may start only 35? Therefore 4 of 162 is a much smaller sample than 4 of 35?

    I'm rambling now, but I'm wondering if there is any "percentage of the total" where a small sample becomes significant. If a guy hits 20 HRs in 81 games, you may very well predict that he'll hit 40 for the year (sample size is 50% of the total). But for a guy who hits 10 HRs in 40 games, you can't as easily say he'll hit 40 for the year (sample size of 25%).

    I guess what I'd like to research and try to find out is the answer to: at what percentage of the total does a sample become truly significant, and how much does this vary for different stats?

    Have you ever thought about this and done any research on it?

    ReplyDelete
  2. @Mo

    http://www.baseballprospectus.com/article.php?articleid=17659

    ReplyDelete
  3. Excellent link, thank you Anonymous. Regarding HR Rate stabalizing at 170 PA, we can look at roughly the first two months of the season and more-or-less accurately project the full season. Barring anamolous streaks or injuries I suppose.

    Thanks for the info.

    ReplyDelete