Sunday, February 23, 2014

Did Team Canada overestimate its players?

Joel Colomby, the fantasy columnist for the Sun chain of newspapers, has an interesting finding today showing how, in 2010, Team Canada's NHL players seemed to get worse after the Olympics.


Of nineteen of Canada's "top scorers," only five had a higher NHL points-per-game rate after the Olympics than before.  The other 14 dropped.

Colomby included only players still in the NHL in 2014, and who are "likely owned in a fantasy pool" this year.  That creates a selective-sampling issue, but if you take the results at face value, a 5-14 record is about 2 SDs from 50-50.

Looking at the other countries, there didn't seem to be any real effect.  It did vary from team to team, but, if you put all the other teams together, you get 16 of 34.  So:

 5-14 Canada
16-18 Rest of World

Going by actual performance differences, the results are similar.  The average player's points-per-game (PPG) dropoffs were:

-0.086 PPG Canada
-0.011 PPG World

By my rough estimate, Canada came in at roughly 1.8 SD from zero.  

------

These averages are from calculations I did by hand; Colomby gives only the individual player numbers.  He hints that we might see repeats for certain players.  For instance, "Leafs fans, and Randy Carlyle, have to hope Phil Kessel rebounds better than he did four years ago [-0.17]."

Of course, I don't think the individual numbers mean anything at all -- small sample size, as Colomby mentions.  But ... this got me wondering about something else.

The Canadian players dropped off in the latter part of the season.  Players who dropped off are more likely to have been lucky earlier in the season, just before Team Canada would have been selected.  

So: isn't it possible that the selection process was perhaps too influenced by randomness?   Could it be that GM Steve Yzerman and his staff put a little too much weight on players' recent "hot" performances, and wound up with a team that perhaps wasn't as good as it could have been?

That theory also explains why Canada's players dropped off more than the rest of the world's.  All the players on Team Canada came from the NHL, so Canada's roster is based on whom management thought were the best Canadian NHL players.  For other teams, the list of NHL players to draw from is smaller, so those decisions are a lot easier.  In fact, several of the teams would obviously need to include *all* of their NHL players, lucky or not.

Also, in this case, the Sun's selective sampling doesn't hurt this hypothesis, and may actually help it.  The players omitted from Colomby's list are the ones who are no longer regulars in the NHL.  Those guys, you'd think, would have been *more* likely to have declined after the Olympics, not less.

Are there other explanations?  Probably.  I bet some of the dropoff has to do with injuries.  The team is selectively sampled for prior good health (injured players don't go to the Olympics), so you'd expect a certain amount of dropoff regardless when those players later get their normal share of injuries.  (I wonder if that might be the source of the slight decrease for the other countries' players.)

And it goes without saying that it could just be random.

------

My gut says ... I bet the "luck" explanation is at least part of what happened, that Team Canada management wound up slightly overestimating players who were hot the first half of 2009-10.  

I could be wrong.  Those of you who know hockey better than I do, check out the list of players, and see if there are any questionable selections that seem to have been based on the player's uncharacteristically good recent play. I'll list the players for you (and also repeat the link).

Player                  Pre   Post  Diff
----------------------------------------
Sidney Crosby PIT      1.28 / 1.55 / +27
Patrice Bergeron BOS    .68 /  .79 / +11
Brenden Morrow DAL      .59 /  .65 /  +6
Jonathan Toews CHI      .89 /  .90 /  +1
Scott Niedermayer ANA   .60 /  .61 /  +1
Mike Richards PHIL      .77 /  .73 /  -4
Drew Doughty LA         .74 /  .67 /  -7
Eric Staal CAR         1.02 /  .95 /  -7
Patrick Marleau SJ     1.03 /  .95 /  -8
Corey Perry ANA         .95 /  .85 / -10
Duncan Keith CHI        .87 /  .76 / -11
Chris Pronger PHIL      .70 /  .59 / -11
Rick Nash CLB           .90 /  .77 / -13
Dan Boyle SJ            .80 /  .65 / -15
Shea Weber NAS          .59 /  .42 / -17
Joe Thornton SJ        1.21 /  .82 / -19
Dany Heatley SJ        1.06 /  .80 / -26
Ryan Getzlaf ANA       1.09 /  .80 / -29
Jarome Iginla CGY       .92 /  .60 / -32


-----

Of course, we will eventually be able to check whether the same thing happens this year.  Maybe, for the current season, we might also see an effect for Team USA.  According to quanthockey.com, there were 136 American players with at least 30 games played at the Olympic break.  With fewer than half the (309) equivalent Canadian NHLers to choose from, the USA might have faced fewer tough decisions, which means less reliance on luck.  But, it's still worth checking.

If the result repeats for 2014, we'd have the cleanest evidence I've seen that sports GMs fail to fully consider luck when predicting future performance.  We already have a strong intuition that happens, but it's been hard to tell for sure.  

All decent players get contracts, even unlucky ones.  So, if John Doe has a career year, we need to know not *whether* he was signed, but for *how much*.  And even then, it's orders of magnitude more difficult to compare performance to salary than it is to compare first-half performance to second-half performance.  

It's a small sample, but this time we have an unambiguous "yes/no" of whether Team Canada thought this player was among the best.  And it turned out that almost three-quarters of the players chosen had, at the time, been playing over their (later-selves') heads.  

Was Team Canada fooled by randomness?






Labels: , , ,

Wednesday, February 05, 2014

Ratings vs. measurements

Sabermetrics may have lot of fancy new methods, but there's no stat that rates who the best players are.  

That's not just because here's no sabermetric "holy grail" of the ultimate statistic.  The real reason is that such a thing isn't even possible.  

You can talk about which player is "better" than another, but the problem is that there's no objective definition of "better".  You can't use statistics to help measure something when you can't even define what it is you're measuring.  

What sabermetrics CAN do is provide statistics that bear on the question of "best".  It can provide objective data that can inform your arguments.  But those arguments, like your definition of "best," are always going to be subjective.

The very first chapter of the 1982 Bill James Baseball Abstract was about comparing hitters.  Who was better?  Johnny Pesky, who had little power but hit for a high average and lots of walks?  Or Dick Stuart, who didn't get on base a whole lot, but regularly hit 30 home runs?

Bill was able to discover that there is a mathematical connection between a batting line and the number of runs that score.  Using that "Runs Created" formula, Bill found that Pesky's three best years were almost perfectly identical to Stuart's three best years.  They created almost the exact same number of runs, in almost exactly the same opportunities (outs).

Doesn't that settle the question?  Doesn't that prove that Pesky and Stuart are equally as good as each other?

No, it doesn't.  "Equal performances" doesn't necessarily imply "equally good hitters."  

Part of the problem, as Bill mentioned in his essay, is that there are a lot of things that Runs Created doesn't consider: park effects, differing quality of opposing pitchers, etc.   We haven't corrected for those.  My experience has been that when people assume that there will eventually be that perfect statistic, this is what they're thinking of, that some day we'll have so much data that we can correct for everything, and be almost perfectly accurate in terms of run estimates.

But, that's not the real problem.  The real problem is that when you're rating players, you're not trying to figure out who created the most runs.  You're trying to figure out who is the *best*.

Who says the highest-rated batter should be the one who creates the most runs?  Sure, creating runs is a very big deal, and our ratings of "best" are tremendously better informed now that we have that information.  But, there are other factors to consider.

For instance: suppose it's obvious a player had a lucky "career year".  Wouldn't that influence your rating of who's better?  What if one of the players saw a platoon advantage much more than the other?  What if one player hit much better with runners on base, but the other hit much better when the score was close?

How do you deal with all those?  It's subjective, isn't it, how much weight you have to give each of those factors in a "best hitter" evaluation?  I don't see how there can possibly be a "right answer" when the question is so subjective and vague.

-----

There are two different ways someone could disagree with you about who's better.  They could disagree with your definition, or they could disagree with your measurement.

The measurement one is easy.  Suppose you think the better player is the one who creates the most runs.  

To measure that, you decide to use "Total Average," the Thomas Boswell stat that's basically bases divided by outs.  You'd calculate every player's stat, and point to the top ones, and say, "those are the best."  And a sabermetrician would come along and say, "well, that's not right.  You're counting a stolen base as equal to a single.  But the SB advances only one runner, while the single advances the batter and also every other baserunner."

And someone else would pull out some other evidence, like when Bill James showed that teams that steal lots of bases don't win as many games as teams who hit lots of singles.  And someone would come along with Pete Palmer's linear weights calculation, and show that a single is worth half a run, on average, while a steal is worth only a fifth of a run.

That's the easy part, critiquing and improving a measurement.  The hard part, and the subjective part, is the definition of "better".  

If you think better means "more runs per out," and I think better means "more runs above replacement," then how do we resolve that?  I guess, like any other debate on what's "better" -- like a political debate, arguing back and forth and appealing to principle.  

Does affirmative action make society better or worse?  Is legal abortion a good thing?  What's better, taking a strict interpretation of the first amendment, or protecting minorities from hate speech?  What's a better performance, a .300 hitter, or a .290 hitter who hits .325 in the clutch?

-----

Even in a simple, two-dimensional case, it can be all gut feel.

We all want the best players to be in the Hall of Fame, right?  Well, some players are among the best because they were very good for many, many years -- Phil Niekro, say.  And some players are among the best because they were superb, but for fewer years -- Sandy Koufax.

What's the tradeoff?  What's the definition of "best" that can tell you whether Koufax or Niekro is "better"?

If you think about it, and look up the stats, your gut will likely come to some firm conclusion about which player is better.  But your intuitive feeling will be different from my intuitive feeling.  And, neither of us can articulate just what the tradeoff is.  Maybe we can say, "well, if Niekro had a few more good years, I'd prefer him."  Or, "I'd be more comfortable saying Koufax is better if he hadn't had those mediocre years at the beginning."  But, we won't be able to say, "I'm weighting a Cy Young Sandy Koufax year as 2.32 times as important as an average Jim Kaat year."

We have a strong intuition, but we can't explain it.

One of my favorite things that Bill James ever did was his "Hall of Fame Monitor" method.  He figured out a point system that scores each player on his Hall of Fame qualifications.  The way the system works, if you have 100 points or more, you'll probably be voted in, and if you don't, you won't.

And it works pretty well -- it does a good job of separating the players who have been enshrined from the ones who haven't.   (It's a prediction, not a recommendation -- it separates on what the voters *did*, not what the voters *should have* done.)

What Bill James did, amazingly, was reverse-engineer the collective brain of the writers, to figure out what their internal mental "formula" was.  

If you look at the details of the system, how players score points for arbitrary-looking achievements, you'd say, "well, that's certainly not how *my* intuitive decision system is figuring it out!"  But ... you know, it's probably reasonably close.  For most of us, we generally agree with the voters.  There are exceptions -- many sabermetricians object to some of the weights the voters apparently apply to various stats -- but, we fans are generally in line with the writers.  Especially on the "peak value vs. career value" question.  

-----

Another way you can see how ratings are subjective is ... the use of the word "rating".  We "rate" the players by who's best, but we don't "rate" the players by who's tallest.  We "measure" the tallest players, or "order" the tallest players, or "rank" the tallest players.  All those words imply some objective criterion, which we can reliably measure.  Runs Created, too.  In his 1982 Pesky/Stuart article, Bill James wrote,


"... runs created is not a rating.  It is an estimated record.  A rating is something which tells you how good; a record is something which tells you how many. ... a record gives you specific information that you could use to move toward those evaluations ... If I *rated* one player at 90 and another at 88, then I would be saying that the player who was at 90 was a better hitter than the one at 88.  But in fact it is entirely possible -- indeed, commonplace -- that that a player who created 88 runs for his team could be considered a better hitter than a player who created 90 ... "

The same thing is true for "grades", like grades in school.  "Jimmy got a grade of 75 percent in math" makes sense.  "Jimmy got a grade of 95 pounds in weight" does not.  

Consumer Reports "assigned" the Tesla a "grade" of 99.  But they did not "assign" the Honda Civic a "grade" of 29 miles per gallon.  

-----

Let me quote Bill James one last time.  In the 1985 Abstract, Bill tried to figure out which great teams were the best of all-time.  He ran a bunch of objective criteria, then added them up to get an answer.  But then he said:


"I offer no proof of that; it is only a carefully worked-out opinion, which is very different from something that can be shown to be true."

Perfect.  If a ranking is not a measurement, then it's an opinion ... no matter how carefully you try to work it out.



Labels: ,