Monday, November 18, 2019

Why you can't calculate aging trajectories with a standard regression

I found myself in a little Twitter discussion last week about using regression to analyze player aging. I argued that regression won't give you accurate results, and that the less elegant "delta method" is the better way to go.

Although I did a small example to try to make my point, Tango suggested I do a bigger simulation and a blog post. That's this.

(Some details if you want:

For the kind of regression we're talking about, each season of a career is an input row. Suppose Damaso Griffin created 2 WAR at age 23, 2.5 WAR at age 24, and 3 WAR at age 25. And Alfredo Garcia created 1, 1.5, and 1.5 WAR at age 24, 25, and 26. The file would look like:

2    23  Damaso Griffin
2.5  24  Damaso Griffin
3    25  Damaso Griffin
1    24  Alfredo Garcia
1.5  25  Alfredo Garcia
1.5  26  Alfredo Garcia

And so on, for all the players and ages you're analyzing. (The names are there so you can have dummy variables for individual player skills.)

You take that file and run a regression, and you hope to get a curve that's "representative" or an "average" or a "consolidation" of how those players truly aged.)

------

I simulated 200 player careers. I decided to use a quadratic (parabola), symmetric around peak age. I would have used just a linear regression, but I was worried that it might seem like the conclusions were the result of the model being too simple.

Mathematically, there are three parameters that define a parabola. For this application, they represent (a) peak age, (b) peak production (WAR), and (c) how steep or gentle the curve is.* 

(*The equation is: 

y = (x - peak age)^2 / -steepness + peak production. 

"Steepness" is related to how fast the player ages: higher steepness is higher decay. Assuming a player has a job only when his WAR is positive, his career length can be computed as twice the square root of (peak WAR * steepness). So, if steepness is 2 and peak WAR is 4, that's a 5.7 year career. If steepness is 6 and peak WAR is 7, that's a 13-year career.

You can also represent a parabola as y = ax^2+bx+c, but it's harder to get your head around what the coefficients mean. They're both the same thing ... you can use basic algebra to convert one into the other.)

For each player, I randomly gave him parameters from these distributions: (a) peak age normally distributed with mean 27 and SD 2; (b) peak WAR with mean 4 and SD 2; and (c) steepness (mean 2, SD 5; but if the result was less than 1.5, I threw it out and picked a new one).

I arbitrarily decided to throw out any careers of length three years or fewer, which reduced the sample from 200 players to 187. Also, I assumed nobody plays before age 18, no matter how good he is. I don't think either of those decisions made a difference.

Here's the plot of all 187 aging curves on one graph:





The idea, now, is to consolidate the 187 curves into one representative curve. Intuitively, what are we expecting here? Probably, something like, the curve that belongs to the average player in the list.

The average random career turned out to be age 26.9, peak WAR 4.19, and steepness 5.36. Here's a curve that matches those parameters:





That seems like what we expect, when we ask a regression to find the best-fit curve. We want a "typical" aging trajectory. Eyeballing the graph, it does look pretty reasonable, although to my eye, it's just a bit small. Maybe half a year bigger left and right, and a bit higher? But close. Up to you ... feel free to draw on your monitor what you think it should look like.  

But when I ran the regression ... well, what came out wasn't close to my guess, and probably not close to your guess either:






It's much, much gentler than it should be. Even if your gut told you something different than the black curve, there's no way your gut was thinking this. The regression came up with a 19-year career. A career that long happened only once in the entire 187-player sample. we expected "representative," but the regression gave us 99.5th percentile.

What happened?

It's the same old "selective sampling"/"survivorship bias" problem.

The simulation decided that when a player's curve scores below zero, those seasons aren't included. It makes sense to code the simulation that way, to match real life. If Jerry Remy had played five years longer than he did, what would his WAR be at age 36? We have no idea.

But, with this simulation, we have a God's-eye view of how negative every player would go. So, let's include that in the plot, down to -20:





See what's happening? The black curve is based on *all* the green data, both above and below zero, and it lands in the middle. The red curve is based only on the green data above zero, so it ignores all the green negatives at the extremes.

If you like, think of the green lines as magnets, pulling the lines towards them. The green magnets bottom-left and bottom-right pull the black curve down and make it steeper. But only the green magnets above zero affect the red line, so it's much less steep.

In fact, if you scroll back up to the other graph, the one that's above zero only, you'll see that at almost every vertical age, the red line bisects the green forest -- there's about as much green magnetism above the red line it there is below it.

In other words: survivorship bias is causing the difference.

------

What's really going on is the regression is just falling for the same classic fallacy we've been warning against for the past 30 years! It's comparing players active (above zero) at age 27 to players active (above zero) at age 35. And it doesn't find much difference. But that's because the two sets of players aren't the same. 

One more thing to make the point clearer. 

Let's suppose you find every player active last year at age 27, and average their performance (per 500PA, or whatever). And then you find every player active last year at age 35, and average their performance.

And you find there's not much difference. And you conclude, hey, players age gracefully! There's hardly any dropoff from age 27 to age 35!

Well, that's the fallacy saberists have been warning against for 30 years, right? The canonical (correct) explanation goes something like this:


"The problem with that logic is that it doesn't actually measure aging, because those two sets of players aren't the same. The players who are able to still be active at 35 are the superstars. The players who were able to be active at 27 are ... almost all of them. All this shows is that superstars at 35 are almost as good as the league average at 27. It doesn't actually tell us how players age."

Well, that logic is *exactly* what the regression is doing. It's calculating the average performance at every age, and drawing a parabola to join them. 

Here's one last graph. I've included the "average at each age" line (blue) calculated from my random data. It's almost a perfect match to the (red) regression line.






------

Bottom line: all the aging regression does is commit the same classic fallacy we repeatedly warn about. It just winds up hiding it -- by complicating, formalizing, and blackboxing what's really going on. 





Labels: ,

Friday, August 04, 2017

Deconstructing an NBA time-zone regression

Warning: for regression geeks only.

----

Recently, I came across an NBA study that found an implausibly huge effect of teams playing in other time zones. The study uses a fairly simple regression, so I started thinking about what could be happening. 

My point here isn't to call attention to the study, just to figure out the puzzle of how such a simple regression could come up with such a weird result. 

------

The authors looked at every NBA regular-season game from 1991-92 to 2001-02. They tried to predict which team won, using these variables:

-- indicator for home team / season
-- indicator for road team / season
-- time zones east for road team
-- time zones west for road team

The "time zones" variable was set to zero if the game was played in the road team's normal time zone, or if it was played in the opposite direction. So, if an east-coast team played on the west coast, the west variable would be 3, and the east variable would be 0.

The team indicators are meant to represent team quality. 

------

When the authors ran the regression, they found the "number of time zones" variable large and statistically significant. For each time zone moving east, teams played .084 better than expected (after controlling for teams). A team moving west played .077 worse than expected. 

That means a .500 road team on the West Coast would actually play .756 ball on the East Coast. And that's regardless of how long the visiting team has been in the home team's time zone. It could be a week or more into a road trip, and the regression says it's still .756.

The authors attribute the effect to "large, biological effects of playing in different time zones discovered in medicine and physiology research." 

------

So, what's going on? I'm going to try to get to the answer, but I'll start with a couple of dead ends that nonetheless helped me figure out what the regression is actually doing. I should say in advance that I can't prove any of this, because I don't have their data and I didn't repeat their regression. This is just from my armchair.

Let's start with this. Suppose it were true, that for physiological reasons, teams always play worse going west, and teams always play better going east. If that were the case, how could you ever know? No matter what you see in the data, it would look EXACTLY like the West teams were just better quality than the East teams. (Which they have been, lately.)  

To see that argument more easily: suppose the teams on the West Coast are all NBA teams. The MST teams are minor-league AAA. The CST teams are AA. And the East Coast teams are minor league A ball. But all the leagues play against each other.

In that case, you'd see exactly the pattern the authors got: teams are .500 against each other in the same time zone, but worse when they travel west to play against better leagues, and better when they travel east to play against worse leagues.

No matter what results you get, there's no way to tell whether it's time zone difference, or team quality.

So is that the issue, that the regression is just measuring a quality difference between teams in different time zones? No, I don't think so. I believe the "time zone" coefficient of the regression is measuring something completely irrelevant (and, in fact, random). I'll get to that in a bit. 

------

Let's start by considering a slightly simpler version of this regression. Suppose we include all the team indicator variables, but, for now, we don't include the time-zone number. What happens?

Everything works, I think. We get decent estimates of team quality, both home and road, for every team/year in the study. So far, so good. 

Now, let's add a bit more complexity. Let's create a regression with two time zones, "West" and "East," and add a variable for the effect of that time zone change.

What happens now?

The regression will fail. There's an infinite number of possible solutions. (In technical terms, the regression matrix is "singular."  We have "collinearity" among the variables.)

How do we know? Because there's more than one set of coefficients that fits the data perfectly. 

(Technical note: a regression will always fail if you have an indicator variable for every team. To get around this, you'll usually omit one team (and the others will come out relative to the one you omitted). The collinearity I'm talking about is even *after* doing that.)

Suppose the regression spit out that the time-zone effect is actually  .080, and it also spit out quality estimates for all the teams.

From that solution, we can find another solution that works just as well. Change the time-zone effect to zero. Then, add .080 to the quality estimate of every West team. 

Every team/team estimate will wind up working out exactly the same. Suppose, in the first result, the Raptors were .400 on the road, the Nuggets were .500 at home, and the time-zone effect is .080. In that case, the regression will estimate the Raptors at .320 against the Nuggets. (That's .400 - (.500 - .500) - .080.)

In the second result, the regression leaves the Raptors at .400, but moves the Nuggets to .580, and the time-zone effect to zero. The Raptors are still estimated at .320 against the Nuggets. (This time, it's .400 - (.580 - .500) - .000.)

You can create as many other solutions as you like that fit the data identically: just add any X to the time-zone estimate, and add the same X to every Western team.

The regression is able to figure out that the data doesn't give a unique solution, so it craps out, with a message that the regression matrix is singular.

------

All that was for a regression with only two time zones. If we now expand to include all four zones, that gives six different effects each direction (E moving to C, C to M, M to P, E to M, C to P, and M to P). What if we include six time-zone variables, one for each effect?

Again, we get an infinity of solutions. We can produce new solutions almost the same way as before. Just take any solution, subtract X from each E team quality, and add X to the E-C, E-M and E-P coefficients. You wind up with the same estimates.

------

But the authors' regression actually did have one unique best fit solution. That's because they did one more thing that we haven't done.

We can get to their regression in two steps.

First, we collapse the six variables into three -- one for "one time zone" (regardless of which zone it is), one for "two time zones," and one for "three time zones". 

Second, we collapse those three variables into one, "number of time zones," which implicitly forces the two-zone effect and three-zone effect to be double and triple, respectively, the value of the one-zone effect. I'll call that the "x/2x/3x rule" and we'll assume that it actually does hold.

So, with the new variable, we run the regression again. What happens?

In the ideal case, the regression fails again. 

By "ideal case," I mean one where all the error terms are zero, where every pair of teams plays exactly as expected. That is, if the estimates predict the Raptors will play .350 against the Nuggets, they actually *do* play .350 against the Nuggets. It will never happen that every pair will go perfectly in real life, but maybe assume that the dataset is trillions of games and the errors even out.

In that special "no errors" case, you still have an infinity of solutions. To get a second solution from a first, you can, for instance, double the time zone effects from x/2x/3x to 2x/4x/6x. Then, subtract x from each CST team, subtract 2x from each MST team, and subtract 3x from each PST team. You'll wind up with exactly the same estimates as before.

-------

For this particular regression to not crap out, there have to be errors. Which is not a problem for any real dataset. The Raptors certainly won't go the exact predicted .350 against the Nuggets, either because of luck, or because it's not mathematically possible (you'd need to go 7-for-20, and the Raptors aren't playing 20 games a season in Denver).

The errors make the regression work.

Why? Before, x/2x/3x fit all the observations perfectly. So you could create duplicate solutions by adding and subtracting X and 2X from the teams, and adding X and 2X to the one-zone effects and two-zone effects. Now, because of errors, not all the observed two-zone effects are exactly double the one-zone effects. So not everything cancels out, and you get different residuals. 

That means that this time there's a unique solution, and the regression spits it out.

-------

In this new, valid, regression, what's the expected value of the estimate for the time-zone effect?

I think it must be zero.

The estimate of the coefficient is a function of the observed error terms in the data. But the errors are, by definition, just as likely to be negative as positive. I believe (but won't prove) that if you reverse the signs of all the error terms, you also reverse the sign of the time zone coefficient estimate.

So, the coefficient is as likely to be negative as positive, which means by symmetry, its expected value must be zero.

In other words: the coefficient in the study, the one that looks like it's actually showing the physiological effects of changing time zone ... is actually completely random, with expected value zero.

It literally has nothing at all to do with anything basketball-related!

-------

So, that's one factor that's giving the weird result, that the regression is fitting the data to randomness. Another factor, and (I think) the bigger one, is that the model is wrong. 

There's an adage, "All models are wrong; some models are useful." My argument is that this model is much too wrong to be useful. 

Specifically, the "too wrong" part is the requirement that the time-zone effect must be proportional to the number of zones -- the "x/2x/3x" assumption.

It seems like a reasonable assumption, that the effect should be proportional to the time lag. But, if it's not, that can distort the results quite a bit. Here's a simplified example showing how that distortion can happen.

Suppose you were to run the regression without the time-zone coefficient, and you get talent estimates for the teams, and you look at the errors in predicted vs. actual. For East teams, you find the errors are

+.040 against Central
+.000 against Mountain
-.040 against Pacific

That means that East teams played .040 better than expected against Central teams (after adjusting for team quality). They played exactly as expected against Mountain Time teams, and .040 worse than expected against West Coast teams.

The average of those numbers is zero. Intuitively, you'd look at those numbers and think: "Hey, there's no appreciable time-zone effect. Sure, the East teams lost a little more than normal against the Pacific teams, but they won a little more than normal against the Central teams, so it's mostly a wash."

Also, you'd notice that it really doesn't look like the observed errors follow x/2x/3x. The closest fit seems to be when you make x equal to zero, to get 0/0/0.

So, does the regression see that and spit out 0/0/0, accepting the errors it found? No. It actually finds a way to make everything fit perfectly!

To do that, it increases its estimates of every Eastern team by .080. Now, every East team appears to underperform by .080 against each of the three other time zones. Which means the observed errors are now 

-.040 against Central
-.080 against Mountain
-.120 against Pacific

And that DOES follow the x/2x/3x model -- which means you can now fit the data perfectly. Using 0/0/0, the .500 Raptors were expected to be .500 against an average Central team (.500 minus 0), but they actually went .540. Using -.040/-.080/.120, the .580 Raptors are expected to be .540 against an average Central team (.580 minus .040), and that's exactly what they did.

So the regression says, "Ha! That must be the effect of time zone! It follows the x/2x/3x requirement, and it fits the data perfectly, because all the errors now come out to zero!"

So you conclude that 

(a) over a 20-year period, the East teams were .580 teams but played down to .500 because they suffered from a huge time-zone effect.

Well, do you really want to believe that? 

You have at least two other options you can justify: 

(b) over a 20-year period, the East teams were .500 teams and there was a time-zone effect of +40 points playing in CST, and -40 points playing in PST, but those effects weren't statistically significant.

(c) over a 20-year period, the East teams were .500 teams and due to lack of statistical significance and no obvious pattern, we conclude there's no real time-zone effect.

The only reason to choose (a) is if you are almost entirely convinced of two things: first, that x/2x/3x is the only reasonable model to consider, and, second, that 40/80/120 points is plausible enough to not assume that it's just random crap, despite the statistical significance.

You have to abandon your model at this point, don't you? I mean, I can see how, before running the regression, the x/2x/3x assumption seemed as reasonable as any. But, now, to maintain that it's plausible, you have to also believe it's plausible that an Eastern team loses .120 points of winning percentage when it plays on the West Coast. Actually, it's worse than that! The .120 was from this contrived example. The real data shows a drop of more than .200 when playing on the West Coast!

The results of the regression should change your mind about the model, and alert you that the x/2x/3x is not the right hypothesis for how time-zone effects work.

-------

Does this seem like cheating? We try a regression, we get statistically-significant estimates, but we don't like the result so we retroactively reject the model. Is that reasonable?

Yes, it is. Because, you have to either reject the model, or accept its implications. IF we accept the model, then we're forced to accept that there's 240-point West-to-East time zone effect, and we're forced to accept that West Coast teams that play at a 41-41 level against other West Coast teams somehow raise their game to the 61-21 level against East Coast teams that are equal to them on paper.

Choosing the x/2x/3x model led you to an absurd conclusion. Better to acknowledge that your model, therefore, must be wrong.

Still think it's cheating? Here's an analogy:

Suppose I don't know how old my friend's son is. I guess he's around 4, because, hey, that's a reasonable guess, from my understanding of how old my friend is and how long he's been married. 

Then, I find out the son is six feet tall.

It would be wrong for me to keep my assumption, wouldn't it? I can't say, "Hey, on the reasonable model that my friend's son is four years old, the regression spit out a statistically significant estimate of 72 inches. So, I'm entitled to conclude my friend's son is the tallest four-year-old in human history."

That's exactly what this paper is doing.  

When your model spews out improbable estimates for your coefficients, the model is probably wrong. To check, try a different, still-plausible model. If the result doesn't hold up, you know the conclusions are the result of the specific model you chose. 

------

By the way, if the statistical significance is concerning you, consider this. When the authors repeated the analysis for a later group of years, the time-zone effect was much smaller. It was .012 going east and -.008 going west, which wasn't even close to statistical significance. 

If the study had combined both samples into one, it wouldn't have found significance at all.

Oh, and, by the way: it's a known result that when you have strong correlation in your regression variables (like here), you get wide confidence intervals and weird estimates (like here). I posted about that a few years ago.  

-------

The original question was: what's going on with the regression, that it winds up implying that a .500 team on the West Coast is a .752 team on the East Coast?

The summary is: there are three separate things going on, all of which contribute:

1.  there's no way to disentangle time zone effects from team quality effects.

2.  the regression only works because of random errors, and the estimate of the time-zone coefficient is only a function of random luck.

3.  the x/2x/3x model leads to conclusions that are too implausible to accept, given what we know about how the NBA works. 





-----

UPDATE, August 6/17: I got out of my armchair and built a simulation. The results were as I expected. The time-zone effect I built in wound up absorbed by the team constants, and the time-zone coefficient varied around zero in multiple runs.



Labels: ,

Friday, December 04, 2015

A new "hot hand" study finds a plausible effect

There's a recent baseball study (main page, .pdf) that claims to find a significant "hot hand" effect. Not just statistically significant, but fairly large:


"Strikingly, we find recent performance is highly significant in predicting performance ... Furthermore these effects are of a significant magnitude: for instance, ... a batter who is “hot” in home runs is 15-25% more likely (0.5-0.75 percentage points or about one half a standard deviation) more likely to hit a home run in his next at bat."

Translating that more concretely into baseball terms: imagine a batter who normally hits 20 HR in a season. The authors are saying that when he's on a hot streak of home runs, he actually hits like a 23 or 25 home run talent. 

That's a strong effect. I don't think even home field advantage is that big, is it?

In any case, after reading the paper ... well, I think the study's conclusions are seriously exaggerated. Because, part of what the authors consider a "hot hand" effect doesn't have anything to do with streakiness at all.

------

The study took all player seasons from 2000 to 2011, subject to an AB minimum. Then, the authors tried to predict every single AB for every player in that span. 

To get an estimate for that AB, the authors considered:

(a) the player's performance in the preceding 25 AB; and
(b) the player's performance in the rest of the season, except that they excluded a "window" of 50 AB before the 25, and 50 AB after the AB being predicted.

To make this go easier, I'm going to call the 25 AB sample the "streak AB" (since it measures how streaky the player was). I'm going to call the two 50 AB exclusions the "window AB". And, I'm going to call the rest of the season the "talent AB," since that's what's being used as a proxy for the player's ability.

Just to do an example: Suppose a player had 501 AB one year, and the authors are trying to predict AB number 201. They'd divide up the season like this:

1. the first 125 AB (part of the "talent AB")
2. the next 50 AB, (part of the "window AB)
3. the next 25 AB (the "streak AB")
4. the single "current AB" being predicted
5. the next 50 AB, (part of the "window AB")
6. the next 250 AB (part of the "talent AB").

They run a regression to predict (4), based on two variables:

B1 -- the player's ability, with is how he did in (1) and (6) combined
B2 -- the player's performance in (3), the 25 "streak AB" that show how "hot" or "cold" he was, going into the current AB.

Well, not just those two -- they also include the obvious control variables, like season, park, opposing pitcher's talent, platoon, and home field advantage. 

(Why did they choose to exclude the "windows" (2) and (5)? They say that because the windows occur so close to the actual streak, they might themselves be subject to streakiness, and that would bias the results.)

What did the study find? That the estimate of "B2" was large and significant. Holding the performance in the 375 AB "ability" sample constant, the better the player did in the immediately preceding 25 "streak" AB, the better he did in the current AB.

In other words, a hot player continues to be hot!

------

But there's a problem with that conclusion, which you might have figured out already. The methodology isn't actually controlling for talent properly.

Suppose you have two players, identical in the "talent" estimate -- in 375 AB each, they both hit exactly 21 home runs.

And suppose that in the streak AB, they were different. In the 25 "streak" AB, player A didn't hit any home runs. But player B hit five additional homers.

In that case, do you really expect them to hit identically in the 26th AB? No, you don't. And not because of streakiness -- but, rather, because player B has demonstrated himself to be a better home run hitter than player A, by a margin of 26 to 21. 

In other words, the regression coefficient confounds two factors -- streakiness, and additional evidence of the players' relative talent.

-------

Here's an example that might make the point a bit clearer.

(a)  in their first 10 AB -- the "talent" AB -- Mario Mendoza and Babe Ruth both fail to hit a HR.
(b)  in their second 100 AB -- the "streak" AB -- Mendoza hits no HR, but the Babe hits 11.
(c)  in their third 100 AB -- the "current" AB -- Mendoza again hits no HR, but the Babe hits 10.

Is that evidence of a hot hand? By the authors' logic, yes, it is. They would say:

1. The two players were identical in talent, from the control sample of (a). 
2. In (b), Ruth was hot, while Mendoza was cold.
3. In (c), Ruth outhit Mendoza. Therefore, it must have been the hot hand in (b) that caused the difference in (c)!

But, of course, the point is ... (b) is not just evidence of which player was hot. It's also evidence of which player was *better*. 

-------

Now, the authors did actually understand this was an issue. 

In a previous version of their paper, they hadn't. In 2014, when Tango posted a link on his site, it took only two-and-a-half hours for commenter Kincaid to point out the problem (comment 6).  (There was a follow-up discussion too.)

The authors took note, and now realize that their estimates of streakiness are confounded by the fact that they're not truly controlling for established performance. 

The easiest way for them to correct the problem would have been just to include the 25 AB in the talent variable. In the "Player A vs. Player B" example, instead of populating the regression with "21/0" and "21/4", they could easily have populated it with "21/0 and "25/4". 

Which they did, except -- only in one regression, and only in an appendix that's for the web version only.

For the published article, they decided to leave the regression the way it was, but, afterwards, try to break down the coefficient to figure out how much of the effect was streakiness, and how much was talent. Actually, the portion I'm calling "talent" they decided to call "learning," on the grounds that it's caused by performance in the "streak AB" allowing us to "learn" more about the player's long-term ability. 

Fine, except: they still chose to define "hot hand" as the SUM of "streakiness" and "learning," on the grounds that ... well, here's how they explain it:


"The association of a hot hand with predictability introduces an issue in interpretation, that is also present but generally unacknowledged in other papers in the area. In particular, predictability may derive from short-term changes in ability, or from learning about longer-term ability. ... We use the term “hot hand” synonymously with short-term predictability, which encompasses both streakiness and learning."

To paraphrase, what they're saying is something like:


"The whole point of "hot hand" studies is to see how well we can predict future performance. So the "hot hand" effect SHOULD include "learning," because the important thing is that the performance after the "hot hand" is higher, and, for predictive purposes, we shouldn't care what caused it to be higher."

I think that's nuts. 

Because, the "learning" only exists in this study because the authors deliberately chose to leave some of the known data out of the talent estimate.

They looked at a 25/4 player (25 home runs of which 4 were during the "streak"), and a 21/0 player (21 HR, 0 during the streak), and said, "hey, let's deliberately UNLEARN about the performance during the streak time, and treat them as identical 21-HR players. Then, we'll RE-LEARN that the 25/4 guy was actually better, and treat that as part of the hot hand effect."

-------

So, that's why the authors' estimates of the actual "hot hand" effect (as normally understood outside of this paper) are way too high. They answered the wrong question. They answered,


"If a guy hitting .250 has a hot streak and raises his average to .260, how much better will he be than a guy hitting .250 who has a cold streak and lowers his average to .240?"

They really should have answered,


"If a guy hitting .240 has a hot streak and raises his average to .250, how much better will he be than a guy hitting .260 who has a cold streak and lowers his average to .250?"

--------

But, as I mentioned, the authors DID try to decompose their estimates into "streakiness" and "learning," so they actually did provide good evidence to help answer the real question.

How did they decompose it? They realized that if streakiness didn't exist at all, each individual "streak AB" should have the same weight as each individual "talent AB". It turned out that the individual "streak AB" were actually more predictive, so the difference must be due to streakiness.

For HR, they found the coefficient for the "streak AB" batting average was .0749. If a "streak AB" were exactly as important as important as a "talent AB", the coefficient would have been .0437. The difference, .0312, can maybe be attributed to streakiness.

In that case, the "hot hand" effect -- as the authors define it, as the sum of the two parts -- is 58% learning, and 42% streakiness.

-------

They didn't have to do all that, actually, since they DID run a regression where the Streak AB were included in the Talent AB. That's Table A20 in the paper (page 59 of the .pdf), and we can read off the streakiness coefficient directly. It's .0271, which is still statistically significant.

What does that mean for prediction?

It means that to predict future performance, based on the HR rate during the streak, only 2.71 percent of the "hotness" is real. You have to regress 97.29 percent to the mean. 

Suppose a player hit home runs at a rate of 20 HR per 500 AB, including the streak. During the streak, he hit 4 HR in 25 AB, which is a rate of 80 HR per 500 AB. What should we expect in the AB that immediately follows the streak?

Well, during the streak, the player hit at a rate 60 HR / 500 AB higher than normal. 60 times 2.71 percent equals 1.6. So, in the AB following the streak, we'd expect him to hit at a rate of 21.6 HR, instead of just 20.

-------

In addition to HR, the authors looked at streaks for hits, strikeouts, and walks. I'll do a similar calculation for those, again from Table A20.

Batting Average

Suppose a player hits .270 overall (except for the one AB we're predicting), but has a hot streak where he hits .420. What should we expect immediately after the streak?

The coefficient is .0053. 150 points above average, times .0053, is ... less than one point. The .270 hitter becomes maybe a .271 hitter.

Strikeouts

Suppose a player normally strikes out 100 times per 500 AB, but struck out at double that rate during the streak (which is 10 K in those 25 AB). What should we expect?

The coefficient is .0279. 100 rate points above average, times .0279, is 2.79. So, we should expect the batter's K rate to be 102.79 per 500 AB, instead of just 100. 

Walks

Suppose a player normally walks 80 times per 500 PA, but had a streak where he walked twice as often. What's the expectation after the streak?

The coefficient here is larger, .0674. So, instead of walking at a rate of 80 per 500 PA, we should expect a walk rate of 85.4. Well, that's a decent effect. Not huge, but something.

(The authors used PA instead of AB as the basis for the walk regression, for obvious reasons.)

---------

It's particularly frustrating that the paper is so misleading, because, there actually IS an indication of some sort of streakiness. 

Of course, for practical purposes, the size of the effect means it's not that important in baseball terms. You have to quadruple your HR rate over a 25 AB streak to get even a 10 percent increase in HR expectation in your next single AB. At best, if you double your walk rate over a hot streak, you walk expectation goes up about 7 percent.

But it's still a significant finding in terms of theory, perhaps the best evidence I've ever seen that there's at least *some* effect. It's unfortunate that the paper chooses to inflate the conclusions by redefining "hot hand" to mean something it's not.




(P.S.  MGL has an essay on this study in the 2016 Hardball Times. My book arrived last week, but I haven't read it yet. Discussion here.)




Labels: , , , , ,

Tuesday, August 12, 2014

More r-squared analogies

OK, so I've come up with yet another analogy for the difference between the regression equation coefficient and the r-squared.

The coefficient is the *actual signal* -- the answer to the question you're asking. The r-squared is the *strength of the signal* relative to the noise for an individual datapoint.

Suppose you want to find the relationship between how many five-dollar bills someone has, and how much money those bills are worth. If you do the regression, you'll find:

Coefficient = 5.00 (signal)
r-squared = 1.00 (strength of signal)
1 minus r-squared = 0.00 (strength of noise)
Signal-to-noise ratio = infinite (1.00 / 0.00)

The signal is: a five-dollar bill is worth $5.00. How strong is the signal?  Perfectly strong --  the r-squared is 1.00, the highest it can be.  (In fact, the signal to noise ratio is infinite, because there's no noise at all.)

Now, change the example a little bit. Suppose a lottery ticket gives you a one-in-a-million chance of winning five million dollars. Then, the expected value of each ticket is $5.  (Of course, most tickets win nothing, but the *average* is $5.)

You want to find out the relationship between how many tickets someone has, and how much money those tickets will win. With a sufficiently large sample size, the regression will give you something like:

Coefficient = 5.00 (signal)
r-squared = 0.0001 (strength of signal)
1 minus r-squared = 0.9999 (strength of noise)
Signal-to-noise ratio = 0.0001 (0.0001 / 0.9999)

The average value of a ticket is the same as a five-dollar bill: $5.00. But the *noise* around $5.00 is very, very large, so the r-squared is small. For any given ticketholder, the distribution of his winnings is going to be pretty wide.

In this case, the signal-to-noise ratio is something like 0.0001 divided by 0.9999, or 1:10,000. There's a lot of noise in with the signal.  If you hold 10 lottery tickets, your expected winnings are $50. But, there's so much noise, that you shouldn't count on the result necessarily being close to $50. The noise could turn it into $0, or $5,000,000.

On the other hand, if you own 10 five-dollar bills, then you *should* count on the $50, because it's all signal and no noise.

It's not a perfect analogy, but it's a good way to get a gut feel. In fact, you can simplify it a bit and make it even easier:

-- the coefficient is the signal.
-- the r-squared is the signal-to-noise ratio.

You can even think of it this way, maybe:

-- the coefficient is the "mean" effect.
-- the (1 - r-squared) is the "variance" (or SD) of the effect.

Five-dollar bills have a mean value of $5, and variance of zero. Five-dollar lottery tickets have a mean value of $5, but a very large variance.  

------

So, keeping in mind these analogies, you can see that this is wrong: 

"The r-squared between lottery tickets and winnings is very close to zero, which means that lottery tickets have very little value."

It's wrong because the r-squared doesn't tell you the actual value of a ticket (mean). It just tells you the noise (variance) around the realized value for an individual ticket-holder. To really see the value of a ticket, you have to look at the coefficient.  

From the r-squared alone, however, you *can* say this:

"The r-squared between lottery tickets and winnings is very close to zero, which means that it's hard to predict what your lottery tickets are going to be worth just based on how many you have."

You can conclude "hard to predict" based on the r-squared. But if you want to conclude "little value on average," you have to look at the coefficient.  

------

In the last post, I linked to a Business Week study that found an r-squared of 0.01 between CEO pay and performance. Because the 0.01 is a small number, the authors concluded that there's no connection, and CEOs aren't paid by performance.

That's the same problem as the lottery tickets.

If you want to see if CEOs who get paid more do better, you need to know the size of the effect. That is: you want to know the signal, not the *strength* of the signal, and not the signal-to-noise ratio. You want the coefficient, not the r-squared.

And, in that study, the signal was surprisingly high -- around 4, by my lower estimate. That is: for every $1 in additional salary, the CEO created an extra $4 for the shareholders. That's the number the magazine needs in order to answer its question.

The low r-squared just shows that the noise is high. The *expected value* is $4, but, for a particular case, it could be far from $4, in either direction.  I haven't checked, but I bet that some companies with relatively low-paid executives might create $100 per dollar, and some companies who pay their CEOs double or triple the average might nonetheless wind up losing value, or even going bankrupt.

------

Now that I think about it, maybe a "lottery ticket" analogy would be good too: 


Think of every effect as a combination of lottery tickets and cash money.

-- The regression coefficient tells you the total value of the tickets and money combined.

-- The r-squared tells you what proportion of that total value is in money.  

That one works well for me.

------

Anyway, the idea is not that these analogies are completely correct, but that they make it easier to interpret the results, and to spot errors of interpretation. When Business Week says, "the r-squared is 0.01, so there is no relationship," you can instantly respond:

"... All that r-squared tells you is, whatever the relationship actually turns out to be, the signal-to-noise ratio is 1:99. But, so what? Maybe it's still an important signal, even if it's drowned out by noise. Tell us what the coefficient is, so we can evaluate the signal on its own!"

Or, when someone says, "the r-squared between team payroll and wins is only .18, which means that money doesn't buy wins," you can respond:

"... All that r-squared tells you is, whatever the relationship actually turns out to be, 82 percent of it comes in the form of lottery tickets, and only 18 percent comes in cash. But those tickets might still be valuable! Tell us what the coefficient is, so we can see that value, and we can figure out if spending money on better players is actually worth it."

------

Does either one of those work for you?  




(You can find more of my old stuff on r-squared by clicking here.)


Labels: , ,

Tuesday, July 29, 2014

Are CEOs overpaid or underpaid?

Corporate executives make a lot of money. Are they worth it? Are higher-paid CEOs actually better than their lower-paid counterparts?

Business Week magazine says, no, they're not, and they have evidence to prove it. They took 200 highly-paid CEOs, and did a regression to predict their company's stock performance from their chief executive's pay. The plot looks highly random, with an r-squared of 0.01. Here's a stolen copy:




The magazine says,


"The comparison makes it look as if there is zero relationship between pay and performance ... The trend line shows that a CEO’s income ranking is only 1 percent based on the company’s stock return. That means that 99 percent of the ranking has nothing to do with performance at all. ...

"If 'pay for performance' was really a factor in compensating this group of CEOs, we’d see compensation and stock performance moving in tandem. The points on the chart would be arranged in a straight, diagonal line."

I think there are several reasons why that might not be right.

First, you can't go by the apparent size of the r-squared. There are a lot of factors involved in stock performance, and it's actually not unreasonable to think that the CEO would only be 1 percent of the total picture.

Second, an r-squared of 0.01 implies a correlation of 0.1. That's actually quite large. I bet if you ran a correlation of baseball salaries to one-week team performance, the r-squared would probably be just as small -- but that wouldn't mean players aren't paid by performance. As I've written before, you have to look at the regression equation, because even the smallest correlation could imply a large effect.

Third, the study appears to be based on a dataset created by Equilar, a consulting firm that advises on executive pay. But Equilar's study was limited to the 200 best-paid CEOs, and that artificially reduces the correlation. 

If you take only the 30 best-paid baseball players, and look at this year's performance on the field, the correlation will be only moderate. But if you add in the rest of the players, and minor-leaguers too, the correlation will be much higher. 

(If you don't believe me: find any scatterplot that shows a strong correlation. Take a piece of paper and cover up the leftmost 90% of the datapoints. The 10% that remain will look much more random.)

Fourth, the observed correlation is fairly statistically significant, at p=0.08 (one-tailed -- calculate it here). That could be just random chance, but, on its face, 0.08 does suggest there's a good chance there's something real going on. On the other hand, the result probably comes out "too" significant because the 200 datapoints aren't really indpendent. It could be the case, for instance, that CEOs tend to get paid more in the oil industry, and, coincidentally, oil stocks happen to have done well recently.

-----

BTW, I don't think there's a full article accompanying the Business Week chart; I think what's in that link is all we get. Which is annoying, because it doesn't tell us how the 200 CEOs were chosen, or what years' stock performance was looked at. I'm not even sure that the salaries were negotiated in advance. If they weren't, of course, the result is meaningless, because it could just be that successful companies rewarded their executives after the fact. 

Furthermore, the chart doesn't match the text. The reporters say they got an r-squared of 0.01. I measured the slope of the regression line in the chart, by counting pixels, and it appears to be around 0.06. But an r of 0.06 implies an r-squared of 0.0036, which is far short of the 0.01 figure. Maybe the authors rounded up, for effect? 

It could be that my pixel count was off. If you raise the slope from 0.06 to 0.071, you now get an r-squared of 0.0051, which does round to 0.01. So, for purposes of this post, I'm going to assume the r is actually 0.07.

-----

A correlation of 0.07 means that, to predict a company's performance ranking, you have to regress its CEO pay ranking 93% towards the mean. (This works out because the X and Y variables have the same SD, both consisting of numbers from 1 to 200.)

In other words, 7 percent of the differences are real. That doesn't sound like much, but it's actually pretty big. 

Suppose you're the 20th ranked CEO in salary. What does that say about your company's likely performance? It means you have to regress it 93% of the way back to 100.5. That takes you to 95th. 

So, CEOs that get paid 20th out of 200 improve their company's stock price by 5 rankings more than CEOs who get paid 100.5th out of 200.

How big is five rankings?

I found a website that allowed me to rank all the stocks in the S&P 500 by one-year return. (They use one year back from today, so, your numbers may be different by the time you try it.  Click on the heading "1-Year Percent.")  

The top stock, Micron Technology, gained 151.47%. The bottom stock, Avon, lost 42.80%.

The difference between #1 and #500 is 194.27 percentage points. Divide that by 499, and the average one-spot-in-the-rankings difference is 0.39 percentage points.

Micron is actually a big outlier -- it's about 33 points higher than #2 (Facebook), and 52 points higher than #5 (Under Armour). So, I'm going to arbitrarily reduce the difference from 0.39 to 0.3, just to be conservative.

On that basis, five rankings is the equivalent of 1.5 percentage points in performance.

How much money is that, in real-life terms, for a stock to overperform by 1.5 points?

On the S&P 500, the average company has a market capitalization (that is, the total value of all outstanding stock) of 28 billion (.pdf). For the average company, then, 1.5 points works out to $420 million in added value.

If you want to use the median rather than the mean, it's $13.4 billion and $200 million, respectively.

Either way, it's a lot more than the difference in CEO compensation.

From the Business Week chart, the top CEO made about $142 million. The 200th CEO made around $12.5 million. The difference is $130 million over 199 rankings, or $650K per ranking. (The top four CEOs are outliers. If you remove them, the spread drops by half. But I'll leave them in anyway.)

Moving up the 80 rankings in our hypothetical example is worth only a $52 million raise -- much less than the apparent value added:

Pay difference:      $52 million
--------------------------------
Median value added: $200 million
Mean value added:   $420 million

Moreover ... the value of a good CEO is much higher, obviously, for a bigger company. The ten biggest companies on the S&P 500 have a market cap of at least $200 billion each. For a company of that size, the equivalently "good" CEO -- the one paid 20th out of 200 -- is worth three billion dollars. That's *60 times* the average executive salary.

Assuming my arithmetic is OK, and I didn't drop a zero somewhere.

-----

So, I think the Business Week regression shows the opposite of what they believe it shows. Taking the data at face value, you'd have to conclude that executives are underpaid according to their talent, not overpaid.

I'm not willing to go that far. There's a lot of randomness involved, and, as I suggested before, other possible explanations for the positive correlation. But, if you DO want to take the chart as evidence of anything,it's evidence that there is, indeed, a substantial connection between pay and performance. The r-squared of less than 0.01 only *looks* tiny.

-----

Although I think this is weak evidence that CEOs *do* make a difference that's bigger than their salary, the numbers certainly suggest that they *can* make that big an impact.

Suppose you own shares of Apple, and they're looking for a new CEO. A "superstar" candidate comes along. He wants twice as much money as normal. As a shareholder, do you want the company to pay it? 

It depends what you expect his (or her) production to be. What kind of difference do you think a good CEO will make in the company's performance?

Suppose that, next year, you think Apple will earn $6.50 a share with a "replacement level" CEO. How much more do you expect with the superstar CEO?

If you think he or she can make a 1% difference, that's an extra 6.5 cents per share. That might be too high. How about one cent a share, from $6.51 instead of $6.50? Does that seem reasonable? 

Apple trades at around 15 times annual earnings. So, one cent in earnings means about 15 cents on the stock price. With six billion Apple shares outstanding, 15 cents a share gives the superstar CEO a "value above replacement" of $900 million.

So, for a company as big as Apple, if you *do* think a CEO can make a 1-part-in-650 difference in earnings, even the top CEO salary of $142 million looks cheap.

Apple has the largest market cap of all 500 companies in the index, at about 15 times the average, so it's perhaps a special case. But it shows that CEOs can certainly create, or destroy, a lot more value than than their salaries.

----- 

So can you conclude that corporate executives are underpaid? Not unless you can provide good evidence that a particular CEO really is that much better than the alternatives. 

There's a lot of luck involved in how a company's business goes -- it depends on the CEO's decisions, sure, but also on the overall market, and the actions of competitors, and advances in technology in general, and world events, and Fed policy, and random fads, and a million other things. It's probably very hard to figure the best CEOs, even based on a whole career. I bet it's as hard, as, say, figuring baseball's best hitters based on only a week's worth of AB. 

Or maybe not. Steve Jobs was fired as Apple's CEO, then, famously, returned to the struggling company a few years later to mastermind the iPod, iPhone, and iPad. Apple is now worth around 100 times as much as it was before Jobs came back. That's an increase in value of somewhere around $500 billion. It was maybe closer to $300 billion at the time of Jobs' death in 2011.

How much of that is due to Jobs' actual "talent" as CEO? Was he just lucky that his ethos of "insanely great" wound up leading to the iPhone? Maybe Jobs just happened to win the lottery, in that he had the right engineers and creative people to create exactly the right product for the right time?

It's obvious that Apple created hundreds of billions of dollars worth of value during Jobs' tenure, but I have no idea how much of that is actually due to Jobs himself. Well, I shouldn't say *no* idea. From what I've read and seen, I'd be willing to bet that he's at least, say, 1 percent responsible. 

One percent of $300 billion is $3 billion. Divide that by 14 years, and it's more than $200 million per year.

If you give Steve Jobs even just one percent of the credit for Apple's renaissance, he was still worth 50 percent more than today's highest-paid CEO, 300 percent more than today's eighth-highest paid CEO, and 1500 percent more than today's 200th-highest-paid CEO. 



Labels: , , , ,