Thursday, April 21, 2016

Noll-Scully doesn't measure anything real

The most-used measure of competitive balance in sports is the "Noll-Scully" measure. To calculate it, you figure the standard deviation (SD) of the winning percentage of all the teams in the league. Then, you divide by what the SD would be if all teams were of equal talent, and the results were all due to luck.

The bigger the number, the less parity in the league.

For a typical, recent baseball season, you'll find the SD of team winning percentage is around .068 (that's 11 wins out of 162 games). By the binomial approximation to normal, the SD due to luck is .039 (6.4 out of 162). So, the Noll-Scully measure works out to .068/.039, which is around 1.74.

In other words: the spread of team winning percentage in baseball is 1.74 times as high as if every team were equal in talent.

------

Both "The Wages of Wins" (TWOW), and a paper on competitive balance I just read recently (which I hope to post about soon), independently use Noll-Scully to compare different sports. Well, not just them, but a lot of academic research on the subject.

The Wages of Wins (page 70 of the second edition) runs this chart:

2.84 NBA
1.81 AL (MLB)
1.67 NL (MLB)
1.71 NHL
1.48 NFL

The authors follow up by speculating on why the NBA's figure is so high, why the league is so unbalanced. They discuss their "short supply of tall people" hypothesis, as well as other issues.

But one thing they don't talk about is the length of the season. In fact, their book (and almost every other academic paper I've seen on the subject) claims that Noll-Scully controls for season length. 

Their logic goes something like this: (a) The Noll-Scully measure is actually a multiple of the theoretical SD of luck. (b) That theoretical SD *does* depend on season length. (c) Therefore, you're comparing the league to what it would be with the same season length, which means you're controlling for it.

But ... that's not right. Yes, dividing by the theoretical SD *does* control for season length, but not completely.

------

Let's go back to the MLB case. We had

.068 observed SD
.039 theoretical luck SD
-------------------------
1.74 Noll-Scully ratio

Using the fact that SDs follow a pythagorean relationship, it follows that

observed SD squared = theoretical luck SD squared + talent SD squared

So

.068 squared = .039 luck squared + talent squared

Solving, we find that the SD of talent = .056. Let's write that this way:

.039 theoretical luck SD
.056 talent SD
------------------------
.068 observed SD
---------------------------------------
1.74 Noll-Scully (.068 divided by .039)

Now, a hypothetical. Suppose MLB had decided to play a season four times as long: 648 games instead of 162. If that happened, the theoretical luck SD would drop in half (we'd divide by the square root of 4). So, the luck SD would be .020. 

The talent SD would remain constant at .056. The new observed SD would be the square root of (.020 squared plus .056 squared), which works out to .059:

.020 theoretical luck SD
.056 talent SD
-------------------------
.059 observed SD
---------------------------------------
2.95 Noll-Scully (.059 divided by .020)

Under this scenario, the Noll-Scully increases from 1.74 to 2.95. But nothing has changed about the game of baseball, or the short supply of home run hitters, or the relative stinginess of owners, or the populations of the cities where the teams play. All that changed was the season length.

--------

My only point here, for now, is that Noll-Scully does NOT properly control for season length. Any discussion of why one sport has a higher Noll-Scully than another *must* include a consideration of the length of the season. Generally, the longer the season, the higher the Noll-Scully. (Try a Noll-Scully calculation early in the season, like today, and you'll get a very low number. That's because after only 15 games, luck is huge, so talent is small compared to luck.)

It's not like there's no alternative. We just showed one! Instead of Noll-Scully, why not just calculate the "talent SD" as above? That estimate *is* constant for season length, and it's still a measure of what academic authors are looking for. 

Tango did this in a famous post in 2006. He got

.060 MLB
.058 NHL
.134 NBA

If you repeat Tango's logic for different season lengths, you'll get the same numbers.  Well, you'll get different results because of random variation ... but they should average somewhere close to those figures.

---------

Now, you could argue ... well, sometimes you *do* want to control for season length. Perhaps one of the reasons the best teams dominate the standings is because NBA management wanted it that way ... so they chose a longer, 82-game season, in order to create some space between the Warriors and the other teams. Furthermore, maybe the NFL deliberately chose 16 games partly to give the weaker teams a chance.

Sure, that's fine. But you don't want to use Noll-Scully there either, because Noll-Scully still *partially* adjusts for season length, by using "luck multiple" as its unit. Either you want to consider season length, or you don't, right? Why would you only *partially* want to adjust for season length? And why that particular part?

If you want to consider season length, just use the actual SD of the standings. If you don't, then use the estimated SD of talent, from the pythagorean calculation. 

Either way, Noll-Scully doesn't measure anything anybody really wants.





Labels: , , , , , , ,

Tuesday, March 22, 2016

Charlie Pavitt: On some implications of the Adam LaRoche situation

Charlie Pavitt occasionally guest posts here ... been a while, but he's back! Here's Charlie's thoughts on Adam Laroche and education.  

-----

Let me begin this essay by stating that if it is true that the White Sox promised Adam LaRoche full clubhouse access for his son Drake, then they are wrong to reverse themselves on that promise now; and if LaRoche only agreed to sign the contract because of the promise, he has good reason to feel betrayed.

But this essay is not about this specific issue. It is about something more general.  My understanding is that LaRoche made a public statement that included something like the following: School is not so important, you can learn more about life in the baseball clubhouse than in school. Even if I am wrong about what LaRoche did or did not say, the question deserves consideration, so I want to discuss my thinking about this general issue and not about this specific case.

You most certainly can learn something about life in the baseball clubhouse. Here are three important things that you can learn: First, that a group of men from drastically different backgrounds (different races/ethnicities/social classes/religions/etc.) can work together in harmony in pursuit of a shared goal. Second, that a group of men from drastically different backgrounds can forge close friendships. Third, that success in one’s pursuits requires what I call the three D’s: desire, dedication, discipline. These are all extremely valuable lessons, and I suppose there are some other things that you can learn in the baseball clubhouse that I cannot think of right now.
But there are many things that are also important in life that you cannot learn in the baseball clubhouse.
First, you cannot learn how to interact in a mature enough manner with women to work in harmony with them in pursuit of a shared goal and to forge close friendships. I have no idea about now, but certainly in the “old days” a baseball clubhouse was anything but a good training ground for learning how to treat women as potential co-workers or non-sexual friends.
Second, you cannot learn how to interact in a mature enough manner with people with a different sexual orientation than you to work in harmony with them in pursuit of a shared goal and to forge close friendships. In this case, it is pretty clear that instances of disparaging treatment are still occurring, although happily the response to these instances has been to demand that the perpetrators grow up and act like adults. I would like to add that it is likely that there are quite a few gay major league baseball players right now who do not feel comfortable enough with how their teammates would react to come out of the closet, given how few past players have felt comfortable enough to do so even after retirement.
Third, you cannot learn how to interact in a mature enough manner with people with disabilities, be they of sight or hearing or physical limitations or psychiatric problems or low “intelligence,” whatever that is, to work in harmony with them in pursuit of a shared goal and to forge close friendships. I will say that baseball players have as a group been sensitive to and supportive of people such as these and also to other players with analogous issues (e.g., alcoholism), which is laudable.  But the issue here is whether one can learn this sensitivity in a baseball clubhouse as well as you can in school.
Fourth, you cannot learn how to interact in a mature enough manner with straight men with no disabilities, but a different temperament than one usually finds among baseball players, to work in harmony with them in pursuit of a shared goal and to forge close friendships.  I am thinking of men who think like artists or musicians or writers (or academics like me), who are basically non-competitive and wish to collaborate with everyone and not only one’s teammates. I am also thinking of men who have made financial sacrifices to dedicate their lives to the betterment of others; those who work in non-profits for pitiful wages, school teachers spending part of their much-lower-than-deserved income on their classroom and students because their school is so badly underfunded. I would not expect active disparagement of such people in the clubhouse, in fact if anything baseball players probably respect the hard work and achievements of such people. But I doubt they learned that attitude in the baseball clubhouse, and in any case the issue is once again whether the ability to understand those mindsets well enough to work alongside and becoming truly friendly with such people is better learned in a baseball clubhouse than in a school.
Fifth, you can learn the sorts of things that can make you a responsible public citizen and contribute to your community and to your nation; for example, someone who votes based on well-thought-out values and relevant knowledge rather than emotion and hearsay.  I am not saying that baseball players as a group do not have adequate public citizen skills, just like the general citizenry my guess is that some do and some don’t, but that you can learn those skills in school far better than in a baseball clubhouse. 
These are the sorts of things that you can learn in a school that you cannot learn in a baseball clubhouse. And, I might add, you can also learn in school about the three important things I listed above that you can learn in a clubhouse; in fact, you can learn them better. You can learn to pursue a goal with or become friends with people from drastically different backgrounds, and you can learn about the importance of the three D’s, and you can learn them better in a school because you are actually participating and not just observing others, as would be generally the case in the clubhouse. Note that unlike the clubhouse, you are learning these skills while interacting with people your own age rather than 10/20 years older. Research has conclusively shown that, while basic language and communication skills are originally learned from intense interaction with one’s immediate family, they are practiced and perfected through intense interaction with one’s age peers.
Now let me add one other thing that you can learn in a school that you cannot learn in a baseball clubhouse. You can learn a marketable skill other than playing baseball.  Again, I want to consider the general issue and stay away from the specific LaRoche situation; my understanding is that Drake is being home-schooled and that Adam LaRoche does many things other than play baseball and I trust that Drake is participating and learning from them. If a player thinks that the baseball clubhouse is a more educational environment for a son than a school, the player is thinking as if there was no question that the son was also going to spend their lives in the baseball clubhouse.  ut what if the son’s interests and temperament are more in line with the examples I mentioned above: the musician/artist/writer/academic, the person who values other’s betterment over personal wealth?  Or even if the son has the desire and temperament to play baseball, but not the skill?  In particular, the latter type of son is woefully unprepared for work, not just for the nuts and bolts of the job but also for working in tandem with women, gay men, and those of different temperament. Or does the player think that, in that instance, it is fine for the son to live his life off the fortune the player is making?

Now, I understand that Drake LaRoche is being home-schooled, and for all I know he is learning about the things I am concerned about, and whether or not this is true in this instance is beside the general point. And I am not saying that boys should never spend any time in a baseball clubhouse, even if it means missing a little school. In conclusion, I am saying the following:

A baseball team that promises a player that the player’s son can totally share the baseball player life at the detriment of schooling is performing a great disservice to the son.

A baseball player who wants his son to totally share the baseball player life at the detriment of schooling is performing a great disservice to his son.

A baseball team that realizes the detriment to the son that the promise has caused and reverses itself on the promise certainly deserves censure for breaking a promise, but is performing a great service to the son.

And a baseball player in the latter situation who retires because of it and, in so doing, insures educational experiences for the son beyond the clubhouse is performing a great service to the son. 

-- Charlie Pavitt


Labels: , ,

Thursday, January 07, 2016

When log5 does and doesn't work

Team A, with an overall winning percentage talent of .600, plays against a weaker Team B with an overall winning percentage of .450. What's the probability that team A wins? 

In the 1980s, Bill James created the "log5" method to answer that question. The formula is

P = (A - AB)/(A+B-2AB)

... where A is the talent level of team A winning (in this case, .600), and B is the talent level of team B (.450).

Plug in the numbers, and you get that team A has a .647 chance of winning against team B. 

That makes sense: A is .600 against average teams. Since opponent B is worse than average, A should be better than .600. 

Team B is .050 worse than average, so you'd kind of expect A to "inherit" those 50 points, to bring it to .650. And it does, almost. The final number is .647 instead of .650. The difference is because of diminishing returns -- those ".050 lost wins" are what B loses to *average* teams because it's bad. Because A is better than average, it would have got some of those .050 wins anyway because it's good, so B can't "lose them again" no matter how bad it is.

In baseball, the log5 formula has been proven to work very well.

------

There was some discussion of log5 lately on Tango's site (unrelated to this post, but very worthwhile), and that got me thinking. Specifically, it got me thinking: log5 CANNOT be right. It can be *almost* right, but it can never be *exactly* right.

In the baseball context, it can be very, very close, indistinguishable from perfect. But in other sports, or other contexts, it could be way wrong. 

Here's one example where it doesn't work at all.

Suppose that, instead of actually playing baseball games, teams just measured their players' average height, and the taller team wins. And, suppose there are 11 teams in the league, and there's a balanced 100-game season.

What happens? Well, the tallest team beats everyone, and goes 100-0. The second-tallest team beats everyone except the tallest, and winds up 90-10. The third-tallest goes 80-20. And so on, all the way down to 0-100.

Now: when a .600 team plays a .400 team, what happens? The log5 formula says it should win 69.2 percent of those games. But, of course, that's not right -- it will win 100 percent of those games, because it's always taller.

For height, the log5 method fails utterly.

------

What's the difference between real baseball and "height baseball" that makes log5 work in one case but not the other?

I'm not 100% sure of this, but I think it's due to a hidden, unspoken assumption in the log5 method. 

When we say, "Team A is a .600 talent," what does that mean? It could mean either of two things:

-- A1. Team A is expected to beat 60 percent of the opponents it plays.

-- A2. If Team A plays an average team, it is expected to win 60 percent of the time.

Those are not the same! And, for the log5 method to work, assumption A1 is irrelevent. It's assumption A2 that, crucially, must be true. 

In both real baseball and "height baseball," A1 is true. But that doesn't matter. What matters is A2. 

In real baseball, A2 is close enough. So log5 works.

In "height baseball," A2 is absolutely false. If Team A (.600) plays an average team (.500), it will win 100 percent of the time, not 60 percent! And that's why log5 doesn't work there.

-------

What it's really coming down to is our old friend, the question of talent vs. luck. In real baseball, for a single game, luck dwarfs talent. In "height baseball," there's no luck at all -- the winner is just the team with the most talent (height). 

Here are two possible reasons a sports team might have a .600 record:

-- B1: Team C is more talented than exactly 60 percent of its opponents

-- B2: Team C is more talented than average, by some unknown amount (which varies by sport) that leads to it winning exactly 60 percent of its games.

Again, these are not the same. And, in real life, all sports (except "height baseball") are some combination of the two. 

B1 refers completely to talent, but B2 refers mostly to luck. The more luck there is, in relation to talent, the better log5 works.

Baseball has a pretty high ratio of luck to talent -- on any given day, the worst team in baseball can beat the best team in baseball, and nobody bats an eye. But in the NBA, there's much less randomness -- if Philadelphia beats Golden State, it's a shocking upset. 

So, my prediction is: the less that luck is a factor in an outcome, the more log5 will underestimate the better team's chance of winning.

Specifically, I would predict: log5 should work better for MLB games than for NBA games.

--------

Maybe someone wants to do some heavy thinking and figure how to move this forward mathematically.  For now, here's how I started thinking about it.

In MLB, the SD of team talent seems to be about 9 games per season. That's 90 runs. Actually, it's less, because you have to regress to the mean. Let's call it 81 runs, or half a run per game. (I'm too lazy to actually calculate it.) Combining the team and opponent, multiply by the square root of two, to give an SD of around 0.7 runs.

The SD of luck, in a single game, is much higher. I think that if you computed the SD of a single team's 162 runs-scored-that-game, you'd get around 3. The SD of runs allowed is also around 3, so the SD of the difference would be around 4.2.

SD(MLB talent) = 0.7 runs
SD(MLB luck)   = 4.2 runs

Now, let's do the NBA. From basketball-reference.com, the SD of the SRS rating seems to be just under 5 points. That's based on outcomes, so it's too high to be an estimate of talent, and we need to regress to the mean. Let's arbitrarily reduce it to 4 points. Combining the two teams, we're up to 5.2 points.

What about the SD of luck? This site shows that, against the spread, the SD of score differential is around 11 points. So we have

SD(NBA talent) =  5.2 points
SD(NBA luck)   = 11.0 points

In an MLB game, luck is 6 times as important as talent. In an NBA game, luck is only 2 times as important as talent. 

But, how you apply that to fix log5, I haven't figured out yet. 

What I *do* think I know is that the MLB ratio of 6:1 is large enough that you don't notice that log5 is off. (I know that from studies that have tested it and found it works almost perfectly.) But I don't actually know whether the NBA ratio of 2:1 is also large enough. My gut says it's not -- I suspect that, for the NBA, in extreme cases, log5 will overestimate the underdog enough so that you'll notice. 

-------

Anyway, let me summarize what I speculate is true:

1. The log5 formula never works perfectly. Only as the luck/talent ratio goes to infinity, will log5 be theoretically perfect. (But, then, the predictions will always be .500 anyway.)  In all other cases, log5 will underestimate, to some extent, how much the better team will dominate.

2. For practical purposes, log5 works well when luck is large compared to talent. The 6:1 ratio for a given MLB game seems to be large enough for log5 to give good results.

3. When comparing sports, the more likely it is that the more-talented team beats the less-talented team, the worse log5 will perform. In other words: the bigger the Vegas odds on underdogs, the worse log5 will perform for that sport.

4. You can also estimate how well log5 will perform with a simple test. Take a team near the extremes of the performance scale (say, a .600/.400 team in MLB, or a .750/.250 team in the NBA), and see how it performed specifically against only those teams with talent close to .500.

If a .750 team has a .750 record against teams known to be average, log5 will work great. But if it plays .770 or .800 or .900 ball against teams known to be average, log5 will not work well. 

-------

All this has been mostly just thinking out loud. I could easily be wrong.




Labels: , , , , ,

Friday, December 04, 2015

A new "hot hand" study finds a plausible effect

There's a recent baseball study (main page, .pdf) that claims to find a significant "hot hand" effect. Not just statistically significant, but fairly large:


"Strikingly, we find recent performance is highly significant in predicting performance ... Furthermore these effects are of a significant magnitude: for instance, ... a batter who is “hot” in home runs is 15-25% more likely (0.5-0.75 percentage points or about one half a standard deviation) more likely to hit a home run in his next at bat."

Translating that more concretely into baseball terms: imagine a batter who normally hits 20 HR in a season. The authors are saying that when he's on a hot streak of home runs, he actually hits like a 23 or 25 home run talent. 

That's a strong effect. I don't think even home field advantage is that big, is it?

In any case, after reading the paper ... well, I think the study's conclusions are seriously exaggerated. Because, part of what the authors consider a "hot hand" effect doesn't have anything to do with streakiness at all.

------

The study took all player seasons from 2000 to 2011, subject to an AB minimum. Then, the authors tried to predict every single AB for every player in that span. 

To get an estimate for that AB, the authors considered:

(a) the player's performance in the preceding 25 AB; and
(b) the player's performance in the rest of the season, except that they excluded a "window" of 50 AB before the 25, and 50 AB after the AB being predicted.

To make this go easier, I'm going to call the 25 AB sample the "streak AB" (since it measures how streaky the player was). I'm going to call the two 50 AB exclusions the "window AB". And, I'm going to call the rest of the season the "talent AB," since that's what's being used as a proxy for the player's ability.

Just to do an example: Suppose a player had 501 AB one year, and the authors are trying to predict AB number 201. They'd divide up the season like this:

1. the first 125 AB (part of the "talent AB")
2. the next 50 AB, (part of the "window AB)
3. the next 25 AB (the "streak AB")
4. the single "current AB" being predicted
5. the next 50 AB, (part of the "window AB")
6. the next 250 AB (part of the "talent AB").

They run a regression to predict (4), based on two variables:

B1 -- the player's ability, with is how he did in (1) and (6) combined
B2 -- the player's performance in (3), the 25 "streak AB" that show how "hot" or "cold" he was, going into the current AB.

Well, not just those two -- they also include the obvious control variables, like season, park, opposing pitcher's talent, platoon, and home field advantage. 

(Why did they choose to exclude the "windows" (2) and (5)? They say that because the windows occur so close to the actual streak, they might themselves be subject to streakiness, and that would bias the results.)

What did the study find? That the estimate of "B2" was large and significant. Holding the performance in the 375 AB "ability" sample constant, the better the player did in the immediately preceding 25 "streak" AB, the better he did in the current AB.

In other words, a hot player continues to be hot!

------

But there's a problem with that conclusion, which you might have figured out already. The methodology isn't actually controlling for talent properly.

Suppose you have two players, identical in the "talent" estimate -- in 375 AB each, they both hit exactly 21 home runs.

And suppose that in the streak AB, they were different. In the 25 "streak" AB, player A didn't hit any home runs. But player B hit five additional homers.

In that case, do you really expect them to hit identically in the 26th AB? No, you don't. And not because of streakiness -- but, rather, because player B has demonstrated himself to be a better home run hitter than player A, by a margin of 26 to 21. 

In other words, the regression coefficient confounds two factors -- streakiness, and additional evidence of the players' relative talent.

-------

Here's an example that might make the point a bit clearer.

(a)  in their first 10 AB -- the "talent" AB -- Mario Mendoza and Babe Ruth both fail to hit a HR.
(b)  in their second 100 AB -- the "streak" AB -- Mendoza hits no HR, but the Babe hits 11.
(c)  in their third 100 AB -- the "current" AB -- Mendoza again hits no HR, but the Babe hits 10.

Is that evidence of a hot hand? By the authors' logic, yes, it is. They would say:

1. The two players were identical in talent, from the control sample of (a). 
2. In (b), Ruth was hot, while Mendoza was cold.
3. In (c), Ruth outhit Mendoza. Therefore, it must have been the hot hand in (b) that caused the difference in (c)!

But, of course, the point is ... (b) is not just evidence of which player was hot. It's also evidence of which player was *better*. 

-------

Now, the authors did actually understand this was an issue. 

In a previous version of their paper, they hadn't. In 2014, when Tango posted a link on his site, it took only two-and-a-half hours for commenter Kincaid to point out the problem (comment 6).  (There was a follow-up discussion too.)

The authors took note, and now realize that their estimates of streakiness are confounded by the fact that they're not truly controlling for established performance. 

The easiest way for them to correct the problem would have been just to include the 25 AB in the talent variable. In the "Player A vs. Player B" example, instead of populating the regression with "21/0" and "21/4", they could easily have populated it with "21/0 and "25/4". 

Which they did, except -- only in one regression, and only in an appendix that's for the web version only.

For the published article, they decided to leave the regression the way it was, but, afterwards, try to break down the coefficient to figure out how much of the effect was streakiness, and how much was talent. Actually, the portion I'm calling "talent" they decided to call "learning," on the grounds that it's caused by performance in the "streak AB" allowing us to "learn" more about the player's long-term ability. 

Fine, except: they still chose to define "hot hand" as the SUM of "streakiness" and "learning," on the grounds that ... well, here's how they explain it:


"The association of a hot hand with predictability introduces an issue in interpretation, that is also present but generally unacknowledged in other papers in the area. In particular, predictability may derive from short-term changes in ability, or from learning about longer-term ability. ... We use the term “hot hand” synonymously with short-term predictability, which encompasses both streakiness and learning."

To paraphrase, what they're saying is something like:


"The whole point of "hot hand" studies is to see how well we can predict future performance. So the "hot hand" effect SHOULD include "learning," because the important thing is that the performance after the "hot hand" is higher, and, for predictive purposes, we shouldn't care what caused it to be higher."

I think that's nuts. 

Because, the "learning" only exists in this study because the authors deliberately chose to leave some of the known data out of the talent estimate.

They looked at a 25/4 player (25 home runs of which 4 were during the "streak"), and a 21/0 player (21 HR, 0 during the streak), and said, "hey, let's deliberately UNLEARN about the performance during the streak time, and treat them as identical 21-HR players. Then, we'll RE-LEARN that the 25/4 guy was actually better, and treat that as part of the hot hand effect."

-------

So, that's why the authors' estimates of the actual "hot hand" effect (as normally understood outside of this paper) are way too high. They answered the wrong question. They answered,


"If a guy hitting .250 has a hot streak and raises his average to .260, how much better will he be than a guy hitting .250 who has a cold streak and lowers his average to .240?"

They really should have answered,


"If a guy hitting .240 has a hot streak and raises his average to .250, how much better will he be than a guy hitting .260 who has a cold streak and lowers his average to .250?"

--------

But, as I mentioned, the authors DID try to decompose their estimates into "streakiness" and "learning," so they actually did provide good evidence to help answer the real question.

How did they decompose it? They realized that if streakiness didn't exist at all, each individual "streak AB" should have the same weight as each individual "talent AB". It turned out that the individual "streak AB" were actually more predictive, so the difference must be due to streakiness.

For HR, they found the coefficient for the "streak AB" batting average was .0749. If a "streak AB" were exactly as important as important as a "talent AB", the coefficient would have been .0437. The difference, .0312, can maybe be attributed to streakiness.

In that case, the "hot hand" effect -- as the authors define it, as the sum of the two parts -- is 58% learning, and 42% streakiness.

-------

They didn't have to do all that, actually, since they DID run a regression where the Streak AB were included in the Talent AB. That's Table A20 in the paper (page 59 of the .pdf), and we can read off the streakiness coefficient directly. It's .0271, which is still statistically significant.

What does that mean for prediction?

It means that to predict future performance, based on the HR rate during the streak, only 2.71 percent of the "hotness" is real. You have to regress 97.29 percent to the mean. 

Suppose a player hit home runs at a rate of 20 HR per 500 AB, including the streak. During the streak, he hit 4 HR in 25 AB, which is a rate of 80 HR per 500 AB. What should we expect in the AB that immediately follows the streak?

Well, during the streak, the player hit at a rate 60 HR / 500 AB higher than normal. 60 times 2.71 percent equals 1.6. So, in the AB following the streak, we'd expect him to hit at a rate of 21.6 HR, instead of just 20.

-------

In addition to HR, the authors looked at streaks for hits, strikeouts, and walks. I'll do a similar calculation for those, again from Table A20.

Batting Average

Suppose a player hits .270 overall (except for the one AB we're predicting), but has a hot streak where he hits .420. What should we expect immediately after the streak?

The coefficient is .0053. 150 points above average, times .0053, is ... less than one point. The .270 hitter becomes maybe a .271 hitter.

Strikeouts

Suppose a player normally strikes out 100 times per 500 AB, but struck out at double that rate during the streak (which is 10 K in those 25 AB). What should we expect?

The coefficient is .0279. 100 rate points above average, times .0279, is 2.79. So, we should expect the batter's K rate to be 102.79 per 500 AB, instead of just 100. 

Walks

Suppose a player normally walks 80 times per 500 PA, but had a streak where he walked twice as often. What's the expectation after the streak?

The coefficient here is larger, .0674. So, instead of walking at a rate of 80 per 500 PA, we should expect a walk rate of 85.4. Well, that's a decent effect. Not huge, but something.

(The authors used PA instead of AB as the basis for the walk regression, for obvious reasons.)

---------

It's particularly frustrating that the paper is so misleading, because, there actually IS an indication of some sort of streakiness. 

Of course, for practical purposes, the size of the effect means it's not that important in baseball terms. You have to quadruple your HR rate over a 25 AB streak to get even a 10 percent increase in HR expectation in your next single AB. At best, if you double your walk rate over a hot streak, you walk expectation goes up about 7 percent.

But it's still a significant finding in terms of theory, perhaps the best evidence I've ever seen that there's at least *some* effect. It's unfortunate that the paper chooses to inflate the conclusions by redefining "hot hand" to mean something it's not.




(P.S.  MGL has an essay on this study in the 2016 Hardball Times. My book arrived last week, but I haven't read it yet. Discussion here.)




Labels: , , , , ,

Tuesday, October 06, 2015

Does vaping induce teenagers to become smokers?

Do electronic cigarettes lead users into smoking real cigarettes? In other words, is vaping a "gateway activity" to smoking?

A recent study says that, yes, vapers are indeed more likely to become smokers than non-vapers are. In fact, they're *four times* as likely to do so. 

The study looked at a sample of young people aged 16 to 26 who said they didn't intend to become smokers. When they caught up with them a year later, only 9.6 percent of the non-vapers had smoked in the past year. But 37.5 percent of the vapers had!

Seems like pretty strong evidence, right? The difference was certainly statistically significant.

Except ... here's an article from FiveThirtyEight that suggests that, no, this is NOT strong evidence that vaping leads to smoking. Why not? Because the sample size was very small. The vaping group comprised only 16 participants, compared to 678 for the control group.

Vaping:      6/ 16  (37.5%)
Non-Vaping: 65/678   (9.6%)

FiveThirtyEight says,

"Voila, six out of 16 makes 37.5 percent — it’s a big number that comes from a small number, which makes it a dubious one. 
So because six people started smoking, news reports alleged that e-cigs were a gateway to analog cigs."

Well, I have some sympathy for that argument, but ... just a little. Statistical significance does adjust for sample size, so, in effect, the data does actually say that the sample size issue isn't that big a deal. To argue that 16 people isn't enough, you need something other than a "gut feel" argument. For instance, you could hypothesize that 16 vapers out of 694 people is a lower incidence of vaping than in the general population, and, therefore, you're getting only "out of the closet vapers" self-identifying, which makes the 16 vapers unrepresentative. 

But, the article doesn't make any arguments like that.

------

The FiveThirtyEight story tries to make the case that the study, and the press release describing it, are biased, because they're too overconfident about a sample that's too small to draw any conclusions. 

I don't agree with that, but I DO agree that there's bias. A much, much worse bias, one that's obvious when you think about it, but one that has nothing to do with the actual statistical techniques. 

What's the actual problem? It's that the whole premise is mistaken. Comparing vapers to non-vapers is NOT evidence for whether vaping entices young people into smoking. Not at all. Even with a huge sample size. Even if you actually counted everyone in the world, and it turned out that vapers were five times as likely to become smokers as non-vapers, that would NOT imply that vaping leads to smoking, and it would NOT imply that banning vaping would "protect our youth" from the dangers of smoking real cigarettes.

It could even be that, depsite vapers being five times as likely to take up smoking, vaping actually *reduces* the incidence of smoking.

How? Well, suppose that vapers and smokers are the same "types" of people, those who want to send a signal that they're risk-takers and nonconformists. Before, they all took up smoking. Now, some take up smoking and some vaping. Sure, some of the vapers become smokers later. But, overall, you could easily have fewer smokers than before you started. 


"What do I think? A vaper is in denial. It’s not the vaping itself that causes you to become a smoker, but simply that a smoker is a closet-vaper. 
"This is likely true of most vices. It won’t act as a gateway, but simply that you will try it because you were going to try to harder stuff anyway. Even if you didn’t want to admit it. 
" ... There’s a dozen ways to get from Chinatown to Times Square. Manhattan then adds a direct bus line that goes up Broadway. Does that bus “cause” people to go from Chinatown to Times Square? Or, does it simply become a stepping stone that they would have otherwise bypassed? 
"Basically, do the same number of people end up going Chinatown to Times Square? 
"Do the same number of people end up smoking the real stuff anyway? All vaping is doing is redirecting the flow of people?"

------

If that sounds too abstract in words, it'll become crystal clear if we just change the context, but leave the numbers and arguments the same.

"Ignore The Headlines: We Don’t Know If Suicide Hotlines Lead Kids to Kill Themselves.
"After a year, 37.5 percent of those who had called a Suicide Hotline had gone on to end their own lives. That's a big percentage when you consider that the suicide rate was only 9.6 percent among respondents who hadn’t called the hotline.  
"Our study identified a longitudinal association between suicide hotline use and progression to actual suicide, among adolescents and young adults. Especially considering the rapid increase and promotion of distress lines, these findings support regulations to limit suicide hotlines and decrease their appeal."

It's exactly the same thing! Really. I edited a bit, but most of the words come exactly from the original articles on vaping.

Now, you could argue: well, it's not REALLY the same thing. We know that suicide hotlines decrease suicide, but, come on, can you really believe that vaping reduces smoking?

To which I answer: absolutely. I *do* believe that vaping reduces smoking. If you believe differently, then, study the issue! This particular study doesn't provide evidence either way.

And, more importantly: "can you really believe?" is not science, no matter how incredulously you say it.

------

Logically and statistically, the relevant number is NOT what percentage of vapers (hotline callers) go on to smoke (commit suicide). The relevant number is, actually, how many people would go on to smoke (commit suicide) if vaping (suicide hotlines) did not exist. 

Why is this not as obvious in the vaping case as in the hotline case? Because of bias against vaping. No other reason. The researchers and doctors start out with the prejudice that vaping is a bad thing, and, because of confirmation bias, interpret the result as, obviously, supporting their view. It seems so obvious that they don't even consider any other possibility.

I bet it's not just vaping and suicide hotlines. I suspect that we'd be eager to accept the "A leads to more bad things than non-A" if we're against A, but we see it's obviously a ridiculous argument if we approve of A. Here are a few I thought of:

"37% of teenagers who play hockey went on to commit assault, as compared to only 9% who didn't play hockey. Therefore, hockey is a gateway to violence, and we need to limit access to hockey and make it less appealing to adolescents." 
"37% of teenagers who use meth go on to commit crimes, as opposed to only 9% who didn't use meth. Therefore, meth is a gateway to criminal behavior, and we need to limit access to meth and make it less appealing to adolescents." 
"37% of patients who get chemotherapy go on to die of cancer, as opposed to only 9% of patients who don't get chemo. Therefore, chemotherapy leads to cancer, and we need to limit access to chemo and make it less appealing to oncologists." 
"37% of men who harass women at work go on to commit at least one sexual assault in the next ten years. This shows that harassment is a precursor to violence, and we need to take steps to reduce it in society."

If you're like me, in the cases of "bad" precursors -- meth and harassment and vaping -- the arguments seem to make sense. But, in the cases of "good" precursors -- hockey and chemotherapy and suicide prevention lines -- the conclusions seem obviously, laughably, wrong.

It's all just confirmation bias at work.

-------

The FiveThirtyEight piece references one of their other posts, titled: "Science Isn’t Broken.  It’s just a hell of a lot harder than we give it credit for."

In that piece, they give several reasons for why so many scientific findings turn out to be false. They mention poor peer review, "p-hacking" results, and failure to self-correct.

Those may all be happening, but, in my opinion, it's much less complicated than that. 

It's just bad logic. It's not as obvious as the bad logic in this case, but, a lot of the time, it's just errors in statistical reasoning that have nothing to do with confidence intervals or methodology or formal statistics. It's a misunderstanding of what a number really means, or a reversal of cause and effect, or an "evidence of absence" fallacy, or ... well, lots of other simple logical errors, like this one.

Regular readers of this blog should not be too surprised by my diagnosis here: most of the papers I've critiqued here suffer from that kind of error, the kind that's obvious only after you catch it. 

FiveThirtyEight writes:

"Science is hard — really f*cking hard."

But, no. It's *thinking straight* that's hard. It's being unbiased that's hard. It must be. There were hundreds of people involved in that vaping study -- scientists, FiveThirtyEight writers, doctors, statisticians, public policy analysts, editors, peer reviewers, anti-smoking groups -- and NONE of them, as far as I know, noticed the real problem: that the argument just doesn't make any sense.




Hat Tip: Tom Tango, who figured it out.


Labels: , , ,

Tuesday, September 01, 2015

Consumer Reports on auto insurance, part IV

(Previous posts: part I; part II; part III)


Consumer Reports' biggest complaint is that insurance companies set premiums by including criteria that, according to CR, don't have anything to do with driving. The one that troubles them the most is credit rating:


"We want you to join forces with us to demand that insurers -- and the regulators charged with watching them on ouir behalf -- adopt price-setting practices that are more meaningfully tethered to how you drive, not to who they think you are. ..."

"In the states where insurance companies don't use credit information, the price of car insurance is based mainly on how people actually drive and other factors, not some future risk that a credit score 'predicts'. ..."

"... an unfair side effect of allowing credit scores to be used to set premium prices is that it effectively forces customers to dig deeper into their pockets to pay for accidents that haven't happened and may never happen."

----

Well, you may or may not agree on whether insurers should be allowed to consider credit scores, but, even if CR's conclusion is correct, their argument still doesn't make sense.

First: the whole idea of insurance is EXACTLY what CR complains about:  to "pay for accidents that haven't happened and may never happen." I mean, that's the ENTIRE POINT of how insurance works -- those of us who don't have accidents wind up paying for those of us who do. 

In fact, we all *hope* that we're paying for accidents that don't happen and may never happen! It's better if we don't suffer injuries, and our car stays in good shape, and our premium stays low. 

Well, maybe CR didn't actually mean that literally. What they were *probably* thinking, but were unable or unwilling to articulate explicitly, is that they think credit scores are not actually indicative of car accident risk -- or, at least, not correlated sufficiently to make the pricing differences fair.

But, I'm sure the insurance industry could demonstrate, immediately, that credit history IS a reliable factor in predicting accident risk. If that weren't true, the first insurance company to realize that could steal all the bad-credit customers away by offering them big discounts!

It's possible, I guess, that CR is right and all the insurance companies are wrong. Since it's an empirical question ... well, CR, show us your evidence! Prove to us, using actual data, that bad-credit customers cause no more accidents than their neighbors with excellent credit. If you can't do that, maybe show us that the bad-credit customers aren't as bad as the insurers think they are. Or, at the very, very least, explain how you figured out, from an armchair thought experiment and without any numbers backing you up, that insurance company actuaries are completely wrong, and have been for a long time, despite having the historical records of thousands, or even millions, of their own customers.

------

Just using common sense, and even without data, it IS reasonable that a better credit rating should predict a lower accident rate, holding everything else equal. You get better credit by paying your bills on time, and not overextending your finances -- both habits that demonstrate a certain level of reliability and conscientiousness. And driving safely requires ... conscientiousness. It's no surprise at all, to me, that credit scores are predictive, to some extent, of future accident claims.

And CR's own evidence supports that! As I mentioned, the article lauds USAA as being the cheapest, by far, of the large insurers they surveyed. 

But USAA markets to only a subset of the population. As Brian B. wrote in the comments to a previous post:


"[USAA insurance is available only to] military and families. So their demographics are biased by a subset of hyper responsible and conscientious professionals and their offspring."

Consumer Reports did, in fact, note that USAA limits its customers selectively. But they didn't bother demanding that USAA raise its rates, or stop unfairly judging military families by "what they think they are" -- more conscientious than average.

-----

Not only does CR not bother mentioning the possibility that drivers with bad credit scores might actually be riskier drivers, they don't even hint that it ever crossed their minds. They seem to stick to the argument that nothing can possibly "predict" future risk except previous driving record. They even put "predict" in scare quotes, as if the idea is so obviously ludicrous that this kind of "prediction" must be a form of quackery.

Except when it's not. In the passage I quoted at the beginning of this post, they squeeze in a nod to "other factors" that might legitimately affect accident risk. What might those factors be? From the article, it seems they have no objection to charging more to teenagers. Or, to men. They never once mention the fact that female drivers pay less than males -- which, you'd think, would be the biggest, most obvious non-driving factor there is.

CR demands that I be judged "not by who the insurance companies think I am!" Unless, of course, I'm young and male, in which case, suddenly it's OK.

Why is it not a scandal that I pay more just for being a man? I may not be the aggressive testosterone-fueled danger CR might "think I am."  If I'm actually as meek as the average female, the insurer is going to "profit from accidents I may never have!"

------

I suspect they're approaching the question from a certain moral standard, rather than empirical considerations of the actual risk. It just bugs them that the big, bad insurance companies make you pay more just for having worse credit. On the other hand, men are aggressive, angry, muscle-car loving speeders, and it's morally OK for them to get punished. As well as young people, who are careless risk-takers who text when they drive.

A less charitable interpretation is that CR is just jumping to the conclusion that higher prices are unjustified, even when based on statistical risk, when they affect "oppressed" groups, like the poor -- but OK when they favor "oppressor" groups, like men. (Recall that CR also complained about "good student" discounts because they believe those discounts benefit wealthier families.)

A more charitable interpretation might see CR's thinking as working something like this:

-- It's OK to charge more to a certain group where it's obvious that they generally have a higher risk. Like, teenage drivers, who don't have much experience. Common sense suggests, of course they get into more accidents.

-- Higher rates are like a "punishment," and it's OK, and even good, to punish people who do bad things. People who have at-fault accidents did something bad, so their rates SHOULD go up, to teach them a lesson! As CR says,

"In California, the $1,188 higher average premium our single drivers had to pay because of an accident they caused is a memorable warning to drive more carefully. ... In New York, our singles received less of a slap, only $429, on average."

-- It's OK for men to pay more than women because psychologists have long known that men are more aggressive and prone to take more risks.

-- But it's *not* OK to charge more for someone in a high-risk group when (a) there's no proof that they're actually, individually, a high risk, and (b) the group is a group that CR feels has been unfairly disadvantaged already. Just because someone has bad credit doesn't mean they're a worse driver, even if, statistically, that group has more accidents than others. Because, maybe a certain driver has bad credit because he was conned into buying a house he couldn't afford. First, he was victimized by greedy bankers and unscrupulous developers ... now he's getting conned a second time, by being gouged for his auto policy, even though he's as safe as anyone else!


If CR had actually come out and said this explicitly, and argued for it in a fair and unbiased fashion, maybe I would change my mind and come to see that CR is right. But ... well, that doesn't actually seem to be what CR is arguing. They seem to believe that their plausible examples of bad credit customers with low risk are enough to prove that the overall correlation must be zero!

When a certain model of car requires twice as many repairs as normal, CR recommends not buying it. But when a certain subset of drivers causes twice as many accidents as average, CR not only suggests we ignore the fact -- they even refuse to admit that it's true!

------

Here's a thought experiment to see how serious CR is about considering only driving history.

Suppose an insurer decided to charge more for drivers who don't wear a helmet when riding a bicycle, based on evidence that legitimately shows that people who refuse to wear bicycle helmets are more likely to refuse to wear seatbelts.

But, they note, it's not a perfect correlation. I, for instance, am an exception. I don't wear a bicycle helmet, but I wouldn't dream of driving without a seatbelt (and I might even be scared to drive a car without airbags). 

Would CR come to my defense, demanding that my insurer stop charging me extra?  Would they insist they judge me by how I drive, not by "who they think I am" based on my history of helmetlessness?

I doubt it. I think they'd be happy that I'm being charged more. I think it's about CR judging which groups "deserve" higher or lower premiums, and then rationalizing from there.

(If you want to argue that bicycling is close enough to driving that this analogy doesn't work, just substitute hockey helmets, or life jackets.)

------

I'm not completely unsympathetic to CR's position. They could easily make a decent case.  They could say, "look, we know that drivers with bad credit cause more accidents, as a group, than drivers with good credit. But it seems fundamentally unfair, in too many individual cases, to judge people by the characteristics of their group, and make them pay twice as much without really knowing whether they fit the group average."

I mean, if they said that about race, or religion, we'd all agree, right? We'd say, yeah, it DOES seem unfair that a Jewish driver like Chaim pays less (or more) than a Muslim driver like Mohammed, just because his group is statistically less (or more) risky. 

But, what if it's actually the case that, statistically, one group causes more accidents than the other? We tell the insurance companies, look, it's not actually because of religion that the groups are different. It must be just something that correlates to religion, perhaps by circumstance or even coincidence.  So, stop being so lazy.  Instead of deciding premiums based on religion, get off your butts and figure out what's actually causing the differences! 

Maybe the higher risk is because of what neighborhoods the groups tend to live in, that some neighborhoods have narrower streets and restricted sightlines that lead to more accidents. Shouldn't the insurance company figure that out, so that if they find that Chaim (or Mohammed) actually lives in a safer neighborhood, they can set his premium by his actual circumstances, instead of his group characteristics, which they will now realize don't apply here?  That way, fewer drivers will be stuck paying unfairly high or low premiums because of ignorance of their actual risk factors.

If that works for religion, or race, it should also work for credit score. Can't the insurance companies do a bit more work, and drill down a bit more, to figure out who has bad credit out of irresponsibility, and who has bad credit because of circumstances out of their control?

Yes! And, I'd bet, the insurance companies are already doing that! Their profits depend on getting risk right, and they can't afford to ignore anything that's relevant, lest other insurers figure it out first, and undercut their rates.

And CR actually almost admits that this is happening. Remember, the article tells us that the insurers aren't actually using the customer's standard credit score -- they're tweaking parts of it to create their own internal metric. CR tells us that to complain about it -- it's arbitrary, and secret! -- but it might actually be the way the insurers make premiums more accurate, and therefore fair.  It might be the way insurers make it less likely that a customer will be forced to pay higher premiums for "accidents that may never happen."

-----

But I don't think CR really cares that premiums are mathematically fair. Their notion of fairness seems to be tied to their arbitrary, armchair judgments about who should be paying what. 

I suspect that even if the insurance companies proved that their premiums were absolutely, perfectly correlated with individual driving talent, CR would still object. They don't have a good enough understanding of risk -- or a willingness to figure it out.

A driver's expected accident rate isn't something that's visible and obvious. It's hard for anyone but an actuary to really see that Bob is likely to have an accident every 10 years, while Joe is likely to have an accident every 20. To an outsider, it looks arbitrary, like Bob is getting ripped off, having to pay twice as much as Joe for no reason. 

The thing is: some drivers really ARE double the risk. But, because accidents are so rare, their driving histories look identical, and there doesn't seem to be any reason to choose between them. But, often, there is.

If you do the math, you'll see that a pitcher who retires batters at a 71% rate is at more than double the "risk" of pitching a perfect game than a pitcher with only a 69% rate. But, in their normal, everyday baseball card statistics, they don't look that much different at all -- just a two percentage point difference in opposition OBP.

I think a big part of the problem is just that luck, risk, and human behavior follow rules that CR isn't willing to accept -- or even try to understand.






Labels: ,