Tuesday, November 25, 2008

Is racism getting worse? Show us some data!

Seventeen months ago, in the light of several high-profile shootings, the Government of Ontario commissioned a study on young people and crime. The study is now out. Authored by two high-profile former politicians, Roy McMurtry and Alvin Curling, it's called "Roots of Youth Violence," came in at 500 pages and cost some $2,000,000.

Before I start, I should say that I haven't read the actual study; I'm going by news reports such as these two. But if those articles are correct, the report is badly flawed.

The part of the paper that got most of the publicity is its finding that racism in Ontario is getting worse – that we Ontarians are more prejudiced than ever. The money quote seems to be

"Racism is worse than it was a generation ago."

The authors concluded that youth violence is caused by poverty, racism, and untreated mental health problems. But the articles I've read mention that only in passing – their main focus is on the "racism is worse" claim.

Now, it seems to me that any quantitiative comparison between racism now and racism 30 years ago is a question of fact. That means that you're pretty much expected to

-- define racism;
-- show how you'll measure it;
-- measure it now;
-- measure it then;
-- and show how the two measurements compare.

I don't think that's just me – that's how we do it in sabermetrics, and I'm sure that's how they do that in academia. Your conclusions have to be supported by data and logic.

Of course, there will always be differences of opinion about what racism is and how best way to measure it. No matter how the report does it, there'll probably be room for reasonable people to disagree, and dispute the findings.

But, as far as I can tell, the authors didn't measure racism at all. They seem to have made absolutely no effort to quantify anything. If the news reports are accurately portraying the study, the results were pulled out of thin air.

To quote Dan Gardner, from the first article,

"... extraordinary claims require extraordinary evidence. ... So what evidence did McMurtry and Curling provide?"

Apparently just anecdote:

"... they talked to people who told, them, presumably, of their experiences and perceptions."

As Gardner points out, it's always good to hear what people have to say. But anecdote isn't evidence. More importantly, you can't compare levels of racism unless you count or measure something. And it doesn't seem like the authors did any of that. At best, they reference existing studies and random quotes.

For instance, they reference a study of university students in Peterborough (Ont.) that found

"85% of respondents reported racism in public places ... the same precentage of them experienced racism in downtown bars. Around 75% of them reported racism in university residences."

"Of all the public/private spaces studied, none was found to be racism-free," the report concluded.

Even if all this is absolutely true (although, as the newspaper article notes, the numbers are "off-the-charts astounding"), what does it really tell us? Not a whole lot. Suppose we take the findings at face value. What does it mean if 85% of people found racism in public places? Well, it depends, doesn't it, on how much interaction you have with the outside world. If everyone encounters 1900 unique other people in public, and only one-tenth of 1% of people are racist, you'll get exactly the reported number – 85% of people will have encountered at least one racist person (1 minus .999 to the power of 1900). What if they encounter only 500 people? Then it takes only about 0.04% of the population to be racist.

(Notice that even with the 85% figure, the actual number of racists can be pretty low. There are hundreds of bad things in life that have happened to 85% of us, but are nonetheless not common enough to panic about.)

So how racist is Peterborough? If you're a visible minority student responding to the survey, how much do you get out? Because it makes a huge difference – in my example, racism could be four times as high in one calculation as the other.

Needless to say, the study doesn't appear to address this issue. Nor does it talk about how the current 85% compares to whatever the number might have been back in the 70s. Small-town Ontario might be different from, say, the US South, but I'd bet you that as bad (or not bad) as racism might be today, I'd guarantee you that thirty years ago in Mississippi, the proportion of blacks reporting racism in public would be a lot higher than 85%.

Another part of the report talks about how visible minority persons in Ontario are much more likely to be lower-income than whites. Now, I knew that was the case, and I bet you did too. The question is, can you simply assume that the cause is racism? One alternative, and completely plausible explanation (as Gardner points out) is that ethnic minorities are more likely to be recent immigrants, and recent immigrants tend to be poor.

There are other possible explanations, and you've probably seen those too. There is an extensive literature, and debate, on this topic. But apparently the authors don't consider that at all! Which is absolutely ridiculous, to wade into an ongoing debate and support the most naive, simplistic view without even acknowledging the existence of any other work on the subject.

It's like ... it's like hiring someone to investigate whether smoking causes cancer, and he says, "well, my granddaddy smoked, and he lived to be 100, so smoking is fine." The problem is not that you're wrong – the problem is that the taxpayers funded you for a year and a half so you could investigate an important public-policy issue, and you didn't even bother to do any research or question your own opinions. And, indeed, the report not only doesn’t appear to have any of this kind of analysis, but it even draws quantitative conclusions in the utter absence of quantitative data!

Shouldn't a government-funded, heavily-publicized study have to undergo at least as much peer review as an obscure academic paper?


Sunday, November 23, 2008

"How to Score" -- a soccer sabermetrics book

When I was in London a few months ago, I stumbled across an interesting soccer sabermetrics book. It's called "How to Score," by Ken Bray, and was originally published in 2006.

Actually, the book is only partly sabermetrics – most of it seems to be physics and strategy. But there's a substantial amount of sabermetric analysis in it. I'm about halfway through, so I haven't hit the meat yet, but here are a few things I learned from the introduction:

-- only 20 percent of goals scored result after four or more passes in the attacking zone (far third of the field).

-- 60 percent of goals result when the scoring team gained possession in the offensive zone (as opposed to bringing the ball in from outside without the defensive team intervening with a touch).

-- in the professional game, approximately one goal results from every ten shots.

-- in a 90-minute game, a midfielder runs about 6 miles.

I'll post more, and perhaps give a full review, as I finish reading.


Tuesday, November 18, 2008

New issues of "By the Numbers" now available

Two new issues of "By the Numbers" (SABR's statistical analysis newsletter, which I edit) have just been released. They're available at my website.


Sunday, November 16, 2008

JQAS: a rudimentary examination of ball-strike counts

A new issue of JQAS came out recently, and there's a baseball paper in it called "Slugging Percentage in Differing Baseball Counts," by Tharemy Hopkins and Rhonda C. Magel.

The paper compares slugging percentages on various ball-strike counts. Rather than just downloading a bunch of Retrosheet data and figuring it out, the authors watched games on TV. Those games took place between March 20, 2008, and April 20, 2008 (which means they must have included spring training games?), and led to the classification of 1260 at-bats (the authors say 1260 games, but that's obviously an error).

Rather than give the results for all possible counts, the authors divided the data in to "pitcher's counts," "batter's counts," and "netural counts". They did this by "communicat[ing] with 24 individuals who have had extensive baseball experience, including coaches, players, umpires, and sports writers." Eventually they settled on 0-0, 1-1, 2-1, and 3-2 being neutral counts, 2-2 being a pitcher's count, and all others being batter's or pitcher's depending on whether there were more strikes or more balls.

Their results for the 1260 at-bats were:

.5109 SLG in 505 neutral-count AB
.3233 SLG in 566 pitcher-count AB
.5753 SLG in 189 hitter-count AB

These numbers add up to exactly 1,260, which suggests that the authors considered only the count in which the AB was resolved. As has been noted many times, this is less useful than counting the results when the plate appearance *passes through* that count, regardless of where the count eventually ended up.

Because of the small sample size, the authors found no significant difference between the neutral count and hitter's count, but did find a difference between neutral and pitcher's count.

I believe there have been many studies that have examined the implications of ball-strike counts in more detail with larger sample sizes. One of mine is on page 4 here (.pdf). Let me know of any others and I'll add more links here.

UPDATE: Tom Tippett study here, courtesy Studes in the comments. Also see Tango's second comment for a few more links.

Labels: , ,

Wednesday, November 05, 2008

Does last year's doubles predict this year's doubles?

In a recent JQAS, there's a paper by David Kaplan called "Univariate and Multivariate Autoregressive Time Series Models of Offensive Baseball Performance: 1901-2005."

The idea is to check whether you can predict one year's MLB offensive statistics from those of previous years. Suppose that doubles have been on the rise in the past few seasons. Does that mean they should continue to rise this year? How well can we predict this year from the past few years?

One way to answer this question is by just graphing doubles and looking at the chart. If the chart looks mildly zig-zaggy and random [note: "zig-zaggy" is not officially a statistical term], then it looks like you won't be able to make a decent prediction. On the other hand, if the plot of doubles looks like long gentle waves up and down, then it would look like trends tend to extend over a number of years. Finally, if the graph is really, really, really zig-zaggy, it could be that a high-doubles year is often followed by a low-doubles year, and that would also help you to predict.

(As it turns out, Figure 1 of the paper shows that doubles follow gentle waves, which means a high-doubles season tends to be followed by another high-doubles season. Actually, that tends to be the case for all the statistics in the paper. Sorry to give away the ending so early.)

Of course, the paper uses a more formal statistical approach, "time series analysis." I don't understand it fully – some of the paper reads like a textbook, explaining the methodology in great detail – but I did take one undergrad course in this stuff a long time ago, so I think I know what's going on, kind of. But someone please correct me if I don't.

One thing Kaplan does to start is to figure out how many previous seasons to use to predict the next one. If you know last year's doubles, that helps to predict next year's. But what if you have last year's *and* the year before? Does that help you make a better prediction? The answer, perhaps not surprisingly, is no: if you know last year's doubles, that's enough – you don't get any more accuracy, at least to a statistically-significant degree, by adding in more previous seasons.

So this is the point where I get a bit lost. Once you know that you only need one previous season, why not just run a regular regression on the pairs of seasons, and get your answer that way? I'm assuming that the time series analysis pretty much does just that, but in a more complicated way.

Another thing that Kaplan does is "differencing". That means that instead of using the actual series of doubles – say, 100, 110, 130, 125 – he calculates the *differences* between the years and uses that as his series. That gives him +10, -20, -5 as his series. Why does he do that? To make the series exhibit "stationarity," the definition of which is in the paper (and was in my brain for at least a week or two back in 1987).

To my mind, the differencing defeats the purpose of the exercise – when you difference, you wind up measuring mostly randomness, rather than the actual trend.

A year's total of doubles can decompose into two factors: an actual trend towards hitting doubles, and a random factor.

Random factors tend not to repeat. And that means a negative change will tend to be followed by a positive change the next year, so you'll get a negative correlation from the randomness.

For instance, suppose that doubles skill is rising, from 100 to 102 to 104. And suppose that the league does in fact hit exactly the expected number of doubles in the first and third year. That means the series is:

100, ?, 104

No matter what happens the second year, the first and second difference have to sum to +4. If lots of doubles are hit the second year – 110, say -- you wind up with differences of +10 and –6. If few doubles are hit the second year – 96, say – you wind up with differences of –4 and +8. No matter what you put in for the question mark, the sum is +4. And that means that a positive difference will be followed by a negative one, and vice-versa.

So, just because of random chance, there will be a *negative* correlation between one difference and the next.

And that's exactly what Kaplan finds. Here are his coefficients for the various offensive statistics, from Table 3:

-0.332 HR
-0.202 2B
-0.236 RBI
-0.180 BB
-0.027 SB

So I think this part of Kaplan's study doesn't really tell us anything we didn't already know: when a season unexpectedly has a whole bunch of doubles (or home runs, or steals, or ...), we should expect it to revert to the mean somewhat the next year.

Kaplan then proceeds to the time-series equivalent of multiple regression, where he tries to predict one of the above five statistics based on all five of their values (and R, for a total of six) the previous year.

He finds pretty much what you'd expect: for the most part, the best predictor of this year's value is last year's value of the same statistic. All of the this year/last year pairs were statistically signficant, except RBI/RBI, which was only significant at the 10% level.

However, to predict RBIs, it turned out that last year's *doubles* was significant at the 3.5% level. Kaplan does mentions that, but doesn't explain why or how that happens. My guess is that it's just random – it just happened that doubles and RBIs were correlated in such a way that doubles took some of the predictive power that would otherwise have gone to RBIs.

Indeed, there were 25 pairs of "last year/this year for a different statistic" in the study. With 25 possibilities, there's a good chance that at least one of them would show 1-in-20 significance for spurious reasons – and I bet that's what's happening with those doubles.