Wednesday, November 05, 2008

Does last year's doubles predict this year's doubles?

In a recent JQAS, there's a paper by David Kaplan called "Univariate and Multivariate Autoregressive Time Series Models of Offensive Baseball Performance: 1901-2005."

The idea is to check whether you can predict one year's MLB offensive statistics from those of previous years. Suppose that doubles have been on the rise in the past few seasons. Does that mean they should continue to rise this year? How well can we predict this year from the past few years?

One way to answer this question is by just graphing doubles and looking at the chart. If the chart looks mildly zig-zaggy and random [note: "zig-zaggy" is not officially a statistical term], then it looks like you won't be able to make a decent prediction. On the other hand, if the plot of doubles looks like long gentle waves up and down, then it would look like trends tend to extend over a number of years. Finally, if the graph is really, really, really zig-zaggy, it could be that a high-doubles year is often followed by a low-doubles year, and that would also help you to predict.

(As it turns out, Figure 1 of the paper shows that doubles follow gentle waves, which means a high-doubles season tends to be followed by another high-doubles season. Actually, that tends to be the case for all the statistics in the paper. Sorry to give away the ending so early.)

Of course, the paper uses a more formal statistical approach, "time series analysis." I don't understand it fully – some of the paper reads like a textbook, explaining the methodology in great detail – but I did take one undergrad course in this stuff a long time ago, so I think I know what's going on, kind of. But someone please correct me if I don't.

One thing Kaplan does to start is to figure out how many previous seasons to use to predict the next one. If you know last year's doubles, that helps to predict next year's. But what if you have last year's *and* the year before? Does that help you make a better prediction? The answer, perhaps not surprisingly, is no: if you know last year's doubles, that's enough – you don't get any more accuracy, at least to a statistically-significant degree, by adding in more previous seasons.

So this is the point where I get a bit lost. Once you know that you only need one previous season, why not just run a regular regression on the pairs of seasons, and get your answer that way? I'm assuming that the time series analysis pretty much does just that, but in a more complicated way.

Another thing that Kaplan does is "differencing". That means that instead of using the actual series of doubles – say, 100, 110, 130, 125 – he calculates the *differences* between the years and uses that as his series. That gives him +10, -20, -5 as his series. Why does he do that? To make the series exhibit "stationarity," the definition of which is in the paper (and was in my brain for at least a week or two back in 1987).

To my mind, the differencing defeats the purpose of the exercise – when you difference, you wind up measuring mostly randomness, rather than the actual trend.

A year's total of doubles can decompose into two factors: an actual trend towards hitting doubles, and a random factor.

Random factors tend not to repeat. And that means a negative change will tend to be followed by a positive change the next year, so you'll get a negative correlation from the randomness.

For instance, suppose that doubles skill is rising, from 100 to 102 to 104. And suppose that the league does in fact hit exactly the expected number of doubles in the first and third year. That means the series is:

100, ?, 104

No matter what happens the second year, the first and second difference have to sum to +4. If lots of doubles are hit the second year – 110, say -- you wind up with differences of +10 and –6. If few doubles are hit the second year – 96, say – you wind up with differences of –4 and +8. No matter what you put in for the question mark, the sum is +4. And that means that a positive difference will be followed by a negative one, and vice-versa.

So, just because of random chance, there will be a *negative* correlation between one difference and the next.

And that's exactly what Kaplan finds. Here are his coefficients for the various offensive statistics, from Table 3:

-0.332 HR
-0.202 2B
-0.236 RBI
-0.180 BB
-0.027 SB

So I think this part of Kaplan's study doesn't really tell us anything we didn't already know: when a season unexpectedly has a whole bunch of doubles (or home runs, or steals, or ...), we should expect it to revert to the mean somewhat the next year.

Kaplan then proceeds to the time-series equivalent of multiple regression, where he tries to predict one of the above five statistics based on all five of their values (and R, for a total of six) the previous year.

He finds pretty much what you'd expect: for the most part, the best predictor of this year's value is last year's value of the same statistic. All of the this year/last year pairs were statistically signficant, except RBI/RBI, which was only significant at the 10% level.

However, to predict RBIs, it turned out that last year's *doubles* was significant at the 3.5% level. Kaplan does mentions that, but doesn't explain why or how that happens. My guess is that it's just random – it just happened that doubles and RBIs were correlated in such a way that doubles took some of the predictive power that would otherwise have gone to RBIs.

Indeed, there were 25 pairs of "last year/this year for a different statistic" in the study. With 25 possibilities, there's a good chance that at least one of them would show 1-in-20 significance for spurious reasons – and I bet that's what's happening with those doubles.


At Wednesday, November 05, 2008 11:05:00 PM, Blogger Don Coffin said...

As I understand it, the primary reason for differencing time series data is to avoid the dread autocorrelation effect. Although you usually see it done when your regressing one variable (e.g., doubles) on another (home runs), rather than on a lag of itself. The reason is fairly simple. If two itme series both show time trends, then regressing one on the other is not evidence of a causal relationship, it's evidence of the trend. Differencing is intended to make the series stationary (i.e., no time trend). Then, if the differences are significantly related, you may be on to something.

I suppose the same thing could happen with a single piece of time series data. What you're capturing is a trend, not a causal relationship. So differencing gets you stationarity both of doubles and of lagged doubles. So what you're lookin for is whether last year's change in doubles is related to this year's change. If it's not, then you have some reason to question whether you actually have a stable time trend. I think that's right. But don't try to raise any money on the strength of it.

At Wednesday, November 05, 2008 11:37:00 PM, Blogger Don Coffin said...

Let me try to make that last bit clearer. Suppose the average year-to-year change is +2. Differencing lets you see if that year-to-year change is stable. So I created a series of random numbers (which Excel will generate, and which fall between 0 and 1. I then subtracted 0.5 from this set of random numbers, which should give me a mean of about 0--the average change is, in fact, -0.056, with a S.D. of 0.2279. And I then created a time series, X, beginning with 0 and then adding each of my random numbers. If I graph this time series X against its lagged value, the time trend is a very tight fit; if I ran a regression (my SPSS license at home has expired), I'd get a real high R-sq.

Then I looked at the changes in X--remember, these are, by construction, random, with a mean of, essentially, zero. If I graph the change in X against the lagged change in X, I get a random relationship, not surprisingly, since the changes are random. (The implicit trend line is horizontal.)

So if I don't difference this series, it looks like there is a strong year-to-year correlation in my X. But that's spurious. The average year-to-year difference is actually zero. (I can send you the spreadsheet, with the charts, if you'd like.)

At Wednesday, November 05, 2008 11:44:00 PM, Blogger Phil Birnbaum said...

Hi, Doc,

I don't think you want random *differences*. I think you want random effects on each of the terms in the series.

That is, you don't want the random numbers to be cumulative.

For instance, suppose your "base" series is 2, 4, 6, 8, 10, and your random numbers just happen to be 0.1, -0.1, 0.3, -0.5, +0.4. You want to adjust your series to

2.1, 3.9, 6.3, 7.5, 10.4

You DON'T want to adjust it to

2.1, 4.0, 6.3, 7.8, 10.2

In the first case, the correlation between the differences and lagged differences will be negative. In the second case, the correlation will be zero.

Real life corresponds to the first case: randomness in one year doesn't accumulate to the next year.

At Thursday, November 06, 2008 11:31:00 AM, Blogger Don Coffin said...

I think I'd disagree about how "real life" works. If there's some pattern, then the random shocks are to the pattern--that is, they build from the existing, not the base, level. But I don't know.

At Thursday, November 06, 2008 5:41:00 PM, Blogger Unknown said...

My simple explanation for the reason to take first differences is that if you have a "non-stationary" series of data, your data trends upwards over time (for example), and try to fit a straight line to it, which is what OLS regression does, then your straight line will fit really well but only because of the time trend in the data. What you've done is fooled yourself into thinking that you've got a good understanding of the data but all you really know is that the data has trended up over time.


Post a Comment

<< Home