Does last year's doubles predict this year's doubles?
In a recent JQAS, there's a paper by David Kaplan called "Univariate and Multivariate Autoregressive Time Series Models of Offensive Baseball Performance: 1901-2005."
The idea is to check whether you can predict one year's MLB offensive statistics from those of previous years. Suppose that doubles have been on the rise in the past few seasons. Does that mean they should continue to rise this year? How well can we predict this year from the past few years?
One way to answer this question is by just graphing doubles and looking at the chart. If the chart looks mildly zig-zaggy and random [note: "zig-zaggy" is not officially a statistical term], then it looks like you won't be able to make a decent prediction. On the other hand, if the plot of doubles looks like long gentle waves up and down, then it would look like trends tend to extend over a number of years. Finally, if the graph is really, really, really zig-zaggy, it could be that a high-doubles year is often followed by a low-doubles year, and that would also help you to predict.
(As it turns out, Figure 1 of the paper shows that doubles follow gentle waves, which means a high-doubles season tends to be followed by another high-doubles season. Actually, that tends to be the case for all the statistics in the paper. Sorry to give away the ending so early.)
Of course, the paper uses a more formal statistical approach, "time series analysis." I don't understand it fully – some of the paper reads like a textbook, explaining the methodology in great detail – but I did take one undergrad course in this stuff a long time ago, so I think I know what's going on, kind of. But someone please correct me if I don't.
One thing Kaplan does to start is to figure out how many previous seasons to use to predict the next one. If you know last year's doubles, that helps to predict next year's. But what if you have last year's *and* the year before? Does that help you make a better prediction? The answer, perhaps not surprisingly, is no: if you know last year's doubles, that's enough – you don't get any more accuracy, at least to a statistically-significant degree, by adding in more previous seasons.
So this is the point where I get a bit lost. Once you know that you only need one previous season, why not just run a regular regression on the pairs of seasons, and get your answer that way? I'm assuming that the time series analysis pretty much does just that, but in a more complicated way.
Another thing that Kaplan does is "differencing". That means that instead of using the actual series of doubles – say, 100, 110, 130, 125 – he calculates the *differences* between the years and uses that as his series. That gives him +10, -20, -5 as his series. Why does he do that? To make the series exhibit "stationarity," the definition of which is in the paper (and was in my brain for at least a week or two back in 1987).
To my mind, the differencing defeats the purpose of the exercise – when you difference, you wind up measuring mostly randomness, rather than the actual trend.
A year's total of doubles can decompose into two factors: an actual trend towards hitting doubles, and a random factor.
Random factors tend not to repeat. And that means a negative change will tend to be followed by a positive change the next year, so you'll get a negative correlation from the randomness.
For instance, suppose that doubles skill is rising, from 100 to 102 to 104. And suppose that the league does in fact hit exactly the expected number of doubles in the first and third year. That means the series is:
100, ?, 104
No matter what happens the second year, the first and second difference have to sum to +4. If lots of doubles are hit the second year – 110, say -- you wind up with differences of +10 and –6. If few doubles are hit the second year – 96, say – you wind up with differences of –4 and +8. No matter what you put in for the question mark, the sum is +4. And that means that a positive difference will be followed by a negative one, and vice-versa.
So, just because of random chance, there will be a *negative* correlation between one difference and the next.
And that's exactly what Kaplan finds. Here are his coefficients for the various offensive statistics, from Table 3:
So I think this part of Kaplan's study doesn't really tell us anything we didn't already know: when a season unexpectedly has a whole bunch of doubles (or home runs, or steals, or ...), we should expect it to revert to the mean somewhat the next year.
Kaplan then proceeds to the time-series equivalent of multiple regression, where he tries to predict one of the above five statistics based on all five of their values (and R, for a total of six) the previous year.
He finds pretty much what you'd expect: for the most part, the best predictor of this year's value is last year's value of the same statistic. All of the this year/last year pairs were statistically signficant, except RBI/RBI, which was only significant at the 10% level.
However, to predict RBIs, it turned out that last year's *doubles* was significant at the 3.5% level. Kaplan does mentions that, but doesn't explain why or how that happens. My guess is that it's just random – it just happened that doubles and RBIs were correlated in such a way that doubles took some of the predictive power that would otherwise have gone to RBIs.
Indeed, there were 25 pairs of "last year/this year for a different statistic" in the study. With 25 possibilities, there's a good chance that at least one of them would show 1-in-20 significance for spurious reasons – and I bet that's what's happening with those doubles.