Tuesday, February 03, 2009

A neural-net Hall of Fame prediction method

"Predicting Baseball Hall of Fame Membership using a Radial Basis Function Network," by Lloyd Smith and James Downey

This JQAS article, from the most recent issue, is a new system to empirically predict who is and who is not in the Hall of Fame. But it's not a series of formulas -- it's an algorithm called a "radial basis function network," which is a type of neural net. I don't know much about this kind of thing, but it's called a "machine learning approach" because the algorithm figures out the algorithm, as it were.

The advantage, it seems to me, is that you don't have to figure out an algorithm yourself. But the disadvantage is that you have no idea what the "real" qualifications for the HOF are -- the neural net spits out a probability for each player, but you have no idea were that came from. As the authors note,

"A disadvantage of the approach described here is that the neural network model is opaque -- it is impossible to understand, with any degree of confidence, why the model fails to classify a player such as Lou Brock as a Hall of Fame member."

That, perhaps, is an unfortunate choice of example. Obviously, Brock is in the Hall of Fame because of his stolen bases. But the authors didn't feed steals into the model. Indeed, the words "stolen base" don't appear anywhere in the paper!

What *does* the model include? For pitchers, it considers: wins, saves, ERA, winning percentage, win shares, and number of times selected to the All Star Game. And it seems to do a reasonable job distinguishing HOFers from non. For pitchers retiring between 1950 and 2002, it makes only six errors -- it mistakenly calls Billy Pierce, Lee Smith, and John Wetteland Hall of Famers; but omits Fergie Jenkins, Hoyt Wilhelm, and Dennis Eckersley.

For hitters, the model includes: hits, HR, OPS, WS, and again All-Star selections. This time the algorithm misidentifies 13 players (Rice was listed as an error, but now is not). The incorrect selections are: McGwire, Dawson, Garvey, Baines, Santo, and Parker. The incorrect omissions are Brock, Appling, Yount, Kiner, Boudreau, Campanella, and Jackie Robinson.

Many of the errors are understandable; McGwire, Campanella, and Robinson, for instance, whose status is heavily influenced by factors other than their statistics. But a couple of the mistakes arise from the choice of data; Brock, of course, but also Robin Yount, who winds up misclassified because he had only three all-star selections -- by far the lowest of any HOFer in the 1950-2002 era. (The next lowest was 6, by Willie McCovey and Billy Williams. And all of the missing HOFers that the model failed to predict had on the list had 8 or fewer.)

The authors defend the use of All-Star selections on the basis that it's a proxy for position played; that's somewhat reasonable, and I guess that's why it somewhat works.

Anyway: is this method better than others, most notably Bill James' algorithms? Strangely, although the authors cite both of James' methods, they don't compare them to their own. My guess is that Bill's methods are probably at least as accurate as the ones in this paper. And Bill's have the advantage that we actually learn something from them -- they help us figure out what it takes to get into the Hall of Fame. The method in this paper, while perhaps being objective, accurate, and complex, doesn't tell us anything except its predictions, and so we don't learn very much about baseball from it.

P.S. The paper assumes that all sabermetrics comes from SABR. This is, of course, not true.

Labels: ,


At Wednesday, February 04, 2009 9:01:00 AM, Anonymous Anonymous said...

My first thought is it should look at 3000 hits as an automatic selection (gamblers and roid users excepted). That would put Yount and Brock inside.

At Thursday, February 05, 2009 4:39:00 PM, Blogger Ted said...

Phil, I disagree with your perspective in the last paragraph.

What this paper shows is that it is not that hard to come up with a set of criteria that does a good job of distinguishing HOFers from non-HOFers. I've seen several variations on this idea. I saw a paper at a conference last year -- if really pressed, I could probably go and look up the authors -- which did a similar exercise as this paper, except they used a method which essentially generated a "flowchart" for determining HOF status.

The relative success of machine learning at solving the HOF classification problem does tell us something about baseball: it tells us that you don't have to know anything about baseball to figure out a good rough cut at standards for HOF election. This tells me something that, to me, is rather deep: when you look at the misclassifications, you can see where special-case considerations came into play with Campy, or Mac, or P*** R***, or whoever.

I also have a different perspective on where Bill James' HOF work sits in our framework of understanding baseball. You seem implicitly in the last paragraph to be using the syllogism:

* HOF voters put a set of players in the HOF
* Bill James constructed a system that puts a similar set of players in the HOF
* Therefore, Bill James' system tells us what it takes to be elected to the HOF

Which, logically, is not a valid conclusion. Many of the provisions in Bill's systems were taken from conventional wisdom as to what standards for the HOF were. What Bill's contribution was was to take all those cliches and organize them systematically, and show that you *could* organize them systematically and come up with good classification systems. That has always been Bill's strength in all of his research, and there's no understanding the value in doing that.

The machine learning approaches complement Bill's work by showing that there are a large set of classifiers that all work basically equally well. In other words, HOFness seems to be a fairly well-defined concept.

I also do not agree with your comments on accuracy. The point of the exercise of doing the neural net classifier is to show that you can do pretty well with only a few degrees of freedom. You can always make a more "accurate" system by adding more degrees of freedom; if you allow me enough degrees of freedom, I can correctly classify 100% of HOFers and HOFers and 100% of non-HOFers as non-HOFers. By most standards of model evaluation, this neural net runs circles around Bill, because Bill had an enormous advantage in terms of the degrees of freedom he had in devising his systems.

At Thursday, February 05, 2009 4:53:00 PM, Blogger Phil Birnbaum said...


OK, I see your point. When you say "this paper shows .... that it is not that hard to come up with a set of criteria that does a good job of distinguishing HOFers from non-HOFers," I agree with you.

The paper does show us that it is not that hard to come up with a set of criteria, but it does NOT show us what those criteria are. But you're right, I'd have to agree that just knowing that the voters' criteria are simulatable is indeed valuable knowledge.

As for accuracy ... do we really know that the neural net algorithm is less arbitrary than Bill's? We do know that it uses fewer data points (or "degrees of freedom" in a more formal sense). But does that really make it simpler in an Occam's Razor kind of way?

Also: we do have some prior knowledge about what voters care about -- stolen bases, for instance. If Bill's methodology includes steals and the neural net doesn't, shouldn't the knowledge that SB is important be a factor in how we evaluate the two systems? My guess that Bill's system is more accurate isn't just a matter of how many errors it makes, but also how it incorporates our prior knowledge about what the criteria actually are. (I'm assuming that we do *know* what some of the considerations are. If you want to argue that voters don't care about steals, but just say they do, you rebut this paragraph completely.)

And lastly: I don't think it's that difficult to come up with a system that's reasonably accurate. The true test is not the first 90% of the variance (which is easy), but the last 10% (which is hard). While intuitively the neural net results look pretty good, I want to see a comparison before passing final judgement. Total Average is pretty good at ranking team offense, but Base Runs is much better.

At Thursday, February 05, 2009 8:38:00 PM, Blogger Ted said...

Your points about the arbitrariness of how to "count" prior information in creating a model are well-taken. We already know going into the exercise -- no matter what approach we choose to take -- that certain statistical totals are very likely to have a lot of explanatory power. It would indeed be silly (in my Bayesian perspective) to take a completely classical viewpoint and ignore the value of that prior information. And, so, for example, the authors' choice not to include SB, and not to explain why they chose not to (if it was a conscious choice), is puzzling, because I agree that by any metric, adding SB seems likely to improve predictive power.

The authors are correct in noting that the neural net method doesn't (easily) tell us the whys and wherefores of how it computes its predictions. (I guess technically speaking it does. It's nothing but a set of formulas with weights, and you could look at those weights and trace the computation for any given player. Whether you could *summarize* that computation in a couple English sentences is a different matter.) But, for any neural net, the types of classifications it can do is well-understood. So, it is possible to formally say that the amount of flexibility a particular neural net has is less than a human like Bill James, who could entertain any possible set of criteria.

My reaction to your "last 10%" comment is more reserved. One can always improve classification accuracy by adding more variables. There's a real danger of overfitting when you do this; you can easily create a model where coincidental features of the data drive the results, and, so, the model would not be as useful for *prediction* as a "less accurate" model that wasn't overfitted.

At Thursday, February 05, 2009 8:52:00 PM, Blogger Phil Birnbaum said...

Hi, Ted,

Agreed that in theory, you could figure out the whys and wherefores of the neural net. But *this paper* doesn't do that, and you raise questions on whether it can be done intuitively (as opposed to technically) at all. So what have us humans learned about baseball from the paper? Nothing yet.

Agreed that one can improve accuracy by adding more variables. But there's a cost/benefit tradeoff. And so, Does the Bill James method add significantly more accuracy for the additional information it requires? Or is the marginal benefit of switching to James not worth the additional data? And you can also ask that question backwards: does the neural net method prove more accurate than, say, a simple regression?

My point about the "last 10%" is that picking the obvious HOFers (Ruth) and the obvious non-HOFers (Roy Howell) is trivial for even the dumbest algorithms. The true test of a method is how well it distinguishes the borderline cases. 13 errors since 1950 seems like a good record, but is it really, if you look only at the players who are difficult to classify?

Finally, one last point hinted at by a commenter on Tango's site: since the neural net uses only similarities, it's quite possible that it gives counterintuitive results in some cases. The commenter suggested that since Bruce Sutter is in with a relatively high ERA, the algorithm might conclude that Tom Henke's failure to be inducted stems from his ERA being too low.

At Thursday, February 05, 2009 8:55:00 PM, Blogger Phil Birnbaum said...

The commenter I referred to in the previous post is Will Belfield. Link is comment 4 here.

At Monday, February 09, 2009 9:53:00 AM, Blogger Ted said...

I think we may have come full circle here, because I'm having a hard time thinking of anything to say but "see my first comment."

I don't believe it's a priori obvious that relatively simple heuristics can give you a 90% hit rate on HOFers, and establishing that fact though a variety of approaches *does* tell us something about baseball: despite criticisims to the contrary, it is consistent with the hypothesis that HOF voters have established something resembling objective standards for election that seem to prevail more often than not.

And, again, I just don't agree with the last 10% being some sort of litmus test. If you give *any* system -- whether automated or human -- more data to work with, you will increase the hit rate; it's mathematically inevitable. Whether you are actually picking up anything meaningful by doing so is questionable. You'll be able to capture a lot of those borderline cases if you add more criteria, but you might not learn more, because what you *really* needed was a "Friend of Frankie" indicator variable. If you tried hard enough, you could reverse-engineer standards to get those guys right, but you wouldn't be learning anything about baseball.

At Tuesday, February 17, 2009 3:47:00 PM, Anonymous Anonymous said...

"As the authors note, 'A disadvantage of the approach described here is that the neural network model is opaque -- it is impossible to understand, with any degree of confidence, why the model fails to classify a player such as Lou Brock as a Hall of Fame member.' That, perhaps, is an unfortunate choice of example. Obviously, Brock is in the Hall of Fame because of his stolen bases. But the authors didn't feed steals into the model. Indeed, the words "stolen base" don't appear anywhere in the paper!"

Doesn't Lou Brock have the highest World Series batting average among MLB players with 20+ World Series plate appearances (or some similar threshold)? Unparalleled success under immense pressure is often held in very high regard, even if OBP may be more important statistically than mere batting average. And if Bill Mazeroski can make the HoF obviously on the strength of a single home run (clutch though it was), then surely Brock warrants entry based on his consistent clutch (World Series) career ... stolen bases notwithstanding.


Post a Comment

<< Home