Wednesday, July 01, 2015

Do stock buybacks enrich CEOs at the expense of the economy?

Are share buybacks hurting the economy and increasing income inequality? 

Some pundits seem to think so. There was an article in Harvard Business Review, a while ago, which might have been an editorial (I can't find a byline). That followed a similar article from FiveThirtyEight, that concentrated on the economic effects. When I Googled, I came across another article from The Atlantic. I think it's a common argument ... I'm pretty sure I've seen it lots of other places, including blogs and Facebook.

They think it's a big deal, at least going by the headlines: 

-- "How stock options lead CEOs to put their own interests first" (Washington Post)

-- "Stock Buybacks Are Killing the American Economy" (The Atlantic)

-- "Profits Without Prosperity" (Harvard Business Review)

-- "Corporate America Is Enriching Shareholders at the Expense of the Economy" (FiveThirtyEight)

But ... it seems to me that neither the "hurt the economy" argument nor the "increase inequality" argument actually makes sense.

Before I start, here's a summary, in my own words, of what the three articles seem to be saying. You can check them out and see if I've captured them fairly.

"Corporations have always paid out some of their earnings in dividends to shareholders. But lately, they've been dispersing even more of their profits, by buying back their own shares on the open market. 

"This is problematic in several ways. For one, it takes money that companies would normally devote to research and expansion, and just pays it out, reducing their ability to expand the economy to benefit everyone. In addition, it artificially boosts the market price of the stock. That benefits CEOs unfairly, since their compensation includes shares of the company, and provides a perverse incentive to funnel cash to buybacks instead of expanding the business.

"Finally, it makes the rich richer, boosting the stock values for CEOs and other shareholders at the expense of the lower and middle classes."

As I said, I don't think any part of this argument actually works. The reasons are fairly straightfoward, not requiring any intricate macroeconomics.


1. Buybacks don't increase the value of the shares

At first consideration, it seems obvious that buybacks must increase the value of your stockholdings. With fewer shares outstanding, the value of the company has to be split fewer ways, so your piece of the pie is bigger.

But, no. Your *fraction* of the pie is bigger, but the pie is reduced in size by exactly the same fraction. You break even. That *has* to be the case, otherwise it would be a way to generate free money!

Acme has one million (1 MM) shares outstanding. The company's business assets are worth $2 MM, and it has $1 MM in cash in the bank with no debt. So the company is worth $3 a share.

Now, Acme buys back 100,000 shares, 10 percent of the total. It spends $300,000 to do that. Then, it cancels the shares, leaving only 900,000.

After the buyback, the company still owns a business worth $2 million, but now only has $700,000 in the bank. Its total value is $2.7 million. Divide that by the 900,000 remaining shares, and you get ... the same $3 a share as when you started.

It's got to be that way. You can't create wealth out of thin air by market-value transactions. 

The HBR author might realize this: he or she hints that buybacks increase stock prices "in the short term," and "even if only temporarily."  I'm not sure how that would happen -- for very liquid shares, the extra demand isn't going to change the price very much. Maybe the *announcements* of buybacks could boost the shares, by signalling that the company has confidence in its future. But that's also the case for announcements of dividend increases.

One caveat: it's true the share price is higher after a buyback than a dividend, but that's not because the buyback raises the price: it's because the dividend lowers it. If the company spends the $300,000 on dividends instead of buybacks, the value of a share drops to $2.70. The shareholders still have $3 worth of value: $2.70 for the share, and 30 cents in cash from the dividend. (It's well known, and easily observed, that the change in share price actually does happen in real life.)

If the CEO chooses to spend the cash on buybacks, then, yes, the stock price will be higher than if he chose to spend it on dividends. It won't just be higher in the short term, but in the long term too. 

Are buybacks actually replacing dividends? The FiveThirtyEight article shows that both dividends and buybacks are increasing, so it's not obviously CEOs choosing to replace one with the other. 

But, sure, if the company does replace expected dividends with buybacks, the share price will indeed sit higher, and the CEO's stock options will be more valuable.

To avoid conflicts of interest, it seems like CEOs should be compensated in options that adjust for dividends paid. (As should all stock options, including the ones civilians buy. But they don't do that, probably because it's too complicated to keep track of.)  But, again: the source of the conflict is not that buybacks boost the share price, but that dividends reduce it. If you believe CEOs are enriching themselves by shady dealing, you should be demanding more dividends, not decrying buybacks.

2. The buyback money is still invested

The narratives claim that the money paid out in share buybacks is lost, that it's money that won't be used to grow the economy.

But it's NOT lost. It's just transferred from the company to the shareholders who sell their stock. 

Suppose I own 10 shares of Apple, and they do a buyback, and I sell my shares for $600 of Apple's money. That's $600 that Apple no longer has to spend on R&D, or advertising, or whatever. But, now, *I have that $600*. And I'm probably going to invest it somewhere else. 

Now, I might just buy stock in another company -- Coca-Cola, say -- from another shareholder. That just transfers money from me to the other guy -- the Coca-Cola Corporation doesn't get any of that to invest. But, then, the other guy will buy some other stock from another guy, and so on, and so on, until you finally hit one last someone who doesn't use it to buy another stock.

What will he do? Maybe he'll use the $600 to buy a computer, or something, in which case that helps the economy that way. Or, he'll donate the $600 to raise awareness of sexism, to shame bloggers who assume all CEOs and investors are "he". Or, he'll use it to pay for his kids' tuition, which is effectively an investment in human capital. 

Who's to say that these expenditures don't help the economy at least as much as Apple's would?

In fact, the investor might use the $600 to actually invest in a business, by buying into an IPO. In 2013, Twitter raised $1.8 billion in fresh money, to use to build its business. It's quite possible that my $600, which came out of Apple's bank account, eventually found its way into Twitter's.

Is that a bad thing? No, it's a very good thing. The market judged, albeit in a roundabout way, that there was more profit potential for that $600 in Twitter than in Apple. The market could be wrong, of course, but, in general, it's pretty efficient. You'd have a tough time convincing me that, at the margin, that $600 would be more profitable in Apple than in Twitter.

The economy grows the best when the R&D money goes where it will do the most good. If Consolidated Buggy Whip has a billion dollars in the bank, do you really want it to build a research laboratory where it can spend it on figuring out how to synthesize a more flexible whip handle? 

At the margin, that's probably where Apple is coming from. It makes huge, huge amounts of profit, around $43 billion last year. It spent about $7 billion on R&D. Do we really want Apple to spend six times as much on research as we think is appropriate? It seems to me that the world is much better off if that money is given back to investors to put elsewhere into the economy.

That might be part of why buyback announcements boost the stock price, if indeed they do. When Apple says it's going to buy back stock, shareholders are relieved to find out they're not going to waste that cash trying to create the iToilet or something.

3. Successful companies are not restrained by cash in the bank

According to the FiveThirtyEight article, Coca-Cola spent around $5 billion in share repurchases in 2013. But their long-term debt is almost $20 billion.

For a company like Coca-Cola, $20 billion is nothing. It's only twice their annual profit. Their credit is good -- I'm sure they could borrow another $20 billion tomorrow if they wanted to.

In other words: anytime the executives at Coke see an opportunity to expand the business, they will have no problem finding money to invest. 

If you don't believe that, if you still believe that the $5 billion buyback reduces their business options ... then, you should be equally outraged if they used that money to pay down their debt. Either way, that's $5 billion cash they no longer have handy! The only difference is, when Coca-Cola pays down debt, the $5 billion goes to the bondholders instead of the shareholders. (In effect, paying off debt is a "bond buyback".)

The "good for the economy" argument isn't actually about buybacks -- it's about investment. If buybacks are bad, it's not because they're buybacks specifically; it's because they're something other than necessary investment.

It's as if people are buying Cadillac Escalades instead of saving for retirement. The problem isn't Escalades, specifically. The problem is that people aren't using the money for retirement savings. Banning Escalades won't help, if people just don't like saving. They'll just spend the money on Lexuses instead.

Is investment actually dropping? The FiveThirtyEight article thinks so -- it shows that companies' investment-to-payout ratio is dropping over time. But, so what? Why divide investment by payouts? Companies could just be getting rid of excess cash that they don't know what to do with (which they also get criticized for -- "sitting on cash"). Looking at Apple ... their capital expenditures went from 12 cents a share in 2007, to $1.55 in 2014 (adjusted for the change in shares outstanding). A thirteen-fold increase in research and development doesn't suggest that they're scrimping on necessary investment.

4. Companies offset their buybacks by issuing new shares

As I mentioned, the FiveThirtyEight article notes that Coke bought back $5 billion in shares in 2013. But, looking at Value Line's report (.pdf), it looks like, between 2012 and 2013, outstanding shares only dropped by about half that amount.

Which means ... even while buying back and retiring $5 billion in old shares, Coca-Cola must have, at the same time, been issuing $2.5 billion in *new* shares. 

I don't know why or how. Maybe they issued them to award to employees as stock options. In that case, the money is part of employee compensation. Even if the shares went to the CEO, if they didn't issue those shares, they'd have to pay the equivalent in cash.

So if you're going to criticize Coca-Cola for wasting valuable cash buying shares, you also have to praise it, in an exactly offsetting way, for *saving* valuable cash by paying employees in shares instead. Don't you?

I suppose you could say, yes, they did the right thing by saving cash, but they could do more of the right thing by not buying back shares! But: the two are equal. If you're going to criticize Coca-Cola for buying back shares, you have to criticize other companies that pay their CEOs exclusively in cash. 

But the HBR article actually gets it backwards. It *criticizes* companies that pay their CEOs in shares!

Suppose Coca-Cola is buying back (say) a million shares for $40 MM, which is presumably bad. Then, they give those shares to the employees, which is also presumably bad. Instead, the Harvard article says, they should take the $40 MM, and give it to the employees directly. 

But that's exactly the same thing! Either way, Coca-Cola has the same amount of cash at the end. It's just that in one case, the original shareholders have shares and the CEO has cash. The other way, the original shareholders have the cash and the CEO has the shares.

What difference does that make to the economy or the company? Little to none.

5. Inequality is barely affected, if at all

Suppose a typical CEO makes about $40 million. And suppose half of that is in stock. And suppose, generously, that the CEO can increase the realized value of his shares by 5 percent by allegedly manipulating the price with share buybacks.

You're talking about $1 million in manipulation. 

How much does that affect inequality? Hardly at all. The top 1% of earners in the United States are, by definition, around 3 million people. That includes children ... let's suppose the official statistics use only 2 million people.

The Fortune 500 companies are, at most, 500 CEOs. Let's include other executives and companies, to get, say, 4,000 people. 

That's still only one-fifth of one percent of the "one percenters."

The average annual income of the top 1% is around $717,000. Multiply that by two million people, and you get total income of around $1.4 trillion.

After the CEOs finish manipulating the stock price, the 4,000 executives earn an extra $4 billion overall. So the income of the top 1% goes from




That's an increase of less than one-third of one percent. Well, yes, technically, that does "contribute" to inequality, but by such a negligible amount that it's hardly worth mentioning. 

And that .00333 percent is still probably an overstatement:

1. We used very generous assumptions about how CEOs capitalize on stock price changes. 

2. When the board offers the CEO stock options, both parties are aware of the benefits of the CEO being able to time the announcements. Without that benefit, pay would probably have to increase (for the same reason you have to pay a baseball player more if you don't give him a no-trade clause). So, much of this alleged benefit is not truly affecting overall compensation.

3. Price manipulation is a zero-sum game. If the executives win, someone loses. Who loses? The investors who buy the executives' shares when they sell. Who are those investors? Mostly the well-off. Some of the buyers might be pension funds for line workers, or some such, but I'd bet most of the buyers are upper middle class, at least. 

We know for sure it isn't the poorest who lose out, because they don't have pension funds or stocks. So it's probably the top 1 percent getting richer on the backs of the top 10 percent.


Here's one argument that *does* hold up, in a way: the claim that buybacks increase earnings per share (EPS).

Let's go back to the Acme example. Suppose, originally, they have $200,000 in earnings: $190,000 from the business, and $10,000 from interest on the $1 MM in the bank. With a million shares outstanding, EPS is 20 cents.

Now, they spend $300K to buy back 100,000 shares. Afterwards, their earnings will be $197,000 instead of $200,000. With only 900,000 shares remaining outstanding, EPS will jump from 20 cents to 21.89 cents.

Does that mean the CEO artificially increased EPS? I would argue: no. He did increase EPS, but not "artificially."

Before the buyback, Acme had a million dollars in the bank, earning only 1 percent interest. On the other hand, an investment in Acme itself would earn almost 7 percent (20 cents on the $3 share price). Why not switch the 1-percent investment for a 7-percent investment? 

It's a *real* improvement, not an artificial one. If Acme doesn't actually need the cash for business purposes, the buyback benefits all investors. It's the same logic that says that when you save for retirement, you get a better return in stocks than in cash. It might be right for Acme for the same reason it's right for you.

Does the improvement in EPS boost the share price? Probably not much -- the stock market is probably efficient enough that investors would have seen the cash in the bank, and adjusted their expectations (and stock price) accordingly. A small boost might arise if the buybacks are larger, or earlier, than expected, but hardly enough to make the CEO any more fabulously wealthy than he'd be without them.


There's another reason companies might buy back shares -- to defer tax for their shareholders.

Suppose Coca-Cola has money sitting around. They can pay $40 to me as a dividend. If they do, I pay tax on that -- say, $12. So, now, I have $12 less in value than before. The value of my stock dropped by $40, and I only have $28 in after-tax cash to compensate.

Instead of paying a dividend, Coke could use the $40 to buy back a share. In that case, I pay no tax, and the value of my account doesn't drop.  

Actually, the buybacks are just deferring my taxes, not eliminating them. When I sell my shares, my capital gain will be $40 more after the buyback than it would have been if Coke had issued a dividend instead. As one of the linked articles notes, the US tax rate on capital gains is roughly the same as on dividends. So, the total amount is a wash -- it's just the timing that changes.

Maybe that tax deferral bothers you. Maybe you think the companies are doing something unfair, and exploiting a loophole. I don't agree; for one thing, I think taxing corporate profits, and also dividends, is double taxation, a hidden, inefficient and sometimes unfair way to raise revenues. (Companies already have to pay corporate income tax on earnings, regardless of whether they use it for buybacks, dividends, reinvestment, or cash hoards.)

You might disagree with me on that point.  If you do, then why aren't you upset at companies who don't pay dividends at all? If share buybacks are a loophole because they defer taxes, then retained earnings must be a bigger loophole, because they defer even *more* taxes!

Keep in mind, though, the deferral from buybacks is not quite as big as it looks. When the company buys the shares, the sellers realize a capital gain immediately. If the stock has skyrocketed recently, the total tax the IRS collects after the buyout could, in theory, be a significant portion the amount it would have collected off the dividend. (For instance, if all the selling shareholders had originally bought Coca-Cola stock for a penny, the entire buyback (less one cent) would be taxed, just as the entire dividend would have been.)

There's another benefit: when Coca-Cola buys shares, it buys them from willing sellers, who are in a position to accept their capital gains tax burden right now. That's the main advantage, as I see it: the immediate tax burden winds up falling on "volunteers," those who are able and willing to absorb it right now.


In my view, buybacks have little to do with greedy CEOs trying to enrich themselves, and they have negligible effect on the economy compared to traditional dividends. They're just the most tax-efficient way for companies to return value to their owners.

Labels: , ,

Friday, June 19, 2015

Can fans evaluate fielding better than sabermetric statistics?

Team defenses differ in how well they turn batted balls into outs. How do you measure the various factors that influence the differences? The fielders obviously have a huge role, but do the pitchers and parks also have an influence?

Twelve years ago, in a group discussion, Erik Allen, Arvin Hsu, and Tom Tango broke down the variation in batting average on balls in play (BAbip). Their analysis was published in a summary called "Solving DIPS" (.pdf).

A couple of weeks ago, I independently repeated their analysis -- I had forgotten they had already done it -- and, reassuringly, got roughly the same result. In round numbers, it turns out that:

The SD of team BAbip fielding talent is roughly 30 runs over a season.


There are several competing systems for evaluating which players and teams are best in the field, and by how much. The Fangraphs stats pages list some of those stats, and let you compare.

I looked at those team stats for the 2014 season. Specifically, these three:

1. DRS, from The Fielding Bible -- specifically, the rPM column, runs above average from plays made. (That's the one we want, because it doesn't include outfielder/catcher arms, or double-play ability.)

2. The Fan Scouting Report (FSR), which is based on an annual fan survey run by Tom Tango.

3. Ultimate Zone Rating (UZR), a stat originally developed by Mitchel Lichtman, but which, as I understand it, is now public. I used the column "RngR," which is the range portion (again to leave out arms and other defensive skills).

All three stats are denominated in runs. Here are their team SDs for the 2014 season, rounded:

37 runs -- DRS (rPM column)
23 runs -- Fan Scouting Report (FSR)
29 runs -- UZR (RngR)
30 runs -- team talent

The SD of DRS is much higher than the SD of team talent. Does that mean it's breaching the "speed of light" limit of forecasting, trying to (retrospectively) predict random luck as well as skill?

No, not necessarily. Because DRS isn't actually trying to evaluate talent.  It's trying to evaluate what actually happened on the field. That has a wider distribution than just talent, because there's luck involved.

A team with fielding talent of +30 runs might have actually saved +40 runs last year, just like a player with 30-home-run talent may have actually hit 40.

The thing is, though, that in the second case, we actually KNOW that the player hit 40 homers. For team fielding, we can only ESTIMATE that it saved 40 runs, because we don't have good enough data to know that the extra runs didn't just result from getting easier balls to field.

In defense, the luck of "made more good plays than average" is all mixed up with "had more easier balls to field than average."  The defensive statistics I've seen try their best to figure out which is which, but they can't, at least not very well.

What they do, basically, is classify every ball in play according to how difficult it was, based on location and trajectory. I found this post from 2003, which shows some of the classifications for UZR. For instance, a "hard" ground ball to the "56" zone (a specific portion of the field between third and short) gets turned into an out 43.5 percent of the time, and becomes a hit the other 56.5 percent. 

If it turns out a team had 100 of those balls to field, and converted them to outs at 45 percent instead of 43.5 percent, that's 1.5 extra outs it gets credited for, which is maybe 1.2 runs saved.

The problem with that is: the 43.5 percent is a very imprecise estimate of what the baseline should be. Because, even in the "hard-hit balls to zone 56" category, the opportunities aren't all the same. 

Some of them are hit close to the fielder, and those might be turned into outs 95 percent of the time, even for an average or bad-fielding team. Some are hit with a trajectory and location that makes them only 8 percent. And, of course, each individual case depends where the fielders are positioned, so the identical ball could be 80 percent in one case and 10 percent in another.

In a "Baseball Guts" thread at Tango's site, data from Sky Andrecheck and BIS suggested that only 20 percent of ground balls, and 10 percent of fly balls, are "in doubt", in the sense that if you were watching the game, you'd think it could have gone either way. In other words, at least 80% of balls in play are either "easy outs" or "sure hits."  ("In doubt" is my phrase, meaning BIPs in which it wasn't immediately at least 90 percent obvious to the observer whether it would be a hit or an out.)

That means that almost all the differences in talent and performance manifest themselves in just 10 to 20 percent of balls in play.

But, even the best fielding systems have few zones that are less than 20 percent or more than 80 percent. That means that there is still huge variation in difficulty *even accounting for zone*. 

So, when a team makes 40 extra plays over a season, it's a combination of:

(a) those 40 plays came from extra performance from the few "in doubt" balls;
(b) those 40 plays came from easier balls overall.

I think (b) is much more a factor than (a), and that you have to regress the +40 to the mean quite a bit to get a true estimate. 

Maybe when the zones get good enough to show large differences between teams -- like, say, 20% for a bad fielder and 80% for a good fielder -- well, at that point, you have a system that might work. But, without that, doesn't it almost have to be the case that most of the difference is just from what kinds of balls you get?

Tango made a very relevant point, indirectly, in a recent post. He asked, "Is it possible that Manny Ramirez never made an above-average play in the outfield?"  The consensus answer, which sounds right to me, was ... it would be very rare to see Manny make a play that an average outfielder wouldn't have made. (Leaving positioning out of the argument for now.)

Suppose BIPs to a certain difficult zone get caught 30% of the time by an average fielder, and Manny catches them 20% of the time. Since ANY outfielder would catch a ball that Manny gets to ... well, that zone must really be at least TWO zones: a "very easy" zone with a 100% catch rate, and a "harder" zone with an 10% catch rate for an average fielder, and a 0% catch rate for Manny.

In other words, if Manny makes 30% plays in that zone and a Gold Glove outfielder makes 25%, it's almost certain that Manny just got easier balls to catch. 

The only way to eliminate that kind of luck is to classify the zones in enough micro detail that you get close to 0% for the worst, or close to 100% for the best.

And that's not what's happening. Which means, there's no way to tell how many runs a defense saved.


And this brings us back to the point I made last month, about figuring out how to split observed runs allowed into observed pitching and observed fielding. There's really no way to do it, because you can't tell a good fielding play from an average one with the numbers currently available. 

Which means: the DRS and UZR numbers in the Fangraphs tables are actually just estimates -- not estimates of talent, but estimates of *what happened in the field*. 

There's nothing wrong with that, in principle: but, I don't think it's generally realized that that's what those are, just estimates. They wind up in the same statistical summaries as pitching and hitting metrics, which themselves are reliable observations. 

At baseball-reference, for instance, you can see, on the hitting page, that Robinson Cano hit .302-28-118 (fact), which was worth 31 runs above average (close enough to be called fact).

On his fielding page, you can see that Cano had 323 putouts (fact) and 444 assists (fact), which, by Total Zone Rating, was worth 4 runs below average (uh-oh).

Unlike the other columns, UZR column is an *estimate*. Maybe it really was -4 runs, but it could easily have been -10 runs, or -20 runs, or +6 runs. 

To the naked eye, the hitting and fielding numbers both look equally official and reliable, as accurate observations of what happened. But one is based on an observation of what happened, and the other is based on an estimate of what happened.


OK, that's a bit of an exaggeration, so let me backtrack and explain what I mean.

Cano had 28 home runs, and 444 assists. Those are "facts", in the sense that the error is zero, if the observations are recorded correctly.

Cano's offense was 31 runs above average. I'm saying that's accurate enough to be called a "fact."  But admittedly, it is, in fact, an estimate. Even if the Linear Weights formula (or whatever) is perfectly accurate, the "runs above average" number is after adjusting for park effects (which are imperfect estimates, albeit pretty good ones). Also, the +31 assumes Cano faced league-average pitching. That, again, is an estimate, but, again, it's a pretty strong one.

For defense, comparatively, the UZR of "-4" is a very, very, weak estimate. It carries an implicit assumption that Cano's "relative difficulty of balls in play" was zero. That's much less reliable than the estimate that his "relative difficulty of pitchers faced" was zero. If you wanted, you could do the math, and show how much weaker the one estimate is than the other; the difference is huge.

But, here's a thought experiment to make it clear. Suppose Cano faces an the worst pitcher in the league, and hits a home run. In that case, he's at worst 1.3 runs above average for that plate appearance, instead of our estimate of 1.4. It's a real difference in how we evaluate his performance, but a small one.

On the other hand, suppose Cano faces a grounder in a 50% zone, but one of the easy ones, that almost any fielder would get to. Then, he's maybe +0.01 hits above average, but we're estimating +0.5. That is a HUGE difference. 

It's also completely at odds with our observation of what happens on the field. After an easy ground ball, even the most casual fan would say he observed Cano saving his team 0 runs over what another player would do. But we write it down as +0.4 runs, which is ... well, it's so big, you have to call it *wrong*. We are not accurately recording what happened on the field.

So, if you take "what happened on the field" in broad, intutive terms, the home run matches: "he did a good thing on the field and created over a run" both to the observer and the statistic. But for the ground ball, the statistic lies. It says Cano "did a good thing on the field and saved almost half a run," but the observer says Cano "made a routine play." 

The batting statistics match what a human would say happened. The fielding stats do not.


How much random error is in those fielding statistics? When UZR gives an SD of 29 runs, how much of that is luck, and how much is talent? If we knew, we could at least regress to the mean. But we don't. 

That's because we don't know the idealized actual SD of observed performance, adjusted for the difficulty of the balls in play. It must be somewhere between 47 runs (the SD of observed performance without adjusting for difficulty), and 30 runs (the SD of talent). But where in between?

In addition: how sure are we that the estimates are even unbiased, in the sense that they're independently just as likely to be too high as too low? If they're truly unbiased, that makes them much easier to live with -- at the very least, you know they'll get more accurate as you average over multiple seasons. But if they inappropriately adjust for park effects, or pitcher talent, you might find some teams being consistently overestimated or underestimated. And that could really screw up your evaluations, especially if you're using those fielding estimates to rejig pitching numbers. 


For now, the estimates I like best are the ones from Tango's "Fan Scouting Report" (FSR). As I understand it, those are actually estimates of talent, rather than estimates of what happened on the field. 

Team FSR has an SD of 23 runs. That's very reasonable. It's even more conservative than it looks. That 23 includes all the "other than range" stuff -- throwing arm, double plays, and so on. So the range portion of FSR is probably a bit lower than 23.

We know the true SD of talent is closer to 30, but there's no way for subjective judgments to be that precise. For one thing, the humans that respond to Tango's survey aren't perfect evaluators of what they see on the field. Second, even if they *were* perfect, a portion of what they're observing is random luck anyway. You have to temper your conclusions for the amount of noise that must be there. 

It might be a little bit apples-to-oranges to compare FSR to the other estimates, because FSR has much more information to work with. The survey respondents don't just use the ball-in-play stats for a single year -- they consider the individual players' entire careers, ages and trajectories; the opinions of their peers and the press; their personal understanding of how fielding works; and anything else they deem relevant.

But, that's OK. If your goal is to try to estimate the influence of team fielding, you might as well just use the best estimate you've got. 

For my part, I think FSR is the one I trust the most. When it comes to evaluating fielding, I think sabermetrics is still way behind the best subjective evaluations.

Labels: , , , , , , , , ,

Friday, June 12, 2015

New issue of "By the Numbers"

The May, 2015 issue of "By the Numbers" is now available.  You can get it from the SABR website (.pdf), or from my website (.pdf).  Back issues can be found here or here.

Labels: ,

Thursday, May 28, 2015

Pitchers influence BAbip more than the fielders behind them

It's generally believed that when pitchers' teams vary in their success rate in turning batted balls into outs, the fielders should get the credit or blame. That's because of the conventional wisdom that pitchers have little control over balls in play.

I ran some numbers, and ... well, I think that's not right. I think individual pitchers actually have as much influence on batting average on balls in play (BAbip) as the defense behind them, and maybe even a bit more.


UPDATE: turns out all the work I did is just confirming a result from 2003, in a document called "Solving DIPS" (.pdf).  It's by Erik Allen, Arvin Hsu, and Tom Tango. (I had read it, too, several years ago, and promptly forgot about it.)

It's striking how close their numbers are to these, even though I'm calculating things in a different way than they did. That suggests that we're all measuring the same thing with the same accuracy.

One advantage of their analysis over mine is they have good park effect numbers.  See the first comment in this post for Tango's links to "batting average on balls in play" park effect data.


For the first step, I'll run the usual "Tango method" to divide BAbip into talent and luck.

For all team-seasons from 2001 to 2011, I figured the SD of team BAbip, adjusted for the league average. That SD turned out to be .01032, which I'll refer to as "10.3 points", as in "points of batting average."  

The average SD of binomial luck for those seasons was 7.1 points. Since

SD(observed)^2 = SD(luck)^2 + SD(talent)^2

We can calculate that SD(talent) = 7.5 points.

"Talent," here, doesn't yet differentiate between pitcher and fielder talent. Actually, it's a conglomeration of everything other than luck -- fielders, pitchers, slight randomness of opposition batters, day/night effects, and park effects. (In this context, we're saying that Oakland's huge foul territory has the "talent" of reducing BAbip by producing foul pop-ups.)


7.2 = SD(luck) 
7.5 = SD(talent) 

For a team-season from 2001 to 2011, talent was more important than luck, but not by much. 

I did the same calculation for other sets of seasons. Here's the summary:

            Obsrvd  Luck Talents
1960-1968    11.41  6.95   9.05
1969-1976    12.24  6.86  10.14
1977-1991    10.95  6.94   8.46
1992-2000    11.42  7.22   8.85
2001-2011    10.32  7.09   7.50
"Average"    11.00  7.00   8.50

I've arbitrarily decided to "average" the eras out to round numbers:  7 points for luck, and 8.5 points for talent. Feel free to use actual averages if you like. 

It's interesting how close that breakdown is to the (rounded) one for team W-L records:

          Observed  Luck  Talent
BABIP        11.00  7.00   8.50
Team Wins    11.00  6.50   9.00

That's just coincidence, but still interesting and intuitively helpful.


That works for separating BAbip into skill and luck, but we still need to break down the skill into pitching and fielding.

I found every pitcher-season from 1981 to 2011 where the pitcher faced at least 400 batters. I compared his BAbip allowed to that of the rest of his team. The comparison to teammates effectively controls for defense, since, presumably, the defense is the same no matter who's on the mound. 

Then, I took the player/rest-of-team difference, and calculated the Z-score: if the difference were all random, how many SDs of luck would it be? 

If BAbip was all luck, the SD of the Z-scores would be exactly 1.0000. It wasn't, of course. It was actually 1.0834. 

Using the "observed squared = talent squared plus luck squared", we can calculate that SD(talent) is 0.417 times as big as SD(luck). For the full dataset, the (geometric) average SD(luck) was 21.75 points. So, SD(talent) must be 0.417 times 21.75, which is 9.07 points.

We're not quite done. The 9.07 isn't an estimate of a single pitcher's talent SD; it's the estimate of the difference between that pitcher and his teammates. There's randomness in the teammates, too, which we have to remove.

I arbitrarily chose to assume the pitcher has 8 times the luck variance of the teammates (he probably pitched more than 1/8 of the innings, but there are more than 8 other pitchers to dilute the SD; I just figured maybe the two forces balance out). That would mean 8/9 of the total variance belongs to the individual pitcher, or the square root of 8/9 of the SD. That reduces the 9.07 points to 8.55 points.

8.55 = SD(single pitcher talent)

That's for individual pitchers. The SD for the talent of a *pitching staff* would be lower, of course, since the individual pitchers would even each other out. If there were nine pitchers on the team, each with equal numbers of BAbip, we'd just divide that by the square root of 9, which would give 2.85. I'll drop that to 2.5, because real life is probably a bit more dilute than that.

So for a single team-season, we have

8.5 = SD(overall talent) 
2.5 = SD(pitching staff talent) 
8.1 = SD(fielding + all other talent)


What else is in that 8.1 other than fielding? Well, there's park effects. The only effect I have good data for, right now (I was too lazy to look hard), is foul outs. I searched for those because of all the times I've read about the huge foul territory in Oakland, and how big an effect it has.

Google found me a FanGraphs study by Eno Sarris, showing huge differences in foul outs among parks. The difference between top and bottom is more than double -- 398 outs in Oakland over two years, compared to only 139 in Colorado. 

The team SD from Sarris's chart was about 24 outs per year. Only half of those go to the home pitchers' BAbip, so that's 12 per year. Just to be conservative, I'll reduce that to 10.

Ten extra outs on a team-season's worth of BIP is around 2.5 points.

So: if 8.1 is the remaining unexplained talent SD, we can break it down as 2.5 points of foul territory, and 7.7 points of everything else (including fielding).

Our breakdown is now:

11.0 = SD(observed) 
 7.1 = SD(luck) 
 2.5 = SD(pitching staff)
 2.5 = SD(park foul outs)
 7.7 = SD(fielders + unexplained)

We can combine the first three lines of the breakdown to get this:

11.0 = SD(observed) 
 7.9 = SD(luck/pitchers/park) 
 7.7 = SD(fielders/unexplained)

Fielding and non-fielding are almost exactly equal. Which is why I think you have to regress BAbip around halfway to the mean to get an unbiased estimate for the contribution of fielding.

UPDATE: as mentioned, Tango has better park effect data, here.


Now, remember when I said that pitchers differ more in BAbip than fielders? Not for a team, but for an individual pitcher,

8.5 = SD(individual pitcher)
7.7 = SD(fielders + unexplained)

The only reason fielding is more important than pitching for a *team*, is that the multiple pitchers on a staff tend to cancel each other out, reducing the 8.5 SD down to 2.5.


Well, those last three charts are the main conclusions of this study. The rest of this post is just confirming the results from a couple of different angles.


Let's try this, to start. Earlier, when we found that SD(pitchers) = 8.5, we did it by comparing a pitcher's BAbip to that of his teammates. What if we compare his BAbip to the rest of the pitchers in the league, the ones NOT on his team?

In that case, we should get a much higher SD(observed), since we're adding the effects of different teams' fielders.

We do. When I convert the pitchers to Z-scores, I get an SD of 1.149. That means SD(talent) is  0.57 as big as SD(luck). With SD(luck) calculated to be about 20.54 points, based on the average number of BIPs in the two samples ... that makes SD(talents) equal to 11.6 points.

In the other study, we found SD(pitcher) was 8.5 points. Subtracting the square of 8.5 from the square of 11.6, as usual, gives

11.6 = SD(pitcher+fielders+park)
 8.5 = SD(pitcher)
 7.9 = SD(fielding+park)

So, SD(fielding+park) works out to 7.9 by this method, 8.1 by the other method. Pretty good confirmation.


Let's try another. This time, we'll look at pitchers' careers, rather than single seasons. 

For every player who pitched at least 4,000 outs (1333.1 innings) between 1980 and 2011, I looked at his career BAbip, compared to his teammates' weighted BAbip in those same seasons. 

And, again, I calculated the Z-scores for number of luck SDs he was off. The SD of those Z-scores was 1.655. That means talent was 1.32 times as important as luck (since 1.32 squared plus 1 squared equals 1.655 squared).

The SD of luck, averaged for all pitchers in the study, was 6.06 points. So SD(talent) was 1.32 times 6.06, or 8.0 points.

10.0 = SD(pitching+luck)
 6.1 = SD(luck)
 8.0 = SD(pitching)

The 8.0 is pretty close to the 8.5 we got earlier. And, remember, we didn't include all pitchers in this study, just those with long careers. That probably accounts for some of the difference.

Here's the same thing, but for 1960-1979:

 9.3 = SD(pitching+luck)
 6.0 = SD(luck)
 7.2 = SD(pitching)

It looks like variation in pitcher BAbip skill was lower in the olden times than it is now. Or, it's just random variation.


I did the career study again, but compare each pitcher to OTHER teams' pitchers. Just like when we did this for single seasons, the SD should be higher, because now we're not controlling for differences in fielding talent. 

And, indeed, it jumps from 8.0 to 8.8. If we keep our estimate that 8.0 is pitching, the remainder must be fielding. Doing the breakdown:

10.5 = SD(pitching+fielding+luck)
 5.8 = SD(luck
 8.0 = SD(pitching)
 3.6 = SD(fielding)

That seems to work out. Fielding is smaller for a career than a season, because the quality of the defense behind the pitcher tends to even out over a career. I was surprised it was even that large, but, then, it does include park effects (and those even out less than fielders do). 

For 1960-1979:

10.2 = SD(pitching+fielding+luck)
 5.7 = SD(luck)
 7.2 = SD(pitching)
 4.4 = SD(fielding)

Pretty much along the same lines.


Unless I've screwed up somewhere, I think we've got these as our best estimates for BAbip variation in talent:

8.5 = SD(individual pitcher BAbip talent)
2.5 = SD(team pitching staff BAbip talent)
7.7 = SD(team fielding staff BAbip talent)
2.5 = SD(park foul territory BAbip talent)

And, for a single team-season,

7.1 = SD(team season BAbip luck)

For a single team-season, it appears that luck, pitching, and park effects, combined, are about as big an influence on BAbip as fielding skill.  

Labels: , , , ,

Monday, May 25, 2015

Do defensive statistics overrate the importance of fielding?

The job of a baseball team's pitchers and fielders is to prevent the opposition from scoring. It's easy to see how well they succeeded, collectively -- just count the number of runs that crossed the plate.

But how can you tell how many of those runs were because of the actions of the pitchers, and how many were because of the actions of the fielders?

I think it's very difficult, and that existing attempts do more harm than good.

The breakdowns I've been looking at lately are the ones from Baseball Reference, on their WAR pages -- pitching Wins Above Replacement (pWAR) and defensive Wins Above Replacment (dWAR). 

They start out with team Runs Allowed. Then, they try to figure out how good the defense was, behind the pitchers, by using play-by-play data. If they conclude that the defense cost the team, say, 20 runs, they bump the pitchers by 20 runs to balance that out.

So, if the team gave up 100 runs more than average, the pitchers might come in at -80, and the fielders at -20. 

I'm going to argue that method doesn't work right. The results are unreliable, inaccurate, and hard to interpret. 

Before I start, five quick notes:

1. My argument is strongest when fielding numbers are based on basic play-by-play data, like the kind from Retrosheet. When using better data, like the Fielding Bible numbers (which use subjective observations and batted ball timing measurements), the method becomes somewhat better. (But, I think, still not good enough.)  Baseball Reference uses the better method for 2003 and later.

2. The WAR method tries to estimate the value of what actually took place, not the skill level of the players who made it happen. That is: it's counting performance, not measuring talent. 

3. Defensive WAR (dWAR) correlates highly with opposition batting average on balls in play (BAbip). For team-seasons from 1982 to 2009, the correlation was +0.60. For 2003 to 2009, with the improved defense data, the correlation was still +0.56.

dWAR actually includes much more than just BAbip. There are adjustments for outfielder arms, double plays, hit types, and so on. But, to keep things simple, I'm going to argue as if dWAR and BAbip are measuring the same thing. The argument wouldn't change much if I kept adding disclaimers for the other stuff.

4. I'll be talking about *team* defense and pitching. The calculation for individuals has other issues that I'm not going to deal with here.  

5. I don't mean to pick on B-R specifically ... I think there are other systems that do things the same way. I just happened to run across this one most recently. Also, even though the example is in the context of WAR, the criticism isn't about WAR at all; it's only about splitting observed performance between pitching and fielding. 


The problem here is a specific case of a more general issue: it's easy to see what happens on the field, but often very difficult to figure out how to allocate it to the various players involved.

That's why hockey and football and basketball and soccer are so much harder to figure out than baseball. In basketball, when there's a rebound, how do you figure out who "caused" it? It could be the rebounder being skillful, or the other players drawing the defenders away, or even the coach's strategy.

But in baseball, when a single is hit, you pretty much know you can assign all the offensive responsibility to the batter. (The baserunners might have some effect, but it's small.) And even though you have to assign the defensive responsibility to the nine fielders collectively, the pitcher is so dominant that, for years, we've chosen to almost ignore the other eight players completely.

But now that we *don't* want to ignore them ... well, how do you figure out which players "caused" a run to be prevented? Now it gets hard.


Even for a single ball in play, it's hard.

The pitcher gives up a ground ball to the shortstop, who throws the batter out. We all agree that we observed an out, and we credit the out we observed to the pitcher and fielders collectively (under "opposition batting" ).

But how do we allocate it separately rather than collectively? Do we credit the pitcher? The fielders? Both, in some proportion?

It probably depends on the specifics of the play.

With two outs in the bottom of the ninth, when a third baseman dives to catch a screaming line drive over the bag, we say his defense saved the game. When it's a soft liner hit right to him, we say the pitcher saved the game. Our observation of what happened on the field actually depends on our perception of where the credit lies. 

We might want to credit the fielder in proportion to the difficulty of the play. If it's an easy grounder, one that would be a hit only 5% of the time, we might credit the defense with 5% of the value of the out, with the rest to the pitcher. If it's a difficult play, we might go 80% instead of 5%.

That sounds reasonable. And it's actually what dWAR does; it apportions the credit based on how often it thinks an average defense would have turned that batted ball into an out.

The problem is: how do you estimate that probability? For older seasons, where you have only Retrosheet data, you can't. At best, from play-by-play data, you can figure if it's a ground ball, fly ball, or liner. If you don't have that information, you have to just assume that every ball in play is league-average, around 30 percent chance of being a hit and 70 percent chance of being an out.

Here is the problem, which I think is not immediately obvious: when you assume certain balls in play are the same, you wind up, mathematically, giving 100 percent of the allocation of credit to the fielders. Even if you go through the arithmetic of calculating everything to the average of 70/30, and assigning 30 percent of every hit to the pitcher, and so forth ... if you do all that, you'll still wind up, at the end of the calculation, that if the fielders gave up 29 percent hits instead of 30 percent, the fielders get ALL the credit for the one "missing" hit.

Of course, the actual Baseball Reference numbers don't treat every BIP the same ... they do have the hit type, and, for recent seasons, they have trajectory data to classify BIPs better. But, it's not perfect. There are still discrepancies, all of which wind up in the fielding column.  [UPDATE: I think even the best data currently available barely puts a dent in the problem.]

For Retrosheet data, and making up numbers:  suppose a fly ball has a 26 percent chance of being a hit, but a ground ball has a 33 percent chance. The pitchers will get the credit for what types of balls they gave up, but the fielders will get 100% of the credit after that. So, if the pitcher gives up fewer ground balls, he gets the credit. But if he gives up *easier* ground balls, the fielders get the credit instead.

This is the key point. Everything that's not known about the ball in play, everything that's random, or anything that has to be averaged or guessed or estimated or ignored -- winds up entirely in the fielders' column.

Now, maybe someone could argue that's actually what we want, that all the uncertainty goes to the fielders. Because, it's been proven, pitchers don't differ much in control over balls in play.

But that argument fails. "Pitchers don't differ much in BAbip" is a statement about *talent*, not about *observations*. In actual results, pitchers DO differ in BAbip, substantially, because of luck. Take two identical pitcher APBA cards, and play out seasons, and you'll find big differences. 

Observations are a combination of talent and luck. If you want to divide the observed balls in play into observed pitching and observed fielding, you're also going to have to divide the luck properly. Not, zero to the pitcher and 100 percent to the fielders.


Traditionally, with observations at the team level, our observations were close to 100% perfect, in how they reflected what happened on the field. But when you take those observations and allocate them, it's no longer 100% perfect. You're estimating, rather than observing.

For the first time, we are "guessing" about how to allocate the observation. 

Watching a specific game, in 1973 or whenever, it would have been obvious that a certain run was prevented by the pitcher inducing three weak ground balls to second base. Now, we can't tell that from Retrosheet data, and so we (mistakenly, in this case) assign the credit to the fielders. 

Nowhere else do we have to guess. We'll observe the same number of doubles, home runs, strikeouts, and popups to first base as the fan in the seats in 1973. We'll observe the same runs and runs allowed, and innings pitched, and putouts by the left fielder. But we will NOT "observe" the same "performance" of the fielders. Fans who watched the games got a reasonable idea how many runs the fielders saved (or cost) by their play. We don't; we have to guess.

Of course, in guessing, our error could go either way. Sometimes we'll overcredit the defense (compared to the pitcher), and sometimes we'll undercredit the defense. Doesn't it all come out in the wash? Nope. A single season is nowhere near enough for the variation to even out. 

For single seasons, we will be substantially overestimating the extent of the fielders' responsibility for runs prevented (or allowed) on balls in play.

This means that you actually have to regress team dWAR to the mean to get an unbiased estimate for what happened on the field. That's never happened before. Usually, we need to regress to estimate *talent*. Here, we need to regress just to estimate *observations*. (To estimate talent, we'd need to regress again, afterwards.)


Here's an analogy I think will make it much clearer.

Imagine that we didn't have data about home runs directly, only about deep fly balls (say, 380 feet or more). And we found that, on average, 75% of those fly balls turn out to be home runs.

One year, the opposition hits 200 fly balls, but instead of 150 of them being home runs, only 130 of them are.

And we say, "wow, those outfielders must be awesome, to have saved 20 home runs. They must have made 20 spectacular, reaching-over-the-wall, highlight-reel catches!"

No, probably not. It most likely just turned out that only 130 of those 150 deep fly balls had enough distance. By ignoring the substantial amount of luck in the relationship between fly balls and home run potential, we wind up overestimating the outfielders' impact on runs.

That's exactly what's happening here.


"What actually happened on the field" is partly subjective. We observe what we think is important, and attribute it where we think it makes sense. You could count take the outcome of an AB and assign it to the on-deck hitter instead of the batter, but that would make no sense to the way our gut assigns credit or blame. (Our gut, of course, is based on common-sense understandings of how baseball and physics work.)

We assign the observations to the players we think "caused" the result. But we do it even when we know the result is probably the result of luck. It's just the way our brains work. If it was luck, we want to assign it to the lucky party. Otherwise, our observation is *wrong*. 

Here's an analogy, a coin-flipping game. The rules go like this:

-- The pitcher flips a coin. If it's a head, it's a weak ground ball, which is always an out. If it's a tail, it's a hard ground ball. 

-- If it's a hard ground ball, the fielders take over, and flip their own coin. If that coin is a head, they make the play for an out. If it's a tail, it's a hit. 

You play a season of this, and you see the team allowed 15 more hits than expected. How do you assign the blame? 

It depends.

Suppose the "fielders" coin flipped tails exactly half the time, as expected, but the "pitchers" coin flipped too many tails, so the pitchers gave up 30 too many hard ground balls. In that case, we'd say that the fielders can't be blamed, that the 15 extra hits were the pitcher's fault.

If it were the other way around -- the pitcher coin flipped heads half the time, but the "fielder" coin flipped 15 too few heads, letting 15 too many balls drop in for hits -- we'd "blame" the fielders. 

We have very specific criteria about how to assign the observations properly, even when they're just random.

The dWAR calculation violates those criteria. It refuses to look at the pitcher coin, or at least, it doesn't have complete data for it. So it just assigns all 15 hits, or whatever the incomplete pitcher coin data can't explain, to the fielder coin.


Why is this a big deal? What does it matter, that the split between pitching and defense has this flaw?

1.  First, it matters because it introduces something that's new to us, and not intuitively obvious -- the need to regress the dWAR to the mean *just to get an unbiased estimate of what happened on the field*. It took me a lot of thinking until I realized that's what's going on, partly because it's so counterintuitive.

2.  It matters because most estimates of fielding runs saved don't do any regression to the mean. This leads to crazy overestimates of the impact of fielding. 

My guess, based on some calculations I did, is that you have to regress dWAR around halfway to the mean, for cases where you just use BAbip as your criterion. If my guess is right, it means fielding only half as important as dWAR thinks it is. 

Of course, if you have better data, like the Fielding Bible's, you may have to regress less -- depending on the accuracy of your estimates are of how hard each ball is to field. Maybe with that data you only have to regress, say, a third of the way to the mean, instead of a half. I have no idea.

The first edition of the Fielding Bible figures fielding cost the 2005 New York Yankees a total of 164 hits -- more than a hit a game. That's about 130 runs, or 13 wins.

They were saying that if you watched the games, and evaluated what happened, you'd see the fielders screw up often enough that you'd observe an average of a hit per game. 

I'm saying ... no way, that's too hard to believe. I don't know what the real answer is, but I'd be willing to bet that it's closer to half a hit than a full hit -- with the difference caused by the Fielding Bible's batted-ball observations not being perfect.

I'll investigate this further, how much you have to regress.

3.  The problem *really* screws up the pitching numbers. What you're really trying to is start with Runs Allowed and subtract observed defense. But the measure of observed defense you're using is far too extreme. So, effectively, you're subjecting true pWAR to an overadjustment, along with a random shock.

Even so, that doesn't necessarily mean pWAR becomes less accurate. If the errors were the same magnitude as the true effect of fielding, it would generally be a break-even. If the Yankees are actually 40 runs worse than average, the error is the same whether you credit the pitchers 80 runs, or 0 runs ... it's just a matter of which direction. 

Except: the errors aren't fixed. Even if you were to adjust dWAR by regressing it to the mean exactly the right amount, it would still be just an estimate, with a random error.  Adding that in, and you'd still be behind.  

And, perhaps more importantly, with the adjustment, we lose our understanding of what the numbers might mean. The traditional way, when the error is due to *not* adjusting for defense, we intuitively know how to deal with the numbers, what they might not mean. We've always known we can't evaluate pitchers based on runs allowed, unless we adjusting for fielding. But, we've developed a gut feel for what the unadjusted numbers mean, because we've dealt with them so often. 

We probably even have an idea what direction the adjustment has to go, whether the pitchers in question had a good or bad defense behind them -- because we know the players, and we've seen them play, and we know their reputations. 

But, the dWAR way ... well, we have no gut feel for how we need to adjust, because it's no longer about adding in the fielders; it's about figuring out how bad the defense overestimate might have been, and how randomness might have affected the final number.  

When you adjust for dWAR, what you're saying is: "instead of living with the fact that team pitching stats are biased by the effects of fielding, I prefer to have team pitching stats biased by some random factor that's actually a bigger bias than the original." 

All things considered, I think I'd rather just stick with runs allowed.

Labels: , , , , ,