Sabermetric Research: Does diversity help soccer teams win?

A new academic paper (.pdf) claims that linguistic diversity can be a huge factor in helping high-level soccer teams win games. Here's a Washington Post blog item where they explain their study and conclusions.

They looked at ten seasons' (2003-2013) worth of teams from the Champions League Group Round and beyond -- 168 team-seasons in all. To measure which teams were the most diverse, they used something called the "Automated Similarity Judgment Program" to rate every pair of languages on a 100-point scale. The higher the number, the more distant the languages are from each other. (Roughly speaking, the European languages are fairly close to each other, as are the East Asian languages, and the Middle-East languages. But each of those three groups are relatively distant from the others.)

Then, they calculate the overall team score by averaging the linguistic distance for every pair of players.

What they found: the more diverse teams have significantly better goal differential.

Could it simply be that the more diverse teams also have better players? The authors corrected for that by including "transfer value" in their regression. They use the log of the total estimated value a team could receive for its players, as estimated by transfermarkt.com .

What does that value represent? Well, as I understand it: in soccer, teams sell players to other teams all the time. The team gets the proceeds from the sale, and the player negotiates a contract with his new team. The contract value is significantly less than the sale price.

In 2013, Gareth Bale was transferred from Tottenham Hotspur to Real Madrid, for a sum of more than 78 million UK pounds, the highest sum ever. Bale's salary, though, is 15.6 million pounds per year. (Of course, the transfer price is a one-time payment, but Bale's salary is paid every year.)

Transfermarkt.com posts their own, real-time, estimates of transfer value for every player in most European leagues. I think they base the estimates on the player's current level of performance, so the values are a good proxy for how talented the team actually is.

The authors' main result says that after correcting for team value, if you increase diversity by 1 standard deviation, you gain 0.31 goals per game.

------

Now, 0.31 goals per game is a LOT. How big a lot? I'll translate that to MLS terms (even though the study was for the European Champions League), just because the teams are more comparable to each other in talent.

In MLS, teams play 34-game seasons, so an extra 0.31 goals per game works out to 10.5 overall. In both 2011 and 2012, *more than half* of MLS teams were within 10 goals of even.

That's how big 0.31 goals per game is. Can you really improve your standings position that much by favoring players who speak a foreign language?

I don't think so. I think there's a big problem with the study, one you may already have noticed.

The authors based their regression not on transfer value itself, but on the logarithm of team value. But, as I've written before, that doesn't make sense. What a log(salary) model is assuming that what matters -- what translates value into performance -- is not the dollar difference, but the *percentage difference*. It assumes that if you go from $10 million to $20 million, you get the same number of goals as if you go from $100 million to $200 million.

That can't be right, can it? It doesn't work that way for Mountain Dew. If you go from $1 to $2, you get one extra can. If you go from $10 to $20, do you still get only one extra can?

It's got to be almost as wrong for players. If you spend 20 million Euros on a striker, how many extra goals is he going to give you? According to the model, he'll give you more than twice as many goals if you're AC Milan than if you're Real Madrid. That's because 20 million Euros increases Milan's value by 9 percent, but it increases Madrid's value by only 3.5 percent.

That makes no sense. Well, I suppose it *could* make sense if higher-value teams are so good that there are diminishing returns. A team of eight Barry Bondses probably won't win that many more games if you add a ninth. But, real life European football teams aren't even close to that level.

Also: if that were true, we'd see almost every player sold to a weaker team. Suppose the last few players Barcelona signed gave them 0.25 extra goals per game. Barcelona's team value is 11 times Celtic FC's. So, why wouldn't Celtic have bought the players instead, and got 11 times the benefit -- almost *four goals per game* -- for the same money?

Because, of course, they wouldn't get four goals per game. It would give them the same 0.25 goals per game -- or, perhaps, a fraction more, or less.

Empirical evidence has shown that, in MLB, free agent salaries are close to linear in expected runs. There's no reason to think that European soccer is any different.

-----

Why does the incorrect use of log(value) lead to the overvaluing of diversity?

Well, if a relationship is linear, but you plot it as proportional to the log, you get a curved line, an exponential line. The graph of y=x is linear, of course, but here's a picture of what happens when you put log(x) on the x-axis instead. You get the equivalent of y=e^x:

It's obviously not linear, but if you don't realize that, and fit a straight line anyway, you get this:

What happens? The top and bottom teams get badly underestimated. And the middle teams get badly overestimated.

Suppose that, now, you take the regression and throw in your variable for "linguistic diversity". What happens?

Well, diversity happens to be positively correlated with team value (r=+.23, the paper says). So, the regression "sees" that log(salary) underestimates the top teams. It "sees" that the top teams also have high diversity. Then, it connects the two, and "notices" that high diversity is related to teams that appeared to do better than otherwise expected. So it "concludes" that diversity is a positive and significant factor.

Well, not quite; there's one more thing we have to explain. It's not just the top teams that are overestimated -- it's the bottom teams, too. So, if you add in diversity, there's no reason yet to assume it will help the fit. If it makes the the right-side dots fit better because of high diversity, the left-side dots will fit worse because of low diversity, and it'll cancel out.

Well, here's a possible explanation (which I suspect is the right one).

In the study, the top teams are the same, successful ones over and over -- Chelsea, Madrid, Barcelona, Bayern Munich, and so forth. Presumably, those have a consistent (higher than normal) diversity score.

But the bottom teams are probably those who qualified for the Champions League maybe only one or two of the 10 years in the study. With so many of those bad teams, the low- and high-diversities will cancel out more than the few good teams at the high end. So, there would be high diversity at the top, but average (or slightly-below average) diversity in the middle and bottom.

If that's true, then the diversity variable will fix the underestimates of the top teams, while not affecting the bottom teams much.

------

I can't prove this is happening, without having the full dataset, but I bet it is. In any event, it's not up to me to prove it. It's enough to show that (a) the logarithmic model is wrong, and (b) the logarithmic model is wrong in a way that will plausibly produce the exact kind of (spurious) result the authors found.

In other words: if you say the Yankees won last night, and it turns out your evidence is last week's newspaper ... well, that's enough to refute your claim. I don't actually have to prove that the score in your paper is actually different from last night's score.

-----

But, in the interests of double-checking my hypothesis, I gathered data for the 32 teams in the 2013-14 Champions League (which is outside the sample the authors used). I got team values from transfermarkt.com, like the authors did, although the values I used may have changed slightly by now. I used the group round only.

Here's the graph I got for a linear relationship:

Looks quite reasonable. Compare it to the graph I get for a log relationship:

It isn't obvious that the dots on the log graph actually form a curved pattern. But, once you fit the regression line, it shows up -- the top and bottom teams are underestimated, as theory suggests. (I'm sure the curving would be much more obvious if I used the full 10-season dataset instead of just the one year.)

In fairness to the authors, there's a reason it might have been less obvious to them. They didn't use all 32 teams each year, as I did. They included only teams from the five most successful countries (England, France, Italy, Spain, and Germany). All their teams were valued at 123MM Euros or higher (4.8 or higher on the log scale). The remaining 16 team values were generally lower -- nine were under 100MM Euros, and one was as low as 19MM (2.9 on the log scale).

Mathematical theory says that the smaller the range of log(X), the less curved the line will be. Which is why, if you compress the X-axis by eliminating every point to the left of 4.8, it's less obvious the dots form a curve.

To illustrate, here's the full sample again, but this time on one graph. (The linear is linear, and the log is curved.) The lines are quite different:

But, for the smaller sample, not quite as much:

It turns out that in this truncated sample, the correlation is actually *higher* for the log scale than for the linear scale, .88 to .83. That doesn't make the model right; it's just that the random errors happen to offset the errors in the model, in this case. (Especially those two dots at the bottom.)

My guess is: if I plotted all 10 years, instead of just the one, the pattern of dots would look more like a curve. The highest-value teams would still fit the straight line much better than the curved line, and that would cause "diversity" -- which would apply disproportionately to the highest-value teams -- to make up the difference.

But, as I said, (a) I don't have the full dataset, and (b) it doesn't really matter. If your regression assumes that players are worth orders of magnitude more to poor teams as rich teams, there's no obvious way to interpret what comes out the back end.

---

(Hat tip to GuyM for pointing me to the article.)

Labels: diversity, regression, soccer

2 Comments:

At Sunday, June 15, 2014 8:06:00 PM, Anonymous said...: I did not read the article but my first thought is that the better teams with more money and other resources have a greater ability to attract and retain players from all over the globe. They have more access to talent, wherever that talent comes from.
At Monday, June 16, 2014 4:56:00 AM, Anonymous said...: I am very dubious about the paper's hypothesis (as you present it). It seems to me that there will be hidden variables that could explain the relationship better.

The log relationship between spending and goals can be justified. It's a question of diminishing returns. It makes perfect sense that £10 million will have a bigger impact on a poor team than a rich team. At the top end of the wage/value scale there are some very high figures as there is a limited pool of talent and plenty of oligarchs and sheikhs willing to spend. We can only see what happens when UEFA fair financial rules kick in.

Also, Celtic don't buy the same players as Barca as they can't afford to. They have little choice there.

<< Home

Sabermetric Research

Friday, June 13, 2014

Does diversity help soccer teams win?

2 Comments:

About Me

Previous Posts