Are soccer goals scored less valuable than goals prevented?
During this year's World Cup of Soccer, I found a sabermetric soccer book discounted at a Toronto bookstore. It's called "The Numbers Game," and subtitled "Why Everything You Know About Soccer Is Wrong."
Actually, I don't know that much about soccer, but much of the book fails to convince me -- for instance, when the authors argue that defense is more important than offense:
"To see if attacking leads to more wins, and whether defense leads to fewer wins and more draws, we conducted a set of rigorous, sophisticated regression analyses on our Premier League data."
As far as I can tell, the regressions tried to predict team wins based on team goals scored and conceded. The results:
0.230 wins -- value of a goal scored
0.216 wins -- value of a goal conceded
The authors write,
"That means goals created and goals prevented contribute about equally to manufacturing wins in English soccer."
But, when it came to losses:
0.176 losses -- value of a goal scored
0.235 losses -- value of a goal conceded
So,
"... defense provides a more powerful statistical explanation for why teams lose. ... when it came to avoiding defeat, the goals that clubs didn't concede were each 33 percent more valuable than the goals they scored."
------
The authors argue that
(a) goals scored and conceded contribute equally to wins;
(b) goals conceded contribute more to losses than goals scored.
Except ... aren't those results logically inconsistent with each other?
Suppose you look at the last 20 games where Chelsea faced Arsenal. From (b), you would deduce,
If Chelsea had scored one goal fewer, but also conceded one goal fewer, they'd probably have had fewer losses.
That's because, according to the author's numbers, the lost goal would have cost Chelsea 0.176 losses, but the goal prevented would have saved them 0.235 losses. Net gain: 0.059 fewer losses.
But Chelsea's goals scored are Arsenal's goals conceded, and vice versa. Also, Chelsea's losses are Arsenal's wins, and vice versa. So, you can rephrase that last quote as,
If Arsenal had conceded one goal fewer, but also scored one goal fewer, they'd probably have had fewer wins.
Except ... the authors just argued that goals scored and conceded are *equal* in terms of wins.
Without realizing it, the book simultaneously makes two contradictory arguments!
-----
So why did the coefficents for goals scored and goals allowed come out so different in the regression? I think it's just random chance.
If a team scores 20 goals and concedes 20 goals, you'd expect them to win as many games as they lose. But that might not happen if the goals aren't evenly distributed over games. For instance, the team might have lost 19 games by a score of 1-0, while winning a 20th game 20-1.
In other words, team wins and losses vary randomly from their goal differential expectation. If the teams that underperformed happened to be teams that scored more than they conceded, and the teams that overperformed happened to be teams that conceded more than they scored ... in that case, the regression notices that overperformance is correlated with defense, and adjusts accordingly. And you wind up with the result the authors got.
(Another source of error is that performance isn't linear in terms of goals; it's pythagorean. But that's probably a minor issue compared to simple randomness.)
I'd bet that, for the "wins" regression, there was no pattern for which teams randomly outperformed their win projections. But for the "losses" regression, there *was* that kind of pattern, where the teams with better defense did lose fewer games than projected.
I'd bet that if you grouped the games differently, and reran the regression, you'd get a different result. Instead of your regression rows being team-based, like "Chelsea's 38 games from 2007-08," make them time-based, like "the first four weeks of the 2007-08 schedule." That will scramble up the projection anomalies a different way, and I'd bet that the four coefficient estimates wind up much closer to each other.
Labels: pythagoras, soccer