World Cup 2018 predictions with Big Data: who is going to win what and when?

The Champions League final was not just an exciting game, but it was also a potential game changer: Liverpool Striker Mo Salah, Player of the Year, recipient of the Golden Boot and Egypt’s best known player had suddenly had his ticket to the World Cup cast into doubt after an aggressive challenge from Sergio Ramos left him injured.

The implications of this tackle could have been be huge: If Salah was unable to play for his country (it turns out that he has done the almost impossible and is fit), this could have affected Egypt’s chances in the World Cup, and while Egypt is unlikely to be troubling the later rounds of the competition, like a butterfly flapping its wings on the other side of the world, Egypt’s performance could in turn impact how the other teams in its group do - and ultimately affect which country gets to lift the trophy.

This said, Salah’s potential fall is only one of literally millions of data points. One of millions of factors that could ultimately affect the outcome of the competition.

How can we get a handle on what to expect, then? Is there any way to predict how teams will perform? Could Big Data, which has already transformed countless other industries, also unlock a deeper understanding of the beautiful game? Could it predict who will win the World Cup?

Data points

Opta Sports and STATS are two companies that try to answer questions like this. As sports data companies, their missions are to collect data and make sense of it for their clients, which includes sports teams and federations, as well as media that are hungry for data insights (ahem).

“It's very easy to think that more data is good, but until you know how you're going to use it and what you can learn from it, sometimes it can be data for data's sake”
Paul Power, STATS.com

What do they actually collect, then? Opta’s marketing manager Peter Deeley explained that for each football match, his company collects around 2000 individual data points, mostly focused on “on-ball” actions. A team of three analysts - one for each side, and someone to double check tricky moments, will sit in the company’s data hub in Leeds, and will record essentially everything that happens on the pitch: every pass, cross and shot, as well as the positions on the field where each interaction has taken place.

The data is delivered to clients live, which is why, for example, UK pundit (and former England player) Gary Lineker is able to tell viewers about stats like possession and shots on goal at half time.

Stats.com does the same sort of thing - and Paul Power, a data scientist at the company, was keen to tell me about how it isn’t just humans that are used for data collection, but new computer vision technologies.

When it comes to accurately recording the position of each player on the pitch, his company uses cameras placed around the edge to figure it out, saving the need for players to wear tracking beacons under their shirts, like has happened in sports like Rugby Union.

But why stick to humans? Couldn’t computer vision be used to log all of this sort of data? “People are still best because of nuances that computers are not going to be able to understand,” argues Paul.

He gives the example what if a player is cornered and kicks the ball away out of desperation, but luckily the ball is then received by a player on the same team. To a machine, this may look like simply a long pass, as machines cannot work out the context of what is going on, or the look of panic on the player’s face - it would log a long pass, whereas technically the event is technically something else: a clearance. Which means that without a human to make these calls, the logged data could be less accurate.

The Opta approach

We know both companies have a lot of data - but who do they think will actually win the World Cup? Though both companies generate a lot of detailed data for their clients, interestingly STATS and Opta diverged when it comes to modelling this summer’s tournament.

In Opta’s case, Peter explained to me that their World Cup model doesn’t take into account the myriad of individual players’ data. Instead, Opta has chosen only to look at the performance of the specific national squads on a team level. For example, assessing Egypt’s chances based on how the Egyptian team performed in the past, and without taking Mo Salah’s injury situation into account.

“Data scientists for the World Cup looked at the historical performance of different countries, what difference does it make if you are playing as the host nation, what difference does it make that you're playing in your home continent."
Peter Deeley, Opta

“The data scientists for the World Cup looked at the historical performance of different countries, what difference does it make if you are playing as the host nation, what difference does it make that you're playing in your home continent [and] what difference does it make if you have won the last few World Cups,” Peter explains.

The data scientists were then able to tweak the model by running it hundreds of thousands of times to make iterative improvements, adjusting the relative weight of each factor in the algorithm.

This is a surprise, as you would assume the more data the better, but Peter believes that this model can still deliver good predictions.

“A World Cup is only done every four years, so you will often find that a decent quality player, playing for a country that often plays in World Cups, will only play in two World Cup tournaments - you won't have that much data on that players impact on the wider team, within the international set up.” he says.

And he believes that this team level data is enough: “Italy won in 2006 - they weren't favourites and the quality of their squad though good, they weren't a team that had a Cristiano Ronaldo level superstar.”

He goes on to explain: “It is really interesting, with World Cups it is true that those teams that historically do well keep doing well. Germany, in the last three World Cups have at least got to the semi finals.

"Even though you can argue their team this time around is not as good as last time, they still have that track record of being current world champions, of being a team that generally performs well - and it is in their home continent. That would mean they have a good chance generally, not regardless of their squad, but they have a history of performing well at tournaments.”

“It's very easy to think that more data is good, but until you know how you're going to use it and what you can learn from it, sometimes it can be data for data's sake”, he says.

The STATS model

STATS has modelled the World Cup rather differently. Unlike its rival, it is taking individual player data into account for what it calls “What If?” Analytics.

According to Paul, this means that STATS can effectively use individual player data to work out not just how a team will perform, but also quantify the impact of swapping players in and out of the squad. In Mo Salah’s case, STATS claims its system would be able to work out the impact on Egypt of whether he is fit enough to play or not.

“You can plug in these different situations and that would be able to generate an outcome and that measure would either be number of goals scored or conceded, or simply win probability: how does that player increase or decrease chances?" Paul explains.

“We can look at this, run the simulations and this will actually tell us: Mo Salah might be worth 0.3 of a goal, or if he isn't playing and another player comes in, that reduces the win probability by 3% or 10% or it might actually increase it depending on the team that they're actually playing against.”

Why does STATS believe the individual approach works better than looking at teams?

“Everybody knows if you’re missing your star players it's going to impact on performance - you don't need a complex neural network to tell you that,” says Paul. “If you're missing that in your dataset, that's really going to skew your probabilities and your predictions”.

“We know that by adding in these additional features off the players that we get better impacts because what we're able to do better is model the direct relationships between individuals, and while its a team sport, we know that certain individuals have a bigger influence on the outcome than certain others.

"If you’re missing a full-back for example, that’s potentially going to be less of an issue than missing a central midfielder, so you have to account for that, and as a result of that we're really confident in the model that we've generated.”

Tell me who is going to win, dammit

Now we come to the all important question: Which country do the two models predict is going to win? In both cases, as proper stats nerds, they have delivered probabilistic forecasts which contain rather more nuance than you mate Dave, who swears blind that Germany are going to win again because he’s got a good feeling about them.

I asked STATS for its predictions, and sadly, despite the company being willing to tell me about all of the data it has access to, and how it would actually make a prediction, I was told that they won’t be publishing their predictions this year. Why? Out of fear of being wrong? No, the answer is much more straightforward: this is valuable information, and they only want to spill the beans to paying clients.

We do, however, have a prediction from Opta. It rates perennial World Cup winners Brazil (just don’t mention 2014) as the most likely champions once again - giving them a 14.2% chance of winning. This means that if you ran the World Cup with the exact same teams 20 times over, you’d only expect Brazil to win around three times. Like your mate Dave, Opta also fancies Germany - giving them an 11.4% chance of once again taking home the trophy.

Another company that likes to predict and has a scary amount of accuracy in its results is EA Sports. For the last three World Cups, it has rightly predicted the eventual World Cup winner.

Using the detailed data it has on the players and team rankings in FIFA 2018 and its World Cup add-on, it ran a simulation of the tournament and France were the eventual winners, defeating Germany in the final. Given it predicted Germany and Spain for the 2014 and 2010 World Cups respectively, this could be a good shout.

Then there’s Blue Yonder, a company famed for using AI to predict the ebb and flow of stock management in some of the world’s biggest supermarkets. It recently turned its hand to predicting the World Cup. Left-field yes, but its technology has analyzed every international football match played since 1872, running over 1 million simulations of the World Cup and believes that Brazil are the favorites to win Russia, with a 22.5% chance of winning.

And what about England? The bad news for Gareth Southgate is that Opta gives his squad a lowly 1.9% chance, while Blue Yonder ups this a little to 5.7%.

If Opta and Blue Yonder are right, it’s highly likely we can look forward to losing yet another penalty shoot-out. Sigh.