Argentina Lost, But Don't Panic — Someone Still Thinks They Can Win It All! | Yunqi Capital Science Chat

云启资本·November 25, 2022

In the end, it turned out AI wasn't even as accurate as an octopus.

The Qatar World Cup group stage is in full swing, and alongside the thrilling matches, fans have been equally captivated by the endless pre-game predictions. Whether it's algorithmic models that "let the data speak" or animal oracles with a dash of mysticism, each seems to find its own audience.

How do these prediction models actually work? What elements help sharpen their accuracy? In this edition of "Yunqi Kepu" ("Yunqi Chats Science"), we share some fascinating explorations into machine learning & AI predictions. Enjoy~

➤➤➤ Argentina defeated Brazil 1-0 in the 2022 Qatar World Cup final, with Messi scoring the only goal to secure Argentina's first World Cup title since 1986.

Throughout the tournament, Messi scored 8 goals in 7 matches, claiming both the Golden Boot (top scorer) and the Golden Ball (best player). Brazil and France finished as runners-up and third place, respectively.

FIFA 23 prediction results | Image from official website

Wait, what? Didn't the World Cup just start? And didn't Argentina just lose?

The above result was simulated by the football video game FIFA 23. Yet plenty of fans bought into it. FIFA had successfully predicted the winner of the previous three World Cups. No wonder its publisher EA boasted that everyone could just skip watching the tournament — the "spoiler" was already out there.

You'll notice that every time a major tournament like the World Cup rolls around, predictions multiply like rabbits: AI, large models, high-tech wizardry (and low-tech too — remember Paul the Octopus?)...

What makes these "prophets" so "confident"?

What factors determine prediction outcomes?

In recent years, sports competitions including football have largely been predicted through traditional statistics and machine learning methods. Prediction agencies collect historical match data and structure the factors that can influence a game. Combining bookmakers' spreads and odds, they build models using machine learning algorithms to generate results.

One commonly used algorithm is "Random Forests," typically deployed in marketing and health insurance calculations. Simply put, this system builds a "forest" with many "trees" (trained on subsets of samples). When a new input arrives, each tree makes its own prediction, and a "democratic voting mechanism" (such as averaging) produces the final result.

Another popular approach is the "Poisson Distribution," which models the probability of a discrete event occurring within a continuous time frame. Real-life applications abound: click-through rates on e-commerce sites during a given period, radioactive decay particles per second, factory robot malfunction frequency, and so on. Applied to football, it can estimate each team's attacking and defensive strength from historical data to predict goal probability.

Machine learning is a process of discovering and learning underlying patterns from existing data. | Image from Giphy

But complex machine learning models often employ multiple algorithms depending on the data features incorporated.

One international research team favored Argentina's archrival Brazil in this World Cup. First, they built a team strength statistical model using Poisson distribution based on international match data from the past eight years to estimate current capabilities. But this wasn't a simple average of past "results" — more recent matches carried greater weight. The "future strength" estimate also incorporated odds from 28 international bookmakers. Combining additional data dimensions — team market value, FIFA rankings, team structure characteristics, and country characteristics represented by population and GDP per capita — they constructed a random forest model.

The team's final result: Brazil with a 15% chance of winning, followed by Argentina, the Netherlands, Germany, and France.

The choice of data dimensions matters enormously. The volume and variety of data can produce vastly different predictions. FIFA rankings are straightforward enough. But why do many models include socioeconomic factors?

Joachim Klement, an analyst at UK investment bank Liberum Capital who successfully predicted the 2014 and 2018 World Cup winners, used "GDP per capita" as an example: a country can't be too poor — developing football talent requires infrastructure and pitches; but if a country is too wealthy, kids have too many sports options beyond football.

The "population" factor only matters where football dominates the culture, such as in Latin America. Croatia, the 2018 World Cup runner-up, has just 4 million people — a small European nation — but invests heavily in its youth academy system.

Socioeconomic factors also influence football match outcomes | Image from Giphy

Weather (in the host country) is another important factor. Too hot or too cold hurts a team's chances (just look at host Qatar). The ideal temperature is 14°C, roughly equivalent to the annual average in southern Europe and much of South America. With this in mind, aside from England (1966) and Germany (1964, 1974, 1990, 2014), every World Cup winner in history fits this pattern.

The hardest factor to measure is "home advantage." It could be familiarity with the venue, support from domestic fans, even favorable refereeing. Only Qatar has lost its opening match as host — showing that while home advantage defies clear explanation, its impact is undeniably real.

Machine learning is a process of discovering and learning underlying patterns from existing data. A match outcome does indeed depend heavily on historical performance.

But every prediction model carries the same disclaimer: "No guarantees~"

Science or superstition — which is more accurate?

Football matches have far too many unexpected variables that determine outcomes.

Because of Qatar's scorching summer heat, this World Cup had to be moved to winter, completely disrupting domestic league schedules worldwide and making it harder for players to adapt. "National teams have less preparation time, compressing players' recovery window before the World Cup, and combined with Qatar's climate conditions, this increases injury risk," said the research institution that had favored Brazil.

Most prediction agencies shared similar concerns. With less time to prepare and gel, teams that rely on coordinated play and balanced squads — like Spain and Germany — saw their advantage diminish. For individual stars like Cristiano Ronaldo and Messi, the impact was relatively smaller. But on the flip side, given their age, physical fatigue became a significant variable that could swing match results.

Messi | Image from Giphy

Sports data provider Opta favored Brazil, giving them a 15.8% title probability, ahead of Argentina (12.6%) and France (12.2%). Yet as recently as June, they had firmly backed France as the favorite. Their reason for the "flip": France's morale and team cohesion showed cyclical decline — clearly based on recent observations. So with predictions, the closer to the event, the higher the accuracy.

Even after matches begin, predictions keep shifting. Data journalism site FiveThirtyEight has an "SPI" (Soccer Power Index) that makes advance predictions for every match. But real-time dynamics during games are also factored in, continuously calculating possible scorelines for the remaining time — if you follow European leagues, you've probably seen real-time win probability graphics on broadcasts.

They offered an example. In 2014, Brazil vs. Croatia. Before kickoff, based on historical SPI, the model gave Brazil an 86% win probability. Eleven minutes in, a Brazilian defender unfortunately scored an own goal, deflecting a wayward Croatian shot into his own net. Brazil trailed 0-1.

Immediately, the model adjusted its prediction, calculating that Brazil could still come back with a 58% win probability. Based on past observations, they had concluded: excellent teams that fall behind briefly often get motivated to win by larger margins. The better the team, the greater the "drama."

So they adjusted their real-time prediction again, giving Brazil a 66% chance of winning. Final score: 3-1. Spot on.

These models incorporating "real-time calculation" prove somewhat more accurate than pure "AI pattern-finding." But can football matches truly be "predicted"?

AI relies on big data, building machine learning models to produce what seems the most likely outcome — giving people a reason to "buy in" with data-driven authority.

"Even with the most sophisticated statistical techniques, predictions remain highly uncertain, because football is a very difficult game to predict." That's what Goldman Sachs wrote in its 2018 World Cup prediction report. In other words, after all the analysts' number-crunching and odds-calculating, the result might still be less reliable than Paul the Octopus.

Half science, half superstition | Image from Giphy

Paul predicted winners by choosing between glass tanks decorated with competing nations' flags, retrieving a mussel placed inside. During the 2010 South Africa World Cup, Paul went 8-for-8, including predicting Spain's final victory over the Netherlands. By contrast, the infamous "jinxed" football legend Pelé got it wrong again and again.

Paul the Octopus | Image from Oriental IC

Science or superstition? There's no reasoning with it.

A local Qatari falconer used his bird to predict the World Cup opener between Qatar and Ecuador. He attached each country's flag to separate drones, baited with food, then released the falcon to see which it chose. The falcon soared — brushing past Qatar's flag before ultimately selecting Ecuador's.

In sports with such high randomness, there's never a "certainly accurate" prediction method. When results defy authority, even majority opinion, we can only fume: "That's not scientific!"

And that, too, is part of the joy of sports.

References

[1] https://www.zeileis.org/news/fifa2022/

[2] https://www.bcaresearch.com/reports?r=4201bf52ad3bfda09aed64d54c9a02f4&submissionGuid=85cb89ce-e607-422c-ab47-1fbd01c69f0f

[3] https://fivethirtyeight.com/features/how-our-2022-world-cup-predictions-work/

[4] https://liberum.s3.amazonaws.com/STRS_1013754.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAICKLXNJJPOVS4TPQ%2F20221122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221122T000000Z&X-Amz-Expires=86400&X-Amz-Signature=2b7c6dc7e88e4f154c44bf28c793857052dd114621ca332f5e72979eaf11db87&X-Amz-SignedHeaders=host

[5] https://new.qq.com/rain/a/20221120A01FK400.html

Author: Youzi

Editor: Shen Zhihan

This article is from Guokr (ID: Guokr42). Unauthorized reproduction prohibited. For reprint inquiries, contact sns@guokr.com