During my analytics internship in the Private Banking department in DBS, I participated in a competition held for all DBS staff globally. Given the World Cup season, the challenge was to make use of Data-Driven methods to predict the top 3 countries of the cup with their probable scores, for the finals and 3rd place playoff.

Using my knowledge in data analytics learned from school and personal interest, I embarked on this project by using the Python programming language with the relevant statistics libraries. While my prediction (Germany, Spain, France) ended up being rather inaccurate due to various unexpected scores earlier in the cup, I throughly enjoyed the experience of building my models and conducting the analyses.

For reference, I have included the link to my GitHub repository regarding this project:

https://github.com/Angps1995/WorldCup2018Prediction

Selection

In order to the conduct the analysis, I utilised the dataset provided by DBS and also included from some data from website, https://www.kaggle.com.

The various data primarily provided information regarding the following:

  • Soccer Matches Results from 1993
  • Country Soccer Rankings from 2011
  • FIFA 18 Players Statistics
  • WC2018 Squad Players Details
  • WC2018 Group and Fixtures

Cleaning

Given that most of these datasets an excess of information, a lot of which that are not relevant, data cleaning had to be performed so as conduct the further analysis. The majors steps taken to clean the data are listed as follows:

  • Filtered soccer matches that were from 2011 onwards (Since rankings data are only available from 2011)
  • Filtered countries and players involved in WC2018
  • Removed missing data
  • Encoded categorical variables like Stadium used (“Home/Away/Neutral Side”) and result (“Win/Draw/Lose”)
  • Created a new dataset to record the results of WC matches up to 23/06 (as the submission date for the contest was 25/06)

The python notebook that was used for the data cleaning can be found in:

https://github.com/Angps1995/WorldCup2018Prediction/blob/master/datacleaning.ipynb

Analysis

Initially, I tested several classification models (Logistic Regression, Decision Tree, Random Forest etc.) to predict the number of goals the Home and Away side would score in the match.

The 6 variables used in the classification models to predict the results of a match are:

  • Which stadium is it played at (0 – neutral, 1 – away, 2 – home)
  • Whether the match is an important match or a friendly match (0 – Friendly, 1 – Important)
  • How much the Home team’s rank changes compared to the past period
  • How much the Away team’s rank changes compared to the past period
  • Difference between both teams’ ranking
  • Difference between both teams’ mean weighted ratings over the past 3 years

Logistic Regression was chosen out of the different classification models due to highest F1 score, which gives a weighted average between precision and accuracy/recall of the analysis.

Next, I selected the following variables of each country:

  • Soccer Power Index
  • Average Age
  • Average Height
  • Total World Cup Appearances
  • Average goals scored per game
  • Average goals conceded per game
  • Potential

These variables were compared against the opponent so as to build a poisson distribution model to predict the number of goals scored.

I then combined the result from the logistic regression and the poisson distribution to predict the exact score of a match. A weight of 0.8 would be given to the poisson distribution model while 0.2 would be given to the logistic regression model. (0.8 and 0.2 are weights I randomly produced, as I felt that the poisson model was built upon more recent and possibly more relevant data).

The full explanation of the analysis and its corresponding code can be found in section 7 of:

https://github.com/Angps1995/WorldCup2018Prediction/blob/master/WorldCup18_Prediction_Modelling.ipynb

Reflection

I learned that other participants in the competition utilised methods such as random forest, simulations with poisson distributions and etc. It was learned that the Best Analytical Approach Winner for the contest used ELO ratings to develop a Poisson regression model and then ran a Monte Carlo Simulation 10,000 times to predict the results.

There are many different ways to go about predicting the result and it is riveting to uncover how  others approach the problem. But the most important takeaway for me was the process of learning and trying fun and creative new methods to conduct my prediction, while excitedly watching the World Cup.

One thought on “World Cup 2018 Prediction Challenge Reflections

Leave a comment