In this work, we predict the FIFA World Cup matches and attempt to propose a method to generate ideal grouping of teams for a balanced tournament using past match results and rankings which can be used to faciliate a tournament end-to-end. In summary, we use the data to generate and extract relevant features, and then use multiple supervised techniques to predict winner of a match. We implement multiple algorithms and do a thorough comparison of each of them. Apart from real data, we also explore creating fictitious matches and use semi-supervised learning in an attempt to improve the models. Alongside match predictions, we also use unsupervised clustering techniques to create groups that can facilitate a good tournament. Through these two processes, we create an end-to-end tool that can take in participating teams, build groups, predict results of matches and ultimately, predict a complete tournament.
Please find the full report which contains detailed description and results here - Report
Team Members - Ananya Sharma, Snigdha Verma, Apoorva Sinha, Yu-Chen Lin, Neelabh Sinha
We analyze the performance of the various classification schemes on our dataset using all standard metrics of evaluating a supervised learning algorithm. We primarily want to optimize the accuracy of the prediction, but since we have a balanced dataset and no bias towards avoiding either true negatives or false positives, we treat them fairly. Our final results are as shown below:
Technique | Accuracy | Precision | Recall | F-1 score | ROC-AUC |
---|---|---|---|---|---|
Logistic Regression | 73.49% | 73.59% | 73.49% | 73.49% | 0.81 |
SVM | 73.38% | 73.51% | 73.38% | 73.37% | 0.81 |
Decision Tree | 70.29% | 70.46% | 70.29% | 70.26% | 0.77 |
kNN | 71.41% | 71.51% | 71.41% | 71.40% | 0.80 |
Gaussian Naive Bayes | 72.54% | 72.54% | 72.54% | 72.53% | 0.80 |
Random Forest | 71.69% | 71.81% | 71.69% | 71.68% | 0.80 |
Gradient Boosting | 71.52% | 72.06% | 71.52% | 71.41% | 0.80 |
Adaptive Boost (base model - logistic regression) | 73.16% | 73.30% | 73.16% | 73.14% | 0.81 |
Please review the report linked above for a detailed result and other experiments conducted including using synthetic data and self-supervised learning.
-
Clone the repository:
git clone https://github.com/neelabhsinha/fifa-world-cup-prediction-ml.git cd fifa-world-cup-prediction-ml
-
Install the required dependencies:
pip install -r requirements.txt
To run the script, you need to use Python's argparse
to specify the task and various options. Below are the tasks and arguments available.
train
: Train a machine learning model.preprocess
: Preprocess the data and generate features.simulate_tournament
: Simulate a FIFA World Cup tournament.generate_artificial_data
: Generate artificial match data.
--task
: Specifies the task to perform. Choices aretrain
,preprocess
,simulate_tournament
, andgenerate_artificial_data
.--model
: Specifies the model to use for training and prediction. Choices are defined in themodels
list.--tune
: If provided, tunes the hyperparameters for the model.--do_pca
: If provided, performs PCA on the data (optional).--select_features
: If provided, generates features from the data automatically.--simulate_tournament
: If provided, simulates a tournament end-to-end (task is not required in this case).--use_artificial_data
: If provided, uses artificial data for training. If used, generate the artificial match data
To train a model, use the following command:
python script_name.py --task train --model model_name --do_pca --select_features --use_artificial_data
To preprocess data and generate features,
python script_name.py --task preprocess
To generate artificial match data for self-supervised learning, do the following -
python script_name.py --task generate_artificial_data