β½ predictEPL: Predicting Premier League Outcomes with Random Forest
This project builds a machine learning model to predict whether a Premier League team will win or not, using a Random Forest Classifier and hyperparameter tuning with cross-validation.
The project uses the EPL Matches dataset containing match-by-match team statistics across seasons, and demonstrates the full workflow β from data preprocessing and feature engineering (rolling averages) to model evaluation with confusion matrices and classification metrics.
π Project Overview
Football outcomes depend on a variety of factors β opponent strength, home advantage, and recent form. Automating match outcome prediction helps quantify these patterns and test hypotheses about team performance.
My project leverages a Random Forest Classifier to predict match results (win = 1, not win = 0) based on both static and time-dependent features.
βοΈ Technologies Used
Python 3.10+
NumPy
Pandas
Matplotlib
scikit-learn
π§ Feature Engineering
To improve prediction accuracy, the model incorporates rolling averages β smoothed versions of recent match statistics (e.g., goals, shots, distance).
These features reduce short term fluctuations and randomness from single-match fluctuations, allowing the model to recognize trends like team momentum or fatigue effects.
π§ Model Training
Model: RandomForestClassifier(n_estimators=50, min_samples_split=10)
Train/test split: based on match date (before vs after 2022-01-01)
Hyperparameter tuning: Performed using RandomizedSearchCV with 7-fold cross-validation over:
n_estimators β [50, 500)
min_samples_split β [1, 20)
β Interpretation
The tuned Random Forest achieved 65% accuracy, which is good considering the complex and unpredictable nature of sports matches
π Results
Best hyperparameters: n_estimators β 343, min_samples_split β 13
Highest Accuracy: 0.65
Conclusion: The Random Forest Classifier performs robustly on EPL match data β handling nonlinear relationships and mixed categorical features while avoiding overfitting through ensemble averaging.
π§ Future Improvements
Incorporate expected goals (xG) or player-level features for deeper predictive context
Test advanced ensemble models (e.g., XGBoost, LightGBM)
Add feature-importance visualization to interpret drivers of victory
Extend prediction horizon
π€ Author
Benjamin Yang University of North Carolina at Chapel Hill
π§ Email: yangbenjamin19@gmail.com
π GitHub: BenYang12