A machine learning web application to predict student math scores based on demographic and academic performance data. Built with Flask, scikit-learn, and deployed on AWS EC2, this project demonstrates a complete end-to-end ML workflow, from data exploration to deployment.
- Project Overview
- Features
- Technology Stack
- Project Architecture & Workflow
- Getting Started
- Running the Application
- Jupyter Notebooks
- Development Workflow
- Model Performance
- Configuration
- Troubleshooting
- API Endpoints
- Contributing
- License
- Future Enhancements
This project aims to understand and predict student performance in mathematics. By analyzing features such as gender, ethnicity, parental education level, and test preparation, we can build a model that provides an accurate estimate of a student's math score.
The application serves as a practical example of building and deploying a production-ready machine learning system, complete with a web interface for real-time predictions.
- Predictive Modeling: Utilizes regression models to predict student math scores.
- Comprehensive EDA: Detailed exploratory data analysis to uncover insights and relationships in the data.
- Multi-Model Evaluation: Trains and evaluates several models (Random Forest, Decision Tree, Gradient Boosting, Linear Regression, CatBoost, AdaBoost, and K-Neighbors) to select the best performer.
- Hyperparameter Tuning: Employs
GridSearchCV
to find the optimal parameters for each model. - Modular Pipeline: A structured, reusable pipeline for data ingestion, transformation, and model training.
- Model Persistence: Saved trained models and preprocessors into pickle files for production use.
- Web Interface: A user-friendly web form built with Flask to input student data and receive instant predictions.
- Robust Engineering: Features custom logging, exception handling, and a modular project structure for maintainability.
- Backend: Flask
- ML & Data Science: Scikit-learn, CatBoost, Pandas, NumPy
- Data Visualization: Matplotlib, Seaborn
- Development Environment: Jupyter Notebook, uv (or venv/pip)
- Deployment: AWS EC2 with Elastic Beanstalk
The project is organized into a modular structure that separates concerns and makes the system easy to maintain and scale.
βββ artifacts/ # Stores output files like models and preprocessors
β βββ model.pkl # Trained model object
β βββ preprocessor.pkl # Preprocessing pipeline object
βββ notebooks/ # Jupyter notebooks for EDA and initial modeling
βββ src/ # Source code for the application
β βββ components/ # Core ML pipeline components
β β βββ data_ingestion.py # Data loading and splitting
β β βββ data_transformation.py # Feature engineering and preprocessing
β β βββ model_trainer.py # Model training and evaluation
β βββ pipeline/ # Manages training and prediction workflows
β β βββ prediction_pipeline.py
β β βββ training_pipeline.py
β βββ exception.py # Custom exception handling
β βββ logger.py # Logging configuration
β βββ utils.py # Utility functions
βββ application.py # Main Flask application entry point
βββ requirements.txt # Project dependencies
βββ README.md # This file
-
Data Ingestion (
data_ingestion.py
):- Reads the raw data from
notebooks/data/stud.csv
. - Splits the data into training and testing sets.
- Saves the raw, train, and test CSVs into the
artifacts/
directory. - Triggers the data transformation and model training steps.
- Reads the raw data from
-
Data Transformation (
data_transformation.py
):- Creates a preprocessing pipeline using
ColumnTransformer
. - Applies
StandardScaler
to numerical features andOneHotEncoder
to categorical features. - Saves the fitted preprocessor object as
preprocessor.pkl
for later use.
- Creates a preprocessing pipeline using
-
Model Training (
model_trainer.py
):- Receives the transformed data.
- Runs a suite of regression models through
GridSearchCV
to find the best model and hyperparameters. - Selects the model with the highest RΒ² score (minimum threshold of 0.6).
- Saves the best-performing model as
model.pkl
.
-
Prediction (
prediction_pipeline.py
&application.py
):- The Flask app captures user input from the web form.
- The
PredictPipeline
loads the savedpreprocessor.pkl
andmodel.pkl
. - It transforms the new input data and feeds it to the model to generate a prediction, which is then displayed to the user.
First, clone the repository and navigate to the project directory:
git clone https://github.com/GoJo-Rika/Student-Performance-Prediction-System.git
cd Student-Performance-Prediction-System
We recommend using uv
, a fast, next-generation Python package manager, for setup.
-
Install
uv
on your system if you haven't already.# On macOS and Linux curl -LsSf https://astral.sh/uv/install.sh | sh # On Windows powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Create a virtual environment and install dependencies with a single command:
uv sync
This command automatically creates a
.venv
folder in your project directory and installs all listed packages fromrequirements.txt
.Note: For a comprehensive guide on
uv
, check out this detailed tutorial: uv-tutorial-guide.
If you prefer to use the standard venv
and pip
:
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt # Using uv: uv add -r requirements.txt
Follow these steps to run the project locally.
Before you can run the web application for the first time, you need to train the model. This will generate the necessary model.pkl
and preprocessor.pkl
files in the artifacts/
directory.
python src/components/data_ingestion.py # Using uv: uv run src/components/data_ingestion.py
This single command will execute the entire training workflow: data ingestion, transformation, and model training.
Once the training is complete and the artifacts are saved, start the Flask web server:
python application.py # Using uv: uv run application.py
Open your web browser and navigate to: http://127.0.0.1:5000
You can now use the form to input student data and get a math score prediction.
The notebooks/
directory contains two key notebooks that document the project's development:
1 . EDA STUDENT PERFORMANCE .ipynb
: This notebook contains a detailed Exploratory Data Analysis (EDA) of the student dataset, including visualizations and key insights that informed the feature engineering and model selection process.2. MODEL TRAINING.ipynb
: This notebook shows the initial model training and evaluation experiments. It serves as a scratchpad for testing different models and preprocessing steps before they were refactored into the mainsrc
pipeline.
This project follows a modular, pipeline-based architecture:
- Experimentation: Initial development in Jupyter notebooks
- Modularization: Successful experiments converted to reusable components
- Pipeline Integration: Components connected in training and prediction pipelines
- Error Handling: Custom exceptions and logging for debugging
- Testing: Iterative testing and refinement
- Deployment: AWS EC2 deployment with Elastic Beanstalk configuration
The system evaluates multiple algorithms and selects the best performer:
- Minimum RΒ² score threshold: 0.6
- Grid search hyperparameter optimization
- Cross-validation for robust evaluation
- EC2 Instance: Configured via
.ebextensions/python.config
- WSGI: Flask application served through
application:application
- Environment: Production-ready with proper logging
artifacts/
βββ model.pkl # Trained model
βββ preprocessor.pkl # Feature transformation pipeline
βββ train.csv # Training data
βββ test.csv # Test data
βββ data.csv # Raw data
logs/
βββ [timestamp].log # Application logs
Common Issues:
- Import errors: Ensure all dependencies are installed
- Data not found: Check
notebooks/data/stud.csv
exists - Model not found: Run training pipeline first
- Prediction errors: Check input data format
Debugging:
- Check logs in
logs/
directory - Custom exceptions provide detailed error context
- Use logging output for pipeline debugging
GET /
: Home pageGET /predictdata
: Prediction formPOST /predictdata
: Submit prediction request
Contributions are welcome! If you have suggestions or want to improve the project, please follow these steps:
- Fork the repository.
- Create a new feature branch (
git checkout -b feature/your-feature-name
). - Make your changes and commit them (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature/your-feature-name
). - Open a Pull Request.
This project is licensed under the MIT License. See the LICENSE
file for more details.
- REST API for programmatic access
- Model training & retraining pipeline
- Performance monitoring dashboard
- Additional ML algorithms
- A/B testing framework