Skip to content

GoJo-Rika/Student-Performance-Prediction-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Student Performance Prediction

Python Version License: MIT Framework

A machine learning web application to predict student math scores based on demographic and academic performance data. Built with Flask, scikit-learn, and deployed on AWS EC2, this project demonstrates a complete end-to-end ML workflow, from data exploration to deployment.

πŸ“‹ Table of Contents

  1. Project Overview
  2. Features
  3. Technology Stack
  4. Project Architecture & Workflow
  5. Getting Started
  6. Running the Application
  7. Jupyter Notebooks
  8. Development Workflow
  9. Model Performance
  10. Configuration
  11. Troubleshooting
  12. API Endpoints
  13. Contributing
  14. License
  15. Future Enhancements

πŸ“– Project Overview

This project aims to understand and predict student performance in mathematics. By analyzing features such as gender, ethnicity, parental education level, and test preparation, we can build a model that provides an accurate estimate of a student's math score.

The application serves as a practical example of building and deploying a production-ready machine learning system, complete with a web interface for real-time predictions.

✨ Features

  • Predictive Modeling: Utilizes regression models to predict student math scores.
  • Comprehensive EDA: Detailed exploratory data analysis to uncover insights and relationships in the data.
  • Multi-Model Evaluation: Trains and evaluates several models (Random Forest, Decision Tree, Gradient Boosting, Linear Regression, CatBoost, AdaBoost, and K-Neighbors) to select the best performer.
  • Hyperparameter Tuning: Employs GridSearchCV to find the optimal parameters for each model.
  • Modular Pipeline: A structured, reusable pipeline for data ingestion, transformation, and model training.
  • Model Persistence: Saved trained models and preprocessors into pickle files for production use.
  • Web Interface: A user-friendly web form built with Flask to input student data and receive instant predictions.
  • Robust Engineering: Features custom logging, exception handling, and a modular project structure for maintainability.

πŸ› οΈ Technology Stack

  • Backend: Flask
  • ML & Data Science: Scikit-learn, CatBoost, Pandas, NumPy
  • Data Visualization: Matplotlib, Seaborn
  • Development Environment: Jupyter Notebook, uv (or venv/pip)
  • Deployment: AWS EC2 with Elastic Beanstalk

πŸ—οΈ Project Architecture & Workflow

The project is organized into a modular structure that separates concerns and makes the system easy to maintain and scale.

Directory Structure

β”œβ”€β”€ artifacts/                          # Stores output files like models and preprocessors
β”‚   β”œβ”€β”€ model.pkl                       # Trained model object
β”‚   └── preprocessor.pkl                # Preprocessing pipeline object
β”œβ”€β”€ notebooks/                          # Jupyter notebooks for EDA and initial modeling
β”œβ”€β”€ src/                                # Source code for the application
β”‚   β”œβ”€β”€ components/                     # Core ML pipeline components
β”‚   β”‚   β”œβ”€β”€ data_ingestion.py           # Data loading and splitting
β”‚   β”‚   β”œβ”€β”€ data_transformation.py      # Feature engineering and preprocessing
β”‚   β”‚   └── model_trainer.py            # Model training and evaluation
β”‚   β”œβ”€β”€ pipeline/                       # Manages training and prediction workflows
β”‚   β”‚   β”œβ”€β”€ prediction_pipeline.py
β”‚   β”‚   └── training_pipeline.py   
β”‚   β”œβ”€β”€ exception.py                    # Custom exception handling
β”‚   β”œβ”€β”€ logger.py                       # Logging configuration
β”‚   └── utils.py                        # Utility functions
β”œβ”€β”€ application.py                      # Main Flask application entry point
β”œβ”€β”€ requirements.txt                    # Project dependencies
└── README.md                           # This file

ML Pipeline Workflow

  1. Data Ingestion (data_ingestion.py):

    • Reads the raw data from notebooks/data/stud.csv.
    • Splits the data into training and testing sets.
    • Saves the raw, train, and test CSVs into the artifacts/ directory.
    • Triggers the data transformation and model training steps.
  2. Data Transformation (data_transformation.py):

    • Creates a preprocessing pipeline using ColumnTransformer.
    • Applies StandardScaler to numerical features and OneHotEncoder to categorical features.
    • Saves the fitted preprocessor object as preprocessor.pkl for later use.
  3. Model Training (model_trainer.py):

    • Receives the transformed data.
    • Runs a suite of regression models through GridSearchCV to find the best model and hyperparameters.
    • Selects the model with the highest RΒ² score (minimum threshold of 0.6).
    • Saves the best-performing model as model.pkl.
  4. Prediction (prediction_pipeline.py & application.py):

    • The Flask app captures user input from the web form.
    • The PredictPipeline loads the saved preprocessor.pkl and model.pkl.
    • It transforms the new input data and feeds it to the model to generate a prediction, which is then displayed to the user.

πŸš€ Getting Started

Step 1: Clone the Repository

First, clone the repository and navigate to the project directory:

git clone https://github.com/GoJo-Rika/Student-Performance-Prediction-System.git
cd Student-Performance-Prediction-System

Step 2: Set Up The Environment and Install Dependencies

We recommend using uv, a fast, next-generation Python package manager, for setup.

Recommended Approach (using uv)

  1. Install uv on your system if you haven't already.

    # On macOS and Linux
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # On Windows
    powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
  2. Create a virtual environment and install dependencies with a single command:

    uv sync

    This command automatically creates a .venv folder in your project directory and installs all listed packages from requirements.txt.

    Note: For a comprehensive guide on uv, check out this detailed tutorial: uv-tutorial-guide.

Alternative Approach (using venv and pip)

If you prefer to use the standard venv and pip:

  1. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows use: venv\Scripts\activate
  2. Install the required dependencies:

    pip install -r requirements.txt  # Using uv: uv add -r requirements.txt

πŸ‘Ÿ Running the Application

Follow these steps to run the project locally.

Step 1: Run the Training Pipeline

Before you can run the web application for the first time, you need to train the model. This will generate the necessary model.pkl and preprocessor.pkl files in the artifacts/ directory.

python src/components/data_ingestion.py     # Using uv: uv run src/components/data_ingestion.py

This single command will execute the entire training workflow: data ingestion, transformation, and model training.

Step 2: Start the Prediction Server

Once the training is complete and the artifacts are saved, start the Flask web server:

python application.py   # Using uv: uv run application.py

Step 3: Access the Web App

Open your web browser and navigate to: http://127.0.0.1:5000

You can now use the form to input student data and get a math score prediction.

πŸ§ͺ Jupyter Notebooks

The notebooks/ directory contains two key notebooks that document the project's development:

  1. 1 . EDA STUDENT PERFORMANCE .ipynb: This notebook contains a detailed Exploratory Data Analysis (EDA) of the student dataset, including visualizations and key insights that informed the feature engineering and model selection process.
  2. 2. MODEL TRAINING.ipynb: This notebook shows the initial model training and evaluation experiments. It serves as a scratchpad for testing different models and preprocessing steps before they were refactored into the main src pipeline.

πŸ”„ Development Workflow

This project follows a modular, pipeline-based architecture:

  1. Experimentation: Initial development in Jupyter notebooks
  2. Modularization: Successful experiments converted to reusable components
  3. Pipeline Integration: Components connected in training and prediction pipelines
  4. Error Handling: Custom exceptions and logging for debugging
  5. Testing: Iterative testing and refinement
  6. Deployment: AWS EC2 deployment with Elastic Beanstalk configuration

πŸ“ˆ Model Performance

The system evaluates multiple algorithms and selects the best performer:

  • Minimum RΒ² score threshold: 0.6
  • Grid search hyperparameter optimization
  • Cross-validation for robust evaluation

πŸ”§ Configuration

AWS Deployment

  • EC2 Instance: Configured via .ebextensions/python.config
  • WSGI: Flask application served through application:application
  • Environment: Production-ready with proper logging

File Structure

artifacts/
β”œβ”€β”€ model.pkl               # Trained model
β”œβ”€β”€ preprocessor.pkl        # Feature transformation pipeline
β”œβ”€β”€ train.csv               # Training data
β”œβ”€β”€ test.csv                # Test data
└── data.csv                # Raw data

logs/
└── [timestamp].log         # Application logs

πŸ› Troubleshooting

Common Issues:

  1. Import errors: Ensure all dependencies are installed
  2. Data not found: Check notebooks/data/stud.csv exists
  3. Model not found: Run training pipeline first
  4. Prediction errors: Check input data format

Debugging:

  • Check logs in logs/ directory
  • Custom exceptions provide detailed error context
  • Use logging output for pipeline debugging

πŸ“ API Endpoints

  • GET /: Home page
  • GET /predictdata: Prediction form
  • POST /predictdata: Submit prediction request

🀝 Contributing

Contributions are welcome! If you have suggestions or want to improve the project, please follow these steps:

  1. Fork the repository.
  2. Create a new feature branch (git checkout -b feature/your-feature-name).
  3. Make your changes and commit them (git commit -m 'Add some feature').
  4. Push to the branch (git push origin feature/your-feature-name).
  5. Open a Pull Request.

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for more details.

🎯 Future Enhancements

  • REST API for programmatic access
  • Model training & retraining pipeline
  • Performance monitoring dashboard
  • Additional ML algorithms
  • A/B testing framework