Hep Event Classifier is a complete machine learning pipeline for classifying high-energy physics events. It covers data preprocessing, model training & hyperparameter tuning, evaluation, and result visualization. The pipeline supports multiple classifiers and saves both the trained models and evaluation artifacts.
- Features
- Project Structure
- Installation
- Usage
- Pipeline Steps
- Models & Hyperparameter Tuning
- Evaluation
- Future Work
- License
-
End-to-end ML pipeline from raw CSV data to trained models.
-
Preprocessing with scaling, encoding, and train-test split.
-
Model training with GridSearchCV for hyperparameter tuning.
-
Supports Logistic Regression, Decision Tree, Random Forest, XGBoost.
-
Automatic saving of:
- Best trained model (
.pkl
) - Evaluation metrics (
.json
) - Confusion matrix plots (
.png
)
- Best trained model (
-
Modular design allows easy extension to new models or preprocessing steps.
hep-event-classifier/
│
├─ src/
│ ├─ pipeline.py # Main entry point for running the full pipeline
│ ├─ preprocessing.py # Data preprocessing utilities
│ ├─ models.py # Model definitions and training functions
│ └─ evaluation.py # Evaluation metrics & visualization
│
├─ tests/
│ ├─ test_preprocessing.py
| ├─ test_models.py
| ├─ test_evaluation.py
| └─ test_pipeline.py
│
├─ models/ # Directory for saved trained models
├─ results/ # Directory for evaluation outputs
├─ data/ # Example CSVs
├─ README.md
├─ requirements.txt
└─ .gitignore
Clone the repository and set up a Python virtual environment:
git clone https://github.com/AafaqueAnjum06/hep-event-classifier.git
cd hep-event-classifier
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
python src/pipeline.py
This runs the entire pipeline using default settings and example data in data/
.
Optional arguments (modify in pipeline.py
or CLI-ready later):
DATA_PATH
– path to CSV datasetmodel_save_dir
– directory to save trained modelsresults_save_dir
– directory to save evaluation results
pytest -v tests/test_pipeline.py
This uses a synthetic dataset to verify the pipeline logic.
-
Data Preprocessing
- Loads CSV
- Splits features & target
- Scales numeric features
- Splits train/test sets
-
Model Training & Hyperparameter Tuning
- Performs GridSearchCV for multiple models
- Evaluates validation accuracy
- Saves the best model
-
Evaluation
- Computes metrics: accuracy, precision, recall, F1-score
- Generates confusion matrix
- Saves evaluation artifacts
Model | Key Hyperparameters |
---|---|
LogisticRegression | C , solver |
DecisionTree | max_depth , min_samples_split |
RandomForest | n_estimators , max_depth , min_samples_split |
XGBoost | n_estimators , max_depth , learning_rate |
- GridSearchCV is used with cross-validation (default
cv=3
) - Handles both binary and multiclass classification
- Metrics: Accuracy, Precision, Recall, F1-score
- Visualization: Confusion matrix
- Example outputs are saved in
results/
- Best model info and validation accuracy are logged in the console
MIT License © 2025 Aafaque Anjum