A comprehensive Machine Learning pipeline for SMS Spam Detection using DVC (Data Version Control) for reproducible ML workflows and DVC Live for experiment tracking. This project demonstrates best practices for building scalable, maintainable ML pipelines.
- Overview
- Features
- Project Structure
- Pipeline Stages
- Technologies Used
- Installation
- Usage
- Configuration
- Experiment Tracking
This project implements a complete ML pipeline for SMS spam detection using a Random Forest classifier. The pipeline is built with DVC for data versioning and reproducible workflows, ensuring that every experiment can be reproduced exactly as it was run.
- End-to-end ML Pipeline: From data ingestion to model evaluation
- DVC Integration: Version control for data, models, and experiments
- Experiment Tracking: Real-time metrics and parameter tracking with DVC Live
- Modular Architecture: Clean separation of concerns across pipeline stages
- Comprehensive Logging: Detailed logging for debugging and monitoring
- Parameter Management: YAML-based configuration for easy hyperparameter tuning
- π Reproducible Workflows: DVC ensures every experiment can be reproduced
- π Experiment Tracking: Real-time metrics visualization with DVC Live
- π§Ή Data Preprocessing: Text cleaning, stemming, and feature engineering
- π€ Model Training: Random Forest classifier with configurable parameters
- π Model Evaluation: Comprehensive metrics (Accuracy, Precision, Recall, AUC)
- π Comprehensive Logging: Detailed logs for each pipeline stage
- βοΈ Parameter Management: YAML-based configuration system
- ποΈ Modular Design: Clean, maintainable code structure
DVC-Pipeline-AWS/
βββ π src/ # Source code
β βββ π data_ingestion.py # Data loading and splitting
β βββ π data_preprocessing.py # Text preprocessing and cleaning
β βββ π feature_eng.py # Feature engineering (TF-IDF)
β βββ π model_training.py # Model training
β βββ π model_evaluation.py # Model evaluation and metrics
βββ π data/ # Data storage
β βββ π raw/ # Raw data files
β βββ π interim/ # Intermediate processed data
β βββ π processed/ # Final processed data
βββ π models/ # Trained models
βββ π logs/ # Log files
βββ π reports/ # Evaluation reports
βββ π experiments/ # Jupyter notebooks and datasets
βββ π dvc.yaml # DVC pipeline configuration
βββ π params.yaml # Hyperparameters and configuration
βββ π requirements.txt # Python dependencies
βββ π README.md # This file
The ML pipeline consists of 5 main stages, each managed by DVC:
- Downloads spam dataset from GitHub
- Splits data into train/test sets
- Saves raw data to
data/raw/
- Text cleaning and normalization
- Label encoding for target variable
- Duplicate removal
- Saves processed data to
data/interim/
- TF-IDF vectorization
- Configurable feature extraction
- Saves feature-engineered data to
data/processed/
- Random Forest classifier training
- Hyperparameter configuration
- Model serialization to
models/
- Performance metrics calculation
- Experiment tracking with DVC Live
- Metrics logging to
reports/
- Python 3.x: Core programming language
- DVC: Data version control and pipeline management
- DVC Live: Experiment tracking and visualization
- Scikit-learn: Machine learning algorithms and preprocessing
- Pandas: Data manipulation and analysis
- NLTK: Natural language processing
- NumPy: Numerical computing
- PyYAML: Configuration management
- Matplotlib: Data visualization
- Python 3.8 or higher
- Git
- DVC
-
Clone the repository
git clone <your-repo-url> cd DVC-Pipeline-AWS
-
Install Python dependencies
pip install -r requirements.txt
-
Install DVC (if not already installed)
pip install dvc
-
Initialize DVC (if not already initialized)
dvc init
# Run the entire pipeline
dvc repro
# Run a specific stage
dvc repro data_ingestion
dvc repro model_training
# Data ingestion
python src/data_ingestion.py
# Data preprocessing
python src/data_preprocessing.py
# Feature engineering
python src/feature_eng.py
# Model training
python src/model_training.py
# Model evaluation
python src/model_evaluation.py
# View experiment results
dvc exp show
# Compare experiments
dvc exp diff
# View metrics
dvc metrics show
The pipeline is configured through params.yaml
:
data_ingestion:
test_size: 0.10
feature_eng:
max_features: 45
model_training:
n_estimators: 25
random_state: 2
To experiment with different parameters:
- Modify
params.yaml
- Run the pipeline:
dvc repro
- Track results:
dvc exp show
This project uses DVC Live for comprehensive experiment tracking:
- Accuracy: Overall model performance
- Precision: Spam detection precision
- Recall: Spam detection recall
- AUC: Area under ROC curve
- Real-time metrics plotting
- Parameter tracking
- Experiment comparison
# View current metrics
dvc metrics show
# Compare experiments
dvc exp diff
# View experiment history
dvc exp show
The pipeline is defined in dvc.yaml
:
stages:
data_ingestion:
cmd: python src/data_ingestion.py
deps:
- src/data_ingestion.py
params:
- data_ingestion.test_size
outs:
- data/raw
# ... other stages
The model evaluation stage calculates and tracks:
- Accuracy: Overall classification accuracy
- Precision: Precision for spam detection
- Recall: Recall for spam detection
- AUC: Area under the ROC curve
- Modify parameters in
params.yaml
- Run pipeline:
dvc repro
- Track results:
dvc exp show
# Modify test_size in params.yaml
data_ingestion:
test_size: 0.20 # Changed from 0.10
# Run experiment
dvc repro
# View results
dvc exp show
Each pipeline stage includes comprehensive logging:
- Console output: Real-time progress
- File logging: Detailed logs in
logs/
directory - Error handling: Graceful error handling and reporting