A production-ready MLOps pipeline that automatically classifies Reddit content using advanced multi-label machine learning with enterprise-level automation and continuous learning.
π Live Application
- Multi-Label Classification: Simultaneous analysis across 5 dimensions (Safety, Toxicity, Sentiment, Topic, Engagement)
- Automated MLOps: Weekly retraining with model selection and deployment
- Production Scaling: Handles 25,000+ posts per training cycle
- Real-time Inference: Sub-second response times with 88%+ accuracy
Category | Description | Classifications |
---|---|---|
Safety | Content safety assessment | Safe, NSFW |
Toxicity | Harmful content detection | Non-toxic, Toxic |
Sentiment | Emotional tone analysis | Positive, Neutral, Negative |
Topic | Content categorization | Technology, Gaming, Business, Health |
Engagement | Viral potential prediction | High, Low Engagement |
Reddit API β Data Pipeline β ML Training β Model Deployment β Web Application
β β β β β
PRAW API GitHub Actions Ensemble ML Git LFS Streamlit
MLOps Pipeline:
- Data Collection: Weekly automated Reddit data ingestion (25,000+ posts)
- Feature Engineering: TF-IDF vectorization (10k features, 1-2 grams)
- Model Training: Multi-algorithm competition (Logistic Regression, SVM, Neural Networks, LightGBM)
- Model Selection: Ensemble creation from top-performing models
- Deployment: Automated Git LFS versioning and cloud deployment
Metric | Value | Description |
---|---|---|
Binary F1-Score | 88.3% | SFW/NSFW classification accuracy |
Multi-Label Jaccard | 82.7% | Overall multi-category performance |
Training Data | 25,000+ | Reddit posts per training cycle |
Inference Speed | <100ms | Real-time response capability |
Model Size | ~150MB | Optimized for cloud deployment |
Automation | Weekly | Continuous learning and updates |
Core Technologies:
- Python 3.11, Scikit-learn, LightGBM, Pandas, NumPy
- Streamlit, Plotly (Visualization)
- PRAW (Reddit API), TF-IDF (NLP)
MLOps & Infrastructure:
- GitHub Actions (CI/CD), Git LFS (Model Versioning)
- Docker (Containerization), Streamlit Cloud (Deployment)
- Python 3.11+
- Git with Git LFS support
- Reddit API credentials (for data collection and training)
git clone https://github.com/RobinMillford/Reddit-content-classifier.git
cd Reddit-content-classifier
# Setup Git LFS for model files
git lfs install
git lfs pull
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Get Reddit API Credentials:
- Go to Reddit App Preferences
- Click "Create App" or "Create Another App"
- Fill in the form:
- Name: Your app name (e.g., "Content Classifier")
- App type: Select "script"
- Description: Optional
- About URL: Leave blank
- Redirect URI:
http://localhost:8080
- Click "Create app"
- Note down the Client ID (under app name) and Client Secret
Setup Environment Variables:
Create a .env
file in the project root:
# Create .env file
touch .env # Linux/macOS
# or create manually on Windows
Add your Reddit API credentials to .env
:
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=YourAppName/1.0
Alternative: Export Environment Variables:
# Export variables (Linux/macOS)
export REDDIT_CLIENT_ID="your_client_id"
export REDDIT_CLIENT_SECRET="your_client_secret"
export REDDIT_USER_AGENT="YourAppName/1.0"
# Windows Command Prompt
set REDDIT_CLIENT_ID=your_client_id
set REDDIT_CLIENT_SECRET=your_client_secret
set REDDIT_USER_AGENT=YourAppName/1.0
# Windows PowerShell
$env:REDDIT_CLIENT_ID="your_client_id"
$env:REDDIT_CLIENT_SECRET="your_client_secret"
$env:REDDIT_USER_AGENT="YourAppName/1.0"
# Start the web application
streamlit run app.py
π Access: Application runs at http://localhost:8501
# Collect fresh training data
python src/ingest_data.py
# Train and evaluate models
python src/train.py
# Models are automatically saved and can be loaded by app.py
Common Issues:
- Git LFS files not downloading: Run
git lfs pull
- Reddit API errors: Verify your
.env
credentials - Model files missing: Ensure Git LFS is installed and configured
- Import errors: Check virtual environment activation
Verify Setup:
# Check Git LFS status
git lfs ls-files
# Verify environment variables
python -c "import os; print(os.getenv('REDDIT_CLIENT_ID'))"
# Test Reddit API connection
python -c "import praw; reddit = praw.Reddit(client_id='test', client_secret='test', user_agent='test'); print('PRAW imported successfully')"
βββ src/
β βββ ingest_data.py # Reddit data collection
β βββ train.py # ML model training
βββ .github/workflows/ # CI/CD automation
βββ app.py # Streamlit web application
βββ champion_model.pkl # Production binary model (Git LFS)
βββ multi_label_model.pkl # Production multi-label model (Git LFS)
βββ vectorizer.joblib # Text preprocessing pipeline (Git LFS)
βββ model_metadata.joblib # Model performance metrics (Git LFS)
Business Value: Demonstrates end-to-end ML engineering capabilities with production-ready automation and scalable infrastructure design.
Technical Expertise: Showcases expertise in MLOps, automated pipelines, multi-label classification, and cloud deployment strategies.
Results Delivered: 88%+ accuracy system processing 25,000+ posts weekly with zero-downtime continuous deployment.
This project is open source and welcomes contributions from the community.
How to Contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature/enhancement
- Make your changes with proper testing
- Submit a pull request with detailed description
Areas for Contribution:
- Model performance improvements
- New classification categories
- Enhanced MLOps automation
- Documentation and testing
Development Setup:
git clone https://github.com/RobinMillford/Reddit-content-classifier.git
cd Reddit-content-classifier
pip install -r requirements.txt
streamlit run app.py
Project Repository: github.com/RobinMillford/Reddit-content-classifier
This project demonstrates production-ready MLOps implementation suitable for enterprise content moderation systems.