AI Content Classifier: Production MLOps Pipeline

A production-ready MLOps pipeline that automatically classifies Reddit content using advanced multi-label machine learning with enterprise-level automation and continuous learning.

🌐 Live Application

🎯 Project Overview

Multi-Label Classification: Simultaneous analysis across 5 dimensions (Safety, Toxicity, Sentiment, Topic, Engagement)
Automated MLOps: Weekly retraining with model selection and deployment
Production Scaling: Handles 25,000+ posts per training cycle
Real-time Inference: Sub-second response times with 88%+ accuracy

📊 Multi-Label Classification

Category	Description	Classifications
Safety	Content safety assessment	Safe, NSFW
Toxicity	Harmful content detection	Non-toxic, Toxic
Sentiment	Emotional tone analysis	Positive, Neutral, Negative
Topic	Content categorization	Technology, Gaming, Business, Health
Engagement	Viral potential prediction	High, Low Engagement

🏗️ Architecture

Reddit API → Data Pipeline → ML Training → Model Deployment → Web Application
    │              │              │               │                │
 PRAW API       GitHub Actions   Ensemble ML      Git LFS        Streamlit

MLOps Pipeline:

Data Collection: Weekly automated Reddit data ingestion (25,000+ posts)
Feature Engineering: TF-IDF vectorization (10k features, 1-2 grams)
Model Training: Multi-algorithm competition (Logistic Regression, SVM, Neural Networks, LightGBM)
Model Selection: Ensemble creation from top-performing models
Deployment: Automated Git LFS versioning and cloud deployment

🎯 Performance Metrics

Metric	Value	Description
Binary F1-Score	88.3%	SFW/NSFW classification accuracy
Multi-Label Jaccard	82.7%	Overall multi-category performance
Training Data	25,000+	Reddit posts per training cycle
Inference Speed	<100ms	Real-time response capability
Model Size	~150MB	Optimized for cloud deployment
Automation	Weekly	Continuous learning and updates

🛠️ Technology Stack

Core Technologies:

Python 3.11, Scikit-learn, LightGBM, Pandas, NumPy
Streamlit, Plotly (Visualization)
PRAW (Reddit API), TF-IDF (NLP)

MLOps & Infrastructure:

GitHub Actions (CI/CD), Git LFS (Model Versioning)
Docker (Containerization), Streamlit Cloud (Deployment)

🚀 Local Development Setup

Prerequisites

Python 3.11+
Git with Git LFS support
Reddit API credentials (for data collection and training)

Step 1: Clone Repository

git clone https://github.com/RobinMillford/Reddit-content-classifier.git
cd Reddit-content-classifier

# Setup Git LFS for model files
git lfs install
git lfs pull

Step 2: Environment Setup

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Step 3: Reddit API Configuration

Get Reddit API Credentials:

Go to Reddit App Preferences
Click "Create App" or "Create Another App"
Fill in the form:
- Name: Your app name (e.g., "Content Classifier")
- App type: Select "script"
- Description: Optional
- About URL: Leave blank
- Redirect URI: http://localhost:8080
Click "Create app"
Note down the Client ID (under app name) and Client Secret

Setup Environment Variables:

Create a .env file in the project root:

# Create .env file
touch .env  # Linux/macOS
# or create manually on Windows

Add your Reddit API credentials to .env:

REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=YourAppName/1.0

Alternative: Export Environment Variables:

# Export variables (Linux/macOS)
export REDDIT_CLIENT_ID="your_client_id"
export REDDIT_CLIENT_SECRET="your_client_secret"
export REDDIT_USER_AGENT="YourAppName/1.0"

# Windows Command Prompt
set REDDIT_CLIENT_ID=your_client_id
set REDDIT_CLIENT_SECRET=your_client_secret
set REDDIT_USER_AGENT=YourAppName/1.0

# Windows PowerShell
$env:REDDIT_CLIENT_ID="your_client_id"
$env:REDDIT_CLIENT_SECRET="your_client_secret"
$env:REDDIT_USER_AGENT="YourAppName/1.0"

Step 4: Run Application

# Start the web application
streamlit run app.py

🌐 Access: Application runs at http://localhost:8501

Step 5: Custom Model Training (Optional)

# Collect fresh training data
python src/ingest_data.py

# Train and evaluate models
python src/train.py

# Models are automatically saved and can be loaded by app.py

Troubleshooting

Common Issues:

Git LFS files not downloading: Run git lfs pull
Reddit API errors: Verify your .env credentials
Model files missing: Ensure Git LFS is installed and configured
Import errors: Check virtual environment activation

Verify Setup:

# Check Git LFS status
git lfs ls-files

# Verify environment variables
python -c "import os; print(os.getenv('REDDIT_CLIENT_ID'))"

# Test Reddit API connection
python -c "import praw; reddit = praw.Reddit(client_id='test', client_secret='test', user_agent='test'); print('PRAW imported successfully')"

📁 Project Structure

├── src/
│   ├── ingest_data.py      # Reddit data collection
│   └── train.py            # ML model training
├── .github/workflows/      # CI/CD automation
├── app.py                  # Streamlit web application
├── champion_model.pkl      # Production binary model (Git LFS)
├── multi_label_model.pkl   # Production multi-label model (Git LFS)
├── vectorizer.joblib       # Text preprocessing pipeline (Git LFS)
└── model_metadata.joblib   # Model performance metrics (Git LFS)

💼 Professional Impact

Business Value: Demonstrates end-to-end ML engineering capabilities with production-ready automation and scalable infrastructure design.

Technical Expertise: Showcases expertise in MLOps, automated pipelines, multi-label classification, and cloud deployment strategies.

Results Delivered: 88%+ accuracy system processing 25,000+ posts weekly with zero-downtime continuous deployment.

🤝 Contributing

This project is open source and welcomes contributions from the community.

How to Contribute:

Fork the repository
Create a feature branch: git checkout -b feature/enhancement
Make your changes with proper testing
Submit a pull request with detailed description

Areas for Contribution:

Model performance improvements
New classification categories
Enhanced MLOps automation
Documentation and testing

Development Setup:

git clone https://github.com/RobinMillford/Reddit-content-classifier.git
cd Reddit-content-classifier
pip install -r requirements.txt
streamlit run app.py

Project Repository: github.com/RobinMillford/Reddit-content-classifier

This project demonstrates production-ready MLOps implementation suitable for enterprise content moderation systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Content Classifier: Production MLOps Pipeline

🎯 Project Overview

📊 Multi-Label Classification

🏗️ Architecture

🎯 Performance Metrics

🛠️ Technology Stack

🚀 Local Development Setup

Prerequisites

Step 1: Clone Repository

Step 2: Environment Setup

Step 3: Reddit API Configuration

Step 4: Run Application

Step 5: Custom Model Training (Optional)

Troubleshooting

📁 Project Structure

💼 Professional Impact

🤝 Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
Workflow.png		Workflow.png
app.py		app.py
champion_model.pkl		champion_model.pkl
model_metadata.joblib		model_metadata.joblib
multi_label_model.pkl		multi_label_model.pkl
requirements.txt		requirements.txt
vectorizer.joblib		vectorizer.joblib

License

RobinMillford/Reddit-content-classifier

Folders and files

Latest commit

History

Repository files navigation

AI Content Classifier: Production MLOps Pipeline

🎯 Project Overview

📊 Multi-Label Classification

🏗️ Architecture

🎯 Performance Metrics

🛠️ Technology Stack

🚀 Local Development Setup

Prerequisites

Step 1: Clone Repository

Step 2: Environment Setup

Step 3: Reddit API Configuration

Step 4: Run Application

Step 5: Custom Model Training (Optional)

Troubleshooting

📁 Project Structure

💼 Professional Impact

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages