-
-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
Problem
MISSING: Pruning is a critical model compression technique that removes unnecessary weights/neurons to reduce model size and inference cost.
Existing
- Issue Quantization – PTQ and QAT (INT8/INT4) with LoRA Compatibility #278: Quantization (different technique)
- LoRA adapters (low-rank, different technique)
Missing Implementations
Unstructured Pruning (HIGH):
- Magnitude-based pruning (remove smallest weights)
- Gradient-based pruning
- Movement pruning (mask learning)
- Global vs layer-wise thresholds
Structured Pruning (CRITICAL):
- Channel pruning (entire filters)
- Neuron pruning
- Head pruning (for transformers)
- Block pruning
Advanced Techniques (MEDIUM):
- Lottery Ticket Hypothesis (find winning subnetworks)
- Iterative pruning (gradual removal)
- Fine-tuning after pruning
- One-shot pruning vs iterative
Metrics:
- Sparsity ratio
- FLOPs reduction
- Memory reduction
- Accuracy degradation
Use Cases
- Deploy large models on edge devices
- 50-90% parameter reduction with minimal accuracy loss
- Faster inference
- Lower memory footprint
Architecture
Success Criteria
- Prune BERT to 50% sparsity with <1% accuracy loss
- Prune ResNet to 70% sparsity
- Integration with training loop
- Benchmarks on standard datasets
Metadata
Metadata
Assignees
Labels
No labels