-
-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Problem
MISSING: Knowledge distillation transfers knowledge from large teacher models to smaller student models, achieving comparable performance with much smaller size.
Missing Implementations
Standard Distillation (CRITICAL):
- Response-based: soft labels (temperature scaling)
- Feature-based: intermediate layer matching
- Relation-based: similarity between samples
Advanced Techniques (HIGH):
- Self-distillation (student = teacher architecture)
- Cross-modal distillation (different modalities)
- Online distillation (peer teaching)
- Multi-teacher distillation
Task-Specific (HIGH):
- Classification distillation
- Object detection distillation (faster R-CNN → YOLO)
- Language model distillation (BERT → DistilBERT)
- Vision transformer distillation (ViT → DeiT)
Loss Functions:
- KL divergence for soft labels
- MSE for feature matching
- Attention transfer
- Combined task loss + distillation loss
Use Cases
- Create lightweight models for deployment
- Compress GPT-3 → GPT-2 size with 95% performance
- Mobile/edge deployment
- Faster inference
Architecture
Success Criteria
- DistilBERT-like results (40% smaller, 97% performance)
- Integration with existing training loop
- Teacher-student architecture flexibility
- Benchmarks on ImageNet, GLUE
Metadata
Metadata
Assignees
Labels
No labels