Skip to content

[Model Compression] Implement Knowledge Distillation #408

@ooples

Description

@ooples

Problem

MISSING: Knowledge distillation transfers knowledge from large teacher models to smaller student models, achieving comparable performance with much smaller size.

Missing Implementations

Standard Distillation (CRITICAL):

  • Response-based: soft labels (temperature scaling)
  • Feature-based: intermediate layer matching
  • Relation-based: similarity between samples

Advanced Techniques (HIGH):

  • Self-distillation (student = teacher architecture)
  • Cross-modal distillation (different modalities)
  • Online distillation (peer teaching)
  • Multi-teacher distillation

Task-Specific (HIGH):

  • Classification distillation
  • Object detection distillation (faster R-CNN → YOLO)
  • Language model distillation (BERT → DistilBERT)
  • Vision transformer distillation (ViT → DeiT)

Loss Functions:

  • KL divergence for soft labels
  • MSE for feature matching
  • Attention transfer
  • Combined task loss + distillation loss

Use Cases

  • Create lightweight models for deployment
  • Compress GPT-3 → GPT-2 size with 95% performance
  • Mobile/edge deployment
  • Faster inference

Architecture

Success Criteria

  • DistilBERT-like results (40% smaller, 97% performance)
  • Integration with existing training loop
  • Teacher-student architecture flexibility
  • Benchmarks on ImageNet, GLUE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions