-
-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
Problem
MISSING: Graph-level optimizations that fuse operators and optimize computational graphs for faster inference.
Existing
- Issue ONNX Export and ONNX Runtime (CPU/DirectML) Execution #280: ONNX export (complementary)
- Issue Inference Optimizations – KV Cache, RoPE, and FlashAttention #277: Some inference optimizations
Missing Implementations
Operator Fusion (CRITICAL):
- Conv + BatchNorm + ReLU fusion
- Matmul + Bias + Activation fusion
- Elementwise operation fusion
- Multi-headed attention fusion
Graph Optimization (CRITICAL):
- Constant folding
- Dead code elimination
- Common subexpression elimination
- Layout optimization (NCHW vs NHWC)
Memory Optimization (HIGH):
- In-place operations
- Memory reuse
- Gradient checkpointing integration
- Activation memory planning
Computation Optimization (HIGH):
- Algebraic simplification
- Strength reduction
- Loop fusion
- Vectorization hints
Frameworks to Compete With
- TensorRT (NVIDIA)
- TorchScript optimization
- ONNX Runtime optimizations
- TVM/Apache TVM
Architecture
Success Criteria
- 2-5x inference speedup from fusion
- Reduced memory footprint
- Integration with existing models
- Benchmarks vs TensorRT/ONNX Runtime
Metadata
Metadata
Assignees
Labels
No labels