Skip to content

[RoadMap][Call For Contributions] Mooncake Store V3 Roadmap #1035

@stmatengss

Description

@stmatengss

Milestone 1: Core Architecture Refactor & Decoupling

This milestone focuses on foundational architectural changes to improve modularity, flexibility, and prepare for future scaling.

  • (TE/Store Separation): Decouple the TE (Task/Tensor Engine) and Store components into separate, independent packages.
  • (Client/Worker Decoupling): Decouple the dummy client from the worker to remove strong dependencies. 
  • (Flexible Deployment): Update the Store to support various flexible deployment models, such as client-only, client + master, etc. 
  • (Tensor-native APIs): Put/Get Tensor APIs contains TP rank and model info.

Milestone 2: Master Service Enhancements

This milestone enhances the Master component to support new storage architectures and routing logic.

  • (Key-based Routing): Implement new key-based routing capabilities in the Master service. 
  • (Metadata Adaptation - Storage): Adapt the Master's metadata management to support the new multi-level storage architecture.
  • (Recovery) kv metadata persistency
  • (KVCache Awareness Interface) Exposes hit ratio for different layers.
  • (Metadata Adaptation - HA): Upgrade metadata schema and logic to meet new High Availability (HA) requirements. 
  • (Multi-tenant): Support Multi-tenant with different models, users and auth keys

Milestone 3: Worker: Multi-Level Storage Architecture

This is a major epic to build the next-generation multi-level storage system within the Worker.

  • 3.1: Abstraction & Caching

    • (Storage Abstraction Layer): Design and implement the core abstraction layer for multi-level storage. 
    • (Cache Scheduling Interface): Design the abstract interface for cache scheduling logic. 
    • (Eviction Logic): Implement basic data eviction logic within the new storage architecture. [store] Add disk eviction feature #1028
    • (LRU Cache): Implement an LRU (Least Recently Used) policy as the default cache scheduling strategy. 
    • (Local Client Cache): Keep a local cache for better performance. [RFC]: Add Local Cache Mechanism for Mooncake Store Client #1062
  • 3.2: Storage Backend Implementation

    • (DRAM Adaptation): Adapt the storage layer for DRAM, including support for NUMA affinity. 
    • (SSD Adaptation): Adapt the storage layer for SSDs, enabling local external storage read/write capabilities.  [RFC] Contribute Local Storage to Distributed Pool #1054
    • (VRAM Adaptation): Adapt the storage layer to utilize VRAM. 
    • (Huawei NPU Adaptation): Implement support for Huawei NPUs (H2D). 
  • 3.3: Elastic KVCache Storage

Milestone 4: Worker: Networking & Elasticity

This milestone focuses on refactoring worker communication and enabling resource elasticity.

Milestone 5: Deployment & Operations

This milestone covers K8s integration (i.e., RBG, https://github.com/sgl-project/rbg) and build process improvements.

  • (K8s Autoscaling): Implement support for Kubernetes-based autoscaling of worker and dummy client instances.
  • (Scenario-based Builds): Implement a build system capable of producing different worker binaries optimized for different scenarios. 
  • (Integration With AI Configurator): Use AI Configurator for better measuring Resource workers and other configurations.
  • (Deployment Documentation & Guides): Create comprehensive, up-to-date deployment documentation and step-by-step setup guides to simplify installation and configuration for all environments.

Milestone 6: CI & CD enhancement

  • (End-to-end CI tests): For SGLang, support Hicache, PD, Elatics EP, checkpoint engine tests.

Milestone 7: Performance & Benchmarks

  • (Store Master Benchmark): Design and integrate a dedicated benchmark for the Mooncake store master module to evaluate throughput, latency, and scalability.

Thanks for being a part of the Mooncake community! Welcome to discuss and contribute!


If you have any ideas, just leave a comment below and help shape the Roadmap.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    RoadmapFuture roadmap or plan for new features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions