Skip to content

[RFC] Contribute Local Storage to Distributed Pool #1054

@ykwd

Description

@ykwd

Changes proposed

This RFC proposes the design of a new feature: support contributing local SSD storage to the distributed store pool, which introduces a unified storage backend interface with multiple backends, file storage Get/Put/Eviction workflows, and master coordination.

Motivation

Currently, Mooncake relies heavily on in-memory storage for fast access. However, memory capacity is limited and expensive. Adding support for local SSD-based storage enables:

  • Larger total storage capacity.
  • Tiered data access
  • Improved data persistence and recovery.
  • Flexibility in resource allocation between heterogeneous clients.

Design Overview

1. Storage Backend: Unified Interface

All backends will implement a unified storage interface to simplify integration and extension.
We plan to support three backend implementations:

  1. File-per-key Backend (Already supported)

    • Each key-value pair is stored in a separate file.
    • Simple but inefficient for large-scale data.
  2. Bucket Backend @zhuxinjie-nz

    • Multiple key-value pairs are grouped into a single file (“bucket”).
    • Write and eviction are performed at the bucket level, improving I/O efficiency.
  3. OffsetAllocator Backend

    • A large pre-allocated file acts as a storage arena.
    • Space is allocated and released using an OffsetAllocator.
    • Enables efficient space management and fewer file operations.

2. Get Workflow @zhuxinjie-nz

  1. Client A issues a Get request.

  2. Master returns the replica information.

  3. Client A attempts to read from a memory replica first.

  4. If only remote file replica exists (located on Client B), then:

    • A sends an RPC to B.
    • B reads the requested data from local SSD into its local buffer memory.
    • B returns the buffer address to A. B will ensure this address will not write other data before a certain time (e.g. 5s).
    • A performs an RDMA read to fetch the data directly.
    • If the read completes before timeout → A returns OK.
    • If it times out → A returns Error.

3. Put Workflow @zhuxinjie-nz

  1. Client A issues a Put request.
  2. Master assigns the target replica to Client B.
  3. A performs an RDMA write to B.
  4. Once the write succeeds, A sends a PutEnd notification to Master.
  5. Upon receiving PutEnd, Master adds the key to B’s persistence queue.
  6. B periodically requests pending persistence tasks from Master.
  7. B obtains the persistence request, performs a BatchGetReplica to acquire the lease, and writes the data to local SSD.
  8. Once successfully persisted, B notifies Master with another PutEnd message.

Load Balancing Considerations

Clients may have heterogeneous resource configurations:

  • Client B: large memory, limited or no SSD.
  • Client C: minimal memory, large SSD.

To optimize for this diversity, future versions will enhance Master’s scheduling logic to assign persistence tasks to the most suitable clients. The rest of the flow remains unchanged.

4. Eviction Workflow

  1. Each client manages its local SSD storage usage.

  2. When nearing capacity, the client initiates Eviction:

    • The client sends a Remove request to Master.
    • Upon successful confirmation, the client deletes the local data.

5. Initialization Workflow

Upon startup, each client:

  • Reads local file metadata.
  • Validates file integrity.
  • Reports valid replicas back to Master via Put requests.

This ensures that valid persisted data is re-registered after restarts or failures.

6. Master Modifications

The Master component requires several extensions:

  1. Associate SSD segment information to each client.
  2. If a client has an SSD, associate it with a persistence queue.
  3. The replica metadata structure will record ip address and replica size, which provides necessary information for clients to issue remote file read.

This feature is under active development. Suggestions and contributions are appreciated.

Related PR: #968 #1028 #1031

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions