Skip to content

Conversation

rainyfly
Copy link
Collaborator

@rainyfly rainyfly commented Aug 1, 2025

Description

If we want to run PD disaggregated deployment in FD, we should use splitwise scheduler to distribute task and use redis to synchronize instance meta info、user request and generated results. The dispatch for task is inside LLMEngine by spiltwise scheduler after recieved request.

We also want to support external module to dispatch tasks. The external module will dispatch task for P and D instance, send request to the scheduled LLMEngine directly and receive response.

Copy link

paddle-bot bot commented Aug 1, 2025

Thanks for your contribution!

@Jiang-Jia-Jun Jiang-Jia-Jun requested a review from Copilot August 1, 2025 08:42
@Jiang-Jia-Jun
Copy link
Collaborator

另外此处少一个CI监控,需要增加类似 https://github.com/PaddlePaddle/FastDeploy/tree/develop/test/ci_use/EB_Lite 的测试,覆盖新接口的使用

Copilot

This comment was marked as outdated.

@rainyfly rainyfly requested a review from Copilot August 1, 2025 09:31
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for EP (expert) PD (prefill-decode) disaggregated deployment with external module support. It introduces infrastructure for external modules to dispatch tasks directly to P and D instances using TCP-based ZMQ communication and a new "dp" scheduler alongside the existing splitwise scheduler.

  • Adds ZmqTcpServer/ZmqIpcServer for TCP/IPC communication modes controlled by FD_ENABLE_INTERNAL_ADAPTER
  • Introduces DPScheduler and DPLocalScheduler for external module task dispatch
  • Implements InternalAdapter for external module control commands (get_payload, get_metrics, connect_rdma)

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
fastdeploy/splitwise/internal_adapter_utils.py New InternalAdapter class for handling external module control commands
fastdeploy/scheduler/dp_scheduler.py New DPScheduler and DPLocalScheduler for distributed processing
fastdeploy/scheduler/config.py Added DPLocalSchedulerConfig and dp scheduler support
fastdeploy/inter_communicator/zmq_server.py New ZMQ server implementations for TCP and IPC communication
fastdeploy/inter_communicator/zmq_client.py Refactored ZMQ client with base class and IPC implementation
fastdeploy/inter_communicator/engine_worker_queue.py Added RDMA connection task queues and management
fastdeploy/inter_communicator/__init__.py Updated imports for new ZMQ classes
fastdeploy/envs.py Added environment variables for internal adapter configuration
fastdeploy/entrypoints/engine_client.py Updated to use ZmqIpcClient instead of ZmqClient
fastdeploy/engine/expert_service.py Added support for dp scheduler and internal adapter
fastdeploy/engine/engine.py Enhanced with TCP/IPC server selection and dp scheduler support
fastdeploy/engine/args_utils.py Added splitwise_role to scheduler config fields
fastdeploy/cache_manager/cache_transfer_manager.py Added data_parallel_size parameter
fastdeploy/cache_manager/cache_messager.py Enhanced with RDMA connection handling and data parallel support

@CLAassistant
Copy link

CLAassistant commented Aug 3, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ rainyfly
✅ EmmonsCurse
❌ root


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants