[Feature]report PD info to IM#8082
Conversation
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-26 15:08:18
📋 Review 摘要
PR 概述:新增 IM 查询 FastDeploy/PD 注册信息、ready 健康检查和 /fastdeploy/server/info 汇报接口
变更范围:fastdeploy/entrypoints/openai/api_server.py
影响面 Tag:[APIServer] [PD Disaggregation]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/entrypoints/openai/api_server.py:960 |
dp_rank 字符串与整数比较,is_master 永远不会置为 1 |
| 🔴 Bug | fastdeploy/entrypoints/openai/api_server.py:985 |
async LLM 模式下 llm_engine 没有 .engine,新增 info 接口会 500 |
📝 PR 规范检查
标题包含官方 Tag,但当前 PR 描述各 section 仍是模板占位/空内容,建议替换为下面的完整描述。
标题建议(可直接复制):
[APIServer] Report PD info to IM
PR 描述建议(点击展开,可直接复制)
## Motivation
Report FastDeploy PD disaggregation/register information to IM, including server identity, role, resource information, connected decode nodes, and readiness status.
## Modifications
- Add `/register_info` for decode node registration metadata.
- Add `/v2/health/ready` for IM readiness checks backed by existing `/health`.
- Add `/fastdeploy/server/info` to report API server/PD fields, resource ranges, master flag, and connected decode node list.
- Start a background decode-node poller that reads `D_IP_LIST`/`DECODE_PORTS` and collects `/register_info` from decode nodes.
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
新增接口方向和变更范围清晰,但当前实现会在 master 识别和 async LLM 部署下产生错误结果/接口 500。建议先修复上述两个功能问题,并补充接口级测试后再合入。
| with open(fed_member_file, 'r') as f: | ||
| fed_member_list = f.read().strip().split(',') | ||
| if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and \ | ||
| dp_rank == 0: |
There was a problem hiding this comment.
🔴 Bug dp_rank 在上面已经被转成字符串,这里再和整数 0 比较,条件永远为 False。
配置了 FED_MEMBER_FILE 且当前 HOST_IP 是成员列表第一个、DP rank 为 0 时,is_master 仍会保持 0,IM 侧无法识别 master 节点。
建议修复方式:保留一个整数 rank 用于逻辑判断,只在拼接 pod_name 或写入响应时再转字符串。
dp_rank = cfg.parallel_config.local_data_parallel_id
# pod_name 拼接处使用 str(dp_rank)
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and dp_rank == 0:
is_master = 1
cfg_dict["dp_rank"] = str(dp_rank)| cfg_dict["is_stopping"] = "running" | ||
| cfg_dict["is_master"] = is_master | ||
| cfg_dict["container_host_ip"] = os.getenv("HOST_IP", "None") | ||
| cfg_dict["free_block_num"] = llm_engine.engine.resource_manager.available_block_num() |
There was a problem hiding this comment.
🔴 Bug 这里直接访问 llm_engine.engine.resource_manager,在 FD_ENABLE_ASYNC_LLM=1 时会让新增接口返回 500。
load_engine() 在 async 模式下把全局 llm_engine 设置为 AsyncLLM;AsyncLLM 继承的 EngineServiceClient 只在子进程里创建 EngineService,主进程对象没有 .engine 属性。文件里已有生命周期代码也用 not isinstance(llm_engine, AsyncLLM) 区分了同步引擎路径。
建议修复方式:对 AsyncLLM 单独走跨进程状态查询/control API 获取 free_block_num,或在 async 模式下返回明确的不可用值;不要在 API server 主进程直接读取 llm_engine.engine.resource_manager。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8082 +/- ##
==========================================
Coverage ? 67.39%
==========================================
Files ? 475
Lines ? 67048
Branches ? 10335
==========================================
Hits ? 45187
Misses ? 18990
Partials ? 2871
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 7/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)分析器: 通用分析(fallback) 失败用例:
关键日志:
PR 只修改了 修复建议:
关联变更: 🔴 Pre Commit — PR问题(置信度: 高)分析器: 通用分析(fallback) 失败用例:
关键日志:
修复建议:
关联变更: 🔴 Approval — 需要 Approval(置信度: 高)分析器: 内置分析 失败用例:
关键日志:
该 Job 是审批门禁,不是代码执行失败。完成人工审批后,相关 CI 才会继续执行。 修复建议:
关联变更: 未关联代码变更 |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.