[RL] Reuse GDR checkpoint transfer handle#8078
Conversation
ee3f166 to
b69ad2a
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8078 +/- ##
==========================================
Coverage ? 67.52%
==========================================
Files ? 475
Lines ? 66907
Branches ? 10317
==========================================
Hits ? 45182
Misses ? 18857
Partials ? 2868
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-26 11:04:01
📋 Review 摘要
PR 概述:缓存 GDR CheckpointTransfer handle,避免动态权重更新时重复初始化 transfer 句柄。
变更范围:fastdeploy/rl/dynamic_weight_manager.py、tests/rl/test_dynamic_weight_gdr.py
影响面 Tag:[RL]
问题
未发现新的阻塞性问题。PR 规范问题在下面章节报,不要在这里重复
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | _destroy_gdr_handle() 吞掉 cleanup() 异常且没有任何日志。 |
|
| F2 | 缓存的 GDR CheckpointTransfer 没有在 sleep/clear 权重路径释放。 |
📝 PR 规范检查
符合规范。标题使用官方 [RL] Tag,PR 描述包含 checklist §D2 要求的 Motivation、Modifications、Usage or Command、Accuracy Tests 和 Checklist 章节。
总体评价
本轮按风险优先追溯了 GDR handle 创建、复用、异常清理、runner update/clear/sleep 调用链和新增单测。除历史未解决项外,暂未发现新的需要行间评论的问题。
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 8/10 通过
2 失败详情🔴 Approval — 需要 Approval(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。
🔴 xpu_8cards_case_test / run_xpu_8cards_cases — 不稳定问题(置信度: 中)分析器: 通用分析(fallback)
关键日志:
修复建议:
关联变更: 本 PR 只修改 |
Motivation
Avoid repeated
CheckpointTransferinitialization during GDR dynamic weight updates. Reusing the initialized handle reduces repeated setup overhead across multiple update steps.Modifications
CheckpointTransferhandle inDynamicWeightManager.update_weights_by_gdrcalls.Usage or Command
No new user-facing command. Existing GDR weight update flow is unchanged.
Accuracy Tests
Not applicable. This PR only changes checkpoint-transfer handle initialization behavior and does not affect model outputs.
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.