Skip to content

[Cherry-Pick][RL] Reuse GDR checkpoint transfer handle(#8078)#8079

Merged
Jiang-Jia-Jun merged 1 commit into
PaddlePaddle:release/2.6from
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-release-2.6
Jun 26, 2026
Merged

[Cherry-Pick][RL] Reuse GDR checkpoint transfer handle(#8078)#8079
Jiang-Jia-Jun merged 1 commit into
PaddlePaddle:release/2.6from
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-release-2.6

Conversation

@jackyYang6

Copy link
Copy Markdown
Contributor

Motivation

Avoid repeated CheckpointTransfer initialization during GDR dynamic weight updates. Reusing the initialized handle reduces repeated setup overhead across multiple update steps.

Modifications

  • Cache the GDR CheckpointTransfer handle in DynamicWeightManager.
  • Lazily initialize the handle on the first GDR weight update.
  • Reuse the cached handle for later update_weights_by_gdr calls.
  • Destroy and reset the cached handle when an update fails.

Usage or Command

No new user-facing command. Existing GDR weight update flow is unchanged.

Accuracy Tests

Not applicable. This PR only changes checkpoint-transfer handle initialization behavior and does not affect model outputs.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No unit tests added because this is a handle lifecycle optimization for GDR runtime behavior.
  • Provide accuracy results. Not applicable; no model output changes.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@jackyYang6 jackyYang6 force-pushed the jacky/optimize-checkpoint-transfer-handle-init-release-2.6 branch from 6ae8098 to 4b77ebf Compare June 25, 2026 11:48
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-25 20:03:04

📋 Review 摘要

PR 概述:为 RL 动态权重更新的 GDR 路径缓存并复用 CheckpointTransfer handle。
变更范围fastdeploy/rl/dynamic_weight_manager.pytests/rl/test_dynamic_weight_gdr.py
影响面 Tag[RL]

问题

未发现阻塞性问题。PR 规范问题在下面章节报,不要在这里重复

历史 Findings 修复情况

Finding 问题 状态
F1 _destroy_gdr_handle 会静默吞掉 cleanup 失败。 ⚠️ 仍存在

📝 PR 规范检查

不符合规范:当前标题 [Cherry-Pick][RL] Reuse GDR checkpoint transfer handle 缺少 release 分支 Cherry-Pick 标题要求的来源 develop PR 号,规范格式为 [Cherry-Pick][Tag] 标题描述(#原PR号)。当前上下文和本地 git 历史没有可验证的来源 PR 号,因此不臆造编号;请补齐真实来源 PR 号后使用该格式。PR 描述已包含 MotivationModificationsUsage or CommandAccuracy TestsChecklist,结构符合 §D2。

总体评价

代码实现和新增单测覆盖了 handle 复用及失败后重建的主要路径;未发现新的阻塞性代码问题。历史 cleanup 异常静默问题仍保留,建议后续至少记录日志以便排障。

@jackyYang6 jackyYang6 changed the title [Cherry-Pick][RL] Reuse GDR checkpoint transfer handle [Cherry-Pick][RL] Reuse GDR checkpoint transfer handle(#8078) Jun 25, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.60870% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@ae64731). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/rl/dynamic_weight_manager.py 82.60% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #8079   +/-   ##
==============================================
  Coverage               ?   71.60%           
==============================================
  Files                  ?      386           
  Lines                  ?    55790           
  Branches               ?     8764           
==============================================
  Hits                   ?    39951           
  Misses                 ?    13023           
  Partials               ?     2816           
Flag Coverage Δ
GPU 71.60% <82.60%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 82c7c7a into PaddlePaddle:release/2.6 Jun 26, 2026
34 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants