[Models] Update SWA RoPE theta for MLA/GQA attention#8077
Conversation
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-25 19:30:12
📋 Review 摘要
PR 概述:为 MLA/GQA 的 SWA 注意力路径接入独立 swa_rope_theta / swa_rope_emb。
变更范围:ForwardMeta、InputBatch / ProposerInputBatch、GPU runner、append attention backend、DeepSeek V3 MLA attention。
影响面 Tag:[Models] [OP]
问题
未发现新的阻塞性问题。PR 规范问题在下面章节报,不要在这里重复。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | glm_moe_dsa 的非 SWA 层会丢失 rope_parameters["rope_theta"]。 |
|
| F2 | rope_already_applied 的 PaddleFormers fallback 在 SWA 层会再次应用 RoPE。 |
📝 PR 规范检查
标题缺少官方 Tag,描述仍是空模板且未提供精度/对齐结果;下面给出可直接替换的标题和描述。
标题建议(可直接复制):
[Models] Update SWA RoPE theta for MLA/GQA attention
PR 描述建议(点击展开,可直接复制)
## Motivation
为配置了 `swa_rope_theta` 的 MLA/GQA 滑动窗口注意力层使用独立 RoPE base,避免 SWA 层与全量注意力层共用 `rope_theta`。
## Modifications
- `ForwardMeta` 增加 `swa_rotary_embs`,`gpu_model_runner` 从 `share_inputs["swa_rope_emb"]` 传入。
- `InputBatch` 和 `ProposerInputBatch` 在配置 `swa_rope_theta` 时额外构建 `swa_rope_emb`。
- `AppendAttentionBackend` 在 `window_attn_skip_freq[layer_id] == 1` 且配置 `swa_rope_theta` 时使用 `swa_rotary_embs`。
- `DeepseekV3MLAAttention` 对 SWA 层使用 `swa_rope_theta` 初始化 RoPE,并缓存 `window_attn_skip_freq`。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本轮按风险优先审查了 5 个变更文件中的 RoPE/SWA 主链路,并核对了 append attention backend、DeepSeek MLA、GPU ForwardMeta 初始化和输入 batch 构造。除历史未解决的 F1/F2 外,未发现新的 diff 级别阻塞缺陷;合入前仍建议修复这两个历史问题并补充精度/对齐说明。
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #8077 +/- ##
==========================================
Coverage ? 67.62%
==========================================
Files ? 475
Lines ? 66909
Branches ? 10321
==========================================
Hits ? 45249
Misses ? 18813
Partials ? 2847
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 9/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)分析器: 通用分析(fallback) 失败用例: 覆盖率阈值检查
关键日志:
修复建议:
关联变更: |
Motivation
为配置了
swa_rope_theta的 MLA/GQA 滑动窗口注意力层使用独立 RoPE base,避免 SWA 层与全量注意力层共用rope_theta。Modifications
ForwardMeta增加swa_rotary_embs,gpu_model_runner从share_inputs["swa_rope_emb"]传入。InputBatch和ProposerInputBatch在配置swa_rope_theta时额外构建swa_rope_emb。AppendAttentionBackend在window_attn_skip_freq[layer_id] == 1且配置swa_rope_theta时使用swa_rotary_embs。DeepseekV3MLAAttention对 SWA 层使用swa_rope_theta初始化 RoPE,并缓存window_attn_skip_freq。Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.