Skip to content

[Fix] discard failed and filtered rollout_state#1922

Merged
YanhuiDua merged 6 commits into
InternLM:mainfrom
YanhuiDua:discard-filtered
Jun 18, 2026
Merged

[Fix] discard failed and filtered rollout_state#1922
YanhuiDua merged 6 commits into
InternLM:mainfrom
YanhuiDua:discard-filtered

Conversation

@YanhuiDua

@YanhuiDua YanhuiDua commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

修复点

  1. 将failed / filtered的样本直接丢弃,也就是rollout_state中全部字段清空,并且不再放到replay buffer中去
  2. 由于修复点1,那就不能再从replaybuffer中统计leftover的failed和filtered的样本,于是改成在ProduceProgress中进行统计

修复后内存曲线

  • 机0:
img_v3_0212o_8ad1e9ac-c7fb-4093-825f-77b6697fdd0g
  • 机1:
img_v3_0212o_0ba2373d-417f-449d-aa06-205a0369ca4g

@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/replay_buffer.py Outdated
Comment thread xtuner/v1/data_proto/rl_data.py Outdated
Comment thread xtuner/v1/rl/utils/ray_utils.py
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Discards FAILED and FILTERED rollout groups in ReplayBuffer.put() to prevent unbounded accumulation, and adds free_rollout_state_refs / discard_rollout_state helpers to release Ray object refs. Also fixes reset_rollout_response to handle list[ObjectRef] routed_experts.

ProduceBatchResult impact: leftover_failed and leftover_filtered will now always be 0 (groups discarded before storage). This is the intended fix. However, the in-place mutation bug below can cause FAILED/FILTERED groups to be miscounted as COMPLETED by the producer.

RoutedExperts impact: Positive — fixes a gap where list[ObjectRef] routed_experts were not freed in reset_rollout_response. New free_rollout_state_refs properly handles all ref types.

Issues

Critical

  • replay_buffer.py:466-471discard_rollout_state mutates items in-place, resetting status to INIT. The caller (put_generated_group at produce_utils.py:187) re-reads get_group_status(group) after put() returns and sees COMPLETED for what was actually a FAILED/FILTERED group. This inflates the producer's completed_sample_count and can cause premature termination.

Warning

  • rl_data.py:167-178 — Setting the required field message to None creates an invalid RolloutState that would crash any code reading it.
  • Missing regression test — per project standards, bug fix PRs must include a test that reproduces the original issue.

Nit

  • ray_utils.py:97-105id()-based dedup operates on Python object identity, not Ray ObjectRef identity. Two wrappers for the same Ray object won't be deduplicated.
  • free_rollout_state_refs is a public function missing a docstring.

Verdict

REQUEST_CHANGES

@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

result.filtered_samples = status_counts.get(Status.FILTERED, 0)


def _merge_discarded_counts(result: ProduceBatchResult, progress: Any, task_name: str) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit — progress: Any bypasses mypy in a strict-checked path.

Both ProduceProgress and DisaggProduceProgress implement consume_discarded(task_name) -> tuple[int, int], but using Any here skips static verification. A lightweight Protocol or explicit union type would catch future signature drift:

Suggested change
def _merge_discarded_counts(result: ProduceBatchResult, progress: Any, task_name: str) -> None:
def _merge_discarded_counts(result: ProduceBatchResult, progress: ProduceProgress | DisaggProduceProgress, task_name: str) -> None:

(Requires importing DisaggProduceProgress — alternatively a Protocol with the consume_discarded method.)

Comment on lines +185 to +192
failed_samples_state = state.get(
"failed_samples",
state.get("failed_groups", state.get("discarded_failed_groups", {})),
)
filtered_samples_state = state.get(
"filtered_samples",
state.get("filtered_groups", state.get("discarded_filtered_groups", {})),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit — fallback keys "failed_groups" / "discarded_failed_groups" don't appear in any existing checkpoint format.

If these were intermediate names during development that never shipped, the nested .get() chains are dead code. Simplifying to just state.get("failed_samples", {}) makes the contract clearer and avoids future confusion about which key names are "supported."

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Updated Review (all 3 commits)

The previous review flagged critical issues in commit 1 (in-place mutation corrupting group status, None on required fields, id()-based dedup). All three issues were correctly fixed in commits 2–3:

  • In-place mutation: discard now happens in put_generated_group with initial_status captured before mutation; return False immediately.
  • Required fields: discard_rollout_state now skips required fields (continue), preserving message intact.
  • Dedup: uses ref.hex() (Ray ObjectRef identity) instead of id() (Python wrapper identity).

ProduceBatchResult impact: leftover_failed / leftover_filtered renamed to failed_samples / filtered_samples with updated semantics — now includes buffer leftovers + discarded-before-insertion counts. Buffer leftover portion will drain to 0 after deployment since FAILED/FILTERED groups no longer enter the buffer.

RoutedExperts impact: Positive — free_rollout_state_refs correctly handles all ref ownership shapes (single ObjectRef, list, and nested refs in extra_fields). reset_rollout_response now handles list[ObjectRef] routed_experts.

Remaining Issues

Nit

Verdict

APPROVE — the memory fix is correct and well-tested. Only cosmetic nits remain.

Comment thread xtuner/v1/rl/agent_loop_manager/produce_utils.py Outdated
@YanhuiDua YanhuiDua merged commit f099c89 into InternLM:main Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants