Summary
We noticed that the action_advantage field — which conditions the Pi05 policy's behavior via the tokenized text prompt — appears to be hardcoded to 1.0 (bin 10, i.e., "expert-level") during both deployment inference and online RL world-model rollouts. We are unsure whether this is an intentional design choice or an oversight, but we are concerned it could lead to a self-deception feedback loop: the policy generates suboptimal actions yet is consistently told "you are performing optimally." Could this cause error to accumulate over time? We would appreciate clarification from the authors on the intended design here.
Background: Two Paths for action_advantage
In RISE's Pi05 architecture, action_advantage appears in two separate roles:
| Role |
Data Path |
Used For |
| Prompt conditioning |
action_advantage → TokenizePrompt → "Advantage: X" in LLM prefix |
Conditions action generation behavior |
| RL reward signal |
reward_model.predict_reward() → rm_value / conditional_advantage → PPO advantage |
Guides policy gradient update direction |
What We Found
Path 1 (offline training): action_advantage is pre-computed by a trained value model via label_frame_value.py, producing frame-specific advantage values (e.g., 0.3, 0.7, 0.1). This makes sense.
Path 2 (inference/online RL): When the policy is deployed or running in the online self-improvement loop, action_advantage appears to be hardcoded rather than dynamically computed:
# policy_and_value/policy_online/rlinf/envs/roborl/roborl_env_lerobot.py:335-336
if self.with_advantage_condition:
pseudo_action_advantage = torch.tensor(1.0) # hardcoded to "expert"
Meanwhile, the reward model IS running in the online loop, computing real-time advantage-like values:
# policy_and_value/policy_online/rlinf/models/embodiment/openpi_action_model.py:726
if need_infer:
result["conditional_advantage"] = (
reward_model_value * self.config.advantage_scale
).clamp(-1.0, 1.0) # ← computed but not fed back into the prompt
This conditional_advantage appears to be used as the RL reward for PPO/GRPO updates, but doesn't seem to be written back to the action_advantage field in the tokenized prompt. We may be missing something in the data flow — please correct us if so.
Our Concern
If our understanding is correct, the online self-improvement loop might be running with conflicting signals:
Step t: Policy("Advantage: 10 (expert)") → action_t [policy told "you are optimal"]
↓
World Model → predicted future frames
↓
Reward Model: V(t+k) - V(t) = -0.2 [action was actually suboptimal]
↓
PPO loss penalizes the policy
↓
Step t+1: Policy("Advantage: 10 (expert)") → action_{t+1} [STILL told "you are optimal"]
↓
...PPO corrects the weights, but the prompt gives the opposite signal...
The PPO update tries to correct the policy, but the policy receives opposing signals each step: the gradient says "that action was bad," while the prompt says "you are at expert level." We are not certain this is actually a problem — perhaps the authors intentionally designed the policy to always operate in "expert mode." We hope the authors can clarify.
Our Confusion
We understand action_advantage may serve two conceptually different roles:
- As a prompt condition → tells the model "what quality level to operate at" — useful for multi-quality datasets
- As an RL reward → tells the optimizer "was this action good?" — should be computed dynamically
However, reading the code, we're not sure how these two paths are connected during the online loop — the dynamically computed conditional_advantage from the reward model doesn't seem to be wired back into the prompt's action_advantage. We may well be misunderstanding the intended architecture here.
Possible Directions (unsure which is correct)
Direction A: Disable prompt conditioning in online RL, use value model only for RL reward:
algorithm:
with_advantage_condition: False # ← disable prompt conditioning
add_reward_model: True # ← value model still provides RL reward
Direction B: Jointly train a value model alongside the policy, so the policy can self-assess advantage:
# Pi0Config_Custom
with_value_head: True # shared backbone + value head
loss_value_weight: 0.1 # value loss weight
# at inference: value head predicts current advantage → inject into prompt
Direction C: Feed the reward model's real-time output back into the prompt:
if self.with_advantage_condition:
if prev_pred_result is not None:
pseudo_action_advantage = prev_pred_result["conditional_advantage"]
else:
pseudo_action_advantage = torch.tensor(0.5) # neutral prior for first step
Questions for the Authors
-
Was the hardcoding of action_advantage = 1.0 during inference an intentional design choice, or a bug?
-
Is the conditional_advantage from the reward model designed to eventually feed back into the prompt? The infrastructure exists (openpi_action_model.py:726) but the last-mile connection is missing.
-
What is the recommended approach for the self-improvement loop? Should we:
- (a) Train the policy without advantage conditioning entirely, using value model only for RL reward?
- (b) Let the policy learn to self-assess advantage by jointly training a value model alongside the policy?
- (c) Feed the reward model's real-time output back into the prompt?
- (d) Any other better approach?
Summary
We noticed that the
action_advantagefield — which conditions the Pi05 policy's behavior via the tokenized text prompt — appears to be hardcoded to1.0(bin 10, i.e., "expert-level") during both deployment inference and online RL world-model rollouts. We are unsure whether this is an intentional design choice or an oversight, but we are concerned it could lead to a self-deception feedback loop: the policy generates suboptimal actions yet is consistently told "you are performing optimally." Could this cause error to accumulate over time? We would appreciate clarification from the authors on the intended design here.Background: Two Paths for
action_advantageIn RISE's Pi05 architecture,
action_advantageappears in two separate roles:action_advantage→TokenizePrompt→"Advantage: X"in LLM prefixreward_model.predict_reward()→rm_value/conditional_advantage→ PPO advantageWhat We Found
Path 1 (offline training):
action_advantageis pre-computed by a trained value model vialabel_frame_value.py, producing frame-specific advantage values (e.g.,0.3,0.7,0.1). This makes sense.Path 2 (inference/online RL): When the policy is deployed or running in the online self-improvement loop,
action_advantageappears to be hardcoded rather than dynamically computed:Meanwhile, the reward model IS running in the online loop, computing real-time advantage-like values:
This
conditional_advantageappears to be used as the RL reward for PPO/GRPO updates, but doesn't seem to be written back to theaction_advantagefield in the tokenized prompt. We may be missing something in the data flow — please correct us if so.Our Concern
If our understanding is correct, the online self-improvement loop might be running with conflicting signals:
The PPO update tries to correct the policy, but the policy receives opposing signals each step: the gradient says "that action was bad," while the prompt says "you are at expert level." We are not certain this is actually a problem — perhaps the authors intentionally designed the policy to always operate in "expert mode." We hope the authors can clarify.
Our Confusion
We understand
action_advantagemay serve two conceptually different roles:However, reading the code, we're not sure how these two paths are connected during the online loop — the dynamically computed
conditional_advantagefrom the reward model doesn't seem to be wired back into the prompt'saction_advantage. We may well be misunderstanding the intended architecture here.Possible Directions (unsure which is correct)
Direction A: Disable prompt conditioning in online RL, use value model only for RL reward:
Direction B: Jointly train a value model alongside the policy, so the policy can self-assess advantage:
Direction C: Feed the reward model's real-time output back into the prompt:
Questions for the Authors
Was the hardcoding of
action_advantage = 1.0during inference an intentional design choice, or a bug?Is the
conditional_advantagefrom the reward model designed to eventually feed back into the prompt? The infrastructure exists (openpi_action_model.py:726) but the last-mile connection is missing.What is the recommended approach for the self-improvement loop? Should we: