Is `action_advantage` being hardcoded to 1.0 during inference/online RL causing self-deception in the self-improvement loop?

### Summary

We noticed that the `action_advantage` field — which conditions the Pi05 policy's behavior via the tokenized text prompt — appears to be **hardcoded to `1.0` (bin 10, i.e., "expert-level")** during both deployment inference and online RL world-model rollouts. We are unsure whether this is an intentional design choice or an oversight, but we are concerned it could lead to a self-deception feedback loop: the policy generates suboptimal actions yet is consistently told "you are performing optimally." Could this cause error to accumulate over time? We would appreciate clarification from the authors on the intended design here.

### Background: Two Paths for `action_advantage`

In RISE's Pi05 architecture, `action_advantage` appears in **two separate roles**:

| Role | Data Path | Used For |
|------|-----------|----------|
| **Prompt conditioning** | `action_advantage` → `TokenizePrompt` → `"Advantage: X"` in LLM prefix | Conditions action generation behavior |
| **RL reward signal** | `reward_model.predict_reward()` → `rm_value` / `conditional_advantage` → PPO advantage | Guides policy gradient update direction |

### What We Found

**Path 1 (offline training):** `action_advantage` is pre-computed by a trained value model via `label_frame_value.py`, producing frame-specific advantage values (e.g., `0.3`, `0.7`, `0.1`). This makes sense.

**Path 2 (inference/online RL):** When the policy is deployed or running in the online self-improvement loop, `action_advantage` appears to be hardcoded rather than dynamically computed:

```python
# policy_and_value/policy_online/rlinf/envs/roborl/roborl_env_lerobot.py:335-336
if self.with_advantage_condition:
    pseudo_action_advantage = torch.tensor(1.0)   # hardcoded to "expert"
```

Meanwhile, the **reward model IS running** in the online loop, computing real-time advantage-like values:

```python
# policy_and_value/policy_online/rlinf/models/embodiment/openpi_action_model.py:726
if need_infer:
    result["conditional_advantage"] = (
        reward_model_value * self.config.advantage_scale
    ).clamp(-1.0, 1.0)   # ← computed but not fed back into the prompt
```

This `conditional_advantage` appears to be used as the RL reward for PPO/GRPO updates, but doesn't seem to be written back to the `action_advantage` field in the tokenized prompt. We may be missing something in the data flow — please correct us if so.

### Our Concern

If our understanding is correct, the online self-improvement loop might be running with conflicting signals:

```
Step t:   Policy("Advantage: 10 (expert)") → action_t   [policy told "you are optimal"]
          ↓
          World Model → predicted future frames
          ↓
          Reward Model: V(t+k) - V(t) = -0.2   [action was actually suboptimal]
          ↓
          PPO loss penalizes the policy
          ↓
Step t+1: Policy("Advantage: 10 (expert)") → action_{t+1}   [STILL told "you are optimal"]
          ↓
          ...PPO corrects the weights, but the prompt gives the opposite signal...
```

The PPO update tries to correct the policy, but the policy receives opposing signals each step: the gradient says "that action was bad," while the prompt says "you are at expert level." We are not certain this is actually a problem — perhaps the authors intentionally designed the policy to always operate in "expert mode." We hope the authors can clarify.

### Our Confusion

We understand `action_advantage` may serve two conceptually different roles:

1. **As a prompt condition** → tells the model "what quality level to operate at" — useful for multi-quality datasets
2. **As an RL reward** → tells the optimizer "was this action good?" — should be computed dynamically

However, reading the code, we're not sure how these two paths are connected during the online loop — the dynamically computed `conditional_advantage` from the reward model doesn't seem to be wired back into the prompt's `action_advantage`. We may well be misunderstanding the intended architecture here.

### Possible Directions (unsure which is correct)

**Direction A**: Disable prompt conditioning in online RL, use value model only for RL reward:

```yaml
algorithm:
  with_advantage_condition: False   # ← disable prompt conditioning
  add_reward_model: True            # ← value model still provides RL reward
```

**Direction B**: Jointly train a value model alongside the policy, so the policy can self-assess advantage:

```python
# Pi0Config_Custom
with_value_head: True           # shared backbone + value head
loss_value_weight: 0.1          # value loss weight
# at inference: value head predicts current advantage → inject into prompt
```

**Direction C**: Feed the reward model's real-time output back into the prompt:

```python
if self.with_advantage_condition:
    if prev_pred_result is not None:
        pseudo_action_advantage = prev_pred_result["conditional_advantage"]
    else:
        pseudo_action_advantage = torch.tensor(0.5)  # neutral prior for first step
```

### Questions for the Authors

1. **Was the hardcoding of `action_advantage = 1.0` during inference an intentional design choice, or a bug?**

2. **Is the `conditional_advantage` from the reward model designed to eventually feed back into the prompt?** The infrastructure exists (`openpi_action_model.py:726`) but the last-mile connection is missing.

3. **What is the recommended approach for the self-improvement loop?** Should we:
   - (a) Train the policy without advantage conditioning entirely, using value model only for RL reward?
   - (b) Let the policy learn to self-assess advantage by jointly training a value model alongside the policy?
   - (c) Feed the reward model's real-time output back into the prompt?
   - (d) Any other better approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is `action_advantage` being hardcoded to 1.0 during inference/online RL causing self-deception in the self-improvement loop? #15

Summary

Background: Two Paths for `action_advantage`

What We Found

Our Concern

Our Confusion

Possible Directions (unsure which is correct)

Questions for the Authors

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Role	Data Path	Used For
Prompt conditioning	`action_advantage` → `TokenizePrompt` → `"Advantage: X"` in LLM prefix	Conditions action generation behavior
RL reward signal	`reward_model.predict_reward()` → `rm_value` / `conditional_advantage` → PPO advantage	Guides policy gradient update direction

Is action_advantage being hardcoded to 1.0 during inference/online RL causing self-deception in the self-improvement loop? #15

Description

Summary

Background: Two Paths for action_advantage

What We Found

Our Concern

Our Confusion

Possible Directions (unsure which is correct)

Questions for the Authors

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Is `action_advantage` being hardcoded to 1.0 during inference/online RL causing self-deception in the self-improvement loop? #15

Background: Two Paths for `action_advantage`