Skip to content

fix V-trace importance weighting in CPU and scalar#598

Open
valtterivalo wants to merge 1 commit into
PufferAI:5.0from
valtterivalo:fix-vtrace-rho-scaling
Open

fix V-trace importance weighting in CPU and scalar#598
valtterivalo wants to merge 1 commit into
PufferAI:5.0from
valtterivalo:fix-vtrace-rho-scaling

Conversation

@valtterivalo

Copy link
Copy Markdown

missing parentheses here in CPU and scalar kernels. vec is good alrdy.

…dvantage

The scalar CUDA kernel and the CPU advantage applied the clipped importance
weight to the reward term only (rho*r + gamma*V' - V). The vectorized CUDA
kernel already applies it to the whole TD error (rho*(r + gamma*V' - V)), the
canonical V-trace form. The three paths only agree on-policy (rho == 1), so
off-policy advantage estimates diverged on the scalar-routed and CPU paths.
Align scalar and CPU with the vectorized kernel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant