Learning Agentic Policy from Action Guidance

Yuxiang Ji^1,2*, Zengbin Wang^2*, Yong Wang^2†, Shidong Yang², Ziyu Ma², Guanhua Chen³, Zonghua Sun¹, Liaoni Wu¹, Xiangxiang Chu²
¹Xiamen University ²AMAP, Alibaba Group ³Southern University of Science and Technology
^*Equal contribution, ^†Project lead.

News

[May 12, 2026]: Codebase released. (work in progress)

Overview

Agentic reinforcement learning (RL) for LLMs critically depends on the exploration capability of the base policy: when reward states are beyond its reachable region, advantage estimates can collapse and training may stall. Instead of relying on costly supervised cold starts, we study how to use readily available action trajectories as plan-style guidance to help agents reach useful states during RL.

The overview of ActGuide-RL.

We propose ActGuide-RL, which injects action data as adaptive reference guidance and jointly optimizes guided and unguided rollouts, internalizing the exploration gains back into the unguided policy. On search-agent benchmarks, ActGuide-RL consistently improves over vanilla RL and can approach SFT+RL performance without requiring supervised warm-start data.

Quick start

Installation

conda create -n actguide python=3.12 -y
conda activate actguide
pip install -e .
pip install swanlab

Data preparation

export DATA_DIR=/path/to/data/deepsearch
cd examples/data_preprocess
bash preprocess_deepresearch_actguide.sh

Tool and reward servers

Launch the DeepResearch tool server:

export SERPER_API_KEY=your_serper_key
bash tool_server/run_deepresearch_api_server.sh 0

Launch one or more OpenAI-compatible reward judge servers. For example, with vLLM:

CUDA_VISIBLE_DEVICES=0 vllm serve /path/to/judge-model --host 0.0.0.0 --port 7011 --disable-log-requests

If you run multiple reward servers or need a single external port, use the proxy. By default it maps /reward1/, /reward2/, ... to local ports 7011, 7012, ...:

cd searchagent_scripts/proxy
python run_proxy.py

RL training

Run the ActGuide recipe:

bash searchagent_scripts/train_searchagent_actguide.sh

Evaluation

bash searchagent_scripts/test_searchagent.sh

Citation

Comming soon.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
searchagent_scripts		searchagent_scripts
tool_server		tool_server
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Agentic Policy from Action Guidance

News

Table of contents

Overview

Quick start

Installation

Data preparation

Tool and reward servers

RL training

Evaluation

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning Agentic Policy from Action Guidance

News

Table of contents

Overview

Quick start

Installation

Data preparation

Tool and reward servers

RL training

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages