Skip to content

AMAP-ML/ActGuide-RL

Repository files navigation

Learning Agentic Policy from Action Guidance

Yuxiang Ji1,2*, Zengbin Wang2*, Yong Wang2†, Shidong Yang2, Ziyu Ma2, Guanhua Chen3, Zonghua Sun1, Liaoni Wu1, Xiangxiang Chu2
1Xiamen University    2AMAP, Alibaba Group    3Southern University of Science and Technology
*Equal contribution, Project lead.    

Paper

News

  • [May 12, 2026]: Codebase released. (work in progress)

Table of contents

Overview

Agentic reinforcement learning (RL) for LLMs critically depends on the exploration capability of the base policy: when reward states are beyond its reachable region, advantage estimates can collapse and training may stall. Instead of relying on costly supervised cold starts, we study how to use readily available action trajectories as plan-style guidance to help agents reach useful states during RL.

intro The overview of ActGuide-RL.

We propose ActGuide-RL, which injects action data as adaptive reference guidance and jointly optimizes guided and unguided rollouts, internalizing the exploration gains back into the unguided policy. On search-agent benchmarks, ActGuide-RL consistently improves over vanilla RL and can approach SFT+RL performance without requiring supervised warm-start data.

Quick start

Installation

conda create -n actguide python=3.12 -y
conda activate actguide
pip install -e .
pip install swanlab

Data preparation

export DATA_DIR=/path/to/data/deepsearch
cd examples/data_preprocess
bash preprocess_deepresearch_actguide.sh

Tool and reward servers

Launch the DeepResearch tool server:

export SERPER_API_KEY=your_serper_key
bash tool_server/run_deepresearch_api_server.sh 0

Launch one or more OpenAI-compatible reward judge servers. For example, with vLLM:

CUDA_VISIBLE_DEVICES=0 vllm serve /path/to/judge-model --host 0.0.0.0 --port 7011 --disable-log-requests

If you run multiple reward servers or need a single external port, use the proxy. By default it maps /reward1/, /reward2/, ... to local ports 7011, 7012, ...:

cd searchagent_scripts/proxy
python run_proxy.py

RL training

Run the ActGuide recipe:

bash searchagent_scripts/train_searchagent_actguide.sh

Evaluation

bash searchagent_scripts/test_searchagent.sh

Citation

Comming soon.

About

Learning Agentic Policy from Action Guidance

Resources

License

Stars

Watchers

Forks

Contributors