Yuxiang Ji1,2*,
Zengbin Wang2*,
Yong Wang2†,
Shidong Yang2,
Ziyu Ma2,
Guanhua Chen3,
Zonghua Sun1,
Liaoni Wu1,
Xiangxiang Chu2
1Xiamen University
2AMAP, Alibaba Group
3Southern University of Science and Technology
*Equal contribution,
†Project lead.
- [May 12, 2026]: Codebase released. (work in progress)
Agentic reinforcement learning (RL) for LLMs critically depends on the exploration capability of the base policy: when reward states are beyond its reachable region, advantage estimates can collapse and training may stall. Instead of relying on costly supervised cold starts, we study how to use readily available action trajectories as plan-style guidance to help agents reach useful states during RL.
We propose ActGuide-RL, which injects action data as adaptive reference guidance and jointly optimizes guided and unguided rollouts, internalizing the exploration gains back into the unguided policy. On search-agent benchmarks, ActGuide-RL consistently improves over vanilla RL and can approach SFT+RL performance without requiring supervised warm-start data.
conda create -n actguide python=3.12 -y
conda activate actguide
pip install -e .
pip install swanlabexport DATA_DIR=/path/to/data/deepsearch
cd examples/data_preprocess
bash preprocess_deepresearch_actguide.shLaunch the DeepResearch tool server:
export SERPER_API_KEY=your_serper_key
bash tool_server/run_deepresearch_api_server.sh 0Launch one or more OpenAI-compatible reward judge servers. For example, with vLLM:
CUDA_VISIBLE_DEVICES=0 vllm serve /path/to/judge-model --host 0.0.0.0 --port 7011 --disable-log-requestsIf you run multiple reward servers or need a single external port, use the proxy. By default it maps
/reward1/, /reward2/, ... to local ports 7011, 7012, ...:
cd searchagent_scripts/proxy
python run_proxy.pyRun the ActGuide recipe:
bash searchagent_scripts/train_searchagent_actguide.shbash searchagent_scripts/test_searchagent.shComming soon.