CodeRM-NT: Reward Model for Code RL without Unit Tests

Left: Synthetic unit tests can produce unreliable rewards, while CodeRM-NT correctly ranks code quality.
Right: Qwen3-4B-Thinking trained with CodeRM-NT outperforms synthetic unit tests on most benchmarks.

Overview

Providing accurate reward signals for code generated by LLMs is a significant challenge in applying reinforcement learning (RL) to code generation. Existing methods rely on unit tests, which are expensive to curate and unreliable when automatically synthesized.

CodeRM-NT is a code reward model with no reliance on unit tests. Instead of executing test cases, it learns to estimate the functional correctness of generated Python code from rewards that are collected via Monte Carlo Tree Search (MCTS) guided by LLM-as-a-Judge. Training Qwen2.5-Coder, GLM-4-9B-0414, and Qwen3-4B-Thinking with CodeRM-NT consistently outperforms synthetic unit-test rewards on HumanEval(+), MBPP(+), LiveCodeBench-v5, and BigCodeBench-Instruct-Hard.

Key Results

Training with CodeRM-NT consistently outperforms synthetic unit tests and other reward models across multiple code generation benchmarks:

Model	Reward	HumanEval	HumanEval+	MBPP	MBPP+	LCB-v5	BCB-I-Hard	Avg.
Qwen2.5-Coder-1.5B	Unit Tests	73.2	67.7	70.9	61.1	5.1	6.1	47.4
	CodeRM-NT	75.0	69.5	72.0	60.8	5.5	7.4	48.4
Qwen2.5-Coder-3B	Unit Tests	86.6	82.3	74.9	64.6	13.0	15.5	56.2
	CodeRM-NT	88.4	82.3	75.9	66.1	13.6	14.2	56.8
Qwen2.5-Coder-7B	Unit Tests	90.9	87.8	85.4	73.0	17.3	18.2	62.1
	CodeRM-NT	90.2	86.0	86.8	74.6	17.5	18.2	62.2
GLM-4-9B-0414	Unit Tests	84.1	79.9	81.0	69.0	15.4	15.5	57.5
	CodeRM-NT	87.2	81.7	79.9	67.2	15.3	18.2	58.3
Qwen3-4B-Thinking	Unit Tests	97.6	92.7	91.0	75.1	50.3	25.7	72.1
	CodeRM-NT	97.6	94.5	92.6	77.2	52.1	22.3	72.7

Getting Started

Requirements

Python 3.10+
PyTorch 2.0+
CUDA-capable GPUs (8x A100 80GB recommended)
Docker

For MCTS, reward model training, and TRL-based RL, run:

pip install -r requirements.txt

For Slime-based training, using the provided Docker image is recommended:

docker run --rm --gpus all --shm-size=16g -it slimerl/slime:latest /bin/bash

Step 1: Reward Data Curation via MCTS

See mcts/README.md for the full pipeline.

Step 2: Train CodeRM-NT

See rm_train/README.md.

Step 3: RL

For the reward model, use our published Rishubi/CodeRM-NT or use the model trained in Step 2.

TRL (Qwen2.5-Coder on KodCode-V1): see rl_train_trl/README.md.
slime (Qwen3-4B-Thinking, GLM-4-9B-0414 on OCI/KodCode): see rl_train_slime/README.md.

Citation

If you find our work helpful, please kindly cite our paper:

@inproceedings{xia-etal-2026-coderm,
    title = "{C}ode{RM}-{NT}: Reward Model for Code {RL} without Unit Tests",
    author = "Xia, Xiao  and
      Zhang, Dan  and
      Sun, Tianrui",
    editor = "Liakata, Maria  and
      Moreira, Viviane P.  and
      Zhang, Jiajun  and
      Jurgens, David",
    booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {ACL} 2026",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.findings-acl.2150/",
    pages = "43316--43333",
    ISBN = "979-8-89176-395-1"
}

Acknowledgements

Built on Qwen2.5-Coder-Instruct, GLM-4-9B-0414, and Qwen3-4B-Thinking-2507 base models.
Uses huggingface/trl and THUDM/slime for RL training.
Uses Magicoder-OSS-Instruct-75K to curate reward data and KodCode-V1 and OpenCodeInstruct for RL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeRM-NT: Reward Model for Code RL without Unit Tests

Overview

Key Results

Getting Started

Requirements

Step 1: Reward Data Curation via MCTS

Step 2: Train CodeRM-NT

Step 3: RL

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
mcts		mcts
rl_train_slime		rl_train_slime
rl_train_trl		rl_train_trl
rm_train		rm_train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CodeRM-NT: Reward Model for Code RL without Unit Tests

Overview

Key Results

Getting Started

Requirements

Step 1: Reward Data Curation via MCTS

Step 2: Train CodeRM-NT

Step 3: RL

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages