Left: Synthetic unit tests can produce unreliable rewards, while CodeRM-NT correctly ranks code quality.
Right: Qwen3-4B-Thinking trained with CodeRM-NT outperforms synthetic unit tests on most benchmarks.
Providing accurate reward signals for code generated by LLMs is a significant challenge in applying reinforcement learning (RL) to code generation. Existing methods rely on unit tests, which are expensive to curate and unreliable when automatically synthesized.
CodeRM-NT is a code reward model with no reliance on unit tests. Instead of executing test cases, it learns to estimate the functional correctness of generated Python code from rewards that are collected via Monte Carlo Tree Search (MCTS) guided by LLM-as-a-Judge. Training Qwen2.5-Coder, GLM-4-9B-0414, and Qwen3-4B-Thinking with CodeRM-NT consistently outperforms synthetic unit-test rewards on HumanEval(+), MBPP(+), LiveCodeBench-v5, and BigCodeBench-Instruct-Hard.
Training with CodeRM-NT consistently outperforms synthetic unit tests and other reward models across multiple code generation benchmarks:
| Model | Reward | HumanEval | HumanEval+ | MBPP | MBPP+ | LCB-v5 | BCB-I-Hard | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-Coder-1.5B | Unit Tests | 73.2 | 67.7 | 70.9 | 61.1 | 5.1 | 6.1 | 47.4 |
| CodeRM-NT | 75.0 | 69.5 | 72.0 | 60.8 | 5.5 | 7.4 | 48.4 | |
| Qwen2.5-Coder-3B | Unit Tests | 86.6 | 82.3 | 74.9 | 64.6 | 13.0 | 15.5 | 56.2 |
| CodeRM-NT | 88.4 | 82.3 | 75.9 | 66.1 | 13.6 | 14.2 | 56.8 | |
| Qwen2.5-Coder-7B | Unit Tests | 90.9 | 87.8 | 85.4 | 73.0 | 17.3 | 18.2 | 62.1 |
| CodeRM-NT | 90.2 | 86.0 | 86.8 | 74.6 | 17.5 | 18.2 | 62.2 | |
| GLM-4-9B-0414 | Unit Tests | 84.1 | 79.9 | 81.0 | 69.0 | 15.4 | 15.5 | 57.5 |
| CodeRM-NT | 87.2 | 81.7 | 79.9 | 67.2 | 15.3 | 18.2 | 58.3 | |
| Qwen3-4B-Thinking | Unit Tests | 97.6 | 92.7 | 91.0 | 75.1 | 50.3 | 25.7 | 72.1 |
| CodeRM-NT | 97.6 | 94.5 | 92.6 | 77.2 | 52.1 | 22.3 | 72.7 |
- Python 3.10+
- PyTorch 2.0+
- CUDA-capable GPUs (8x A100 80GB recommended)
- Docker
For MCTS, reward model training, and TRL-based RL, run:
pip install -r requirements.txtFor Slime-based training, using the provided Docker image is recommended:
docker run --rm --gpus all --shm-size=16g -it slimerl/slime:latest /bin/bashSee mcts/README.md for the full pipeline.
See rm_train/README.md.
For the reward model, use our published Rishubi/CodeRM-NT or use the model trained in Step 2.
- TRL (Qwen2.5-Coder on KodCode-V1): see rl_train_trl/README.md.
- slime (Qwen3-4B-Thinking, GLM-4-9B-0414 on OCI/KodCode): see rl_train_slime/README.md.
If you find our work helpful, please kindly cite our paper:
@inproceedings{xia-etal-2026-coderm,
title = "{C}ode{RM}-{NT}: Reward Model for Code {RL} without Unit Tests",
author = "Xia, Xiao and
Zhang, Dan and
Sun, Tianrui",
editor = "Liakata, Maria and
Moreira, Viviane P. and
Zhang, Jiajun and
Jurgens, David",
booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {ACL} 2026",
month = jul,
year = "2026",
address = "San Diego, California, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.findings-acl.2150/",
pages = "43316--43333",
ISBN = "979-8-89176-395-1"
}
- Built on Qwen2.5-Coder-Instruct, GLM-4-9B-0414, and Qwen3-4B-Thinking-2507 base models.
- Uses huggingface/trl and THUDM/slime for RL training.
- Uses Magicoder-OSS-Instruct-75K to curate reward data and KodCode-V1 and OpenCodeInstruct for RL.

