Introduction

CodeRM-8B is a small yet powerful model designed to enable efficient and high-quality unit test generation. It is trained on a dataset of 60k high-quality synthetic Python unit tests using Llama3.1-70B-Instruct. These unit tests are synthesized based on two well-regarded code instruction tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO. The training dataset used for unit test generation is openly available under CodeRM-UnitTest.

For further information and details of training, refer to our paper: "Dynamic Scaling of Unit Tests for Code Reward Modeling" available on arXiv.

You can also visit the homepage and the github repo of the paper.

Model Information

The model is trained based on Llama3.1-8B-Instruct.

Prompt Format

Below is a question and it's corresponding code answer. Please write test cases to check the correctness of the code answer. You need to use the unittest library in Python and create a test class for testing.

### question
{question}

### code solution
{code in function format}

Please add detailed comments to the test cases you write. You do not need to test the function's ability to throw exceptions.

Performance

Best-of-N

First, we evaluate the performance of CodeRM-8B using a best-of-N setting. In this setup, an LLM (policy model) generates 100 candidate code solutions for a given programming problem, while another LLM (reward model) generates 100 unit tests. The optimal code solution is then selected based on majority voting derived from the execution results of these unit tests.

Under this framework, our trained unit test generator demonstrates performance comparable to Llama3.1-70B-Instruct, despite having an 8x smaller parameter size. The detailed evaluation results across three well-known benchmarks are as follows:

Model Policy: Llama3-8B Policy: Llama3-70B Policy: GPT-3.5 Policy: GPT-4o-mini
Benchmark: HumanEval Plus
Vanilla 53.58 73.74 67.83 82.96
Reward: Llama3.1-8B 66.84 (+13.26) 77.14 (+3.40) 76.32 (+8.49) 83.11 (+0.15)
Reward: Llama3.1-70B 72.04 (+18.46) 78.54 (+4.80) 79.76 (+11.93) 85.45 (+2.49)
Reward: CodeRM-8B 72.01 (+18.43) 78.69 (+4.95) 78.01 (+10.18) 86.38 (+3.42)
Benchmark: MBPP Plus
Vanilla 49.20 69.33 70.53 71.59
Reward: Llama3.1-8B 64.31 (+15.11) 71.64 (+2.31) 74.18 (+3.65) 74.48 (+2.89)
Reward: Llama3.1-70B 65.26 (+16.06) 71.85 (+2.52) 75.72 (+5.19) 74.96 (+3.37)
Reward: CodeRM-8B 66.71 (+17.51) 72.44 (+3.11) 75.96 (+5.43) 75.20 (+3.61)
Benchmark: LiveCodeBench
Vanilla 11.98 25.30 20.55 34.83
Reward: Llama3.1-70B 13.28 (+1.30) 28.46 (+3.16) 22.80 (+2.25) 38.60 (+3.77)
Reward: CodeRM-8B 15.21 (+3.23) 27.73 (+2.43) 21.76 (+1.21) 39.20 (+4.37)

Quality of Unit Test

We evaluate the quality of the unit test generated by CodeRM-8B. As each unit test functions as a classifier to determine correct or incorrect solutions, we first utilize accuracy and F1 score as metrics to assess the classification performance of the unit test.

We further propose two new metrics to detailed evaluate the possibility of the unit test making incorrect judgments. False Acceptance Rate (FAR) measures the probability of wrong solutions being accepted by unit tests. False Rejection Rate (FRR) measures the probability of correct solutions being rejected by unit tests. The calculation formulas for these four metrics are introduced in Appendix D of the paper.

Below is the quality of individual unit tests and the combination of multiple unit tests on HumanEval Plus, utilizing Llama3.1-8B as the policy model. The top two performances are marked in bold and underlined.

Model Acc (↑) F1 (↑) FAR (↓) FRR (↓)
Quality of Individual Unit Tests
Llama3.1-8B 60.02 44.97 13.66 46.13
Llama3.1-70B 73.65 70.15 11.10 34.51
CodeRM-8B (Ours) 69.64 63.63 11.17 38.55
Quality of Multiple Unit Tests
Llama3.1-8B 74.21 74.35 20.44 30.55
Llama3.1-70B 78.30 78.76 17.19 25.97
CodeRM-8B (Ours) 80.46 81.27 16.48 22.71

Citation

If you find our model helpful, please cite the original paper:

@misc{ma2025coderm,
      title={Dynamic Scaling of Unit Tests for Code Reward Modeling}, 
      author={Zeyao Ma and Xiaokang Zhang and Jing Zhang and Jifan Yu and Sijia Luo and Jie Tang},
      year={2025},
      eprint={2501.01054},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.01054}, 
}

Contact

If you have any problems, feel free to raise an issue or reach out to us via email at: [email protected].

Downloads last month
17
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for KAKA22/CodeRM-8B

Finetuned
(615)
this model

Dataset used to train KAKA22/CodeRM-8B