Introduction
CodeRM-8B is a small yet powerful model designed to enable efficient and high-quality unit test generation. It is trained on a dataset of 60k high-quality synthetic Python unit tests using Llama3.1-70B-Instruct. These unit tests are synthesized based on two well-regarded code instruction tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO. The training dataset used for unit test generation is openly available under CodeRM-UnitTest.
For further information and details of training, refer to our paper: "Dynamic Scaling of Unit Tests for Code Reward Modeling" available on arXiv.
You can also visit the homepage and the github repo of the paper.
Model Information
The model is trained based on Llama3.1-8B-Instruct.
Prompt Format
Below is a question and it's corresponding code answer. Please write test cases to check the correctness of the code answer. You need to use the unittest library in Python and create a test class for testing.
### question
{question}
### code solution
{code in function format}
Please add detailed comments to the test cases you write. You do not need to test the function's ability to throw exceptions.
Performance
Best-of-N
First, we evaluate the performance of CodeRM-8B using a best-of-N setting. In this setup, an LLM (policy model) generates 100 candidate code solutions for a given programming problem, while another LLM (reward model) generates 100 unit tests. The optimal code solution is then selected based on majority voting derived from the execution results of these unit tests.
Under this framework, our trained unit test generator demonstrates performance comparable to Llama3.1-70B-Instruct, despite having an 8x smaller parameter size. The detailed evaluation results across three well-known benchmarks are as follows:
Model | Policy: Llama3-8B | Policy: Llama3-70B | Policy: GPT-3.5 | Policy: GPT-4o-mini |
---|---|---|---|---|
Benchmark: HumanEval Plus | ||||
Vanilla | 53.58 | 73.74 | 67.83 | 82.96 |
Reward: Llama3.1-8B | 66.84 (+13.26) | 77.14 (+3.40) | 76.32 (+8.49) | 83.11 (+0.15) |
Reward: Llama3.1-70B | 72.04 (+18.46) | 78.54 (+4.80) | 79.76 (+11.93) | 85.45 (+2.49) |
Reward: CodeRM-8B | 72.01 (+18.43) | 78.69 (+4.95) | 78.01 (+10.18) | 86.38 (+3.42) |
Benchmark: MBPP Plus | ||||
Vanilla | 49.20 | 69.33 | 70.53 | 71.59 |
Reward: Llama3.1-8B | 64.31 (+15.11) | 71.64 (+2.31) | 74.18 (+3.65) | 74.48 (+2.89) |
Reward: Llama3.1-70B | 65.26 (+16.06) | 71.85 (+2.52) | 75.72 (+5.19) | 74.96 (+3.37) |
Reward: CodeRM-8B | 66.71 (+17.51) | 72.44 (+3.11) | 75.96 (+5.43) | 75.20 (+3.61) |
Benchmark: LiveCodeBench | ||||
Vanilla | 11.98 | 25.30 | 20.55 | 34.83 |
Reward: Llama3.1-70B | 13.28 (+1.30) | 28.46 (+3.16) | 22.80 (+2.25) | 38.60 (+3.77) |
Reward: CodeRM-8B | 15.21 (+3.23) | 27.73 (+2.43) | 21.76 (+1.21) | 39.20 (+4.37) |
Quality of Unit Test
We evaluate the quality of the unit test generated by CodeRM-8B. As each unit test functions as a classifier to determine correct or incorrect solutions, we first utilize accuracy and F1 score as metrics to assess the classification performance of the unit test.
We further propose two new metrics to detailed evaluate the possibility of the unit test making incorrect judgments. False Acceptance Rate (FAR) measures the probability of wrong solutions being accepted by unit tests. False Rejection Rate (FRR) measures the probability of correct solutions being rejected by unit tests. The calculation formulas for these four metrics are introduced in Appendix D of the paper.
Below is the quality of individual unit tests and the combination of multiple unit tests on HumanEval Plus, utilizing Llama3.1-8B as the policy model. The top two performances are marked in bold and underlined.
Model | Acc (↑) | F1 (↑) | FAR (↓) | FRR (↓) |
---|---|---|---|---|
Quality of Individual Unit Tests | ||||
Llama3.1-8B | 60.02 | 44.97 | 13.66 | 46.13 |
Llama3.1-70B | 73.65 | 70.15 | 11.10 | 34.51 |
CodeRM-8B (Ours) | 69.64 | 63.63 | 11.17 | 38.55 |
Quality of Multiple Unit Tests | ||||
Llama3.1-8B | 74.21 | 74.35 | 20.44 | 30.55 |
Llama3.1-70B | 78.30 | 78.76 | 17.19 | 25.97 |
CodeRM-8B (Ours) | 80.46 | 81.27 | 16.48 | 22.71 |
Citation
If you find our model helpful, please cite the original paper:
@misc{ma2025coderm,
title={Dynamic Scaling of Unit Tests for Code Reward Modeling},
author={Zeyao Ma and Xiaokang Zhang and Jing Zhang and Jifan Yu and Sijia Luo and Jie Tang},
year={2025},
eprint={2501.01054},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.01054},
}
Contact
If you have any problems, feel free to raise an issue or reach out to us via email at: [email protected].
- Downloads last month
- 17
Model tree for KAKA22/CodeRM-8B
Base model
meta-llama/Llama-3.1-8B