File size: 6,076 Bytes
e066062 9c39e4f e066062 13f7659 e066062 b91222a e066062 b91222a e066062 3511ed5 e066062 1c3e7c4 2767715 b91222a dd9b7c5 b91222a 9dc8592 b91222a 9dc8592 b91222a 18640c6 b91222a 18640c6 d11053e b91222a 9c39e4f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
license: apache-2.0
datasets:
- KAKA22/CodeRM-UnitTest
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
tags:
- code
- llama
library_name: transformers
---
# Introduction
CodeRM-8B is a small yet powerful model designed to enable efficient and high-quality unit test generation.
It is trained on a dataset of 60k high-quality synthetic Python unit tests using Llama3.1-70B-Instruct.
These unit tests are synthesized based on two well-regarded code instruction tuning datasets:
[CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) and the
training set of [TACO](https://huggingface.co/datasets/BAAI/TACO).
The training dataset used for unit test generation is openly available under
[CodeRM-UnitTest](https://huggingface.co/datasets/KAKA22/CodeRM-UnitTest).
For further information and details of training, refer to our paper:
"Dynamic Scaling of Unit Tests for Code Reward Modeling" available on [arXiv](https://arxiv.org/abs/2501.01054).
You can also visit the [homepage](https://code-reward-model.github.io/) and the github [repo](https://github.com/RUCKBReasoning/CodeRM) of the paper.
# Model Information
The model is trained based on [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
# Prompt Format
```
Below is a question and it's corresponding code answer. Please write test cases to check the correctness of the code answer. You need to use the unittest library in Python and create a test class for testing.
### question
{question}
### code solution
{code in function format}
Please add detailed comments to the test cases you write. You do not need to test the function's ability to throw exceptions.
```
# Performance
## Best-of-N
First, we evaluate the performance of CodeRM-8B using a best-of-N setting. In this setup, an LLM (policy model) generates
100 candidate code solutions for a given programming problem, while another LLM (reward model) generates 100 unit
tests. The optimal code solution is then selected based on majority voting derived from the execution results
of these unit tests.
Under this framework, our trained unit test generator demonstrates performance comparable to Llama3.1-70B-Instruct,
despite having an 8x smaller parameter size. The detailed evaluation results across three well-known benchmarks
are as follows:
| Model | Policy: Llama3-8B | Policy: Llama3-70B | Policy: GPT-3.5 | Policy: GPT-4o-mini |
| :------ | :------ | :------ | :------ | :------ |
| **Benchmark: HumanEval Plus** |||||
| Vanilla | 53.58 | 73.74 | 67.83 | 82.96 |
| Reward: Llama3.1-8B | 66.84 (+13.26) | 77.14 (+3.40) | 76.32 (+8.49) | 83.11 (+0.15) |
| Reward: Llama3.1-70B | **72.04 (+18.46)** | <u>78.54 (+4.80</u>) | **79.76 (+11.93)** | <u>85.45 (+2.49</u>) |
| Reward: CodeRM-8B | <u>72.01 (+18.43</u>) | **78.69 (+4.95)** | <u>78.01 (+10.18</u>) | **86.38 (+3.42)** |
| **Benchmark: MBPP Plus** |||||
| Vanilla | 49.20 | 69.33 | 70.53 | 71.59 |
| Reward: Llama3.1-8B | 64.31 (+15.11) | 71.64 (+2.31) | 74.18 (+3.65) | 74.48 (+2.89) |
| Reward: Llama3.1-70B | <u>65.26 (+16.06</u>) | <u>71.85 (+2.52</u>) | <u>75.72 (+5.19</u>) | <u>74.96 (+3.37</u>) |
| Reward: CodeRM-8B | **66.71 (+17.51)** | **72.44 (+3.11)** | **75.96 (+5.43)** | **75.20 (+3.61)** |
| **Benchmark: LiveCodeBench** |||||
| Vanilla | 11.98 | 25.30 | 20.55 | 34.83 |
| Reward: Llama3.1-70B | <u>13.28 (+1.30</u>) | **28.46 (+3.16)** | **22.80 (+2.25)** | <u>38.60 (+3.77</u>) |
| Reward: CodeRM-8B | **15.21 (+3.23)** | <u>27.73 (+2.43</u>)| <u>21.76 (+1.21</u>) | **39.20 (+4.37)** |
## Quality of Unit Test
We evaluate the quality of the unit test generated by CodeRM-8B. As each unit test functions as a classifier to
determine correct or incorrect solutions, we first utilize accuracy and F1 score as metrics to assess the
classification performance of the unit test.
We further propose two new metrics to detailed evaluate the possibility of the unit test making incorrect judgments.
False Acceptance Rate (FAR) measures the probability of wrong solutions being accepted by unit tests.
False Rejection Rate (FRR) measures the probability of correct solutions being rejected by unit tests.
The calculation formulas for these four metrics are introduced in Appendix D of the paper.
Below is the quality of individual unit tests and the combination of multiple unit tests on HumanEval Plus,
utilizing Llama3.1-8B as the policy model. The top two performances are marked in **bold** and _underlined_.
| **Model** | **Acc (↑)** | **F1 (↑)** | **FAR (↓)** | **FRR (↓)** |
|----------------------|---------------|---------------|---------------|---------------|
| **Quality of Individual Unit Tests** | | | | |
| Llama3.1-8B | 60.02 | 44.97 | 13.66 | 46.13 |
| Llama3.1-70B | **73.65** | **70.15** | **11.10** | **34.51** |
| *CodeRM-8B (Ours)* | <u>69.64</u> | <u>63.63</u> | <u>11.17</u> | <u>38.55</u> |
| **Quality of Multiple Unit Tests** | | | | |
| Llama3.1-8B | 74.21 | 74.35 | 20.44 | 30.55 |
| Llama3.1-70B | <u>78.30</u> | <u>78.76</u> | <u>17.19</u> | <u>25.97</u> |
| *CodeRM-8B (Ours)* | **80.46** | **81.27** | **16.48** | **22.71** |
# Citation
If you find our model helpful, please cite the original paper:
```
@misc{ma2025coderm,
title={Dynamic Scaling of Unit Tests for Code Reward Modeling},
author={Zeyao Ma and Xiaokang Zhang and Jing Zhang and Jifan Yu and Sijia Luo and Jie Tang},
year={2025},
eprint={2501.01054},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.01054},
}
```
# Contact
If you have any problems, feel free to raise an issue or reach out to us via email at: <[email protected]>. |