Spaces:
Running
Running
Coffee-Gym
commited on
Upload index.html
Browse files- index.html +9 -7
index.html
CHANGED
@@ -86,8 +86,11 @@
|
|
86 |
<!-- @PAN TODO: change links -->
|
87 |
<a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
|
88 |
class="external-link button is-normal is-rounded is-dark" target="_blank">
|
89 |
-
<span class="icon">
|
90 |
<i class="fas fa-file-pdf"></i>
|
|
|
|
|
|
|
91 |
</span>
|
92 |
<span>Paper</span>
|
93 |
</a>
|
@@ -189,7 +192,7 @@
|
|
189 |
<!-- Abstract. -->
|
190 |
<div class="columns is-centered has-text-centered">
|
191 |
<div class="column is-four-fifths">
|
192 |
-
<h2 class="title is-3">🔔News</h2>
|
193 |
<div class="content has-text-justified">
|
194 |
|
195 |
|
@@ -219,7 +222,7 @@
|
|
219 |
<h2 class="title is-3">Introduction</h2>
|
220 |
<div class="content has-text-justified">
|
221 |
<p>
|
222 |
-
This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
|
223 |
</p>
|
224 |
</div>
|
225 |
</div>
|
@@ -313,16 +316,15 @@ The strong performance of our CoffeeEval validates its effectiveness in assessin
|
|
313 |
|
314 |
<div class="content has-text-centered">
|
315 |
<img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
|
316 |
-
<p> We
|
317 |
-
ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
|
318 |
</div>
|
319 |
|
320 |
<div class="content has-text-justified">
|
321 |
-
<p>
|
322 |
Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
|
323 |
In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
|
324 |
Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
|
325 |
-
</p>
|
326 |
</div>
|
327 |
|
328 |
</div>
|
|
|
86 |
<!-- @PAN TODO: change links -->
|
87 |
<a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
|
88 |
class="external-link button is-normal is-rounded is-dark" target="_blank">
|
89 |
+
<!-- <span class="icon">
|
90 |
<i class="fas fa-file-pdf"></i>
|
91 |
+
</span> -->
|
92 |
+
<span class="icon">
|
93 |
+
<p style="font-size:18px">📝</p>
|
94 |
</span>
|
95 |
<span>Paper</span>
|
96 |
</a>
|
|
|
192 |
<!-- Abstract. -->
|
193 |
<div class="columns is-centered has-text-centered">
|
194 |
<div class="column is-four-fifths">
|
195 |
+
<h2 class="title is-3">🔔 News</h2>
|
196 |
<div class="content has-text-justified">
|
197 |
|
198 |
|
|
|
222 |
<h2 class="title is-3">Introduction</h2>
|
223 |
<div class="content has-text-justified">
|
224 |
<p>
|
225 |
+
This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available.
|
226 |
</p>
|
227 |
</div>
|
228 |
</div>
|
|
|
316 |
|
317 |
<div class="content has-text-centered">
|
318 |
<img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
|
319 |
+
<p> Code editing results of our feedback model trained with Coffee-Gym, i.e., PPO-COFFEEVAL, on HumanEvalFix and COFFEE-Test. We pair our feedback model with an open-source code LLM as the code editor.</p>
|
|
|
320 |
</div>
|
321 |
|
322 |
<div class="content has-text-justified">
|
323 |
+
<!-- <p>
|
324 |
Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
|
325 |
In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
|
326 |
Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
|
327 |
+
</p> -->
|
328 |
</div>
|
329 |
|
330 |
</div>
|