Coffee-Gym commited on
Commit
fb38f03
·
verified ·
1 Parent(s): 027fc14

Upload index.html

Browse files
Files changed (1) hide show
  1. index.html +9 -7
index.html CHANGED
@@ -86,8 +86,11 @@
86
  <!-- @PAN TODO: change links -->
87
  <a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
88
  class="external-link button is-normal is-rounded is-dark" target="_blank">
89
- <span class="icon">
90
  <i class="fas fa-file-pdf"></i>
 
 
 
91
  </span>
92
  <span>Paper</span>
93
  </a>
@@ -189,7 +192,7 @@
189
  <!-- Abstract. -->
190
  <div class="columns is-centered has-text-centered">
191
  <div class="column is-four-fifths">
192
- <h2 class="title is-3">🔔News</h2>
193
  <div class="content has-text-justified">
194
 
195
 
@@ -219,7 +222,7 @@
219
  <h2 class="title is-3">Introduction</h2>
220
  <div class="content has-text-justified">
221
  <p>
222
- This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
223
  </p>
224
  </div>
225
  </div>
@@ -313,16 +316,15 @@ The strong performance of our CoffeeEval validates its effectiveness in assessin
313
 
314
  <div class="content has-text-centered">
315
  <img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
316
- <p> We use
317
- ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
318
  </div>
319
 
320
  <div class="content has-text-justified">
321
- <p>
322
  Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
323
  In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
324
  Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
325
- </p>
326
  </div>
327
 
328
  </div>
 
86
  <!-- @PAN TODO: change links -->
87
  <a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
88
  class="external-link button is-normal is-rounded is-dark" target="_blank">
89
+ <!-- <span class="icon">
90
  <i class="fas fa-file-pdf"></i>
91
+ </span> -->
92
+ <span class="icon">
93
+ <p style="font-size:18px">📝</p>
94
  </span>
95
  <span>Paper</span>
96
  </a>
 
192
  <!-- Abstract. -->
193
  <div class="columns is-centered has-text-centered">
194
  <div class="column is-four-fifths">
195
+ <h2 class="title is-3">🔔 News</h2>
196
  <div class="content has-text-justified">
197
 
198
 
 
222
  <h2 class="title is-3">Introduction</h2>
223
  <div class="content has-text-justified">
224
  <p>
225
+ This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available.
226
  </p>
227
  </div>
228
  </div>
 
316
 
317
  <div class="content has-text-centered">
318
  <img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
319
+ <p> Code editing results of our feedback model trained with Coffee-Gym, i.e., PPO-COFFEEVAL, on HumanEvalFix and COFFEE-Test. We pair our feedback model with an open-source code LLM as the code editor.</p>
 
320
  </div>
321
 
322
  <div class="content has-text-justified">
323
+ <!-- <p>
324
  Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
325
  In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
326
  Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
327
+ </p> -->
328
  </div>
329
 
330
  </div>