Coffee-Gym commited on
Commit
027fc14
·
verified ·
1 Parent(s): 814fe0c

Upload index.html

Browse files
Files changed (1) hide show
  1. index.html +20 -52
index.html CHANGED
@@ -61,11 +61,11 @@
61
  <div class="columns is-centered">
62
  <div class="column has-text-centered">
63
  <h1 class="title is-1 publication-title is-bold">
64
- <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
65
- <span class="opencodeinterpreter" style="vertical-align: middle">Coffee</span>
66
  </h1>
67
  <h2 class="subtitle is-3 publication-subtitle">
68
- Boost Your Code LLMs by Fixing Bugs with Feedback
69
  </h2>
70
 
71
 
@@ -134,13 +134,12 @@
134
 
135
  </div>
136
 
137
- <div class="links-row">
138
  <span class="link-block">
139
  <a href="#mainresults"
140
  class="external-link button is-normal is-rounded is-dark">
141
  <span class="icon has-text-white">
142
  <i class="fa-solid fa-trophy"></i>
143
- <!-- <p style="font-size:18px">🏆</p> -->
144
  </span>
145
  <span>Main Results</span>
146
  </a>
@@ -148,7 +147,6 @@
148
 
149
 
150
 
151
- <!-- EvalAI Link. -->
152
  <span class="link-block">
153
  <a href="#example"
154
  class="external-link button is-normal is-rounded is-dark">
@@ -159,7 +157,7 @@
159
  </a>
160
  </span>
161
 
162
- </div>
163
 
164
  </div>
165
  </div>
@@ -178,8 +176,8 @@
178
  <section class="hero teaser">
179
  <div class="container is-max-desktop">
180
  <div class="content has-text-centered">
181
- <img src="static/images/figure1.svg" alt="geometric reasoning" width="95%"/>
182
- <p> (Left) A motivating example of code editing with natural language feedback. With appropriate feedback, the editor model can produce a correct editing with the help of the feedback. (Right) By evaluating CoffeePots on HumanEvalFix, a benchmark that aims to assess code editing abilities, we show <b>our approach outperforms GPT-4</b>. </p>
183
  </div>
184
  <!-- </div> -->
185
  </div>
@@ -221,7 +219,7 @@
221
  <h2 class="title is-3">Introduction</h2>
222
  <div class="content has-text-justified">
223
  <p>
224
- This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYMincludes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfullyreflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
225
  </p>
226
  </div>
227
  </div>
@@ -235,8 +233,7 @@
235
  <div class="hero-body has-text-centered">
236
  <h1 class="title is-1 mmmu">
237
  <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
238
- <span class="mmmu" style="vertical-align: middle">Coffee: dataset
239
- for COde Fixing with FEEdback</span>
240
  </h1>
241
  </div>
242
  </section>
@@ -249,17 +246,12 @@
249
  <div class="column is-four-fifths">
250
  <h2 class="title is-3">Overview</h2>
251
  <div class="content has-text-centered">
252
- <img src="static/images/coffee_example.svg" alt="algebraic reasoning" class="center" style="width:40%">
253
  <p> An example instance from COFFEE dataset.</p>
254
  </div>
255
  <div class="content has-text-justified">
256
  <p>
257
- Recent large language models have demonstrated promising capabilities in correcting their codes based on natural language feedback. However, this ability is currently only applicable to open-source models (e.g., GPT-3.5-Turbo and GPT-4), and <b>not to closed-source models</b>. This poses a significant safety and privacy concern as the codes cannot be uploaded to any external servers (e.g., OpenAI API server).
258
- </p>
259
-
260
- <p>
261
- To this end, we introduce ☕COFFEE, a dataset for <b>code editing with feedback</b>.
262
- Our dataset includes diverse solutions to programming problems collected from an online competitive programming platform. For each solution, we additionally annotate natural language feedback to provide detailed explanations for the errors towards correct edits, and augment synthetic test cases to measure the correctness of the edited solutions.
263
  </p>
264
  </div>
265
  </div>
@@ -274,9 +266,7 @@
274
  <section class="hero is-light is-small">
275
  <div class="hero-body has-text-centered">
276
  <h1 class="title is-1 mmmu">
277
- <img src="static/images/coffeepot_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
278
- <span class="mmmu" style="vertical-align: middle"> COFFEEPOTS: Aligning Feedback
279
- with Preferred Edits
280
  </span>
281
  </h1>
282
  </div>
@@ -286,40 +276,18 @@ with Preferred Edits
286
  <div class="columns is-centered has-text-centered">
287
  <!-- <div class="column is-full-width has-text-centered"> -->
288
  <div class="column is-four-fifths">
289
- <h2 class="title is-3">Is Training on COFFEE with Supervised Finetuning Enough?</h2>
290
 
291
- <div class="container is-max-desktop">
292
- <div class="content has-text-centered">
293
- <img src="static/images/pass_ratio.svg" alt="algebraic reasoning" width="70%"/>
294
- <p> Pass@1 results of code editing with SFT feedback on the test set of COFFEE compared with editing with ChatGPT feedback and editing without any feedback settings.</p>
295
- </div>
296
- <br/>
297
- </div>
298
- <div class="content has-text-justified">
299
- <p>
300
- <b><h5>No.</h5></b> As we show in the figure above, we find that training a critic model that generates natural language feedback on a given erroneous code with next token prediction obejct cannot produce any helpful deedback. Editing code with feedback from SFT critic shows performance even worse than direct editing (i.e., editing w/o feedback). We posit that learning to generate accurate feedback is a very challenging goal that cannot be achieved only with next-token prediction object.
301
- </p>
302
- <br/>
303
- <div class="content has-text-centered">
304
- <img src="static/images/feedback_quality_analysis.svg" alt="algebraic reasoning" width="100%"/>
305
- <p>Human evaluation on the quality of feedback from SFT critic. </p>
306
- </div>
307
- <p>
308
- Further analysis of the feedback quality suggests that the feedback from the SFT-trained critic has not yet reached a satisfactory level. It remains only as <b>'Partially Correct'</b> feedback, resulting the decreased performance in code editing as we show in the previous
309
- analysis.
310
- </p>
311
- </div>
312
 
313
  <br/>
314
  <div class="content has-text-centered">
315
- <img src="static/images/coffeepots.svg" alt="algebraic reasoning" width="100%"/>
316
- <p>Overview of COFFEEPOTs. </p>
317
  </div>
318
 
319
  <div class="content has-text-justified">
320
  <p>
321
- To resolve the aforementioned issue, we introduce COFFEEPOTS, a framework for COde Fixing with FEEdback via Preference-Optimized Tuning and Selection. We first use our COFFEE dataset to train code LLMs via supervised fine-tuning (SFT) for feedback augmented code editing.
322
- Then, we additionally <b>leverage synthetic test cases in COFFEE to annotate preferred (i.e., helpful) solutions and apply preference alignment </b>to guide the generation of helpful feedback.
323
  </p>
324
  </div>
325
 
@@ -344,8 +312,8 @@ with Preferred Edits
344
  <div class="content">
345
 
346
  <div class="content has-text-centered">
347
- <img src="static/images/main_table.svg" alt="algebraic reasoning" width="100%"/>
348
- <p>Performances in editing machine-generated codes. We report pass@1 and ERR (in parentheses). We use
349
  ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
350
  </div>
351
 
@@ -362,7 +330,7 @@ Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competi
362
  </div>
363
  <!-------------------------------------------------------------------- Case Study-------------------------------------------------------------------->
364
 
365
- <div class="columns is-centered has-text-centered">
366
  <div class="column is-four-fifths">
367
  <h2 class="title is-3" id="example">Example</h2>
368
  <iframe
@@ -377,7 +345,7 @@ Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competi
377
  </div>
378
  </div>
379
  </div>
380
- </div>
381
 
382
 
383
  </div>
 
61
  <div class="columns is-centered">
62
  <div class="column has-text-centered">
63
  <h1 class="title is-1 publication-title is-bold">
64
+ <!-- <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/> -->
65
+ <span class="opencodeinterpreter" style="vertical-align: middle">Coffee-Gym</span>
66
  </h1>
67
  <h2 class="subtitle is-3 publication-subtitle">
68
+ An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
69
  </h2>
70
 
71
 
 
134
 
135
  </div>
136
 
137
+ <!-- <div class="links-row">
138
  <span class="link-block">
139
  <a href="#mainresults"
140
  class="external-link button is-normal is-rounded is-dark">
141
  <span class="icon has-text-white">
142
  <i class="fa-solid fa-trophy"></i>
 
143
  </span>
144
  <span>Main Results</span>
145
  </a>
 
147
 
148
 
149
 
 
150
  <span class="link-block">
151
  <a href="#example"
152
  class="external-link button is-normal is-rounded is-dark">
 
157
  </a>
158
  </span>
159
 
160
+ </div> -->
161
 
162
  </div>
163
  </div>
 
176
  <section class="hero teaser">
177
  <div class="container is-max-desktop">
178
  <div class="content has-text-centered">
179
+ <img src="static/images/comparison_w_prev.svg" alt="geometric reasoning" width="95%"/>
180
+ <p> Comparison between COFFEE-GYM and the previous approach. </p>
181
  </div>
182
  <!-- </div> -->
183
  </div>
 
219
  <h2 class="title is-3">Introduction</h2>
220
  <div class="content has-text-justified">
221
  <p>
222
+ This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
223
  </p>
224
  </div>
225
  </div>
 
233
  <div class="hero-body has-text-centered">
234
  <h1 class="title is-1 mmmu">
235
  <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
236
+ <span class="mmmu" style="vertical-align: middle">COFFEE: Human-written Code Edit Traces with Annotated Pairwise Feedback</span>
 
237
  </h1>
238
  </div>
239
  </section>
 
246
  <div class="column is-four-fifths">
247
  <h2 class="title is-3">Overview</h2>
248
  <div class="content has-text-centered">
249
+ <img src="static/images/data.svg" alt="example instance of coffee dataset" class="center" style="width:50%">
250
  <p> An example instance from COFFEE dataset.</p>
251
  </div>
252
  <div class="content has-text-justified">
253
  <p>
254
+ We curate ☕️ COFFEE, a dataset of code fixing with feedback, from human-written code edit traces. Coffee consists of problems of diverse levels of difficulty, including challenging problems that only human programmers can solve, and provides test cases for reward functions.
 
 
 
 
 
255
  </p>
256
  </div>
257
  </div>
 
266
  <section class="hero is-light is-small">
267
  <div class="hero-body has-text-centered">
268
  <h1 class="title is-1 mmmu">
269
+ <span class="mmmu" style="vertical-align: middle">COFFEEEVAL: Unit-test-driven Feedback Evaluation
 
 
270
  </span>
271
  </h1>
272
  </div>
 
276
  <div class="columns is-centered has-text-centered">
277
  <!-- <div class="column is-full-width has-text-centered"> -->
278
  <div class="column is-four-fifths">
 
279
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
  <br/>
282
  <div class="content has-text-centered">
283
+ <img src="static/images/coffeeeval_results.png" alt="evalution results on coffeeeval" width="100%"/>
284
+ <p>Performance of our evaluation protocol on the test sets of Coffee compared to the baselines. Wrong Feedback is abbreviated as WF due to limited space.</p>
285
  </div>
286
 
287
  <div class="content has-text-justified">
288
  <p>
289
+ DeepSeek-CoffeeEval achieves higher Pearson correlation and lower MSE than all G-Eval and Editing baselines. In particular, our approach shows even higher correlation than the G-Eval baseline implemented with GPT-4-Turbo.
290
+ The strong performance of our CoffeeEval validates its effectiveness in assessing the quality of NL feedback in the code editing task.
291
  </p>
292
  </div>
293
 
 
312
  <div class="content">
313
 
314
  <div class="content has-text-centered">
315
+ <img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
316
+ <p> We use
317
  ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
318
  </div>
319
 
 
330
  </div>
331
  <!-------------------------------------------------------------------- Case Study-------------------------------------------------------------------->
332
 
333
+ <!-- <div class="columns is-centered has-text-centered">
334
  <div class="column is-four-fifths">
335
  <h2 class="title is-3" id="example">Example</h2>
336
  <iframe
 
345
  </div>
346
  </div>
347
  </div>
348
+ </div> -->
349
 
350
 
351
  </div>