Spaces:
Running
Running
Coffee-Gym
commited on
Upload index.html
Browse files- index.html +20 -52
index.html
CHANGED
@@ -61,11 +61,11 @@
|
|
61 |
<div class="columns is-centered">
|
62 |
<div class="column has-text-centered">
|
63 |
<h1 class="title is-1 publication-title is-bold">
|
64 |
-
<img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
|
65 |
-
<span class="opencodeinterpreter" style="vertical-align: middle">Coffee</span>
|
66 |
</h1>
|
67 |
<h2 class="subtitle is-3 publication-subtitle">
|
68 |
-
|
69 |
</h2>
|
70 |
|
71 |
|
@@ -134,13 +134,12 @@
|
|
134 |
|
135 |
</div>
|
136 |
|
137 |
-
<div class="links-row">
|
138 |
<span class="link-block">
|
139 |
<a href="#mainresults"
|
140 |
class="external-link button is-normal is-rounded is-dark">
|
141 |
<span class="icon has-text-white">
|
142 |
<i class="fa-solid fa-trophy"></i>
|
143 |
-
<!-- <p style="font-size:18px">🏆</p> -->
|
144 |
</span>
|
145 |
<span>Main Results</span>
|
146 |
</a>
|
@@ -148,7 +147,6 @@
|
|
148 |
|
149 |
|
150 |
|
151 |
-
<!-- EvalAI Link. -->
|
152 |
<span class="link-block">
|
153 |
<a href="#example"
|
154 |
class="external-link button is-normal is-rounded is-dark">
|
@@ -159,7 +157,7 @@
|
|
159 |
</a>
|
160 |
</span>
|
161 |
|
162 |
-
</div>
|
163 |
|
164 |
</div>
|
165 |
</div>
|
@@ -178,8 +176,8 @@
|
|
178 |
<section class="hero teaser">
|
179 |
<div class="container is-max-desktop">
|
180 |
<div class="content has-text-centered">
|
181 |
-
<img src="static/images/
|
182 |
-
<p>
|
183 |
</div>
|
184 |
<!-- </div> -->
|
185 |
</div>
|
@@ -221,7 +219,7 @@
|
|
221 |
<h2 class="title is-3">Introduction</h2>
|
222 |
<div class="content has-text-justified">
|
223 |
<p>
|
224 |
-
This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-
|
225 |
</p>
|
226 |
</div>
|
227 |
</div>
|
@@ -235,8 +233,7 @@
|
|
235 |
<div class="hero-body has-text-centered">
|
236 |
<h1 class="title is-1 mmmu">
|
237 |
<img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
|
238 |
-
<span class="mmmu" style="vertical-align: middle">
|
239 |
-
for COde Fixing with FEEdback</span>
|
240 |
</h1>
|
241 |
</div>
|
242 |
</section>
|
@@ -249,17 +246,12 @@
|
|
249 |
<div class="column is-four-fifths">
|
250 |
<h2 class="title is-3">Overview</h2>
|
251 |
<div class="content has-text-centered">
|
252 |
-
<img src="static/images/
|
253 |
<p> An example instance from COFFEE dataset.</p>
|
254 |
</div>
|
255 |
<div class="content has-text-justified">
|
256 |
<p>
|
257 |
-
|
258 |
-
</p>
|
259 |
-
|
260 |
-
<p>
|
261 |
-
To this end, we introduce ☕COFFEE, a dataset for <b>code editing with feedback</b>.
|
262 |
-
Our dataset includes diverse solutions to programming problems collected from an online competitive programming platform. For each solution, we additionally annotate natural language feedback to provide detailed explanations for the errors towards correct edits, and augment synthetic test cases to measure the correctness of the edited solutions.
|
263 |
</p>
|
264 |
</div>
|
265 |
</div>
|
@@ -274,9 +266,7 @@
|
|
274 |
<section class="hero is-light is-small">
|
275 |
<div class="hero-body has-text-centered">
|
276 |
<h1 class="title is-1 mmmu">
|
277 |
-
<
|
278 |
-
<span class="mmmu" style="vertical-align: middle"> COFFEEPOTS: Aligning Feedback
|
279 |
-
with Preferred Edits
|
280 |
</span>
|
281 |
</h1>
|
282 |
</div>
|
@@ -286,40 +276,18 @@ with Preferred Edits
|
|
286 |
<div class="columns is-centered has-text-centered">
|
287 |
<!-- <div class="column is-full-width has-text-centered"> -->
|
288 |
<div class="column is-four-fifths">
|
289 |
-
<h2 class="title is-3">Is Training on COFFEE with Supervised Finetuning Enough?</h2>
|
290 |
|
291 |
-
<div class="container is-max-desktop">
|
292 |
-
<div class="content has-text-centered">
|
293 |
-
<img src="static/images/pass_ratio.svg" alt="algebraic reasoning" width="70%"/>
|
294 |
-
<p> Pass@1 results of code editing with SFT feedback on the test set of COFFEE compared with editing with ChatGPT feedback and editing without any feedback settings.</p>
|
295 |
-
</div>
|
296 |
-
<br/>
|
297 |
-
</div>
|
298 |
-
<div class="content has-text-justified">
|
299 |
-
<p>
|
300 |
-
<b><h5>No.</h5></b> As we show in the figure above, we find that training a critic model that generates natural language feedback on a given erroneous code with next token prediction obejct cannot produce any helpful deedback. Editing code with feedback from SFT critic shows performance even worse than direct editing (i.e., editing w/o feedback). We posit that learning to generate accurate feedback is a very challenging goal that cannot be achieved only with next-token prediction object.
|
301 |
-
</p>
|
302 |
-
<br/>
|
303 |
-
<div class="content has-text-centered">
|
304 |
-
<img src="static/images/feedback_quality_analysis.svg" alt="algebraic reasoning" width="100%"/>
|
305 |
-
<p>Human evaluation on the quality of feedback from SFT critic. </p>
|
306 |
-
</div>
|
307 |
-
<p>
|
308 |
-
Further analysis of the feedback quality suggests that the feedback from the SFT-trained critic has not yet reached a satisfactory level. It remains only as <b>'Partially Correct'</b> feedback, resulting the decreased performance in code editing as we show in the previous
|
309 |
-
analysis.
|
310 |
-
</p>
|
311 |
-
</div>
|
312 |
|
313 |
<br/>
|
314 |
<div class="content has-text-centered">
|
315 |
-
<img src="static/images/
|
316 |
-
<p>
|
317 |
</div>
|
318 |
|
319 |
<div class="content has-text-justified">
|
320 |
<p>
|
321 |
-
|
322 |
-
|
323 |
</p>
|
324 |
</div>
|
325 |
|
@@ -344,8 +312,8 @@ with Preferred Edits
|
|
344 |
<div class="content">
|
345 |
|
346 |
<div class="content has-text-centered">
|
347 |
-
<img src="static/images/
|
348 |
-
<p>
|
349 |
ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
|
350 |
</div>
|
351 |
|
@@ -362,7 +330,7 @@ Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competi
|
|
362 |
</div>
|
363 |
<!-------------------------------------------------------------------- Case Study-------------------------------------------------------------------->
|
364 |
|
365 |
-
<div class="columns is-centered has-text-centered">
|
366 |
<div class="column is-four-fifths">
|
367 |
<h2 class="title is-3" id="example">Example</h2>
|
368 |
<iframe
|
@@ -377,7 +345,7 @@ Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competi
|
|
377 |
</div>
|
378 |
</div>
|
379 |
</div>
|
380 |
-
</div>
|
381 |
|
382 |
|
383 |
</div>
|
|
|
61 |
<div class="columns is-centered">
|
62 |
<div class="column has-text-centered">
|
63 |
<h1 class="title is-1 publication-title is-bold">
|
64 |
+
<!-- <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/> -->
|
65 |
+
<span class="opencodeinterpreter" style="vertical-align: middle">Coffee-Gym</span>
|
66 |
</h1>
|
67 |
<h2 class="subtitle is-3 publication-subtitle">
|
68 |
+
An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
|
69 |
</h2>
|
70 |
|
71 |
|
|
|
134 |
|
135 |
</div>
|
136 |
|
137 |
+
<!-- <div class="links-row">
|
138 |
<span class="link-block">
|
139 |
<a href="#mainresults"
|
140 |
class="external-link button is-normal is-rounded is-dark">
|
141 |
<span class="icon has-text-white">
|
142 |
<i class="fa-solid fa-trophy"></i>
|
|
|
143 |
</span>
|
144 |
<span>Main Results</span>
|
145 |
</a>
|
|
|
147 |
|
148 |
|
149 |
|
|
|
150 |
<span class="link-block">
|
151 |
<a href="#example"
|
152 |
class="external-link button is-normal is-rounded is-dark">
|
|
|
157 |
</a>
|
158 |
</span>
|
159 |
|
160 |
+
</div> -->
|
161 |
|
162 |
</div>
|
163 |
</div>
|
|
|
176 |
<section class="hero teaser">
|
177 |
<div class="container is-max-desktop">
|
178 |
<div class="content has-text-centered">
|
179 |
+
<img src="static/images/comparison_w_prev.svg" alt="geometric reasoning" width="95%"/>
|
180 |
+
<p> Comparison between COFFEE-GYM and the previous approach. </p>
|
181 |
</div>
|
182 |
<!-- </div> -->
|
183 |
</div>
|
|
|
219 |
<h2 class="title is-3">Introduction</h2>
|
220 |
<div class="content has-text-justified">
|
221 |
<p>
|
222 |
+
This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
|
223 |
</p>
|
224 |
</div>
|
225 |
</div>
|
|
|
233 |
<div class="hero-body has-text-centered">
|
234 |
<h1 class="title is-1 mmmu">
|
235 |
<img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
|
236 |
+
<span class="mmmu" style="vertical-align: middle">COFFEE: Human-written Code Edit Traces with Annotated Pairwise Feedback</span>
|
|
|
237 |
</h1>
|
238 |
</div>
|
239 |
</section>
|
|
|
246 |
<div class="column is-four-fifths">
|
247 |
<h2 class="title is-3">Overview</h2>
|
248 |
<div class="content has-text-centered">
|
249 |
+
<img src="static/images/data.svg" alt="example instance of coffee dataset" class="center" style="width:50%">
|
250 |
<p> An example instance from COFFEE dataset.</p>
|
251 |
</div>
|
252 |
<div class="content has-text-justified">
|
253 |
<p>
|
254 |
+
We curate ☕️ COFFEE, a dataset of code fixing with feedback, from human-written code edit traces. Coffee consists of problems of diverse levels of difficulty, including challenging problems that only human programmers can solve, and provides test cases for reward functions.
|
|
|
|
|
|
|
|
|
|
|
255 |
</p>
|
256 |
</div>
|
257 |
</div>
|
|
|
266 |
<section class="hero is-light is-small">
|
267 |
<div class="hero-body has-text-centered">
|
268 |
<h1 class="title is-1 mmmu">
|
269 |
+
<span class="mmmu" style="vertical-align: middle">COFFEEEVAL: Unit-test-driven Feedback Evaluation
|
|
|
|
|
270 |
</span>
|
271 |
</h1>
|
272 |
</div>
|
|
|
276 |
<div class="columns is-centered has-text-centered">
|
277 |
<!-- <div class="column is-full-width has-text-centered"> -->
|
278 |
<div class="column is-four-fifths">
|
|
|
279 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
280 |
|
281 |
<br/>
|
282 |
<div class="content has-text-centered">
|
283 |
+
<img src="static/images/coffeeeval_results.png" alt="evalution results on coffeeeval" width="100%"/>
|
284 |
+
<p>Performance of our evaluation protocol on the test sets of Coffee compared to the baselines. Wrong Feedback is abbreviated as WF due to limited space.</p>
|
285 |
</div>
|
286 |
|
287 |
<div class="content has-text-justified">
|
288 |
<p>
|
289 |
+
DeepSeek-CoffeeEval achieves higher Pearson correlation and lower MSE than all G-Eval and Editing baselines. In particular, our approach shows even higher correlation than the G-Eval baseline implemented with GPT-4-Turbo.
|
290 |
+
The strong performance of our CoffeeEval validates its effectiveness in assessing the quality of NL feedback in the code editing task.
|
291 |
</p>
|
292 |
</div>
|
293 |
|
|
|
312 |
<div class="content">
|
313 |
|
314 |
<div class="content has-text-centered">
|
315 |
+
<img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
|
316 |
+
<p> We use
|
317 |
ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
|
318 |
</div>
|
319 |
|
|
|
330 |
</div>
|
331 |
<!-------------------------------------------------------------------- Case Study-------------------------------------------------------------------->
|
332 |
|
333 |
+
<!-- <div class="columns is-centered has-text-centered">
|
334 |
<div class="column is-four-fifths">
|
335 |
<h2 class="title is-3" id="example">Example</h2>
|
336 |
<iframe
|
|
|
345 |
</div>
|
346 |
</div>
|
347 |
</div>
|
348 |
+
</div> -->
|
349 |
|
350 |
|
351 |
</div>
|