Spaces:

Coffee-Gym
/

Project-Coffee-Gym

Running

App Files Files Community

Coffee-Gym commited on Jul 1, 2024

Commit

027fc14

verified ·

1 Parent(s): 814fe0c

Upload index.html

Browse files

Files changed (1) hide show

index.html +20 -52

index.html CHANGED Viewed

@@ -61,11 +61,11 @@
       <div class="columns is-centered">
         <div class="column has-text-centered">
           <h1 class="title is-1 publication-title is-bold">
-            <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
-            <span class="opencodeinterpreter" style="vertical-align: middle">Coffee</span>
             </h1>
           <h2 class="subtitle is-3 publication-subtitle">
-            Boost Your Code LLMs by Fixing Bugs with Feedback
           </h2>
@@ -134,13 +134,12 @@
             </div>
-            <div class="links-row">
               <span class="link-block">
                 <a href="#mainresults"
                    class="external-link button is-normal is-rounded is-dark">
                   <span class="icon has-text-white">
                     <i class="fa-solid fa-trophy"></i>
-                      <!-- <p style="font-size:18px">🏆</p> -->
                   </span>
                   <span>Main Results</span>
                 </a>
@@ -148,7 +147,6 @@
-              <!-- EvalAI Link. -->
               <span class="link-block">
                 <a href="#example"
                    class="external-link button is-normal is-rounded is-dark">
@@ -159,7 +157,7 @@
                 </a>
               </span>
-            </div>
           </div>
         </div>
@@ -178,8 +176,8 @@
 <section class="hero teaser">
   <div class="container is-max-desktop">
         <div class="content has-text-centered">
-          <img src="static/images/figure1.svg" alt="geometric reasoning" width="95%"/>
-          <p> (Left) A motivating example of code editing with natural language feedback. With appropriate feedback, the editor model can produce a correct editing with the help of the feedback. (Right) By evaluating CoffeePots on HumanEvalFix, a benchmark that aims to assess code editing abilities, we show <b>our approach outperforms GPT-4</b>. </p>
         </div>
       <!-- </div> -->
     </div>
@@ -221,7 +219,7 @@
         <h2 class="title is-3">Introduction</h2>
         <div class="content has-text-justified">
           <p>
-            This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYMincludes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfullyreflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
           </p>
         </div>
       </div>
@@ -235,8 +233,7 @@
   <div class="hero-body has-text-centered">
   <h1 class="title is-1 mmmu">
     <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
-    <span class="mmmu" style="vertical-align: middle">Coffee: dataset
-       for COde Fixing with FEEdback</span>
   </h1>
   </div>
 </section>
@@ -249,17 +246,12 @@
       <div class="column is-four-fifths">
         <h2 class="title is-3">Overview</h2>
         <div class="content has-text-centered">
-          <img src="static/images/coffee_example.svg" alt="algebraic reasoning" class="center" style="width:40%">
           <p> An example instance from COFFEE dataset.</p>
         </div>
         <div class="content has-text-justified">
           <p>
-            Recent large language models have demonstrated promising capabilities in correcting their codes based on natural language feedback. However, this ability is currently only applicable to open-source models (e.g., GPT-3.5-Turbo and GPT-4), and <b>not to closed-source models</b>. This poses a significant safety and privacy concern as the codes cannot be uploaded to any external servers (e.g., OpenAI API server).
-          </p>
-          <p>
-            To this end, we introduce ☕COFFEE, a dataset for <b>code editing with feedback</b>.
-            Our dataset includes diverse solutions to programming problems collected from an online competitive programming platform. For each solution, we additionally annotate natural language feedback to provide detailed explanations for the errors towards correct edits, and augment synthetic test cases to measure the correctness of the edited solutions.
           </p>
         </div>
     </div>
@@ -274,9 +266,7 @@
 <section class="hero is-light is-small">
   <div class="hero-body has-text-centered">
   <h1 class="title is-1 mmmu">
-    <img src="static/images/coffeepot_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
-    <span class="mmmu" style="vertical-align: middle"> COFFEEPOTS: Aligning Feedback
-with Preferred Edits
        </span>
   </h1>
   </div>
@@ -286,40 +276,18 @@ with Preferred Edits
     <div class="columns is-centered has-text-centered">
       <!-- <div class="column is-full-width has-text-centered"> -->
       <div class="column is-four-fifths">
-        <h2 class="title is-3">Is Training on COFFEE with Supervised Finetuning Enough?</h2>
-        <div class="container is-max-desktop">
-        <div class="content has-text-centered">
-          <img src="static/images/pass_ratio.svg" alt="algebraic reasoning"  width="70%"/>
-          <p>  Pass@1 results of code editing with SFT feedback on the test set of COFFEE compared with editing with ChatGPT feedback and editing without any feedback settings.</p>
-        </div>
-      <br/>
-        </div>
-        <div class="content has-text-justified">
-          <p>
-           <b><h5>No.</h5></b> As we show in the figure above, we find that training a critic model that generates natural language feedback on a given erroneous code with next token prediction obejct cannot produce any helpful deedback. Editing code with feedback from SFT critic shows performance even worse than direct editing (i.e., editing w/o feedback). We posit that learning to generate accurate feedback is a very challenging goal that cannot be achieved only with next-token prediction object.
-          </p>
-          <br/>
-          <div class="content has-text-centered">
-            <img src="static/images/feedback_quality_analysis.svg" alt="algebraic reasoning"  width="100%"/>
-            <p>Human evaluation on the quality of feedback from SFT critic. </p>
-          </div>
-          <p>
-            Further analysis of the feedback quality suggests that the feedback from the SFT-trained critic has not yet reached a satisfactory level. It remains only as <b>'Partially Correct'</b> feedback, resulting the decreased performance in code editing as we show in the previous
-             analysis.
-          </p>
-        </div>
           <br/>
           <div class="content has-text-centered">
-            <img src="static/images/coffeepots.svg" alt="algebraic reasoning"  width="100%"/>
-            <p>Overview of COFFEEPOTs. </p>
           </div>
         <div class="content has-text-justified">
         <p>
-          To resolve the aforementioned issue, we introduce COFFEEPOTS, a framework for COde Fixing with FEEdback via Preference-Optimized Tuning and Selection. We first use our COFFEE dataset to train code LLMs via supervised fine-tuning (SFT) for feedback augmented code editing.
-          Then, we additionally <b>leverage synthetic test cases in COFFEE to annotate preferred (i.e., helpful) solutions and apply preference alignment </b>to guide the generation of helpful feedback.
         </p>
         </div>
@@ -344,8 +312,8 @@ with Preferred Edits
           <div class="content">
           <div class="content has-text-centered">
-            <img src="static/images/main_table.svg" alt="algebraic reasoning"  width="100%"/>
-            <p>Performances in editing machine-generated codes. We report pass@1 and ERR (in parentheses). We use
 ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
           </div>
@@ -362,7 +330,7 @@ Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competi
     </div>
 <!-------------------------------------------------------------------- Case Study-------------------------------------------------------------------->
-    <div class="columns is-centered has-text-centered">
       <div class="column is-four-fifths">
         <h2 class="title is-3" id="example">Example</h2>
         <iframe
@@ -377,7 +345,7 @@ Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competi
           </div>
         </div>
       </div>
-    </div>
   </div>

       <div class="columns is-centered">
         <div class="column has-text-centered">
           <h1 class="title is-1 publication-title is-bold">
+            <!-- <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>  -->
+            <span class="opencodeinterpreter" style="vertical-align: middle">Coffee-Gym</span>
             </h1>
           <h2 class="subtitle is-3 publication-subtitle">
+            An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
           </h2>
             </div>
+            <!-- <div class="links-row">
               <span class="link-block">
                 <a href="#mainresults"
                    class="external-link button is-normal is-rounded is-dark">
                   <span class="icon has-text-white">
                     <i class="fa-solid fa-trophy"></i>
                   </span>
                   <span>Main Results</span>
                 </a>
               <span class="link-block">
                 <a href="#example"
                    class="external-link button is-normal is-rounded is-dark">
                 </a>
               </span>
+            </div> -->
           </div>
         </div>
 <section class="hero teaser">
   <div class="container is-max-desktop">
         <div class="content has-text-centered">
+          <img src="static/images/comparison_w_prev.svg" alt="geometric reasoning" width="95%"/>
+          <p> Comparison between COFFEE-GYM and the previous approach. </p>
         </div>
       <!-- </div> -->
     </div>
         <h2 class="title is-3">Introduction</h2>
         <div class="content has-text-justified">
           <p>
+            This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
           </p>
         </div>
       </div>
   <div class="hero-body has-text-centered">
   <h1 class="title is-1 mmmu">
     <img src="static/images/coffee_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
+    <span class="mmmu" style="vertical-align: middle">COFFEE: Human-written Code Edit Traces with Annotated Pairwise Feedback</span>
   </h1>
   </div>
 </section>
       <div class="column is-four-fifths">
         <h2 class="title is-3">Overview</h2>
         <div class="content has-text-centered">
+          <img src="static/images/data.svg" alt="example instance of coffee dataset" class="center" style="width:50%">
           <p> An example instance from COFFEE dataset.</p>
         </div>
         <div class="content has-text-justified">
           <p>
+            We curate ☕️ COFFEE, a dataset of code fixing with feedback, from human-written code edit traces. Coffee consists of problems of diverse levels of difficulty, including challenging problems that only human programmers can solve, and provides test cases for reward functions.
           </p>
         </div>
     </div>
 <section class="hero is-light is-small">
   <div class="hero-body has-text-centered">
   <h1 class="title is-1 mmmu">
+    <span class="mmmu" style="vertical-align: middle">COFFEEEVAL: Unit-test-driven Feedback Evaluation
        </span>
   </h1>
   </div>
     <div class="columns is-centered has-text-centered">
       <!-- <div class="column is-full-width has-text-centered"> -->
       <div class="column is-four-fifths">
           <br/>
           <div class="content has-text-centered">
+            <img src="static/images/coffeeeval_results.png" alt="evalution results on coffeeeval"  width="100%"/>
+            <p>Performance of our evaluation protocol on the test sets of Coffee compared to the baselines. Wrong Feedback is abbreviated as WF due to limited space.</p>
           </div>
         <div class="content has-text-justified">
         <p>
+          DeepSeek-CoffeeEval achieves higher Pearson correlation and lower MSE than all G-Eval and Editing baselines. In particular, our approach shows even higher correlation than the G-Eval baseline implemented with GPT-4-Turbo.
+The strong performance of our CoffeeEval validates its effectiveness in assessing the quality of NL feedback in the code editing task.
         </p>
         </div>
           <div class="content">
           <div class="content has-text-centered">
+            <img src="static/images/main_results.png" alt="algebraic reasoning"  width="100%"/>
+            <p> We use
 ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
           </div>
     </div>
 <!-------------------------------------------------------------------- Case Study-------------------------------------------------------------------->
+    <!-- <div class="columns is-centered has-text-centered">
       <div class="column is-four-fifths">
         <h2 class="title is-3" id="example">Example</h2>
         <iframe
           </div>
         </div>
       </div>
+    </div> -->
   </div>