Spaces:

Coffee-Gym
/

Project-Coffee-Gym

Running

App Files Files Community

Coffee-Gym commited on Jul 1, 2024

Commit

fb38f03

verified ·

1 Parent(s): 027fc14

Upload index.html

Browse files

Files changed (1) hide show

index.html +9 -7

index.html CHANGED Viewed

@@ -86,8 +86,11 @@
                 <!-- @PAN TODO: change links -->
                 <a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
                    class="external-link button is-normal is-rounded is-dark" target="_blank">
-                  <span class="icon">
                       <i class="fas fa-file-pdf"></i>
                   </span>
                   <span>Paper</span>
                 </a>
@@ -189,7 +192,7 @@
     <!-- Abstract. -->
     <div class="columns is-centered has-text-centered">
       <div class="column is-four-fifths">
-        <h2 class="title is-3">🔔News</h2>
         <div class="content has-text-justified">
@@ -219,7 +222,7 @@
         <h2 class="title is-3">Introduction</h2>
         <div class="content has-text-justified">
           <p>
-            This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
           </p>
         </div>
       </div>
@@ -313,16 +316,15 @@ The strong performance of our CoffeeEval validates its effectiveness in assessin
           <div class="content has-text-centered">
             <img src="static/images/main_results.png" alt="algebraic reasoning"  width="100%"/>
-            <p> We use
-ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
           </div>
             <div class="content has-text-justified">
-              <p>
                 Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
 In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
 Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
-              </p>
             </div>
           </div>

                 <!-- @PAN TODO: change links -->
                 <a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
                    class="external-link button is-normal is-rounded is-dark" target="_blank">
+                  <!-- <span class="icon">
                       <i class="fas fa-file-pdf"></i>
+                  </span> -->
+                  <span class="icon">
+                    <p style="font-size:18px">📝</p>
                   </span>
                   <span>Paper</span>
                 </a>
     <!-- Abstract. -->
     <div class="columns is-centered has-text-centered">
       <div class="column is-four-fifths">
+        <h2 class="title is-3">🔔 News</h2>
         <div class="content has-text-justified">
         <h2 class="title is-3">Introduction</h2>
         <div class="content has-text-justified">
           <p>
+            This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available.
           </p>
         </div>
       </div>
           <div class="content has-text-centered">
             <img src="static/images/main_results.png" alt="algebraic reasoning"  width="100%"/>
+            <p> Code editing results of our feedback model trained with Coffee-Gym, i.e., PPO-COFFEEVAL, on HumanEvalFix and COFFEE-Test. We pair our feedback model with an open-source code LLM as the code editor.</p>
           </div>
             <div class="content has-text-justified">
+              <!-- <p>
                 Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
 In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
 Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
+              </p> -->
             </div>
           </div>