cerebras
/

Cerebras-GPT-111M

@@ -128,26 +128,26 @@ Model Params | Sequence Length | Batch Size | Number of Steps | Tokens | Tokens
 We evaluate our models on the PILE validation set comprising 380M tokens. We also evaluate the public checkpoints of Pythia Eleuther (2022), OPT Zhang et al. (2022), GPT-NeoX 20B Black et al. (2022), and GPT-J 6B Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
 #### 0-shot Evaluation
-| Model   | Count | Training FLOPs | PILE test xent | Hella-Swag | PIQA  | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
 | ------- | ----- | -------------- | -------------- | ---------- | ----- | ----------- | ------- | ----- | ----- | ---------- | ------------------ |
-| Cerebras| 111M  | 2.5E+18        | 2.566          | 0.268      | 0.594 | 0.488       | 0.194   | 0.380 | 0.166 | 0.118      | 0.315              |
-|         | 256M  | 1.1E+19        | 2.299          | 0.274      | 0.613 | 0.511       | 0.293   | 0.410 | 0.170 | 0.158      | 0.347              |
-|         | 590M  | 5.3E+19        | 2.184          | 0.291      | 0.627 | 0.498       | 0.366   | 0.464 | 0.190 | 0.158      | 0.370              |
-|         | 1.3B  | 2.5E+20        | 1.996          | 0.325      | 0.664 | 0.521       | 0.462   | 0.508 | 0.224 | 0.166      | 0.410              |
-|         | 2.7B  | 9.8E+20        | 1.834          | 0.386      | 0.701 | 0.559       | 0.567   | 0.571 | 0.246 | 0.206      | 0.462              |
-|         | 6.7B  | 5.9E+21        | TODO           | TODO       | TODO  | TODO        | TODO    | TODO  | TODO  | TODO       | TODO               |
-|         | 13B   | 2.1E+22        | 1.575          | 0.513      | 0.766 | 0.646       | 0.696   | 0.714 | 0.367 | 0.286      | 0.570              |
 #### 5-shot Evaluation
-| Model    | Count | Hella-Swag | PIQA  | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA |
 | -------- | ----- | ----------| ----- | ----------- | -------| ----- | ----- | ---------- |
-| Cerebras | 111M  | 0.267     | 0.588 | 0.475       | 0.158  | 0.356 | 0.166 | 0.136      |
-|          | 256M  | 0.278     | 0.606 | 0.522       | 0.225  | 0.422 | 0.183 | 0.164      |
-|          | 590M  | 0.291     | 0.634 | 0.479       | 0.281  | 0.475 | 0.206 | 0.152      |
-|          | 1.3B  | 0.326     | 0.668 | 0.536       | 0.395  | 0.529 | 0.241 | 0.174      |
-|          | 2.7B  | 0.382     | 0.697 | 0.543       | 0.487  | 0.590 | 0.267 | 0.224      |
-|          | 6.7B  | TODO      | TODO  | TODO        | TODO   | TODO  | TODO  | TODO       |
-|          | 13B   | 0.514     | 0.768 | 0.674       | 0.655  | 0.743 | 0.398 | 0.318      |
 <br><br>

 We evaluate our models on the PILE validation set comprising 380M tokens. We also evaluate the public checkpoints of Pythia Eleuther (2022), OPT Zhang et al. (2022), GPT-NeoX 20B Black et al. (2022), and GPT-J 6B Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
 #### 0-shot Evaluation
+| Model   | Params | Training FLOPs | PILE test xent | Hella-Swag | PIQA  | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
 | ------- | ----- | -------------- | -------------- | ---------- | ----- | ----------- | ------- | ----- | ----- | ---------- | ------------------ |
+| Cerebras-GPT | 111M  | 2.5E+18        | 2.566          | 0.268      | 0.594 | 0.488       | 0.194   | 0.380 | 0.166 | 0.118      | 0.315              |
+| Cerebras-GPT | 256M  | 1.1E+19        | 2.299          | 0.274      | 0.613 | 0.511       | 0.293   | 0.410 | 0.170 | 0.158      | 0.347              |
+| Cerebras-GPT | 590M  | 5.3E+19        | 2.184          | 0.291      | 0.627 | 0.498       | 0.366   | 0.464 | 0.190 | 0.158      | 0.370              |
+| Cerebras-GPT | 1.3B  | 2.5E+20        | 1.996          | 0.325      | 0.664 | 0.521       | 0.462   | 0.508 | 0.224 | 0.166      | 0.410              |
+| Cerebras-GPT | 2.7B  | 9.8E+20        | 1.834          | 0.386      | 0.701 | 0.559       | 0.567   | 0.571 | 0.246 | 0.206      | 0.462              |
+| Cerebras-GPT | 6.7B  | 5.9E+21        | TODO           | TODO       | TODO  | TODO        | TODO    | TODO  | TODO  | TODO       | TODO               |
+| Cerebras-GPT | 13B   | 2.1E+22        | 1.575          | 0.513      | 0.766 | 0.646       | 0.696   | 0.714 | 0.367 | 0.286      | 0.570              |
 #### 5-shot Evaluation
+| Model    | Params | Hella-Swag | PIQA  | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA |
 | -------- | ----- | ----------| ----- | ----------- | -------| ----- | ----- | ---------- |
+| Cerebras-GPT | 111M  | 0.267     | 0.588 | 0.475       | 0.158  | 0.356 | 0.166 | 0.136      |
+| Cerebras-GPT | 256M  | 0.278     | 0.606 | 0.522       | 0.225  | 0.422 | 0.183 | 0.164      |
+| Cerebras-GPT | 590M  | 0.291     | 0.634 | 0.479       | 0.281  | 0.475 | 0.206 | 0.152      |
+| Cerebras-GPT | 1.3B  | 0.326     | 0.668 | 0.536       | 0.395  | 0.529 | 0.241 | 0.174      |
+| Cerebras-GPT | 2.7B  | 0.382     | 0.697 | 0.543       | 0.487  | 0.590 | 0.267 | 0.224      |
+| Cerebras-GPT | 6.7B  | TODO      | TODO  | TODO        | TODO   | TODO  | TODO  | TODO       |
+| Cerebras-GPT | 13B   | 0.514     | 0.768 | 0.674       | 0.655  | 0.743 | 0.398 | 0.318      |
 <br><br>