Text Generation
Transformers
PyTorch
English
gpt2
causal-lm
text-generation-inference
Inference Endpoints
rskuzma commited on
Commit
c5b12af
·
1 Parent(s): 5d2d6c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -16
README.md CHANGED
@@ -128,26 +128,26 @@ Model Params | Sequence Length | Batch Size | Number of Steps | Tokens | Tokens
128
  We evaluate our models on the PILE validation set comprising 380M tokens. We also evaluate the public checkpoints of Pythia Eleuther (2022), OPT Zhang et al. (2022), GPT-NeoX 20B Black et al. (2022), and GPT-J 6B Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
129
 
130
  #### 0-shot Evaluation
131
- | Model | Count | Training FLOPs | PILE test xent | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
132
  | ------- | ----- | -------------- | -------------- | ---------- | ----- | ----------- | ------- | ----- | ----- | ---------- | ------------------ |
133
- | Cerebras| 111M | 2.5E+18 | 2.566 | 0.268 | 0.594 | 0.488 | 0.194 | 0.380 | 0.166 | 0.118 | 0.315 |
134
- | | 256M | 1.1E+19 | 2.299 | 0.274 | 0.613 | 0.511 | 0.293 | 0.410 | 0.170 | 0.158 | 0.347 |
135
- | | 590M | 5.3E+19 | 2.184 | 0.291 | 0.627 | 0.498 | 0.366 | 0.464 | 0.190 | 0.158 | 0.370 |
136
- | | 1.3B | 2.5E+20 | 1.996 | 0.325 | 0.664 | 0.521 | 0.462 | 0.508 | 0.224 | 0.166 | 0.410 |
137
- | | 2.7B | 9.8E+20 | 1.834 | 0.386 | 0.701 | 0.559 | 0.567 | 0.571 | 0.246 | 0.206 | 0.462 |
138
- | | 6.7B | 5.9E+21 | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
139
- | | 13B | 2.1E+22 | 1.575 | 0.513 | 0.766 | 0.646 | 0.696 | 0.714 | 0.367 | 0.286 | 0.570 |
140
 
141
  #### 5-shot Evaluation
142
- | Model | Count | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA |
143
  | -------- | ----- | ----------| ----- | ----------- | -------| ----- | ----- | ---------- |
144
- | Cerebras | 111M | 0.267 | 0.588 | 0.475 | 0.158 | 0.356 | 0.166 | 0.136 |
145
- | | 256M | 0.278 | 0.606 | 0.522 | 0.225 | 0.422 | 0.183 | 0.164 |
146
- | | 590M | 0.291 | 0.634 | 0.479 | 0.281 | 0.475 | 0.206 | 0.152 |
147
- | | 1.3B | 0.326 | 0.668 | 0.536 | 0.395 | 0.529 | 0.241 | 0.174 |
148
- | | 2.7B | 0.382 | 0.697 | 0.543 | 0.487 | 0.590 | 0.267 | 0.224 |
149
- | | 6.7B | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
150
- | | 13B | 0.514 | 0.768 | 0.674 | 0.655 | 0.743 | 0.398 | 0.318 |
151
 
152
 
153
  <br><br>
 
128
  We evaluate our models on the PILE validation set comprising 380M tokens. We also evaluate the public checkpoints of Pythia Eleuther (2022), OPT Zhang et al. (2022), GPT-NeoX 20B Black et al. (2022), and GPT-J 6B Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
129
 
130
  #### 0-shot Evaluation
131
+ | Model | Params | Training FLOPs | PILE test xent | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
132
  | ------- | ----- | -------------- | -------------- | ---------- | ----- | ----------- | ------- | ----- | ----- | ---------- | ------------------ |
133
+ | Cerebras-GPT | 111M | 2.5E+18 | 2.566 | 0.268 | 0.594 | 0.488 | 0.194 | 0.380 | 0.166 | 0.118 | 0.315 |
134
+ | Cerebras-GPT | 256M | 1.1E+19 | 2.299 | 0.274 | 0.613 | 0.511 | 0.293 | 0.410 | 0.170 | 0.158 | 0.347 |
135
+ | Cerebras-GPT | 590M | 5.3E+19 | 2.184 | 0.291 | 0.627 | 0.498 | 0.366 | 0.464 | 0.190 | 0.158 | 0.370 |
136
+ | Cerebras-GPT | 1.3B | 2.5E+20 | 1.996 | 0.325 | 0.664 | 0.521 | 0.462 | 0.508 | 0.224 | 0.166 | 0.410 |
137
+ | Cerebras-GPT | 2.7B | 9.8E+20 | 1.834 | 0.386 | 0.701 | 0.559 | 0.567 | 0.571 | 0.246 | 0.206 | 0.462 |
138
+ | Cerebras-GPT | 6.7B | 5.9E+21 | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
139
+ | Cerebras-GPT | 13B | 2.1E+22 | 1.575 | 0.513 | 0.766 | 0.646 | 0.696 | 0.714 | 0.367 | 0.286 | 0.570 |
140
 
141
  #### 5-shot Evaluation
142
+ | Model | Params | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA |
143
  | -------- | ----- | ----------| ----- | ----------- | -------| ----- | ----- | ---------- |
144
+ | Cerebras-GPT | 111M | 0.267 | 0.588 | 0.475 | 0.158 | 0.356 | 0.166 | 0.136 |
145
+ | Cerebras-GPT | 256M | 0.278 | 0.606 | 0.522 | 0.225 | 0.422 | 0.183 | 0.164 |
146
+ | Cerebras-GPT | 590M | 0.291 | 0.634 | 0.479 | 0.281 | 0.475 | 0.206 | 0.152 |
147
+ | Cerebras-GPT | 1.3B | 0.326 | 0.668 | 0.536 | 0.395 | 0.529 | 0.241 | 0.174 |
148
+ | Cerebras-GPT | 2.7B | 0.382 | 0.697 | 0.543 | 0.487 | 0.590 | 0.267 | 0.224 |
149
+ | Cerebras-GPT | 6.7B | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
150
+ | Cerebras-GPT | 13B | 0.514 | 0.768 | 0.674 | 0.655 | 0.743 | 0.398 | 0.318 |
151
 
152
 
153
  <br><br>