michaelfeil commited on
Commit
6cf6296
·
1 Parent(s): a8a7103

Upload EleutherAI/pythia-12b ctranslate fp16 weights

Browse files
Files changed (3) hide show
  1. README.md +321 -0
  2. model.bin +2 -2
  3. special_tokens_map.json +5 -0
README.md ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - ctranslate2
6
+ - int8
7
+ - float16
8
+ - pytorch
9
+ - causal-lm
10
+ - pythia
11
+ license: apache-2.0
12
+ datasets:
13
+ - the_pile
14
+ ---
15
+ # # Fast-Inference with Ctranslate2
16
+ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
17
+
18
+ quantized version of [EleutherAI/pythia-12b](https://huggingface.co/EleutherAI/pythia-12b)
19
+ ```bash
20
+ pip install hf-hub-ctranslate2>=2.0.8
21
+ ```
22
+ Converted on 2023-05-22 using
23
+ ```
24
+ ct2-transformers-converter --model EleutherAI/pythia-12b --output_dir /home/michael/tmp-ct2fast-pythia-12b --force --copy_files tokenizer.json README.md tokenizer_config.json special_tokens_map.json .gitattributes --quantization float16
25
+ ```
26
+
27
+ Checkpoint compatible to [ctranslate2>=3.13.0](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2>=2.0.6](https://github.com/michaelfeil/hf-hub-ctranslate2)
28
+ - `compute_type=int8_float16` for `device="cuda"`
29
+ - `compute_type=int8` for `device="cpu"`
30
+
31
+ ```python
32
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
33
+ from transformers import AutoTokenizer
34
+
35
+ model_name = "michaelfeil/ct2fast-pythia-12b"
36
+ # use either TranslatorCT2fromHfHub or GeneratorCT2fromHfHub here, depending on model.
37
+ model = GeneratorCT2fromHfHub(
38
+ # load in int8 on CUDA
39
+ model_name_or_path=model_name,
40
+ device="cuda",
41
+ compute_type="int8_float16",
42
+ # tokenizer=AutoTokenizer.from_pretrained("EleutherAI/pythia-12b")
43
+ )
44
+ outputs = model.generate(
45
+ text=["def print_hello_world():", "def hello_name(name:"],
46
+ max_length=64
47
+ )
48
+ print(outputs)
49
+ ```
50
+
51
+ # Licence and other remarks:
52
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
53
+
54
+ # Original description
55
+
56
+
57
+ The *Pythia Scaling Suite* is a collection of models developed to facilitate
58
+ interpretability research. It contains two sets of eight models of sizes
59
+ 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
60
+ models: one trained on the Pile, and one trained on the Pile after the dataset
61
+ has been globally deduplicated. All 8 model sizes are trained on the exact
62
+ same data, in the exact same order. We also provide 154 intermediate
63
+ checkpoints per model, hosted on Hugging Face as branches.
64
+
65
+ The Pythia model suite was deliberately designed to promote scientific
66
+ research on large language models, especially interpretability research.
67
+ Despite not centering downstream performance as a design goal, we find the
68
+ models <a href="#evaluations">match or exceed</a> the performance of
69
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
70
+
71
+ <details>
72
+ <summary style="font-weight: 600">Past early release and naming convention.</summary>
73
+
74
+ Previously, we released an early version of the Pythia suite to the public.
75
+ However, we decided to retrain the model suite to address a few hyperparameter
76
+ discrepancies. This model card <a href="#changelog">lists the changes</a>;
77
+ see appendix B in the Pythia paper for further discussion. We found no
78
+ difference in benchmark performance between the two Pythia versions.
79
+ The old models are
80
+ [still available](https://huggingface.co/models?other=pythia_v0), but we
81
+ suggest the retrained suite if you are just starting to use Pythia.<br>
82
+ **This is the current release.**
83
+
84
+ Please note that all models in the *Pythia* suite were renamed in January
85
+ 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
86
+ comparing the old and new names</a> is provided in this model card, together
87
+ with exact parameter counts.
88
+ </details>
89
+ <br>
90
+
91
+ # Pythia-12B
92
+
93
+ ## Model Details
94
+
95
+ - Developed by: [EleutherAI](http://eleuther.ai)
96
+ - Model type: Transformer-based Language Model
97
+ - Language: English
98
+ - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
99
+ for training procedure, config files, and details on how to use.
100
+ - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
101
+ - License: Apache 2.0
102
+ - Contact: to ask questions about this model, join the [EleutherAI
103
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
104
+ Please read the existing *Pythia* documentation before asking about it in the
105
+ EleutherAI Discord. For general correspondence: [contact@eleuther.
106
+ ai](mailto:[email protected]).
107
+
108
+ <figure>
109
+
110
+ | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
111
+ | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
112
+ | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
113
+ | 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
114
+ | 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
115
+ | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
116
+ | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
117
+ | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
118
+ | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
119
+ | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
120
+ <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
121
+ non-deduped models of a given size have the same hyperparameters. “Equivalent”
122
+ models have <b>exactly</b> the same architecture, and the same number of
123
+ non-embedding parameters.</figcaption>
124
+ </figure>
125
+
126
+ ## Uses and Limitations
127
+
128
+ ### Intended Use
129
+
130
+ The primary intended use of Pythia is research on the behavior, functionality,
131
+ and limitations of large language models. This suite is intended to provide
132
+ a controlled setting for performing scientific experiments. We also provide
133
+ 154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
134
+ `step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
135
+ `step143000`. These checkpoints are hosted on Hugging Face as branches. Note
136
+ that branch `143000` corresponds exactly to the model checkpoint on the `main`
137
+ branch of each model.
138
+
139
+ You may also further fine-tune and adapt Pythia-12B for deployment,
140
+ as long as your use is in accordance with the Apache 2.0 license. Pythia
141
+ models work with the Hugging Face [Transformers
142
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
143
+ pre-trained Pythia-12B as a basis for your fine-tuned model, please
144
+ conduct your own risk and bias assessment.
145
+
146
+ ### Out-of-scope use
147
+
148
+ The Pythia Suite is **not** intended for deployment. It is not a in itself
149
+ a product and cannot be used for human-facing interactions. For example,
150
+ the model may generate harmful or offensive text. Please evaluate the risks
151
+ associated with your particular use case.
152
+
153
+ Pythia models are English-language only, and are not suitable for translation
154
+ or generating text in other languages.
155
+
156
+ Pythia-12B has not been fine-tuned for downstream contexts in which
157
+ language models are commonly deployed, such as writing genre prose,
158
+ or commercial chatbots. This means Pythia-12B will **not**
159
+ respond to a given prompt the way a product like ChatGPT does. This is because,
160
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
161
+ Learning from Human Feedback (RLHF) to better “follow” human instructions.
162
+
163
+ ### Limitations and biases
164
+
165
+ The core functionality of a large language model is to take a string of text
166
+ and predict the next token. The token used by the model need not produce the
167
+ most “accurate” text. Never rely on Pythia-12B to produce factually accurate
168
+ output.
169
+
170
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
171
+ known to contain profanity and texts that are lewd or otherwise offensive.
172
+ See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
173
+ discussion of documented biases with regards to gender, religion, and race.
174
+ Pythia-12B may produce socially unacceptable or undesirable text, *even if*
175
+ the prompt itself does not include anything explicitly offensive.
176
+
177
+ If you plan on using text generated through, for example, the Hosted Inference
178
+ API, we recommend having a human curate the outputs of this language model
179
+ before presenting it to other people. Please inform your audience that the
180
+ text was generated by Pythia-12B.
181
+
182
+ ### Quickstart
183
+
184
+ Pythia models can be loaded and used via the following code, demonstrated here
185
+ for the third `pythia-70m-deduped` checkpoint:
186
+
187
+ ```python
188
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
189
+
190
+ model = GPTNeoXForCausalLM.from_pretrained(
191
+ "EleutherAI/pythia-70m-deduped",
192
+ revision="step3000",
193
+ cache_dir="./pythia-70m-deduped/step3000",
194
+ )
195
+
196
+ tokenizer = AutoTokenizer.from_pretrained(
197
+ "EleutherAI/pythia-70m-deduped",
198
+ revision="step3000",
199
+ cache_dir="./pythia-70m-deduped/step3000",
200
+ )
201
+
202
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
203
+ tokens = model.generate(**inputs)
204
+ tokenizer.decode(tokens[0])
205
+ ```
206
+
207
+ Revision/branch `step143000` corresponds exactly to the model checkpoint on
208
+ the `main` branch of each model.<br>
209
+ For more information on how to use all Pythia models, see [documentation on
210
+ GitHub](https://github.com/EleutherAI/pythia).
211
+
212
+ ## Training
213
+
214
+ ### Training data
215
+
216
+ [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
217
+ English. It was created by EleutherAI specifically for training large language
218
+ models. It contains texts from 22 diverse sources, roughly broken down into
219
+ five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
220
+ prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
221
+ miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
222
+ paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
223
+ methodology, and a discussion of ethical implications. Consult [the
224
+ datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
225
+ about the Pile and its component datasets. The Pile can be downloaded from
226
+ the [official website](https://pile.eleuther.ai/), or from a [community
227
+ mirror](https://the-eye.eu/public/AI/pile/).<br>
228
+ The Pile was **not** deduplicated before being used to train Pythia-12B.
229
+
230
+ ### Training procedure
231
+
232
+ All models were trained on the exact same data, in the exact same order. Each
233
+ model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
234
+ model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
235
+ from `step1000` to `step143000` (which is the same as `main`). In addition, we
236
+ also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
237
+ This corresponds to training for just under 1 epoch on the Pile for
238
+ non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
239
+
240
+ All *Pythia* models trained for 143000 steps at a batch size
241
+ of 2M (2,097,152 tokens).<br>
242
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
243
+ procedure, including [how to reproduce
244
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
245
+ Pythia uses the same tokenizer as [GPT-NeoX-
246
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
247
+
248
+ ## Evaluations
249
+
250
+ All 16 *Pythia* models were evaluated using the [LM Evaluation
251
+ Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
252
+ the results by model and step at `results/json/*` in the [GitHub
253
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
254
+ Expand the sections below to see plots of evaluation results for all
255
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
256
+
257
+ <details>
258
+ <summary>LAMBADA – OpenAI</summary>
259
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>
260
+ </details>
261
+
262
+ <details>
263
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
264
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>
265
+ </details>
266
+
267
+ <details>
268
+ <summary>WinoGrande</summary>
269
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>
270
+ </details>
271
+
272
+ <details>
273
+ <summary>AI2 Reasoning Challenge—Easy Set</summary>
274
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>
275
+ </details>
276
+
277
+ <details>
278
+ <summary>SciQ</summary>
279
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>
280
+ </details>
281
+
282
+ ## Changelog
283
+
284
+ This section compares differences between previously released
285
+ [Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current
286
+ models. See Appendix B of the Pythia paper for further discussion of these
287
+ changes and the motivation behind them. We found that retraining Pythia had no
288
+ impact on benchmark performance.
289
+
290
+ - All model sizes are now trained with uniform batch size of 2M tokens.
291
+ Previously, the models of size 160M, 410M, and 1.4B parameters were trained
292
+ with batch sizes of 4M tokens.
293
+ - We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,
294
+ 128,256,512} in addition to every 1000 training steps.
295
+ - Flash Attention was used in the new retrained suite.
296
+ - We remedied a minor inconsistency that existed in the original suite: all
297
+ models of size 2.8B parameters or smaller had a learning rate (LR) schedule
298
+ which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and
299
+ 12B models all used an LR schedule which decayed to a minimum LR of 0. In
300
+ the redone training runs, we rectified this inconsistency: all models now were
301
+ trained with LR decaying to a minimum of 0.1× their maximum LR.
302
+
303
+ ### Naming convention and parameter count
304
+
305
+ *Pythia* models were renamed in January 2023. It is possible that the old
306
+ naming convention still persists in some documentation by accident. The
307
+ current naming convention (70M, 160M, etc.) is based on total parameter count.
308
+
309
+ <figure style="width:32em">
310
+
311
+ | current Pythia suffix | old suffix | total params | non-embedding params |
312
+ | --------------------: | ---------: | -------------: | -------------------: |
313
+ | 70M | 19M | 70,426,624 | 18,915,328 |
314
+ | 160M | 125M | 162,322,944 | 85,056,000 |
315
+ | 410M | 350M | 405,334,016 | 302,311,424 |
316
+ | 1B | 800M | 1,011,781,632 | 805,736,448 |
317
+ | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
318
+ | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
319
+ | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
320
+ | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
321
+ </figure>
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:82758e96ddd9329922cfc81a8fe473226aabe4983aac3e97a860be5626ebee5a
3
- size 11855209712
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91d44167f263b80fec8e1344c3a0791cf71bd42aa10d42e040d0a485bd18520a
3
+ size 23691825848
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }