TheBloke commited on
Commit
693c2a0
·
1 Parent(s): 96428cb

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -87
README.md CHANGED
@@ -2,7 +2,7 @@
2
  inference: false
3
  language:
4
  - en
5
- license: other
6
  model_creator: Upstage
7
  model_link: https://huggingface.co/upstage/Llama-2-70b-instruct-v2
8
  model_name: Llama 2 70B Instruct v2
@@ -37,122 +37,158 @@ tags:
37
  - Model creator: [Upstage](https://huggingface.co/Upstage)
38
  - Original model: [Llama 2 70B Instruct v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)
39
 
 
40
  ## Description
41
 
42
  This repo contains GPTQ model files for [Upstage's Llama 2 70B Instruct v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2).
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
 
 
46
  ## Repositories available
47
 
48
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ)
49
- * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GGML)
 
50
  * [Upstage's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)
 
51
 
 
52
  ## Prompt template: Orca-Hashes
53
 
54
  ```
55
  ### System:
56
- This is a system prompt, please behave and help the user.
57
 
58
  ### User:
59
  {prompt}
60
 
61
  ### Assistant:
 
62
  ```
63
 
64
- ## Provided files
 
 
 
65
 
66
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
67
 
68
  Each separate quant is in a different branch. See below for instructions on fetching from different branches.
69
 
70
- | Branch | Bits | Group Size | Act Order (desc_act) | GPTQ Dataset | Size | ExLlama Compat? | Made With | Desc |
71
- | ------ | ---- | ---------- | -------------------- | ------------ | ---- | --------------- | --------- | ---- |
72
- | [main](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/main) | 4 | None | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 35.33 GB | Yes | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
73
- | [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 40.66 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
74
- | [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 37.99 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
75
- | [gptq-4bit-128g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-4bit-128g-actorder_True) | 4 | 128 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 36.65 GB | Yes | AutoGPTQ | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
76
- | [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 26.78 GB | No | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
77
- | [gptq-3bit-128g-actorder_False](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit-128g-actorder_False) | 3 | 128 | No | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 28.03 GB | No | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
78
- | [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 28.03 GB | No | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
79
- | [gptq-3bit-64g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit-64g-actorder_True) | 3 | 64 | Yes | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 29.30 GB | No | AutoGPTQ | 3-bit, with group size 64g and act-order. Highest quality 3-bit option. Poor AutoGPTQ CUDA speed. |
 
 
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  ## How to download from branches
82
 
83
  - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ:gptq-4bit-32g-actorder_True`
84
  - With Git, you can clone a branch with:
85
  ```
86
- git clone --branch --single-branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ
87
  ```
88
  - In Python Transformers code, the branch is the `revision` parameter; see below.
89
-
 
90
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
91
 
92
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
93
 
94
- It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
95
 
96
  1. Click the **Model tab**.
97
  2. Under **Download custom model or LoRA**, enter `TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ`.
98
  - To download from a specific branch, enter for example `TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ:gptq-4bit-32g-actorder_True`
99
  - see Provided Files above for the list of branches for each option.
100
  3. Click **Download**.
101
- 4. The model will start downloading. Once it's finished it will say "Done"
102
  5. In the top left, click the refresh icon next to **Model**.
103
  6. In the **Model** dropdown, choose the model you just downloaded: `Upstage-Llama-2-70B-instruct-v2-GPTQ`
104
  7. The model will automatically load, and is now ready for use!
105
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
106
- * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
107
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
108
 
 
109
  ## How to use this GPTQ model from Python code
110
 
111
- First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- `GITHUB_ACTIONS=true pip install auto-gptq`
 
 
 
 
114
 
115
- Then try the following example code:
116
 
117
  ```python
118
- from transformers import AutoTokenizer, pipeline, logging
119
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
120
 
121
  model_name_or_path = "TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ"
122
- model_basename = "model"
123
-
124
- use_triton = False
 
 
 
125
 
126
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
127
 
128
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
129
- model_basename=model_basename,
130
- use_safetensors=True,
131
- trust_remote_code=False,
132
- device="cuda:0",
133
- use_triton=use_triton,
134
- quantize_config=None)
135
-
136
- """
137
- To download from a specific branch, use the revision parameter, as in this example:
138
-
139
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
140
- revision="gptq-4bit-32g-actorder_True",
141
- model_basename=model_basename,
142
- use_safetensors=True,
143
- trust_remote_code=False,
144
- device="cuda:0",
145
- quantize_config=None)
146
- """
147
-
148
  prompt = "Tell me about AI"
149
  prompt_template=f'''### System:
150
- This is a system prompt, please behave and help the user.
151
 
152
  ### User:
153
  {prompt}
154
 
155
  ### Assistant:
 
156
  '''
157
 
158
  print("\n\n*** Generate:")
@@ -163,9 +199,6 @@ print(tokenizer.decode(output[0]))
163
 
164
  # Inference can also be done using transformers' pipeline
165
 
166
- # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
167
- logging.set_verbosity(logging.CRITICAL)
168
-
169
  print("*** Pipeline:")
170
  pipe = pipeline(
171
  "text-generation",
@@ -179,12 +212,17 @@ pipe = pipeline(
179
 
180
  print(pipe(prompt_template)[0]['generated_text'])
181
  ```
 
182
 
 
183
  ## Compatibility
184
 
185
- The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
 
 
186
 
187
- ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 
188
 
189
  <!-- footer start -->
190
  <!-- 200823 -->
@@ -209,7 +247,7 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
209
 
210
  **Special thanks to**: Aemon Algiz.
211
 
212
- **Patreon special mentions**: Sam, theTransient, Jonathan Leane, Steven Wood, webtim, Johann-Peter Hartmann, Geoffrey Montalvo, Gabriel Tamborski, Willem Michiel, John Villwock, Derek Yates, Mesiah Bishop, Eugene Pentland, Pieter, Chadd, Stephen Murray, Daniel P. Andersen, terasurfer, Brandon Frisco, Thomas Belote, Sid, Nathan LeClaire, Magnesian, Alps Aficionado, Stanislav Ovsiannikov, Alex, Joseph William Delisle, Nikolai Manek, Michael Davis, Junyu Yang, K, J, Spencer Kim, Stefan Sabev, Olusegun Samson, transmissions 11, Michael Levine, Cory Kujawski, Rainer Wilmers, zynix, Kalila, Luke @flexchar, Ajan Kanaga, Mandus, vamX, Ai Maven, Mano Prime, Matthew Berman, subjectnull, Vitor Caleffi, Clay Pascal, biorpg, alfie_i, 阿明, Jeffrey Morgan, ya boyyy, Raymond Fosdick, knownsqashed, Olakabola, Leonard Tan, ReadyPlayerEmma, Enrico Ros, Dave, Talal Aujan, Illia Dulskyi, Sean Connelly, senxiiz, Artur Olbinski, Elle, Raven Klaugh, Fen Risland, Deep Realms, Imad Khwaja, Fred von Graf, Will Dee, usrbinkat, SuperWojo, Alexandros Triantafyllidis, Swaroop Kallakuri, Dan Guido, John Detwiler, Pedro Madruga, Iucharbius, Viktor Bowallius, Asp the Wyvern, Edmond Seymore, Trenton Dambrowitz, Space Cruiser, Spiking Neurons AB, Pyrater, LangChain4j, Tony Hughes, Kacper Wikieł, Rishabh Srivastava, David Ziegler, Luke Pendergrass, Andrey, Gabriel Puliatti, Lone Striker, Sebastain Graf, Pierre Kircher, Randy H, NimbleBox.ai, Vadim, danny, Deo Leter
213
 
214
 
215
  Thank you to all my generous patrons and donaters!
@@ -220,7 +258,8 @@ And thank you again to a16z for their generous grant.
220
 
221
  # Original model card: Upstage's Llama 2 70B Instruct v2
222
 
223
- # LLaMa-2-70b-instruct-v2 model card
 
224
 
225
  ## Model Details
226
 
@@ -236,69 +275,76 @@ And thank you again to a16z for their generous grant.
236
 
237
  ### Used Datasets
238
  - Orca-style dataset
239
- - Alpaca-Style Dataset
 
 
240
 
241
 
242
  ### Prompt Template
243
  ```
244
  ### System:
245
  {System}
 
246
  ### User:
247
  {User}
 
248
  ### Assistant:
249
  {Assistant}
250
  ```
251
- ### Usage
252
 
253
- *Tested on A100 80GB*
 
 
 
254
 
255
  ```python
256
  import torch
257
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
 
258
  tokenizer = AutoTokenizer.from_pretrained("upstage/Llama-2-70b-instruct-v2")
259
  model = AutoModelForCausalLM.from_pretrained(
260
  "upstage/Llama-2-70b-instruct-v2",
261
- device_map='auto',
262
  torch_dtype=torch.float16,
263
  load_in_8bit=True,
264
- rope_scaling={'type': 'dynamic', 'factor': 2} # longer inputs possible
265
  )
266
- prompt = "### User:\nThomas is very healthy, but he has to go to the hospital every day. What could be the reasons?\n\n### Assistant:\n"
 
267
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
268
- del inputs['token_type_ids']
269
  streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
 
270
  output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=float('inf'))
271
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)
272
  ```
273
 
274
- **Our model can handle >10k input tokens thanks to the `rope_scaling` option.**
275
-
276
  ## Hardware and Software
277
 
278
  * **Hardware**: We utilized an A100x8 * 4 for training our model
279
- * **Training Factors**: We fine-tuned this model using a combination of the [DeepSpeed library](https://github.com/microsoft/DeepSpeed) and the [HuggingFace trainer](https://huggingface.co/docs/transformers/main_classes/trainer) / [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/index)
280
 
281
  ## Evaluation Results
282
 
283
  ### Overview
284
- - We conducted a performance evaluation based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
285
- We evaluated our model on four benchmark datasets, which include `ARC-Challenge`, `HellaSwag`, `MMLU`, and `TruthfulQA`.
286
  We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness), specifically commit [b281b0921b636bc36ad05c0b0b0763bd6dd43463](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463).
 
287
 
288
  ### Main Results
289
- | Model | H4 Average | ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
290
- |-----------------------------------------------|---------|-------|-----------|-------|------------|-------|----------|
291
- | **Llama-2-70b-instruct-v2** (***Ours***, ***Local Reproduction***) | **72.7** | **71.6** | **87.7** | **69.7** | **61.6** | | 7.440625 |
292
- | Llama-2-70b-instruct (Ours, Local Reproduction) | 72.0 | 70.7 | 87.4 | 69.3 | 60.7 | | 7.24375 |
293
- | llama-65b-instruct (Ours, Local Reproduction) | 69.4 | 67.6 | 86.5 | 64.9 | 58.8 | | |
294
- | Llama-2-70b-hf | 67.3 | 67.3 | 87.3 | 69.8 | 44.9 | | |
295
- | llama-30b-instruct-2048 (Ours, Open LLM Leaderboard) | 67.0 | 64.9 | 84.9 | 61.9 | 56.3 | | |
296
- | llama-30b-instruct-2048 (Ours, Local Reproduction) | 67.0 | 64.9 | 85.0 | 61.9 | 56.0 | | 6.88125 |
297
- | llama-30b-instruct (Ours, Open LLM Leaderboard) | 65.2 | 62.5 | 86.2 | 59.4 | 52.8 | | |
298
- | llama-65b | 64.2 | 63.5 | 86.1 | 63.9 | 43.4 | | |
299
- | falcon-40b-instruct | 63.4 | 61.6 | 84.3 | 55.4 | 52.5 | | |
300
-
301
- ### Scripts
302
  - Prepare evaluation environments:
303
  ```
304
  # clone the repository
@@ -309,12 +355,9 @@ git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
309
  cd lm-evaluation-harness
310
  ```
311
 
312
- ## Ethical Issues
313
-
314
- ### Ethical Considerations
315
- - There were no ethical issues involved, as we did not include the benchmark test set or the training set in the model's training process.
316
-
317
  ## Contact Us
318
 
319
- ### Why Upstage LLM?
320
- - [Upstage](https://en.upstage.ai)'s LLM research has yielded remarkable results. Our 30B model **outperforms all models around the world**, positioning itself as the leading performer. Recognizing the immense potential in implementing private LLM to actual businesses, we invite you to easily apply private LLM and fine-tune it with your own data. For a seamless and tailored solution, please do not hesitate to reach out to us. ► [click here to contact](https://www.upstage.ai/private-llm?utm_source=huggingface&utm_medium=link&utm_campaign=privatellm).
 
 
 
2
  inference: false
3
  language:
4
  - en
5
+ license: llama2
6
  model_creator: Upstage
7
  model_link: https://huggingface.co/upstage/Llama-2-70b-instruct-v2
8
  model_name: Llama 2 70B Instruct v2
 
37
  - Model creator: [Upstage](https://huggingface.co/Upstage)
38
  - Original model: [Llama 2 70B Instruct v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)
39
 
40
+ <!-- description start -->
41
  ## Description
42
 
43
  This repo contains GPTQ model files for [Upstage's Llama 2 70B Instruct v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2).
44
 
45
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
46
 
47
+ <!-- description end -->
48
+ <!-- repositories-available start -->
49
  ## Repositories available
50
 
51
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ)
52
+ * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GGUF)
53
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GGML)
54
  * [Upstage's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)
55
+ <!-- repositories-available end -->
56
 
57
+ <!-- prompt-template start -->
58
  ## Prompt template: Orca-Hashes
59
 
60
  ```
61
  ### System:
62
+ {system_message}
63
 
64
  ### User:
65
  {prompt}
66
 
67
  ### Assistant:
68
+
69
  ```
70
 
71
+ <!-- prompt-template end -->
72
+
73
+ <!-- README_GPTQ.md-provided-files start -->
74
+ ## Provided files and GPTQ parameters
75
 
76
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
77
 
78
  Each separate quant is in a different branch. See below for instructions on fetching from different branches.
79
 
80
+ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa.
81
+
82
+ <details>
83
+ <summary>Explanation of GPTQ parameters</summary>
84
+
85
+ - Bits: The bit size of the quantised model.
86
+ - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
87
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
88
+ - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
89
+ - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
90
+ - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
91
+ - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
92
 
93
+ </details>
94
+
95
+ | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
96
+ | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
97
+ | [main](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/main) | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 35.33 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
98
+ | [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 40.66 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
99
+ | [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 37.99 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
100
+ | [gptq-4bit-128g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-4bit-128g-actorder_True) | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 36.65 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
101
+ | [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 26.78 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
102
+ | [gptq-3bit-128g-actorder_False](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit-128g-actorder_False) | 3 | 128 | No | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 28.03 GB | No | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
103
+ | [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 28.03 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
104
+ | [gptq-3bit-64g-actorder_True](https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ/tree/gptq-3bit-64g-actorder_True) | 3 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 29.30 GB | No | 3-bit, with group size 64g and act-order. Poor AutoGPTQ CUDA speed. |
105
+
106
+ <!-- README_GPTQ.md-provided-files end -->
107
+
108
+ <!-- README_GPTQ.md-download-from-branches start -->
109
  ## How to download from branches
110
 
111
  - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ:gptq-4bit-32g-actorder_True`
112
  - With Git, you can clone a branch with:
113
  ```
114
+ git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ
115
  ```
116
  - In Python Transformers code, the branch is the `revision` parameter; see below.
117
+ <!-- README_GPTQ.md-download-from-branches end -->
118
+ <!-- README_GPTQ.md-text-generation-webui start -->
119
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
120
 
121
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
122
 
123
+ It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
124
 
125
  1. Click the **Model tab**.
126
  2. Under **Download custom model or LoRA**, enter `TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ`.
127
  - To download from a specific branch, enter for example `TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ:gptq-4bit-32g-actorder_True`
128
  - see Provided Files above for the list of branches for each option.
129
  3. Click **Download**.
130
+ 4. The model will start downloading. Once it's finished it will say "Done".
131
  5. In the top left, click the refresh icon next to **Model**.
132
  6. In the **Model** dropdown, choose the model you just downloaded: `Upstage-Llama-2-70B-instruct-v2-GPTQ`
133
  7. The model will automatically load, and is now ready for use!
134
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
135
+ * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
136
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
137
+ <!-- README_GPTQ.md-text-generation-webui end -->
138
 
139
+ <!-- README_GPTQ.md-use-from-python start -->
140
  ## How to use this GPTQ model from Python code
141
 
142
+ ### Install the necessary packages
143
+
144
+ Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
145
+
146
+ ```shell
147
+ pip3 install transformers>=4.32.0 optimum>=1.12.0
148
+ pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
149
+ ```
150
+
151
+ If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
152
+
153
+ ```shell
154
+ pip3 uninstall -y auto-gptq
155
+ git clone https://github.com/PanQiWei/AutoGPTQ
156
+ cd AutoGPTQ
157
+ pip3 install .
158
+ ```
159
+
160
+ ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
161
 
162
+ If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
163
+ ```shell
164
+ pip3 uninstall -y transformers
165
+ pip3 install git+https://github.com/huggingface/transformers.git
166
+ ```
167
 
168
+ ### You can then use the following code
169
 
170
  ```python
171
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
172
 
173
  model_name_or_path = "TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ"
174
+ # To use a different branch, change revision
175
+ # For example: revision="gptq-4bit-32g-actorder_True"
176
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
177
+ torch_dtype=torch.float16,
178
+ device_map="auto",
179
+ revision="main")
180
 
181
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
182
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  prompt = "Tell me about AI"
184
  prompt_template=f'''### System:
185
+ {system_message}
186
 
187
  ### User:
188
  {prompt}
189
 
190
  ### Assistant:
191
+
192
  '''
193
 
194
  print("\n\n*** Generate:")
 
199
 
200
  # Inference can also be done using transformers' pipeline
201
 
 
 
 
202
  print("*** Pipeline:")
203
  pipe = pipeline(
204
  "text-generation",
 
212
 
213
  print(pipe(prompt_template)[0]['generated_text'])
214
  ```
215
+ <!-- README_GPTQ.md-use-from-python end -->
216
 
217
+ <!-- README_GPTQ.md-compatibility start -->
218
  ## Compatibility
219
 
220
+ The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
221
+
222
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
223
 
224
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
225
+ <!-- README_GPTQ.md-compatibility end -->
226
 
227
  <!-- footer start -->
228
  <!-- 200823 -->
 
247
 
248
  **Special thanks to**: Aemon Algiz.
249
 
250
+ **Patreon special mentions**: Russ Johnson, J, alfie_i, Alex, NimbleBox.ai, Chadd, Mandus, Nikolai Manek, Ken Nordquist, ya boyyy, Illia Dulskyi, Viktor Bowallius, vamX, Iucharbius, zynix, Magnesian, Clay Pascal, Pierre Kircher, Enrico Ros, Tony Hughes, Elle, Andrey, knownsqashed, Deep Realms, Jerry Meng, Lone Striker, Derek Yates, Pyrater, Mesiah Bishop, James Bentley, Femi Adebogun, Brandon Frisco, SuperWojo, Alps Aficionado, Michael Dempsey, Vitor Caleffi, Will Dee, Edmond Seymore, usrbinkat, LangChain4j, Kacper Wikieł, Luke Pendergrass, John Detwiler, theTransient, Nathan LeClaire, Tiffany J. Kim, biorpg, Eugene Pentland, Stanislav Ovsiannikov, Fred von Graf, terasurfer, Kalila, Dan Guido, Nitin Borwankar, 阿明, Ai Maven, John Villwock, Gabriel Puliatti, Stephen Murray, Asp the Wyvern, danny, Chris Smitley, ReadyPlayerEmma, S_X, Daniel P. Andersen, Olakabola, Jeffrey Morgan, Imad Khwaja, Caitlyn Gatomon, webtim, Alicia Loh, Trenton Dambrowitz, Swaroop Kallakuri, Erik Bjäreholt, Leonard Tan, Spiking Neurons AB, Luke @flexchar, Ajan Kanaga, Thomas Belote, Deo Leter, RoA, Willem Michiel, transmissions 11, subjectnull, Matthew Berman, Joseph William Delisle, David Ziegler, Michael Davis, Johann-Peter Hartmann, Talal Aujan, senxiiz, Artur Olbinski, Rainer Wilmers, Spencer Kim, Fen Risland, Cap'n Zoog, Rishabh Srivastava, Michael Levine, Geoffrey Montalvo, Sean Connelly, Alexandros Triantafyllidis, Pieter, Gabriel Tamborski, Sam, Subspace Studios, Junyu Yang, Pedro Madruga, Vadim, Cory Kujawski, K, Raven Klaugh, Randy H, Mano Prime, Sebastain Graf, Space Cruiser
251
 
252
 
253
  Thank you to all my generous patrons and donaters!
 
258
 
259
  # Original model card: Upstage's Llama 2 70B Instruct v2
260
 
261
+ # SOLAR-0-70b-16bit model card
262
+ The model name has been changed from LLaMa-2-70b-instruct-v2 to SOLAR-0-70b-16bit
263
 
264
  ## Model Details
265
 
 
275
 
276
  ### Used Datasets
277
  - Orca-style dataset
278
+ - Alpaca-style dataset
279
+ - No other dataset was used except for the dataset mentioned above
280
+ - No benchmark test set or the training set are used
281
 
282
 
283
  ### Prompt Template
284
  ```
285
  ### System:
286
  {System}
287
+
288
  ### User:
289
  {User}
290
+
291
  ### Assistant:
292
  {Assistant}
293
  ```
 
294
 
295
+ ## Usage
296
+
297
+ - The followings are tested on A100 80GB
298
+ - Our model can handle up to 10k+ input tokens, thanks to the `rope_scaling` option
299
 
300
  ```python
301
  import torch
302
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
303
+
304
  tokenizer = AutoTokenizer.from_pretrained("upstage/Llama-2-70b-instruct-v2")
305
  model = AutoModelForCausalLM.from_pretrained(
306
  "upstage/Llama-2-70b-instruct-v2",
307
+ device_map="auto",
308
  torch_dtype=torch.float16,
309
  load_in_8bit=True,
310
+ rope_scaling={"type": "dynamic", "factor": 2} # allows handling of longer inputs
311
  )
312
+
313
+ prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"
314
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
315
+ del inputs["token_type_ids"]
316
  streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
317
+
318
  output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=float('inf'))
319
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)
320
  ```
321
 
 
 
322
  ## Hardware and Software
323
 
324
  * **Hardware**: We utilized an A100x8 * 4 for training our model
325
+ * **Training Factors**: We fine-tuned this model using a combination of the [DeepSpeed library](https://github.com/microsoft/DeepSpeed) and the [HuggingFace Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) / [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/index)
326
 
327
  ## Evaluation Results
328
 
329
  ### Overview
330
+ - We conducted a performance evaluation following the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
331
+ We evaluated our model on four benchmark datasets, which include `ARC-Challenge`, `HellaSwag`, `MMLU`, and `TruthfulQA`
332
  We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness), specifically commit [b281b0921b636bc36ad05c0b0b0763bd6dd43463](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463).
333
+ - We used [MT-bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), a set of challenging multi-turn open-ended questions, to evaluate the models
334
 
335
  ### Main Results
336
+ | Model | H4(Avg) | ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
337
+ |--------------------------------------------------------------------|----------|----------|----------|------|----------|-|-------------|
338
+ | **[Llama-2-70b-instruct-v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)**(***Ours***, ***Open LLM Leaderboard***) | **73** | **71.1** | **87.9** | **70.6** | **62.2** | | **7.44063** |
339
+ | [Llama-2-70b-instruct](https://huggingface.co/upstage/Llama-2-70b-instruct) (Ours, Open LLM Leaderboard) | 72.3 | 70.9 | 87.5 | 69.8 | 61 | | 7.24375 |
340
+ | [llama-65b-instruct](https://huggingface.co/upstage/llama-65b-instruct) (Ours, Open LLM Leaderboard) | 69.4 | 67.6 | 86.5 | 64.9 | 58.8 | | |
341
+ | Llama-2-70b-hf | 67.3 | 67.3 | 87.3 | 69.8 | 44.9 | | |
342
+ | [llama-30b-instruct-2048](https://huggingface.co/upstage/llama-30b-instruct-2048) (Ours, Open LLM Leaderboard) | 67.0 | 64.9 | 84.9 | 61.9 | 56.3 | | |
343
+ | [llama-30b-instruct](https://huggingface.co/upstage/llama-30b-instruct) (Ours, Open LLM Leaderboard) | 65.2 | 62.5 | 86.2 | 59.4 | 52.8 | | |
344
+ | llama-65b | 64.2 | 63.5 | 86.1 | 63.9 | 43.4 | | |
345
+ | falcon-40b-instruct | 63.4 | 61.6 | 84.3 | 55.4 | 52.5 | | |
346
+
347
+ ### Scripts for H4 Score Reproduction
 
348
  - Prepare evaluation environments:
349
  ```
350
  # clone the repository
 
355
  cd lm-evaluation-harness
356
  ```
357
 
 
 
 
 
 
358
  ## Contact Us
359
 
360
+ ### About Upstage
361
+ - [Upstage](https://en.upstage.ai) is a company specialized in Large Language Models (LLMs) and AI. We will help you build private LLMs and related applications.
362
+ If you have a dataset to build domain specific LLMs or make LLM applications, please contact us at ► [click here to contact](https://www.upstage.ai/private-llm?utm_source=huggingface&utm_medium=link&utm_campaign=privatellm)
363
+ - As of August 1st, our 70B model has reached the top spot in openLLM rankings, marking itself as the current leading performer globally.