Update README.md
Browse files
README.md
CHANGED
@@ -35,25 +35,11 @@ In order to use the current quantized model, support is offered for different so
|
|
35 |
|
36 |
### 🤗 transformers
|
37 |
|
38 |
-
In order to run the inference with Llama 3.1 405B Instruct GPTQ in INT4,
|
39 |
|
40 |
```bash
|
41 |
-
pip install
|
42 |
-
pip install
|
43 |
-
```
|
44 |
-
|
45 |
-
Otherwise, running the model may fail, since the AutoGPTQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
|
46 |
-
|
47 |
-
Then, the latest version of `transformers` need to be installed including the `accelerate` extra, being 4.43.0 or higher, as:
|
48 |
-
|
49 |
-
```bash
|
50 |
-
pip install "transformers[accelerate]>=4.43.0" --upgrade
|
51 |
-
```
|
52 |
-
|
53 |
-
Finally, in order to use `autogptq`, `optimum` also needs to be installed:
|
54 |
-
|
55 |
-
```bash
|
56 |
-
pip install optimum --upgrade
|
57 |
```
|
58 |
|
59 |
To run the inference on top of Llama 3.1 405B Instruct GPTQ in INT4 precision, the GPTQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
|
@@ -91,30 +77,14 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
|
|
91 |
|
92 |
### AutoGPTQ
|
93 |
|
94 |
-
|
95 |
-
|
96 |
-
In order to run the inference with Llama 3.1 405B Instruct GPTQ in INT4, both `torch` and `autogptq` need to be installed as:
|
97 |
-
|
98 |
-
```bash
|
99 |
-
pip install "torch>=2.2.0,<2.3.0" --upgrade
|
100 |
-
pip install auto-gptq --no-build-isolation
|
101 |
-
```
|
102 |
-
|
103 |
-
Otherwise, running the model may fail, since the AutoGPTQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
|
104 |
-
|
105 |
-
Then, the latest version of `transformers` need to be installed including the `accelerate` extra, being 4.43.0 or higher, as:
|
106 |
-
|
107 |
-
```bash
|
108 |
-
pip install "transformers[accelerate]>=4.43.0" --upgrade
|
109 |
-
```
|
110 |
-
|
111 |
-
Finally, in order to use `autogptq`, `optimum` also needs to be installed:
|
112 |
|
113 |
```bash
|
114 |
-
pip install
|
|
|
115 |
```
|
116 |
|
117 |
-
|
118 |
|
119 |
```python
|
120 |
import torch
|
@@ -148,7 +118,7 @@ outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
|
|
148 |
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
|
149 |
```
|
150 |
|
151 |
-
The AutoGPTQ script has been adapted from [AutoGPTQ/examples/quantization/basic_usage.py](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/examples/quantization/basic_usage.py).
|
152 |
|
153 |
### 🤗 Text Generation Inference (TGI)
|
154 |
|
@@ -159,28 +129,14 @@ Coming soon!
|
|
159 |
> [!NOTE]
|
160 |
> In order to quantize Llama 3.1 405B Instruct using AutoGPTQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~800GiB, and an NVIDIA GPU with 80GiB of VRAM to quantize it.
|
161 |
|
162 |
-
In order to quantize Llama 3.1 405B Instruct,
|
163 |
-
|
164 |
-
```bash
|
165 |
-
pip install "torch>=2.2.0,<2.3.0" --upgrade
|
166 |
-
pip install auto-gptq --no-build-isolation
|
167 |
-
```
|
168 |
-
|
169 |
-
Otherwise the quantization may fail, since the AutoGPTQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
|
170 |
-
|
171 |
-
Then install the latest version of `transformers` as follows:
|
172 |
-
|
173 |
-
```bash
|
174 |
-
pip install "transformers>=4.43.0" --upgrade
|
175 |
-
```
|
176 |
-
|
177 |
-
Finally, in order to use `autogptq`, `optimum` also needs to be installed:
|
178 |
|
179 |
```bash
|
180 |
-
pip install
|
|
|
181 |
```
|
182 |
|
183 |
-
|
184 |
|
185 |
```python
|
186 |
import random
|
|
|
35 |
|
36 |
### 🤗 transformers
|
37 |
|
38 |
+
In order to run the inference with Llama 3.1 405B Instruct GPTQ in INT4, you need to install the following packages:
|
39 |
|
40 |
```bash
|
41 |
+
pip install -q --upgrade transformers accelerate optimum
|
42 |
+
pip install -q --no-build-isolation auto-gptq
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
```
|
44 |
|
45 |
To run the inference on top of Llama 3.1 405B Instruct GPTQ in INT4 precision, the GPTQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
|
|
|
77 |
|
78 |
### AutoGPTQ
|
79 |
|
80 |
+
In order to run the inference with Llama 3.1 405B Instruct GPTQ in INT4, you need to install the following packages:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
```bash
|
83 |
+
pip install -q --upgrade transformers accelerate optimum
|
84 |
+
pip install -q --no-build-isolation auto-gptq
|
85 |
```
|
86 |
|
87 |
+
Alternatively, one may want to run that via `AutoGPTQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
|
88 |
|
89 |
```python
|
90 |
import torch
|
|
|
118 |
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
|
119 |
```
|
120 |
|
121 |
+
The AutoGPTQ script has been adapted from [`AutoGPTQ/examples/quantization/basic_usage.py`](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/examples/quantization/basic_usage.py).
|
122 |
|
123 |
### 🤗 Text Generation Inference (TGI)
|
124 |
|
|
|
129 |
> [!NOTE]
|
130 |
> In order to quantize Llama 3.1 405B Instruct using AutoGPTQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~800GiB, and an NVIDIA GPU with 80GiB of VRAM to quantize it.
|
131 |
|
132 |
+
In order to quantize Llama 3.1 405B Instruct with GPTQ in INT4, you need to install the following packages:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
|
134 |
```bash
|
135 |
+
pip install -q --upgrade transformers accelerate optimum
|
136 |
+
pip install -q --no-build-isolation auto-gptq
|
137 |
```
|
138 |
|
139 |
+
Then run the following script, adapted from [`AutoGPTQ/examples/quantization/basic_usage.py`](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/examples/quantization/basic_usage.py).
|
140 |
|
141 |
```python
|
142 |
import random
|