Cannot use anything but what's in the main branch

#3
by HAvietisov - opened

I want to try 32g version, but can't get it to work. I get "FileNotFoundError: Could not find model in TheBloke/Llama-2-7b-Chat-GPTQ"
the code below :

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "gptq_model-4bit-32g"
use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        revision="gptq-4bit-32g-actorder_True",
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None,
        rope_scaling={"type": "dynamic", "factor": 2})

Yeah sorry it turns out there's a bug in AutoGPTQ with the revision parameter at the moment. I have just pushed a PR to AutoGPTQ to fix it, which you can see here: https://github.com/PanQiWei/AutoGPTQ/pull/205

Also I discovered recently that there's a bug in AutoGPTQ 0.3.0 which breaks inference with group_size + desc_act together. So currently you can't do inference with the model you want, unless you downgrade to 0.2.2. Both the inference bug and the revision bug should be fixed in AutoGPTQ 0.3.1, which I hope will come out in the next 24 hours.

Your options are:

  1. Wait until AutoGPTQ 0.3.1 is released which will fix both bugs, or
  2. Downgrade to AutoGPTQ 0.2.2, then use this code instead, which first downloads the branch locally and then does inference from there:

Change local_base_folder to a suitable base path where you want the model folder to be created. Note that they will still mostly be stored in the Huggingface Cache folder, but symlinks will be created in the path specified which you can then point AutoGPTQ to.

import os
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from huggingface_hub import snapshot_download

model_name = "TheBloke/Llama-2-7b-Chat-GPTQ"
branch = "gptq-4bit-32g-actorder_True"
local_base_folder = "/workspace"
local_folder = os.path.join(local_base_folder, f"{model_name.replace('/', '_')}_{branch}")

snapshot_download(repo_id=model_name, local_dir=local_folder, revision=branch)

model_basename = "gptq_model-4bit-32g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(local_folder, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(local_folder,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=False,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)

input_ids = tokenizer("Llamas are", return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Thanks, this helped me.

Great. By the way, AutoGPTQ 0.3.2 released yesterday and fixes this issue, so you can now upgrade AutoGPTQ to the latest version again.

Ohkay thanks!😊

Ohhhh, thanks a lot!
p.s. inference bug on 0.3.1 is still not fixed.
for some reason 0.3.2 is not available for me atm
and the inference speed is just terrible : update from 2.2 to 3.1 changed speed of inference from 8 t/s to 1.7 t/s

I am using this model to fine tune on my dataset and using the method above- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ/discussions/3#64bffa5b2263348d85c1c662.
I am using the SFTTrainer and getting this issue- HFValidationError: Repo id must use alphanumeric chars..
Any solution?

I'm running auto-gptq version 0.3.1 and I have the same issue, even when trying to use the "main" branch model. How can I get around it?

the method from @TheBloke post above works. But you'l have to downgrade to 0.2.2, @lucasalvarezlacasa

@HAvietisov terrible inference performance means you very likely don't have the AutoGPTQ CUDA kernel compiled. This is a common problem at the moment.

All: AutoGPTQ 0.3.2 is working fine in general for me. However you may need to build it from source, as the PyPi package has multiple problems at the moment that the AutoGPTQ author still has not been able to look at.

Can you all try this:

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip3 install  .

and report back. If you're using any kind of UI, like text-generation-webui, you must do the above in the Python environment of that UI. The text-generation-webui one click installer creates its own Conda environment and you would need to run the above commands with that conda environment activated.

I also just realised that there's still a bug with the AutoGPTQ revision parameter, which means that if you request eg the 32g model, it will download it OK, but it downloads the quantize_config.json from the main branch. So you get this error:

  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py", line 261, in forward
    weight = (scales * (weight - zeros))
RuntimeError: The size of tensor a (32) must match the size of tensor b (128) at non-singleton dimension 0

That is my fault and I will need to fix it in AutoGPTQ. I'll try to do that soon.

So if you want to use an alternative branch version with AutoGPTQ, please download it rather than fetching it straight from the hub in the AutoGPTQ call. The following test code shows doing that, and running it successfully on a random pod with CUDA 11.8 and pytorch 2.0.1:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from huggingface_hub import snapshot_download

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"

use_triton = False

# We can download the tokenizer from the main branch as they're all the same
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

#To download from a specific branch, use the revision parameter, as in this example:

# First download the model, from the desired branch to the specified local_folder - change this location to where you want the model to download to
local_folder="/workspace/llama-2-7b-gptq-32g"
snapshot_download(repo_id=model_name_or_path, local_dir=local_folder, revision="gptq-4bit-32g-actorder_True")

model = AutoGPTQForCausalLM.from_quantized(local_folder,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        quantize_config=None)

prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

And here's the output from me running it:

root@34ea00540a00:~# python3 test.py
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 727/727 [00:00<00:00, 4.04MB/s]
Downloading tokenizer.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 4.29MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 8.35MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 2.69MB/s]
Downloading (…)b06239e96013b/Notice: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 553kB/s]
Downloading (…)9e96013b/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 548/548 [00:00<00:00, 3.15MB/s]
Downloading (…)06239e96013b/LICENSE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.3k/50.3k [00:00<00:00, 3.04MB/s]
Downloading (…)6013b/.gitattributes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 7.07MB/s]
Downloading (…)239e96013b/README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.1k/20.1k [00:00<00:00, 69.2MB/s]
Downloading (…)quantize_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 183/183 [00:00<00:00, 432kB/s]
Downloading (…)neration_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 651kB/s]
Downloading (…)4bit-32g.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.28G/4.28G [00:43<00:00, 98.7MB/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:44<00:00,  3.74s/it]
The safetensors archive passed at /workspace/llama-2-7b-gptq-32g/gptq_model-4bit-32g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.


*** Generate:
<s> [INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Tell me about AI [/INST]  Hello! I'm here to help you with any questions you may have. AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, and decision-making. AI technology has been rapidly advancing in recent years, and it has the potential to revolutionize many industries, including healthcare, finance, transportation, and education.
There are several types of AI, including:
1. Narrow or weak AI: This type of AI is designed to perform a specific task, such as playing chess or recognizing faces.
2. General or strong AI: This type of AI is designed to perform any intellectual task that a human can, and it is still a topic of ongoing research and development.
3. Superintelligence: This type of AI is significantly more intelligent than the best human minds, and it is still a topic of debate and speculation.
It's important to note that AI is not a single entity, but rather a collection of technologies and techniques that are being developed and improved upon by researchers and developers around the world.
I hope this helps! Is there anything else you would like to know about AI?</s>

root@34ea00540a00:~#

@TheBloke what is the expected inference time when using GPTQ models? I found it to be extremely slow (40/50s in average) compared to just using the raw official models from meta-llama (15/20s in average) for 7B CHAT model. Is this the case or might there be something wrong on my side?

Thanks for your support.

@lucasalvarezlacasa I have come across this same problem.

@Vithika any solutions?

@lucasalvarezlacasa @Vithika Did you guys got any solutions

Yeah sorry it turns out there's a bug in AutoGPTQ with the revision parameter at the moment. I have just pushed a PR to AutoGPTQ to fix it, which you can see here: https://github.com/PanQiWei/AutoGPTQ/pull/205

Also I discovered recently that there's a bug in AutoGPTQ 0.3.0 which breaks inference with group_size + desc_act together. So currently you can't do inference with the model you want, unless you downgrade to 0.2.2. Both the inference bug and the revision bug should be fixed in AutoGPTQ 0.3.1, which I hope will come out in the next 24 hours.

Your options are:

  1. Wait until AutoGPTQ 0.3.1 is released which will fix both bugs, or
  2. Downgrade to AutoGPTQ 0.2.2, then use this code instead, which first downloads the branch locally and then does inference from there:

Change local_base_folder to a suitable base path where you want the model folder to be created. Note that they will still mostly be stored in the Huggingface Cache folder, but symlinks will be created in the path specified which you can then point AutoGPTQ to.

import os
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from huggingface_hub import snapshot_download

model_name = "TheBloke/Llama-2-7b-Chat-GPTQ"
branch = "gptq-4bit-32g-actorder_True"
local_base_folder = "/workspace"
local_folder = os.path.join(local_base_folder, f"{model_name.replace('/', '_')}_{branch}")

snapshot_download(repo_id=model_name, local_dir=local_folder, revision=branch)

model_basename = "gptq_model-4bit-32g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(local_folder, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(local_folder,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=False,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)

input_ids = tokenizer("Llamas are", return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Hi all, I tried to install the latest AutoGPTQ version or downgraded to AutoGPTQ 0.2.2 (as advised by bing using: pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/), however, it still shows that AutoGPTQ is not installed. How can I fix this problem?

Sign up or log in to comment