Gibberish nonsense in GPTQ
Hi @TheBloke , I try to run the inference using the latest GPTQ-for-LLaMa with a simpe instruction "Write a positive movie review":
python llama_inference.py TheBloke/koala-7B-GPTQ-4bit-128g --wbits 4 --groupsize 128 --load koala-7B-4bit-128g.safetensors --text "Write a positive movie review" --device=0
Yet, it returns nonsense output:
β please write a positive movie review\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
May I ask how to fix this?
Thank you!
As explained in the README, this is expected if you don't use a more recent version of GPTQ-for-LLaMa inside text-generation-webui.
I just updated the README to make this clearer.
You could update the GPTQ-for-LLaMa installation, however you can't use the very latest code because right now it doesn't work with text-generation-webui. You need to clone GPTQ-for-LLaMa at commit 58c8ab4c7aaccc50f507fd08cce941976affe5e0
.
Or the easier method is to not use the safetensors
file, and instead use koala-7B-4bit-128g.no-act-order.ooba.pt
So:
- Clone the repo locally
- Move/delete the
safetensors
file andkoala-7B-4bit-128g.pt
, such that the only model file remaining iskoala-7B-4bit-128g.no-act-order.ooba.pt
- Now run text-generation-webui and it will work.
If you don't want to clone the whole repo, just download all the JSON files, and the single model file koala-7B-4bit-128g.no-act-order.ooba.pt
.
Hi @TheBloke ,
I just clone the latest version of GPTQ-for-LLaMa today (commit: 1cb5ae890f785a55f20ab07406423d0a05d22073). I didn't use text-generation-webui. Instead, I just use the inference code from GPTQ-for-LLAMA repo to get the generated results.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/llama_inference.py
Is this checkpoint not compatible with the GPTQ inference code? Do I need to do any special preprocessing?
Thank you!
Oh sorry, I did not read your initial message properly. I didn't spot you were calling llama_inference.py
Hmm, that result is surprising to me. I thought the latest GPTQ-for-LLaMa code would work with the safetensors
file.
Maybe he has made some changes that breaks it.
Try this and let me know if it works:
git clone -n https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-April13
cd gptq-April13
git checkout 58c8ab4c7aaccc50f507fd08cce941976affe5e0
python llama_inference.py TheBloke/koala-7B-GPTQ-4bit-128g --wbits 4 --groupsize 128 --load koala-7B-4bit-128g.safetensors --text "Write a positive movie review" --device=0
Thank you for the help. The output is much more reasonable now:
β Write a positive movie review:
"I recently watched a movie called "The Lighthouse," and I was thoroughly impressed by its unique style and captivating story. The two main characters, Thomas (played by Willem
Do I need to format the input to a special server client format, like
https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
Or just plain instruction "Write a positive movie review" is enough?
Besides, I find that there are two version of koala. This is v2, right?
Great! Glad that's working now.
Here is the format I use for Koala. This seems to work best for how it was trained:
BEGINNING OF CONVERSATION:
USER: write a story about llamas
GPT:
The other format htat is commonly used for these models is this - for example, this is recommended on Vicuna:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a story about llamas
### Response:"
I think the first prompt is meant to be better on Koala, but you could try both and see how you get on.
As for two versions: I haven't heard of a second version of Koala? This is definitely v1, and I didn't know there was a v2.
There was a v1.0 and v1.1 of Vicuna - are you maybe thinking of that? I have done v1.1 uploads for Vicuna if you want to try that. It's very good. I have heard reports that Koala may do better with some prompts, and Vicuna 1.1 may do better with other prompts. So they are both worth trying.
Thank you for answering. I see there are two versions of koala here, v1 and v2:
https://huggingface.co/young-geng/koala/tree/main
The Vicuna prompt you sent is for v1 or v1.1?
May I ask where can I find more resources about how to prompting vicuna and koala properly?
Oh OK yeah I see. Then yes this is Koala v2. They published the v1 and v2 files at the same time, so I guess v1 was an earlier attempt and v2 was tweaked in some way. To my knowledge they never explained why there was v1 and v2. But yes I used the v2.
The Vicuna prompt I sent works equally on v1 and v1.1. There is only a small difference between Vicuna 1.0 and 1.1 - but 1.1 is slightly better.
I don't know of specific resources on prompting these models. But I can recommend the YouTube channel of Sam Witteveen. Each time a new model comes out he tries it in Google Colab and publishes the code. I think it was on his YouTube about Koala that I saw the prompt that I showed you above. His channel is here: https://www.youtube.com/@samwitteveenai
You could also try discussing on Discord, if you use that. Here are two Discord servers I use:
Nomic AI (GPT4ALL): https://discord.gg/ZHaesTnb
Alpaca-Lora: https://discord.gg/2pUCtdeS
They both have good discussions on local models, how to prompt them, etc.
Thank you so much. That's very helpful.
By the way, did you compare the inference speed of GPTQ and fp16?
Not formally. But I have done inference in both, and I seem to remember that GPTQ was always quite a bit quicker, even on an A100 with 40GB VRAM.
Thank you. I also find a weird thing for the tokenization:
>>> from transformers import AutoTokenizer
>>> tokenizer1=AutoTokenizer.from_pretrained('huggyllama/llama-7b',use_fast=False)
>>> tokenizer2=AutoTokenizer.from_pretrained('TheBloke/koala-7B-HF', use_fast=False)
>>> tokenizer1.decode(tokenizer1.encode('hello', add_special_tokens=True))
' β hello'
>>> tokenizer2.decode(tokenizer2.encode('hello', add_special_tokens=True))
'<s>hello'
Have you encountered the same issue?
I think that's normal? These later models define start-of-string (or start-of-generation) as <s>
, end-of-string (or end of generation/end of text) as </s>
and also have an <unk>
token.
I think these are meant to be defined in special_tokens.json, like you see here in Vicuna: https://huggingface.co/TheBloke/vicuna-13B-1.1-HF/blob/main/special_tokens_map.json
But that file is empty in Koala. Maybe I should fill it out. To be honest I've never been 100% certain as to what models added/changed those special tokens and whether I was meant to fill in that file manually if it wasn't filled out by the conversion process. But all the recent models seem to use that special_tokens_map.json so I think it's correct.
I think using vicuna or llama's tokenizer config/mapping should be okay. Thanks.
I just looked at the code for Koala EasyLM, the custom inference server they built (similar in principle to Vicuna FastChat I believe). And when they convert the model to HF they specifically set special_tokens_map.json
to be empty:
def write_tokenizer(tokenizer_path, input_tokenizer_path):
print(f"Fetching the tokenizer from {input_tokenizer_path}.")
os.makedirs(tokenizer_path, exist_ok=True)
write_json({}, os.path.join(tokenizer_path, "special_tokens_map.json"))
write_json(
{
"bos_token": "",
"eos_token": "",
"model_max_length": int(1e30),
"tokenizer_class": "LlamaTokenizer",
"unk_token": "",
},
os.path.join(tokenizer_path, "tokenizer_config.json"),
)
shutil.copyfile(input_tokenizer_path, os.path.join(tokenizer_path, "tokenizer.model"))
So I'm not sure why they do that or what the implications are, but it would seem that we're not meant to define the <s>
etc for Koala like we do for Vicuna.
Also I'm confused because I'm not getting the same results you are when I run your test code. Comparing Koala and Vicuna, I do not get a <s>
from Koala like you did:
>>> from transformers import AutoTokenizer
>>> tokenizer_koala=AutoTokenizer.from_pretrained('TheBloke/koala-7B-HF', use_fast=False)
>>> tokenizer_vicuna=AutoTokenizer.from_pretrained('TheBloke/vicuna-13B-1.1-HF', use_fast=False)
>>> tokenizer_koala.decode(tokenizer_koala.encode('hello', add_special_tokens=True))
' hello'
>>> tokenizer_vicuna.decode(tokenizer_vicuna.encode('hello', add_special_tokens=True))
' <s>hello'
Ah, but Koala does definitely use those same tokens. Here's EasyLM/models/llama/llama_model.py :
def __init__(
self,
vocab_file,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
sp_model_kwargs: Optional[Dict[str, Any]] = None,
add_bos_token=False,
add_eos_token=False,
**kwargs,
):
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
self.vocab_file = vocab_file
self.add_bos_token = add_bos_token
self.add_eos_token = add_eos_token
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
with tempfile.NamedTemporaryFile() as tfile:
with open_file(self.vocab_file, 'rb') as fin:
tfile.write(fin.read())
tfile.flush()
tfile.seek(0)
self.sp_model.Load(tfile.name)
""" Initialisation"""
self.add_special_tokens(dict(
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
))
self.pad_token_id = self.unk_token_id
So they are defining the same EOS, BOS and UNK as we saw in Vicuna (and all the other models I've looked at). But for some reason they don't write that to special_tokens_map.json
.
Yet, the Tokenizer still used those symbols? I'm doing another test, one sec..
OK so I made a new local copy of my koala-7B-HF
, and I filled out special_tokens_map.json
and tokenizer_config.json
:
$ cat /Users/tomj/src/huggingface/test-koala-7B-HF/special_tokens_map.json
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
$ cat /Users/tomj/src/huggingface/test-koala-7B-HF/tokenizer_config.json
{
"add_bos_token": true,
"add_eos_token": false,
"bos_token": {
"__type": "AddedToken",
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"clean_up_tokenization_spaces": false,
"eos_token": {
"__type": "AddedToken",
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"model_max_length": 1000000000000000019884624838656,
"pad_token": null,
"sp_model_kwargs": {},
"tokenizer_class": "LlamaTokenizer",
"unk_token": {
"__type": "AddedToken",
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
And then I re-ran the code I mentioned a moment ago, referencing the local model:
>>> from transformers import AutoTokenizer
>>> tokenizer_local_koala=AutoTokenizer.from_pretrained('/Users/tomj/src/huggingface/test-koala-7B-HF', use_fast=False)
>>> tokenizer_local_koala.decode(tokenizer_local_koala.encode('hello', add_special_tokens=True))
' <s>hello'
>>>
>>> tokenizer_HF_koala=AutoTokenizer.from_pretrained('TheBloke/koala-7B-HF', use_fast=False)
>>> tokenizer_HF_koala.decode(tokenizer_HF_koala.encode('hello', add_special_tokens=True))
' hello'
>>>
And now I get the expected beginning-of-string token from the updated local copy.
I think this means that those two JSON files should be updated to match how they are in Vicuna and others. I'm just confused as to why the Koala team didn't do that to start with.
I suppose it could be the case that because they add BOS, EOS and UNK in their own code, it didn't matter that they hadn't specified it in special_tokens_map.json
and tokenizer_config.json
. But that it is a mistake that they didn't..
As you can tell I'm trying to figure out what all this means and how it's meant to work! But I'm pretty sure Koala should have the same values in those two files as the other models, so I'm going to update it on HF.
OK I've updated special_tokens_map.json
and tokenizer_config.json
in TheBloke/Koala-7B-HF
and TheBloke/Koala-13B-HF
and the two GPTQ repos.
Thank you for the help. It's working normally now with the bos token.