when will have a ggml version?
is it possible to have ggml version?
There is already one from TheBloke ( https://huggingface.co/TheBloke/Llama-2-7B-32K-Instruct-GGML ), unfortunately it only outputs gibberish for me
There is already one from TheBloke ( https://huggingface.co/TheBloke/Llama-2-7B-32K-Instruct-GGML ), unfortunately it only outputs gibberish for me
what prompt are you using? People say this use a different prompt then the original llama chat prompt. @pbkowalski
@CUIGuy I've tried both the variant specified [INST]...[\INST] and others, but the output is just symbols regardless
@pbkowalski for which quantization levels did you observe this ?
@mauriceweber I've only tried 2_K, 4_0 and 4_1
The output I get from 4_1:
'[INST]\nWrite a poem about cats\n[\INST]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
I tried different prompts and as well only get long sequences of "\n". Could it be that something breaks in the tokenization of the input?
Can someone with access to the unquantized model verify if the token sequence for the following?
m.tokenize("[INST]\nWrite a poem about cats\n[/INST]\n\n".encode('utf8'))
[1, 29961, 25580, 29962, 13, 6113, 263, 26576, 1048, 274, 1446, 13, 29961, 29914, 25580, 29962, 13, 13]
Based on my experiences, Q2...Q4 quantizations are too small for proper outputs - even when generating "useful" texts (rather than just newlines) these models hallucinate far too much. The Q8_0 quantization, however, works pretty well - and, when using llama.cpp, 16GB RAM allow for context lengths up to 16k, 24GB RAM for lengths up to 32k (tested on a Macbook Air 15" with 24GB unified RAM).