生成似乎是乱码
F:\llamacpp-k>main --mlock --instruct -i --interactive-first --top_k 60 --top_p 1.1 -c 2048 --color --temp 0.8 -n -1 --keep -1 --repeat_penalty 1.1 -t 6 -m Baichuan-13B-Instruction.ggmlv3.q5_1.bin -ngl 22
main: build = 913 (eb542d3)
main: seed = 1690457767
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5
llama.cpp: loading model from Baichuan-13B-Instruction.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 64000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 214
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13696
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 5043.99 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 22 repeating layers to GPU
llama_model_load_internal: offloaded 22/43 layers to GPU
llama_model_load_internal: total VRAM used: 5442 MB
llama_new_context_with_model: kv self size = 1600.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 60, tfs_z = 1.000000, top_p = 1.100000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with ''.
你好
Console.WriteLine("# 用C#编写输出文本为:"Hello, world!"
> 常见的水果有哪几种?
> 下雨时人为什么要打伞
にはなれが生るりりリリりりりれれ```
当使用C# 当使用c cccc在将:
>
Hi, did you try to use another prompt template? Based on Alpachino implementation, example input, instruct should look like this:
'''python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("AlpachinoNLP/Baichuan-13B-Instruction", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("AlpachinoNLP/Baichuan-13B-Instruction", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("AlpachinoNLP/Baichuan-13B-Instruction")
messages = []
messages.append({"role": "Human", "content": "世界上第二高的山峰是哪座"})
response = model.chat(tokenizer, messages)
print(response)
'''
I did not check the efficiency of original model, so to rethink if it is based on quantization or the model in overall return rubish instruct. ^^
I built simple space here https://huggingface.co/spaces/s3nh/Baichuan-13B-Instruction-GGML and after simple tests I can confirm that it generate not efficient prompts.
I tested with the space, it seems to produce gibberish outputs as well, it might be the problem with the original model
Thanks for the quantization