readme.init( )
Browse files
README.md
ADDED
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# CosmsoLLaMa GGUFs
|
2 |
+
|
3 |
+
## Objective
|
4 |
+
Due to the need for quantized models in real-time applications, we introduce our GGUF formatted models. These models are part of
|
5 |
+
GGML project with a hope to democratize the use of Large Models. Depending on the quantization type, there are 20+ models.
|
6 |
+
|
7 |
+
### Features
|
8 |
+
* All quantization details are listed on the right by Hugging Face.
|
9 |
+
* All the models have been tested in `llama.cpp` environments, `llama-cli` and `llama-server`.
|
10 |
+
* Furthermore, a YouTube video has been made to introduce the basics of using `lmstudio` to utilize these models.
|
11 |
+
|
12 |
+
|
13 |
+
|
14 |
+
### Code Example
|
15 |
+
Usage example with `llama-cpp-python`
|
16 |
+
|
17 |
+
```py
|
18 |
+
from llama_cpp import Llama
|
19 |
+
|
20 |
+
# Define the inference parameters
|
21 |
+
inference_params = {
|
22 |
+
"n_threads": 4,
|
23 |
+
"n_predict": -1,
|
24 |
+
"top_k": 40,
|
25 |
+
"min_p": 0.05,
|
26 |
+
"top_p": 0.95,
|
27 |
+
"temp": 0.8,
|
28 |
+
"repeat_penalty": 1.1,
|
29 |
+
"input_prefix": "<|start_header_id|>user<|end_header_id|>\\n\\n",
|
30 |
+
"input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n",
|
31 |
+
"antiprompt": [],
|
32 |
+
"pre_prompt": "Sen bir yapay zeka asistanısın. Kullanıcı sana bir görev verecek. Amacın görevi olabildiğince sadık bir şekilde tamamlamak.",
|
33 |
+
"pre_prompt_suffix": "<|eot_id|>",
|
34 |
+
"pre_prompt_prefix": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n",
|
35 |
+
"seed": -1,
|
36 |
+
"tfs_z": 1,
|
37 |
+
"typical_p": 1,
|
38 |
+
"repeat_last_n": 64,
|
39 |
+
"frequency_penalty": 0,
|
40 |
+
"presence_penalty": 0,
|
41 |
+
"n_keep": 0,
|
42 |
+
"logit_bias": {},
|
43 |
+
"mirostat": 0,
|
44 |
+
"mirostat_tau": 5,
|
45 |
+
"mirostat_eta": 0.1,
|
46 |
+
"memory_f16": True,
|
47 |
+
"multiline_input": False,
|
48 |
+
"penalize_nl": True
|
49 |
+
}
|
50 |
+
|
51 |
+
# Initialize the Llama model with the specified inference parameters
|
52 |
+
llama = Llama.from_pretrained(
|
53 |
+
repo_id="ytu-ce-cosmos/Turkish-Llama-8b-Instruct-v0.1-GGUF",
|
54 |
+
filename="*Q4_K.gguf",
|
55 |
+
verbose=False
|
56 |
+
)
|
57 |
+
# Example input
|
58 |
+
user_input = "Türkiyenin başkenti neresidir?"
|
59 |
+
|
60 |
+
# Construct the prompt
|
61 |
+
prompt = f"{inference_params['pre_prompt_prefix']}{inference_params['pre_prompt']}\n\n{inference_params['input_prefix']}{user_input}{inference_params['input_suffix']}"
|
62 |
+
|
63 |
+
# Generate the response
|
64 |
+
response = llama(prompt)
|
65 |
+
|
66 |
+
# Output the response
|
67 |
+
print(response['choices'][0]['text'])
|
68 |
+
|
69 |
+
```
|
70 |
+
|
71 |
+
The quantization has been made using `llama.cpp`. As we have seen, this method tends to give the most stable results.
|
72 |
+
|
73 |
+
Obviously, we encountered better inference quality for models with the highest bits. However, the inference time tends to be similar between low-bit models.
|
74 |
+
|
75 |
+
Each model's memory footprint can be anticipated by the qunatization docs in either [Hugging Face](https://huggingface.co/docs/transformers/main/en/quantization/overview) or [llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize).
|
76 |
+
|
77 |
+
## Contact
|
78 |
+
*Feel free to contact us whenever you confront any problems :)*
|
79 |
+
|
80 |
+
COSMOS AI Research Group, Yildiz Technical University Computer Engineering Department
|
81 |
+
https://cosmos.yildiz.edu.tr/
|
82 | |
83 |
+
|
84 |
+
|
85 |
+
|