mirekphd
/

gte-Qwen2-1.5B-instruct-Q8_0

@@ -10,11 +10,11 @@ license: apache-2.0
 ---
 ## This version
-This model was converted to a **8-bit GGUF format (`q8_0`)** from **[`Alibaba-NLP/gte-Qwen2-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)** using `llama-quantize` built from [`llama.cpp`](https://github.com/ggerganov/llama.cpp).
 Custom conversion script settings:
 ```json
-"gte-Qwen2-1.5B-instruct": {
     "model_name": "gte-Qwen2-1.5B-instruct",
     "hq_quant_type": "f32",
     "final_quant_type": "q8_0",
@@ -24,28 +24,22 @@ Custom conversion script settings:
     "numexpr_max_thread": 8
     }
 ```
-Please refer to the [original model card](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) for more details on the unquantized model, including its metrics, which may be different (typically slightly worse) for this quantized version.
-## gte-Qwen2-7B-instruct
-**gte-Qwen2-7B-instruct** is the latest model in the gte (General Text Embedding) model family that ranks **No.1** in both English and Chinese evaluations on the Massive Text Embedding Benchmark [MTEB benchmark](https://huggingface.co/spaces/mteb/leaderboard) (as of June 16, 2024).
-Recently, the [**Qwen team**](https://huggingface.co/Qwen) released the Qwen2 series models, and we have trained the **gte-Qwen2-7B-instruct** model based on the [Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) LLM model. Compared to the [gte-Qwen1.5-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct) model, the **gte-Qwen2-7B-instruct** model uses the same training data and training strategies during the finetuning stage, with the only difference being the upgraded base model to Qwen2-7B. Considering the improvements in the Qwen2 series models compared to the Qwen1.5 series, we can also expect consistent performance enhancements in the embedding models.
 The model incorporates several key advancements:
 - Integration of bidirectional attention mechanisms, enriching its contextual understanding.
 - Instruction tuning, applied solely on the query side for streamlined efficiency
 - Comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios. This training leverages both weakly supervised and supervised data, ensuring the model's applicability across numerous languages and a wide array of downstream tasks.
 ## Model Information
-### Overview
 - Model Type: GTE (General Text Embeddings)
-- Model Size: 7B
-- Embedding Dimension: 3584
 - Context Window: 131072
 ### Supported languages
 - North America: English
@@ -60,18 +54,18 @@ The model incorporates several key advancements:
 ```
 llama_model_loader: - kv   0:                       general.architecture str              = qwen2
 llama_model_loader: - kv   1:                               general.type str              = model
-llama_model_loader: - kv   2:                               general.name str              = gte-Qwen2-7B-instruct
 llama_model_loader: - kv   3:                           general.finetune str              = instruct
 llama_model_loader: - kv   4:                           general.basename str              = gte-Qwen2
-llama_model_loader: - kv   5:                         general.size_label str              = 7B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                               general.tags arr[str,5]       = ["mteb", "sentence-transformers", "tr...
 llama_model_loader: - kv   8:                          qwen2.block_count u32              = 28
 llama_model_loader: - kv   9:                       qwen2.context_length u32              = 131072
-llama_model_loader: - kv  10:                     qwen2.embedding_length u32              = 3584
-llama_model_loader: - kv  11:                  qwen2.feed_forward_length u32              = 18944
-llama_model_loader: - kv  12:                 qwen2.attention.head_count u32              = 28
-llama_model_loader: - kv  13:              qwen2.attention.head_count_kv u32              = 4
 llama_model_loader: - kv  14:                       qwen2.rope.freq_base f32              = 1000000.000000
 llama_model_loader: - kv  15:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
 llama_model_loader: - kv  16:                          general.file_type u32              = 7
@@ -87,7 +81,7 @@ llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool
 llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                                   split.no u16              = 0
-llama_model_loader: - kv  29:                                split.count u16              = 8
 llama_model_loader: - kv  30:                        split.tensors.count i32              = 339
 llama_model_loader: - type  f32:  141 tensors
 llama_model_loader: - type q8_0:  198 tensors
@@ -100,23 +94,23 @@ llm_load_print_meta: n_vocab          = 151646
 llm_load_print_meta: n_merges         = 151387
 llm_load_print_meta: vocab_only       = 0
 llm_load_print_meta: n_ctx_train      = 131072
-llm_load_print_meta: n_embd           = 3584
 llm_load_print_meta: n_layer          = 28
-llm_load_print_meta: n_head           = 28
-llm_load_print_meta: n_head_kv        = 4
 llm_load_print_meta: n_rot            = 128
 llm_load_print_meta: n_swa            = 0
 llm_load_print_meta: n_embd_head_k    = 128
 llm_load_print_meta: n_embd_head_v    = 128
-llm_load_print_meta: n_gqa            = 7
-llm_load_print_meta: n_embd_k_gqa     = 512
-llm_load_print_meta: n_embd_v_gqa     = 512
 llm_load_print_meta: f_norm_eps       = 0.0e+00
 llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
 llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 llm_load_print_meta: f_logit_scale    = 0.0e+00
-llm_load_print_meta: n_ff             = 18944
 llm_load_print_meta: n_expert         = 0
 llm_load_print_meta: n_expert_used    = 0
 llm_load_print_meta: causal attn      = 1
@@ -132,11 +126,11 @@ llm_load_print_meta: ssm_d_inner      = 0
 llm_load_print_meta: ssm_d_state      = 0
 llm_load_print_meta: ssm_dt_rank      = 0
 llm_load_print_meta: ssm_dt_b_c_rms   = 0
-llm_load_print_meta: model type       = 7B
 llm_load_print_meta: model ftype      = Q8_0
-llm_load_print_meta: model params     = 7.61 B
-llm_load_print_meta: model size       = 7.53 GiB (8.50 BPW)
-llm_load_print_meta: general.name     = gte-Qwen2-7B-instruct
 llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
 llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
 llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
@@ -145,15 +139,9 @@ llm_load_print_meta: LF token         = 148848 'ÄĬ'
 llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
 llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
 llm_load_print_meta: max token length = 256
-llm_load_tensors:   CPU_Mapped model buffer size =  1008.21 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   959.63 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   974.51 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   983.77 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   944.73 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   944.76 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   944.74 MiB
-llm_load_tensors:   CPU_Mapped model buffer size =   954.29 MiB
-........................................................................................
 llama_new_context_with_model: n_seq_max     = 1
 llama_new_context_with_model: n_ctx         = 131072
 llama_new_context_with_model: n_ctx_per_seq = 131072
@@ -162,10 +150,10 @@ llama_new_context_with_model: n_ubatch      = 512
 llama_new_context_with_model: flash_attn    = 0
 llama_new_context_with_model: freq_base     = 1000000.0
 llama_new_context_with_model: freq_scale    = 1
-llama_kv_cache_init:        CPU KV buffer size =  7168.00 MiB
-llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
 llama_new_context_with_model:        CPU  output buffer size =     0.01 MiB
-llama_new_context_with_model:        CPU compute buffer size =  7452.01 MiB
 llama_new_context_with_model: graph nodes  = 986
 llama_new_context_with_model: graph splits = 1
 ```

 ---
 ## This version
+This model was converted to a **8-bit GGUF format (`q8_0`)** from **[`Alibaba-NLP/gte-Qwen2-1.5B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)** using `llama-quantize` built from [`llama.cpp`](https://github.com/ggerganov/llama.cpp).
 Custom conversion script settings:
 ```json
+  "gte-Qwen2-1.5B-instruct": {
     "model_name": "gte-Qwen2-1.5B-instruct",
     "hq_quant_type": "f32",
     "final_quant_type": "q8_0",
     "numexpr_max_thread": 8
     }
 ```
+Please refer to the [original model card](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) for more details on the unquantized model, including its metrics, which may be different (typically slightly worse) for this quantized version.
+## gte-Qwen2-1.5B-instruct
+**gte-Qwen2-1.5B-instruct** is the latest model in the gte (General Text Embedding) model family. The model is built on [Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) LLM model and use the same training data and strategies as the [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) model.
 The model incorporates several key advancements:
 - Integration of bidirectional attention mechanisms, enriching its contextual understanding.
 - Instruction tuning, applied solely on the query side for streamlined efficiency
 - Comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios. This training leverages both weakly supervised and supervised data, ensuring the model's applicability across numerous languages and a wide array of downstream tasks.
 ## Model Information
 - Model Type: GTE (General Text Embeddings)
+- Model Size: 1.5B
+- Embedding Dimension: 1536
 - Context Window: 131072
 ### Supported languages
 - North America: English
 ```
 llama_model_loader: - kv   0:                       general.architecture str              = qwen2
 llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = gte-Qwen2-1.5B-instruct
 llama_model_loader: - kv   3:                           general.finetune str              = instruct
 llama_model_loader: - kv   4:                           general.basename str              = gte-Qwen2
+llama_model_loader: - kv   5:                         general.size_label str              = 1.5B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                               general.tags arr[str,5]       = ["mteb", "sentence-transformers", "tr...
 llama_model_loader: - kv   8:                          qwen2.block_count u32              = 28
 llama_model_loader: - kv   9:                       qwen2.context_length u32              = 131072
+llama_model_loader: - kv  10:                     qwen2.embedding_length u32              = 1536
+llama_model_loader: - kv  11:                  qwen2.feed_forward_length u32              = 8960
+llama_model_loader: - kv  12:                 qwen2.attention.head_count u32              = 12
+llama_model_loader: - kv  13:              qwen2.attention.head_count_kv u32              = 2
 llama_model_loader: - kv  14:                       qwen2.rope.freq_base f32              = 1000000.000000
 llama_model_loader: - kv  15:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
 llama_model_loader: - kv  16:                          general.file_type u32              = 7
 llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                                   split.no u16              = 0
+llama_model_loader: - kv  29:                                split.count u16              = 2
 llama_model_loader: - kv  30:                        split.tensors.count i32              = 339
 llama_model_loader: - type  f32:  141 tensors
 llama_model_loader: - type q8_0:  198 tensors
 llm_load_print_meta: n_merges         = 151387
 llm_load_print_meta: vocab_only       = 0
 llm_load_print_meta: n_ctx_train      = 131072
+llm_load_print_meta: n_embd           = 1536
 llm_load_print_meta: n_layer          = 28
+llm_load_print_meta: n_head           = 12
+llm_load_print_meta: n_head_kv        = 2
 llm_load_print_meta: n_rot            = 128
 llm_load_print_meta: n_swa            = 0
 llm_load_print_meta: n_embd_head_k    = 128
 llm_load_print_meta: n_embd_head_v    = 128
+llm_load_print_meta: n_gqa            = 6
+llm_load_print_meta: n_embd_k_gqa     = 256
+llm_load_print_meta: n_embd_v_gqa     = 256
 llm_load_print_meta: f_norm_eps       = 0.0e+00
 llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
 llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 llm_load_print_meta: f_logit_scale    = 0.0e+00
+llm_load_print_meta: n_ff             = 8960
 llm_load_print_meta: n_expert         = 0
 llm_load_print_meta: n_expert_used    = 0
 llm_load_print_meta: causal attn      = 1
 llm_load_print_meta: ssm_d_state      = 0
 llm_load_print_meta: ssm_dt_rank      = 0
 llm_load_print_meta: ssm_dt_b_c_rms   = 0
+llm_load_print_meta: model type       = 1.5B
 llm_load_print_meta: model ftype      = Q8_0
+llm_load_print_meta: model params     = 1.78 B
+llm_load_print_meta: model size       = 1.76 GiB (8.50 BPW)
+llm_load_print_meta: general.name     = gte-Qwen2-1.5B-instruct
 llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
 llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
 llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
 llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
 llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
 llm_load_print_meta: max token length = 256
+llm_load_tensors:   CPU_Mapped model buffer size =  1008.90 MiB
+llm_load_tensors:   CPU_Mapped model buffer size =   791.29 MiB
+............................................................................
 llama_new_context_with_model: n_seq_max     = 1
 llama_new_context_with_model: n_ctx         = 131072
 llama_new_context_with_model: n_ctx_per_seq = 131072
 llama_new_context_with_model: flash_attn    = 0
 llama_new_context_with_model: freq_base     = 1000000.0
 llama_new_context_with_model: freq_scale    = 1
+llama_kv_cache_init:        CPU KV buffer size =  3584.00 MiB
+llama_new_context_with_model: KV self size  = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB
 llama_new_context_with_model:        CPU  output buffer size =     0.01 MiB
+llama_new_context_with_model:        CPU compute buffer size =  3340.01 MiB
 llama_new_context_with_model: graph nodes  = 986
 llama_new_context_with_model: graph splits = 1
 ```