This version

This model was converted from the 32-bit original safetensors format to a (lossless in this case) 32-bit GGUF format (f32) from Alibaba-NLP/gte-Qwen2-1.5B-instruct using llama-quantize built from llama.cpp.

Custom conversion script settings:

  "gte-Qwen2-1.5B-instruct": {
    "model_name": "gte-Qwen2-1.5B-instruct", 
    "hq_quant_type": "f32",
    "final_quant_type": "",
    "produce_final_quant": false,
    "parts_num": 2,
    "max_shard_size_gb": 4,
    "numexpr_max_thread": 8
    }

Please refer to the original model card for more details on the unquantized model, including its metrics, which may be different (typically slightly worse) for this quantized version.

gte-Qwen2-1.5B-instruct

gte-Qwen2-1.5B-instruct is the latest model in the gte (General Text Embedding) model family. The model is built on Qwen2-1.5B LLM model and use the same training data and strategies as the gte-Qwen2-7B-instruct model.

The model incorporates several key advancements:

  • Integration of bidirectional attention mechanisms, enriching its contextual understanding.
  • Instruction tuning, applied solely on the query side for streamlined efficiency
  • Comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios. This training leverages both weakly supervised and supervised data, ensuring the model's applicability across numerous languages and a wide array of downstream tasks.

Model Information

  • Model Type: GTE (General Text Embeddings)
  • Model Size: 1.5B
  • Embedding Dimension: 1536
  • Context Window: 131072

Supported languages

  • North America: English
  • Western Europe: German, French, Spanish, Portuguese, Italian, Dutch
  • Eastern & Central Europe: Russian, Czech, Polish
  • Middle East: Arabic, Persian, Hebrew, Turkish
  • Eastern Asia: Chinese, Japanese, Korean
  • South-Eastern Asia: Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
  • Southern Asia: Hindi, Bengali, Urdu
  • [source]

Details

llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gte-Qwen2-1.5B-instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = gte-Qwen2
llama_model_loader: - kv   5:                         general.size_label str              = 1.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,5]       = ["mteb", "sentence-transformers", "tr...
llama_model_loader: - kv   8:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   9:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  10:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  11:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  12:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  13:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  14:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                          general.file_type u32              = 0
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151646]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151646]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                                   split.no u16              = 0
llama_model_loader: - kv  29:                                split.count u16              = 2
llama_model_loader: - kv  30:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  339 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.9308 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151646
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1.5B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 1.78 B
llm_load_print_meta: model size       = 6.62 GiB (32.00 BPW) 
llm_load_print_meta: general.name     = gte-Qwen2-1.5B-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size =  3797.36 MiB
llm_load_tensors:   CPU_Mapped model buffer size =  2978.30 MiB
............................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 131072
llama_new_context_with_model: n_ctx_per_seq = 131072
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =  3584.00 MiB
llama_new_context_with_model: KV self size  = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.01 MiB
llama_new_context_with_model:        CPU compute buffer size =  3340.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1

Usage

Sentence Transformers

Transformers

Inference

Using llama.cpp to get embeddings in CPU and/or GPU

First build or install llama-server binary from llama.cpp, preferably with GPU support.

CLI

Server

# using remote HF repo address (with model file(s) to be downloaded and cached locally)
$ llama-server --hf-repo mirekphd/gte-Qwen2-1.5B-instruct-F32 --hf-file gte-Qwen2-1.5B-instruct-F32-00001-of-00002.gguf --n-gpu-layers 0 --ctx-size 131072 --embeddings

# using a previously downloaded local model file(s)
$ llama-server --model <path-to-hf-models>/mirekphd/gte-Qwen2-1.5B-instruct-F32/gte-Qwen2-1.5B-instruct-F32-00001-of-00002.gguf --n-gpu-layers 0 --ctx-size 131072 --embeddings

Evaluation

MTEB & C-MTEB

Cloud API Services

Citation

If you find our paper or models helpful, please consider cite:

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}
Downloads last month
17
GGUF

32-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.