SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-search-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-model-v1")
# Run inference
sentences = [
    'Statistik ekspor Indonesia Maret 2202',
    'Produk Domestik Bruto Indonesia Triwulanan 2006-2010',
    'Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Januari 2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-v1-dev allstat-semantic-search-v1-test
pearson_cosine 0.9895 0.9895
spearman_cosine 0.9072 0.9074

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at 06f849a
  • Size: 212,917 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.48 tokens
    • max: 29 tokens
    • min: 4 tokens
    • mean: 14.89 tokens
    • max: 53 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:
    query doc label
    ringkasan aktivitas badan pusat statistik tahun 2018 Statistik Harga Produsen sektor pertanian di indonesia 2008 0.1
    indikator kesejahteraan petani rejang lebong 2015 Diagram Timbang Nilai Tukar Petani Kabupaten Rejang Lebong 2015 0.82
    Berapa persen kenaikan kunjungan wisatawan mancanegara pada April 2024? Indeks Perilaku Anti Korupsi (IPAK) Indonesia 2023 sebesar 3,92, menurun dibandingkan IPAK 2022 0.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at 06f849a
  • Size: 26,614 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.21 tokens
    • max: 32 tokens
    • min: 5 tokens
    • mean: 14.41 tokens
    • max: 54 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    query doc label
    Laporan bulanan ekonomi Indonesia bulan November 201 Laporan Bulanan Data Sosial Ekonomi November 2021 0.92
    pekerjaan layak di indonesia tahun 2022: data dan analisis Statistik Penduduk Lanjut Usia Provinsi Papua Barat 2010-Hasil Sensus Penduduk 2010 0.09
    Tabel pendapatan rata-rata pekerja lepas berdasarkan provinsi dan pendidikan tahun 2021 Nilai Impor Kendaraan Bermotor Menurut Negara Asal Utama (Nilai CIF:juta US$), 2018-2023 0.1
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstats-semantic-search-v1-dev_spearman_cosine allstat-semantic-search-v1-test_spearman_cosine
0.0376 250 0.0683 0.0432 0.6955 -
0.0751 500 0.0393 0.0322 0.7230 -
0.1127 750 0.0321 0.0270 0.7476 -
0.1503 1000 0.0255 0.0226 0.7789 -
0.1879 1250 0.024 0.0213 0.7683 -
0.2254 1500 0.022 0.0199 0.7727 -
0.2630 1750 0.0219 0.0195 0.7853 -
0.3006 2000 0.0202 0.0188 0.7795 -
0.3381 2250 0.0191 0.0187 0.7943 -
0.3757 2500 0.0198 0.0178 0.7842 -
0.4133 2750 0.0179 0.0184 0.7974 -
0.4509 3000 0.0179 0.0194 0.7810 -
0.4884 3250 0.0182 0.0168 0.8080 -
0.5260 3500 0.0174 0.0164 0.8131 -
0.5636 3750 0.0174 0.0154 0.8113 -
0.6011 4000 0.0169 0.0157 0.7981 -
0.6387 4250 0.0152 0.0146 0.8099 -
0.6763 4500 0.0148 0.0147 0.8091 -
0.7139 4750 0.0145 0.0145 0.8178 -
0.7514 5000 0.014 0.0139 0.8184 -
0.7890 5250 0.0145 0.0130 0.8166 -
0.8266 5500 0.0134 0.0129 0.8306 -
0.8641 5750 0.013 0.0122 0.8251 -
0.9017 6000 0.0136 0.0130 0.8265 -
0.9393 6250 0.0123 0.0126 0.8224 -
0.9769 6500 0.0113 0.0120 0.8305 -
1.0144 6750 0.0129 0.0117 0.8204 -
1.0520 7000 0.0106 0.0116 0.8284 -
1.0896 7250 0.01 0.0116 0.8303 -
1.1271 7500 0.0096 0.0110 0.8303 -
1.1647 7750 0.01 0.0113 0.8305 -
1.2023 8000 0.0116 0.0108 0.8300 -
1.2399 8250 0.0095 0.0104 0.8432 -
1.2774 8500 0.009 0.0104 0.8370 -
1.3150 8750 0.0101 0.0102 0.8434 -
1.3526 9000 0.01 0.0097 0.8450 -
1.3901 9250 0.0097 0.0103 0.8286 -
1.4277 9500 0.0092 0.0096 0.8393 -
1.4653 9750 0.0093 0.0089 0.8480 -
1.5029 10000 0.0088 0.0090 0.8439 -
1.5404 10250 0.0087 0.0089 0.8569 -
1.5780 10500 0.0082 0.0088 0.8488 -
1.6156 10750 0.009 0.0089 0.8493 -
1.6531 11000 0.0086 0.0086 0.8499 -
1.6907 11250 0.0076 0.0083 0.8600 -
1.7283 11500 0.0076 0.0081 0.8621 -
1.7659 11750 0.0079 0.0081 0.8611 -
1.8034 12000 0.0082 0.0085 0.8540 -
1.8410 12250 0.0074 0.0081 0.8620 -
1.8786 12500 0.007 0.0080 0.8639 -
1.9161 12750 0.0071 0.0083 0.8450 -
1.9537 13000 0.007 0.0076 0.8585 -
1.9913 13250 0.0072 0.0074 0.8640 -
2.0289 13500 0.0055 0.0069 0.8699 -
2.0664 13750 0.0056 0.0068 0.8673 -
2.1040 14000 0.0052 0.0066 0.8723 -
2.1416 14250 0.0059 0.0069 0.8644 -
2.1791 14500 0.0055 0.0068 0.8670 -
2.2167 14750 0.005 0.0065 0.8723 -
2.2543 15000 0.0053 0.0066 0.8766 -
2.2919 15250 0.0057 0.0065 0.8782 -
2.3294 15500 0.0053 0.0064 0.8749 -
2.3670 15750 0.0056 0.0070 0.8708 -
2.4046 16000 0.0058 0.0065 0.8731 -
2.4421 16250 0.0047 0.0064 0.8793 -
2.4797 16500 0.0049 0.0063 0.8801 -
2.5173 16750 0.0051 0.0063 0.8782 -
2.5549 17000 0.0053 0.0060 0.8799 -
2.5924 17250 0.0051 0.0059 0.8825 -
2.6300 17500 0.0048 0.0060 0.8761 -
2.6676 17750 0.0055 0.0055 0.8773 -
2.7051 18000 0.0045 0.0053 0.8833 -
2.7427 18250 0.0041 0.0053 0.8868 -
2.7803 18500 0.0051 0.0054 0.8811 -
2.8179 18750 0.004 0.0052 0.8881 -
2.8554 19000 0.0043 0.0053 0.8764 -
2.8930 19250 0.0047 0.0051 0.8874 -
2.9306 19500 0.0038 0.0051 0.8922 -
2.9681 19750 0.0047 0.0050 0.8821 -
3.0057 20000 0.0037 0.0048 0.8911 -
3.0433 20250 0.0031 0.0048 0.8911 -
3.0809 20500 0.0032 0.0046 0.8934 -
3.1184 20750 0.0034 0.0046 0.8942 -
3.1560 21000 0.0028 0.0045 0.8976 -
3.1936 21250 0.0034 0.0045 0.8932 -
3.2311 21500 0.003 0.0044 0.8959 -
3.2687 21750 0.0033 0.0044 0.8961 -
3.3063 22000 0.0029 0.0043 0.8995 -
3.3439 22250 0.0029 0.0044 0.8978 -
3.3814 22500 0.0027 0.0043 0.8998 -
3.4190 22750 0.003 0.0043 0.9019 -
3.4566 23000 0.0027 0.0042 0.8982 -
3.4941 23250 0.0027 0.0042 0.9014 -
3.5317 23500 0.0034 0.0042 0.9025 -
3.5693 23750 0.003 0.0041 0.9027 -
3.6069 24000 0.0029 0.0041 0.9003 -
3.6444 24250 0.0027 0.0040 0.9023 -
3.6820 24500 0.0027 0.0040 0.9035 -
3.7196 24750 0.0033 0.0040 0.9042 -
3.7571 25000 0.0028 0.0039 0.9053 -
3.7947 25250 0.0027 0.0039 0.9049 -
3.8323 25500 0.0033 0.0039 0.9057 -
3.8699 25750 0.0025 0.0039 0.9075 -
3.9074 26000 0.003 0.0039 0.9068 -
3.9450 26250 0.0026 0.0039 0.9073 -
3.9826 26500 0.0023 0.0038 0.9072 -
4.0 26616 - - - 0.9074

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.2.2+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
9
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-search-model-v1

Dataset used to train yahyaabd/allstats-semantic-search-model-v1

Evaluation results

  • Pearson Cosine on allstats semantic search v1 dev
    self-reported
    0.989
  • Spearman Cosine on allstats semantic search v1 dev
    self-reported
    0.907
  • Pearson Cosine on allstat semantic search v1 test
    self-reported
    0.990
  • Spearman Cosine on allstat semantic search v1 test
    self-reported
    0.907