yahyaabd's picture
Add new SentenceTransformer model
47821f3 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:212917
  - loss:CosineSimilarityLoss
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
  - source_sentence: statistik neraca arus dana indonesia
    sentences:
      - Statistik Kelapa Sawit Indonesia 2012
      - Distribusi Perdagangan Komoditas Kedelai Indonesia 2023
      - Data Runtun Statistik Konstruksi 1990-2010
  - source_sentence: >-
      Seberapa besar kenaikan produksi IBS pada Triwulan IV Tahun 2013
      dibandingkan Triwulan IV Tahun Sebelumnya?
    sentences:
      - Pertumbuhan PDB 2013 Mencapai 5,78 Persen
      - >-
        Statistik Komuter Gerbangkertosusila Hasil Survei Komuter
        Gerbangkertosusila 2017
      - >-
        Statistik Penduduk Lanjut Usia Provinsi Jawa Timur 2010-Hasil Sensus
        Penduduk 2010
  - source_sentence: 'Penduduk Papua: migrasi 2015'
    sentences:
      - >-
        Rata-rata Upah/Gaji Bersih sebulan Buruh/Karyawan Pegawai Menurut
        Pendidikan Tertinggi dan jenis pekerjaan utama, 2019
      - Statistik Pemotongan Ternak 2010 dan 2011
      - >-
        Statistik Harga Produsen Pertanian Sub Sektor Tanaman Pangan,
        Hortikultura dan Perkebunan Rakyat 2010
  - source_sentence: statistik konstruksi 2022
    sentences:
      - Studi Modal Sosial 2006
      - BRS upah buruh agustus 2018
      - Statistik Perdagangan Luar Negeri Indonesia Ekspor 2006 vol 1
  - source_sentence: Statistik ekspor Indonesia Maret 2202
    sentences:
      - Produk Domestik Bruto Indonesia Triwulanan 2006-2010
      - >-
        Indeks Perilaku Anti Korupsi (IPAK) Indonesia 2023 sebesar 3,92, menurun
        dibandingkan IPAK 2022
      - >-
        Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Januari
        2023
datasets:
  - yahyaabd/allstats-semantic-search-synthetic-dataset-v1
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-mpnet-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic search v1 dev
          type: allstats-semantic-search-v1-dev
        metrics:
          - type: pearson_cosine
            value: 0.9894566758405579
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9072484378842124
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic search v1 test
          type: allstat-semantic-search-v1-test
        metrics:
          - type: pearson_cosine
            value: 0.9895284407960067
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9074137706349162
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-search-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-model-v1")
# Run inference
sentences = [
    'Statistik ekspor Indonesia Maret 2202',
    'Produk Domestik Bruto Indonesia Triwulanan 2006-2010',
    'Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Januari 2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-v1-dev allstat-semantic-search-v1-test
pearson_cosine 0.9895 0.9895
spearman_cosine 0.9072 0.9074

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at 06f849a
  • Size: 212,917 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.48 tokens
    • max: 29 tokens
    • min: 4 tokens
    • mean: 14.89 tokens
    • max: 53 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:
    query doc label
    ringkasan aktivitas badan pusat statistik tahun 2018 Statistik Harga Produsen sektor pertanian di indonesia 2008 0.1
    indikator kesejahteraan petani rejang lebong 2015 Diagram Timbang Nilai Tukar Petani Kabupaten Rejang Lebong 2015 0.82
    Berapa persen kenaikan kunjungan wisatawan mancanegara pada April 2024? Indeks Perilaku Anti Korupsi (IPAK) Indonesia 2023 sebesar 3,92, menurun dibandingkan IPAK 2022 0.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at 06f849a
  • Size: 26,614 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.21 tokens
    • max: 32 tokens
    • min: 5 tokens
    • mean: 14.41 tokens
    • max: 54 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    query doc label
    Laporan bulanan ekonomi Indonesia bulan November 201 Laporan Bulanan Data Sosial Ekonomi November 2021 0.92
    pekerjaan layak di indonesia tahun 2022: data dan analisis Statistik Penduduk Lanjut Usia Provinsi Papua Barat 2010-Hasil Sensus Penduduk 2010 0.09
    Tabel pendapatan rata-rata pekerja lepas berdasarkan provinsi dan pendidikan tahun 2021 Nilai Impor Kendaraan Bermotor Menurut Negara Asal Utama (Nilai CIF:juta US$), 2018-2023 0.1
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstats-semantic-search-v1-dev_spearman_cosine allstat-semantic-search-v1-test_spearman_cosine
0.0376 250 0.0683 0.0432 0.6955 -
0.0751 500 0.0393 0.0322 0.7230 -
0.1127 750 0.0321 0.0270 0.7476 -
0.1503 1000 0.0255 0.0226 0.7789 -
0.1879 1250 0.024 0.0213 0.7683 -
0.2254 1500 0.022 0.0199 0.7727 -
0.2630 1750 0.0219 0.0195 0.7853 -
0.3006 2000 0.0202 0.0188 0.7795 -
0.3381 2250 0.0191 0.0187 0.7943 -
0.3757 2500 0.0198 0.0178 0.7842 -
0.4133 2750 0.0179 0.0184 0.7974 -
0.4509 3000 0.0179 0.0194 0.7810 -
0.4884 3250 0.0182 0.0168 0.8080 -
0.5260 3500 0.0174 0.0164 0.8131 -
0.5636 3750 0.0174 0.0154 0.8113 -
0.6011 4000 0.0169 0.0157 0.7981 -
0.6387 4250 0.0152 0.0146 0.8099 -
0.6763 4500 0.0148 0.0147 0.8091 -
0.7139 4750 0.0145 0.0145 0.8178 -
0.7514 5000 0.014 0.0139 0.8184 -
0.7890 5250 0.0145 0.0130 0.8166 -
0.8266 5500 0.0134 0.0129 0.8306 -
0.8641 5750 0.013 0.0122 0.8251 -
0.9017 6000 0.0136 0.0130 0.8265 -
0.9393 6250 0.0123 0.0126 0.8224 -
0.9769 6500 0.0113 0.0120 0.8305 -
1.0144 6750 0.0129 0.0117 0.8204 -
1.0520 7000 0.0106 0.0116 0.8284 -
1.0896 7250 0.01 0.0116 0.8303 -
1.1271 7500 0.0096 0.0110 0.8303 -
1.1647 7750 0.01 0.0113 0.8305 -
1.2023 8000 0.0116 0.0108 0.8300 -
1.2399 8250 0.0095 0.0104 0.8432 -
1.2774 8500 0.009 0.0104 0.8370 -
1.3150 8750 0.0101 0.0102 0.8434 -
1.3526 9000 0.01 0.0097 0.8450 -
1.3901 9250 0.0097 0.0103 0.8286 -
1.4277 9500 0.0092 0.0096 0.8393 -
1.4653 9750 0.0093 0.0089 0.8480 -
1.5029 10000 0.0088 0.0090 0.8439 -
1.5404 10250 0.0087 0.0089 0.8569 -
1.5780 10500 0.0082 0.0088 0.8488 -
1.6156 10750 0.009 0.0089 0.8493 -
1.6531 11000 0.0086 0.0086 0.8499 -
1.6907 11250 0.0076 0.0083 0.8600 -
1.7283 11500 0.0076 0.0081 0.8621 -
1.7659 11750 0.0079 0.0081 0.8611 -
1.8034 12000 0.0082 0.0085 0.8540 -
1.8410 12250 0.0074 0.0081 0.8620 -
1.8786 12500 0.007 0.0080 0.8639 -
1.9161 12750 0.0071 0.0083 0.8450 -
1.9537 13000 0.007 0.0076 0.8585 -
1.9913 13250 0.0072 0.0074 0.8640 -
2.0289 13500 0.0055 0.0069 0.8699 -
2.0664 13750 0.0056 0.0068 0.8673 -
2.1040 14000 0.0052 0.0066 0.8723 -
2.1416 14250 0.0059 0.0069 0.8644 -
2.1791 14500 0.0055 0.0068 0.8670 -
2.2167 14750 0.005 0.0065 0.8723 -
2.2543 15000 0.0053 0.0066 0.8766 -
2.2919 15250 0.0057 0.0065 0.8782 -
2.3294 15500 0.0053 0.0064 0.8749 -
2.3670 15750 0.0056 0.0070 0.8708 -
2.4046 16000 0.0058 0.0065 0.8731 -
2.4421 16250 0.0047 0.0064 0.8793 -
2.4797 16500 0.0049 0.0063 0.8801 -
2.5173 16750 0.0051 0.0063 0.8782 -
2.5549 17000 0.0053 0.0060 0.8799 -
2.5924 17250 0.0051 0.0059 0.8825 -
2.6300 17500 0.0048 0.0060 0.8761 -
2.6676 17750 0.0055 0.0055 0.8773 -
2.7051 18000 0.0045 0.0053 0.8833 -
2.7427 18250 0.0041 0.0053 0.8868 -
2.7803 18500 0.0051 0.0054 0.8811 -
2.8179 18750 0.004 0.0052 0.8881 -
2.8554 19000 0.0043 0.0053 0.8764 -
2.8930 19250 0.0047 0.0051 0.8874 -
2.9306 19500 0.0038 0.0051 0.8922 -
2.9681 19750 0.0047 0.0050 0.8821 -
3.0057 20000 0.0037 0.0048 0.8911 -
3.0433 20250 0.0031 0.0048 0.8911 -
3.0809 20500 0.0032 0.0046 0.8934 -
3.1184 20750 0.0034 0.0046 0.8942 -
3.1560 21000 0.0028 0.0045 0.8976 -
3.1936 21250 0.0034 0.0045 0.8932 -
3.2311 21500 0.003 0.0044 0.8959 -
3.2687 21750 0.0033 0.0044 0.8961 -
3.3063 22000 0.0029 0.0043 0.8995 -
3.3439 22250 0.0029 0.0044 0.8978 -
3.3814 22500 0.0027 0.0043 0.8998 -
3.4190 22750 0.003 0.0043 0.9019 -
3.4566 23000 0.0027 0.0042 0.8982 -
3.4941 23250 0.0027 0.0042 0.9014 -
3.5317 23500 0.0034 0.0042 0.9025 -
3.5693 23750 0.003 0.0041 0.9027 -
3.6069 24000 0.0029 0.0041 0.9003 -
3.6444 24250 0.0027 0.0040 0.9023 -
3.6820 24500 0.0027 0.0040 0.9035 -
3.7196 24750 0.0033 0.0040 0.9042 -
3.7571 25000 0.0028 0.0039 0.9053 -
3.7947 25250 0.0027 0.0039 0.9049 -
3.8323 25500 0.0033 0.0039 0.9057 -
3.8699 25750 0.0025 0.0039 0.9075 -
3.9074 26000 0.003 0.0039 0.9068 -
3.9450 26250 0.0026 0.0039 0.9073 -
3.9826 26500 0.0023 0.0038 0.9072 -
4.0 26616 - - - 0.9074

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.2.2+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}