yahyaabd's picture
Add new SentenceTransformer model
f6ba240 verified
metadata
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
datasets:
  - yahyaabd/allstats-semantic-dataset-v4
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:88250
  - loss:CosineSimilarityLoss
widget:
  - source_sentence: Laporan ekspor Indonesia Juli 2020
    sentences:
      - Statistik Produksi Kehutanan 2021
      - Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Juli 2020
      - Statistik Politik 2017
  - source_sentence: Bulan apa yang dicatat data kunjungan wisatawan mancanegara?
    sentences:
      - Indeks Tendensi Bisnis dan Indeks Tendensi Konsumen 2005
      - Data NTP bulan Maret 2022.
      - >-
        Kunjungan wisatawan mancanegara pada Oktober 2023 mencapai 978,50 ribu
        kunjungan, naik 33,27 persen (year-on-year)
  - source_sentence: >-
      Seberapa besar kenaikan upah nominal harian buruh tani nasional Januari
      2016?
    sentences:
      - Keadaan Angkatan Kerja di Indonesia Mei 2013
      - Profil Pasar Gorontalo 2020
      - Tingkat pengangguran terbuka (TPT) Agustus 2024 sebesar 5,3%.
  - source_sentence: Ringkasan data statistik Indonesia 1997
    sentences:
      - Statistik Upah 2007
      - Harga konsumen bbrp jenis barang kelompok perumahan 2005
      - Statistik Indonesia 1997
  - source_sentence: Pernikahan usia anak di Indonesia periode 2013-2015
    sentences:
      - Jumlah penduduk Indonesia 2013-2015
      - Indikator Ekonomi Desember 2006
      - Indeks Tendensi Bisnis dan Indeks Tendensi Konsumen 2013
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-mpnet-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic mpnet eval
          type: allstats-semantic-mpnet-eval
        metrics:
          - type: pearson_cosine
            value: 0.9714169395957917
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8933550959155299
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic mpnet test
          type: allstats-semantic-mpnet-test
        metrics:
          - type: pearson_cosine
            value: 0.9723087139367028
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8932385415736595
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-dataset-v4 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-mpnet")
# Run inference
sentences = [
    'Pernikahan usia anak di Indonesia periode 2013-2015',
    'Jumlah penduduk Indonesia 2013-2015',
    'Indeks Tendensi Bisnis dan Indeks Tendensi Konsumen 2013',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-mpnet-eval allstats-semantic-mpnet-test
pearson_cosine 0.9714 0.9723
spearman_cosine 0.8934 0.8932

Training Details

Training Dataset

allstats-semantic-dataset-v4

  • Dataset: allstats-semantic-dataset-v4 at 06c3cf8
  • Size: 88,250 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 4 tokens
    • mean: 11.38 tokens
    • max: 46 tokens
    • min: 4 tokens
    • mean: 14.48 tokens
    • max: 67 tokens
    • min: 0.0
    • mean: 0.51
    • max: 1.0
  • Samples:
    query doc label
    Industri teh Indonesia tahun 2021 Statistik Transportasi Laut 2014 0.1
    Tahun berapa data pertumbuhan ekonomi Indonesia tersebut? Nilai Tukar Petani (NTP) November 2023 sebesar 116,73 atau naik 0,82 persen. Harga Gabah Kering Panen di Tingkat Petani turun 1,94 persen dan Harga Beras Premium di Penggilingan turun 0,91 persen. 0.0
    Kemiskinan di Indonesia Maret 2018 Feb Tenaga Kerja 0.1
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-dataset-v4

  • Dataset: allstats-semantic-dataset-v4 at 06c3cf8
  • Size: 18,910 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.35 tokens
    • max: 33 tokens
    • min: 4 tokens
    • mean: 14.25 tokens
    • max: 52 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    nAalisis keuangam deas tshun 019 Statistik Migrasi Nusa Tenggara Barat Hasil Survei Penduduk Antar Sensus 2015 0.1
    Data tanaman buah dan sayur Indonesia tahun 2016 Statistik Penduduk Lanjut Usia 2010 0.1
    Pasar beras di Indonesia tahun 2018 Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Kelompok Komoditi dan Negara, April 2021 0.2
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 8
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.05
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 8
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.05
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-mpnet-eval_spearman_cosine allstats-semantic-mpnet-test_spearman_cosine
0 0 - 0.0979 0.6119 -
0.0906 250 0.0646 0.0427 0.7249 -
0.1813 500 0.039 0.0324 0.7596 -
0.2719 750 0.032 0.0271 0.7860 -
0.3626 1000 0.0276 0.0255 0.7920 -
0.4532 1250 0.0264 0.0230 0.8072 -
0.5439 1500 0.0249 0.0222 0.8197 -
0.6345 1750 0.0226 0.0210 0.8200 -
0.7252 2000 0.0218 0.0209 0.8202 -
0.8158 2250 0.0208 0.0201 0.8346 -
0.9065 2500 0.0209 0.0211 0.8240 -
0.9971 2750 0.0211 0.0190 0.8170 -
1.0877 3000 0.0161 0.0182 0.8332 -
1.1784 3250 0.0158 0.0179 0.8393 -
1.2690 3500 0.0167 0.0189 0.8341 -
1.3597 3750 0.0152 0.0168 0.8371 -
1.4503 4000 0.0151 0.0165 0.8435 -
1.5410 4250 0.0143 0.0156 0.8365 -
1.6316 4500 0.0147 0.0157 0.8467 -
1.7223 4750 0.0138 0.0155 0.8501 -
1.8129 5000 0.0147 0.0154 0.8457 -
1.9036 5250 0.0137 0.0152 0.8498 -
1.9942 5500 0.0144 0.0143 0.8485 -
2.0848 5750 0.0108 0.0139 0.8439 -
2.1755 6000 0.01 0.0146 0.8563 -
2.2661 6250 0.011 0.0141 0.8558 -
2.3568 6500 0.0107 0.0144 0.8497 -
2.4474 6750 0.01 0.0138 0.8577 -
2.5381 7000 0.0097 0.0136 0.8585 -
2.6287 7250 0.0102 0.0135 0.8521 -
2.7194 7500 0.0106 0.0133 0.8537 -
2.8100 7750 0.0098 0.0133 0.8643 -
2.9007 8000 0.0105 0.0138 0.8543 -
2.9913 8250 0.009 0.0129 0.8555 -
3.0819 8500 0.0071 0.0121 0.8692 -
3.1726 8750 0.006 0.0120 0.8709 -
3.2632 9000 0.0078 0.0120 0.8660 -
3.3539 9250 0.0072 0.0122 0.8656 -
3.4445 9500 0.007 0.0123 0.8696 -
3.5352 9750 0.0075 0.0117 0.8707 -
3.6258 10000 0.0081 0.0115 0.8682 -
3.7165 10250 0.0083 0.0116 0.8617 -
3.8071 10500 0.0075 0.0116 0.8665 -
3.8978 10750 0.0077 0.0119 0.8733 -
3.9884 11000 0.008 0.0113 0.8678 -
4.0790 11250 0.0051 0.0110 0.8760 -
4.1697 11500 0.0052 0.0108 0.8729 -
4.2603 11750 0.0056 0.0108 0.8771 -
4.3510 12000 0.0052 0.0109 0.8793 -
4.4416 12250 0.0049 0.0109 0.8766 -
4.5323 12500 0.0055 0.0114 0.8742 -
4.6229 12750 0.0061 0.0108 0.8749 -
4.7136 13000 0.0058 0.0109 0.8833 -
4.8042 13250 0.0049 0.0108 0.8767 -
4.8949 13500 0.0046 0.0108 0.8839 -
4.9855 13750 0.0052 0.0104 0.8790 -
5.0761 14000 0.0041 0.0102 0.8826 -
5.1668 14250 0.004 0.0103 0.8775 -
5.2574 14500 0.0036 0.0102 0.8855 -
5.3481 14750 0.0037 0.0104 0.8841 -
5.4387 15000 0.0036 0.0101 0.8860 -
5.5294 15250 0.0043 0.0104 0.8852 -
5.6200 15500 0.004 0.0100 0.8856 -
5.7107 15750 0.0043 0.0101 0.8842 -
5.8013 16000 0.0043 0.0099 0.8835 -
5.8920 16250 0.0041 0.0099 0.8852 -
5.9826 16500 0.0036 0.0101 0.8866 -
6.0732 16750 0.0031 0.0100 0.8881 -
6.1639 17000 0.0031 0.0098 0.8880 -
6.2545 17250 0.0027 0.0098 0.8886 -
6.3452 17500 0.0032 0.0097 0.8868 -
6.4358 17750 0.0027 0.0097 0.8876 -
6.5265 18000 0.0031 0.0097 0.8893 -
6.6171 18250 0.0032 0.0096 0.8903 -
6.7078 18500 0.003 0.0096 0.8898 -
6.7984 18750 0.0029 0.0098 0.8907 -
6.8891 19000 0.003 0.0096 0.8896 -
6.9797 19250 0.0026 0.0096 0.8913 -
7.0703 19500 0.0024 0.0096 0.8921 -
7.1610 19750 0.0021 0.0097 0.8920 -
7.2516 20000 0.0023 0.0096 0.8910 -
7.3423 20250 0.002 0.0096 0.8920 -
7.4329 20500 0.0022 0.0096 0.8924 -
7.5236 20750 0.002 0.0097 0.8917 -
7.6142 21000 0.0024 0.0096 0.8923 -
7.7049 21250 0.0025 0.0095 0.8928 -
7.7955 21500 0.0022 0.0095 0.8931 -
7.8861 21750 0.0023 0.0095 0.8932 -
7.9768 22000 0.0022 0.0095 0.8934 -
8.0 22064 - - - 0.8932
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.48.0
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}