SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-base-v1")
# Run inference
sentences = [
    'analisis kinerja ekspor indonesia feb 2014',
    'Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Kelompok Komoditi dan Negara Februari 2014',
    'Laporan Bulanan Data Sosial Ekonomi Januari 2019',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-base-v1-eval allstat-semantic-base-v1-test
pearson_cosine 0.9866 0.9877
spearman_cosine 0.9033 0.9063

Training Details

Training Dataset

allstats-semantic-synthetic-dataset-v1

  • Dataset: allstats-semantic-synthetic-dataset-v1 at d59a245
  • Size: 123,640 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 6 tokens
    • mean: 10.64 tokens
    • max: 34 tokens
    • min: 4 tokens
    • mean: 14.06 tokens
    • max: 76 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    Gambaran umum karakteristik usaha di Indonesia Statistik Karakteristik Usaha 2022/2023 0.9
    Tabel data jumlah sekolah, guru, dan murid MA di bawah Kementerian Agama per provinsi. Jumlah Sekolah, Guru, dan Murid Madrasah Aliyah (MA) di Bawah Kementerian Agama Menurut Provinsi, tahun ajaran 2005/2006-2015/2016 0.96
    bagaimana kinerja sektor konstruksi indonesia di triwulan ketiga tahun 2008? Statistik Restoran/Rumah Makan 2007 0.09
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-synthetic-dataset-v1

  • Dataset: allstats-semantic-synthetic-dataset-v1 at d59a245
  • Size: 26,494 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 10.48 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 13.86 tokens
    • max: 58 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    Harga barang konsumsi Indonesia 2022: data per kota Harga Konsumen Beberapa Barang Kelompok Makanan, Minuman, dan Tembakau 90 Kota di Indonesia 2022 0.92
    data biaya hidup bali 2018 Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Kelompok Komoditi dan Negara, Maret 2018 0.1
    ekspor barang indonesia november 2011: data lengkap Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Kelompok Komoditi dan Negara Februari 2013 0.12
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 10
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-base-v1-eval_spearman_cosine allstat-semantic-base-v1-test_spearman_cosine
0.1294 500 0.0454 0.0267 0.7374 -
0.2588 1000 0.0243 0.0205 0.7527 -
0.3882 1500 0.0199 0.0169 0.7720 -
0.5176 2000 0.0186 0.0164 0.7733 -
0.6470 2500 0.0179 0.0158 0.7806 -
0.7764 3000 0.0158 0.0155 0.7826 -
0.9058 3500 0.0159 0.0155 0.7771 -
1.0352 4000 0.0155 0.0143 0.7847 -
1.1646 4500 0.0133 0.0141 0.7935 -
1.2940 5000 0.0128 0.0132 0.7986 -
1.4234 5500 0.0121 0.0120 0.8148 -
1.5528 6000 0.012 0.0118 0.8030 -
1.6822 6500 0.0118 0.0121 0.8132 -
1.8116 7000 0.0119 0.0109 0.8130 -
1.9410 7500 0.0107 0.0108 0.8132 -
2.0704 8000 0.009 0.0098 0.8181 -
2.1998 8500 0.0082 0.0099 0.8221 -
2.3292 9000 0.008 0.0100 0.8221 -
2.4586 9500 0.008 0.0095 0.8302 -
2.5880 10000 0.0083 0.0090 0.8284 -
2.7174 10500 0.0084 0.0093 0.8261 -
2.8468 11000 0.0084 0.0089 0.8283 -
2.9762 11500 0.0083 0.0093 0.8259 -
3.1056 12000 0.0056 0.0083 0.8362 -
3.2350 12500 0.006 0.0081 0.8357 -
3.3644 13000 0.0057 0.0078 0.8381 -
3.4938 13500 0.006 0.0081 0.8399 -
3.6232 14000 0.0058 0.0078 0.8420 -
3.7526 14500 0.0068 0.0078 0.8303 -
3.8820 15000 0.0056 0.0072 0.8502 -
4.0114 15500 0.0054 0.0073 0.8483 -
4.1408 16000 0.004 0.0068 0.8565 -
4.2702 16500 0.0042 0.0069 0.8493 -
4.3996 17000 0.0043 0.0069 0.8507 -
4.5290 17500 0.0045 0.0069 0.8536 -
4.6584 18000 0.0042 0.0064 0.8602 -
4.7878 18500 0.0043 0.0065 0.8537 -
4.9172 19000 0.0039 0.0062 0.8623 -
5.0466 19500 0.0041 0.0065 0.8601 -
5.1760 20000 0.0032 0.0060 0.8643 -
5.3054 20500 0.0032 0.0064 0.8657 -
5.4348 21000 0.0032 0.0062 0.8669 -
5.5642 21500 0.0031 0.0065 0.8633 -
5.6936 22000 0.003 0.0059 0.8682 -
5.8230 22500 0.0032 0.0057 0.8713 -
5.9524 23000 0.0032 0.0057 0.8688 -
6.0818 23500 0.0026 0.0055 0.8772 -
6.2112 24000 0.0023 0.0056 0.8708 -
6.3406 24500 0.0029 0.0056 0.8734 -
6.4700 25000 0.0027 0.0054 0.8748 -
6.5994 25500 0.0022 0.0054 0.8827 -
6.7288 26000 0.0021 0.0053 0.8823 -
6.8582 26500 0.0021 0.0053 0.8832 -
6.9876 27000 0.0025 0.0052 0.8839 -
7.1170 27500 0.002 0.0051 0.8887 -
7.2464 28000 0.0017 0.0050 0.8869 -
7.3758 28500 0.0019 0.0052 0.8845 -
7.5052 29000 0.0017 0.0051 0.8897 -
7.6346 29500 0.0017 0.0051 0.8920 -
7.7640 30000 0.0018 0.0050 0.8889 -
7.8934 30500 0.0019 0.0050 0.8931 -
8.0228 31000 0.002 0.0049 0.8889 -
8.1522 31500 0.0014 0.0049 0.8912 -
8.2816 32000 0.0013 0.0049 0.8922 -
8.4110 32500 0.0014 0.0049 0.8947 -
8.5404 33000 0.0014 0.0049 0.8960 -
8.6698 33500 0.0014 0.0049 0.8972 -
8.7992 34000 0.0014 0.0048 0.8982 -
8.9286 34500 0.0013 0.0048 0.9003 -
9.0580 35000 0.0014 0.0048 0.9001 -
9.1874 35500 0.0012 0.0048 0.8995 -
9.3168 36000 0.0011 0.0048 0.9008 -
9.4462 36500 0.001 0.0047 0.9015 -
9.5756 37000 0.0011 0.0047 0.9026 -
9.7050 37500 0.0011 0.0047 0.9027 -
9.8344 38000 0.001 0.0047 0.9035 -
9.9638 38500 0.0011 0.0047 0.9033 -
10.0 38640 - - - 0.9063
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
4
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-base-v1

Dataset used to train yahyaabd/allstats-semantic-base-v1

Evaluation results