metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:212917
  - loss:CosineSimilarityLoss
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
  - source_sentence: statistik neraca arus dana indonesia
    sentences:
      - Statistik Kelapa Sawit Indonesia 2012
      - Distribusi Perdagangan Komoditas Kedelai Indonesia 2023
      - Data Runtun Statistik Konstruksi 1990-2010
  - source_sentence: >-
      Seberapa besar kenaikan produksi IBS pada Triwulan IV Tahun 2013
      dibandingkan Triwulan IV Tahun Sebelumnya?
    sentences:
      - Pertumbuhan PDB 2013 Mencapai 5,78 Persen
      - >-
        Statistik Komuter Gerbangkertosusila Hasil Survei Komuter
        Gerbangkertosusila 2017
      - >-
        Statistik Penduduk Lanjut Usia Provinsi Jawa Timur 2010-Hasil Sensus
        Penduduk 2010
  - source_sentence: 'Penduduk Papua: migrasi 2015'
    sentences:
      - >-
        Rata-rata Upah/Gaji Bersih sebulan Buruh/Karyawan Pegawai Menurut
        Pendidikan Tertinggi dan jenis pekerjaan utama, 2019
      - Statistik Pemotongan Ternak 2010 dan 2011
      - >-
        Statistik Harga Produsen Pertanian Sub Sektor Tanaman Pangan,
        Hortikultura dan Perkebunan Rakyat 2010
  - source_sentence: statistik konstruksi 2022
    sentences:
      - Studi Modal Sosial 2006
      - BRS upah buruh agustus 2018
      - Statistik Perdagangan Luar Negeri Indonesia Ekspor 2006 vol 1
  - source_sentence: Statistik ekspor Indonesia Maret 2202
    sentences:
      - Produk Domestik Bruto Indonesia Triwulanan 2006-2010
      - >-
        Indeks Perilaku Anti Korupsi (IPAK) Indonesia 2023 sebesar 3,92, menurun
        dibandingkan IPAK 2022
      - >-
        Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Januari
        2023
datasets:
  - yahyaabd/allstats-semantic-search-synthetic-dataset-v1
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-mpnet-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic search v1 dev
          type: allstats-semantic-search-v1-dev
        metrics:
          - type: pearson_cosine
            value: 0.9894566758405579
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9072484378842124
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic search v1 test
          type: allstat-semantic-search-v1-test
        metrics:
          - type: pearson_cosine
            value: 0.9895284407960067
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9074137706349162
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-search-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Maximum Sequence Length: 128 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- allstats-semantic-search-synthetic-dataset-v1

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-model-v1")
# Run inference
sentences = [
    'Statistik ekspor Indonesia Maret 2202',
    'Produk Domestik Bruto Indonesia Triwulanan 2006-2010',
    'Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Januari 2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Datasets: allstats-semantic-search-v1-dev and allstat-semantic-search-v1-test
Evaluated with EmbeddingSimilarityEvaluator

Metric	allstats-semantic-search-v1-dev	allstat-semantic-search-v1-test
pearson_cosine	0.9895	0.9895
spearman_cosine	0.9072	0.9074

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v1

Dataset: allstats-semantic-search-synthetic-dataset-v1 at 06f849a
Size: 212,917 training samples
Columns: query, doc, and label
Approximate statistics based on the first 1000 samples:
query doc label
type string string float
details
min: 5 tokens
mean: 11.48 tokens
max: 29 tokens

min: 4 tokens
mean: 14.89 tokens
max: 53 tokens

min: 0.0
mean: 0.52
max: 1.0

	query	doc	label
type	string	string	float
details	min: 5 tokens mean: 11.48 tokens max: 29 tokens	min: 4 tokens mean: 14.89 tokens max: 53 tokens	min: 0.0 mean: 0.52 max: 1.0

Samples:

query	doc	label
`ringkasan aktivitas badan pusat statistik tahun 2018`	`Statistik Harga Produsen sektor pertanian di indonesia 2008`	`0.1`
`indikator kesejahteraan petani rejang lebong 2015`	`Diagram Timbang Nilai Tukar Petani Kabupaten Rejang Lebong 2015`	`0.82`
`Berapa persen kenaikan kunjungan wisatawan mancanegara pada April 2024?`	`Indeks Perilaku Anti Korupsi (IPAK) Indonesia 2023 sebesar 3,92, menurun dibandingkan IPAK 2022`	`0.0`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v1

Dataset: allstats-semantic-search-synthetic-dataset-v1 at 06f849a
Size: 26,614 evaluation samples
Columns: query, doc, and label
Approximate statistics based on the first 1000 samples:
query doc label
type string string float
details
min: 5 tokens
mean: 11.21 tokens
max: 32 tokens

min: 5 tokens
mean: 14.41 tokens
max: 54 tokens

min: 0.0
mean: 0.5
max: 1.0

	query	doc	label
type	string	string	float
details	min: 5 tokens mean: 11.21 tokens max: 32 tokens	min: 5 tokens mean: 14.41 tokens max: 54 tokens	min: 0.0 mean: 0.5 max: 1.0

Samples:

query	doc	label
`Laporan bulanan ekonomi Indonesia bulan November 201`	`Laporan Bulanan Data Sosial Ekonomi November 2021`	`0.92`
`pekerjaan layak di indonesia tahun 2022: data dan analisis`	`Statistik Penduduk Lanjut Usia Provinsi Papua Barat 2010-Hasil Sensus Penduduk 2010`	`0.09`
`Tabel pendapatan rata-rata pekerja lepas berdasarkan provinsi dan pendidikan tahun 2021`	`Nilai Impor Kendaraan Bermotor Menurut Negara Asal Utama (Nilai CIF:juta US$), 2018-2023`	`0.1`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
num_train_epochs: 4
warmup_ratio: 0.1
fp16: True

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 4
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	Validation Loss	allstats-semantic-search-v1-dev_spearman_cosine	allstat-semantic-search-v1-test_spearman_cosine
0.0376	250	0.0683	0.0432	0.6955	-
0.0751	500	0.0393	0.0322	0.7230	-
0.1127	750	0.0321	0.0270	0.7476	-
0.1503	1000	0.0255	0.0226	0.7789	-
0.1879	1250	0.024	0.0213	0.7683	-
0.2254	1500	0.022	0.0199	0.7727	-
0.2630	1750	0.0219	0.0195	0.7853	-
0.3006	2000	0.0202	0.0188	0.7795	-
0.3381	2250	0.0191	0.0187	0.7943	-
0.3757	2500	0.0198	0.0178	0.7842	-
0.4133	2750	0.0179	0.0184	0.7974	-
0.4509	3000	0.0179	0.0194	0.7810	-
0.4884	3250	0.0182	0.0168	0.8080	-
0.5260	3500	0.0174	0.0164	0.8131	-
0.5636	3750	0.0174	0.0154	0.8113	-
0.6011	4000	0.0169	0.0157	0.7981	-
0.6387	4250	0.0152	0.0146	0.8099	-
0.6763	4500	0.0148	0.0147	0.8091	-
0.7139	4750	0.0145	0.0145	0.8178	-
0.7514	5000	0.014	0.0139	0.8184	-
0.7890	5250	0.0145	0.0130	0.8166	-
0.8266	5500	0.0134	0.0129	0.8306	-
0.8641	5750	0.013	0.0122	0.8251	-
0.9017	6000	0.0136	0.0130	0.8265	-
0.9393	6250	0.0123	0.0126	0.8224	-
0.9769	6500	0.0113	0.0120	0.8305	-
1.0144	6750	0.0129	0.0117	0.8204	-
1.0520	7000	0.0106	0.0116	0.8284	-
1.0896	7250	0.01	0.0116	0.8303	-
1.1271	7500	0.0096	0.0110	0.8303	-
1.1647	7750	0.01	0.0113	0.8305	-
1.2023	8000	0.0116	0.0108	0.8300	-
1.2399	8250	0.0095	0.0104	0.8432	-
1.2774	8500	0.009	0.0104	0.8370	-
1.3150	8750	0.0101	0.0102	0.8434	-
1.3526	9000	0.01	0.0097	0.8450	-
1.3901	9250	0.0097	0.0103	0.8286	-
1.4277	9500	0.0092	0.0096	0.8393	-
1.4653	9750	0.0093	0.0089	0.8480	-
1.5029	10000	0.0088	0.0090	0.8439	-
1.5404	10250	0.0087	0.0089	0.8569	-
1.5780	10500	0.0082	0.0088	0.8488	-
1.6156	10750	0.009	0.0089	0.8493	-
1.6531	11000	0.0086	0.0086	0.8499	-
1.6907	11250	0.0076	0.0083	0.8600	-
1.7283	11500	0.0076	0.0081	0.8621	-
1.7659	11750	0.0079	0.0081	0.8611	-
1.8034	12000	0.0082	0.0085	0.8540	-
1.8410	12250	0.0074	0.0081	0.8620	-
1.8786	12500	0.007	0.0080	0.8639	-
1.9161	12750	0.0071	0.0083	0.8450	-
1.9537	13000	0.007	0.0076	0.8585	-
1.9913	13250	0.0072	0.0074	0.8640	-
2.0289	13500	0.0055	0.0069	0.8699	-
2.0664	13750	0.0056	0.0068	0.8673	-
2.1040	14000	0.0052	0.0066	0.8723	-
2.1416	14250	0.0059	0.0069	0.8644	-
2.1791	14500	0.0055	0.0068	0.8670	-
2.2167	14750	0.005	0.0065	0.8723	-
2.2543	15000	0.0053	0.0066	0.8766	-
2.2919	15250	0.0057	0.0065	0.8782	-
2.3294	15500	0.0053	0.0064	0.8749	-
2.3670	15750	0.0056	0.0070	0.8708	-
2.4046	16000	0.0058	0.0065	0.8731	-
2.4421	16250	0.0047	0.0064	0.8793	-
2.4797	16500	0.0049	0.0063	0.8801	-
2.5173	16750	0.0051	0.0063	0.8782	-
2.5549	17000	0.0053	0.0060	0.8799	-
2.5924	17250	0.0051	0.0059	0.8825	-
2.6300	17500	0.0048	0.0060	0.8761	-
2.6676	17750	0.0055	0.0055	0.8773	-
2.7051	18000	0.0045	0.0053	0.8833	-
2.7427	18250	0.0041	0.0053	0.8868	-
2.7803	18500	0.0051	0.0054	0.8811	-
2.8179	18750	0.004	0.0052	0.8881	-
2.8554	19000	0.0043	0.0053	0.8764	-
2.8930	19250	0.0047	0.0051	0.8874	-
2.9306	19500	0.0038	0.0051	0.8922	-
2.9681	19750	0.0047	0.0050	0.8821	-
3.0057	20000	0.0037	0.0048	0.8911	-
3.0433	20250	0.0031	0.0048	0.8911	-
3.0809	20500	0.0032	0.0046	0.8934	-
3.1184	20750	0.0034	0.0046	0.8942	-
3.1560	21000	0.0028	0.0045	0.8976	-
3.1936	21250	0.0034	0.0045	0.8932	-
3.2311	21500	0.003	0.0044	0.8959	-
3.2687	21750	0.0033	0.0044	0.8961	-
3.3063	22000	0.0029	0.0043	0.8995	-
3.3439	22250	0.0029	0.0044	0.8978	-
3.3814	22500	0.0027	0.0043	0.8998	-
3.4190	22750	0.003	0.0043	0.9019	-
3.4566	23000	0.0027	0.0042	0.8982	-
3.4941	23250	0.0027	0.0042	0.9014	-
3.5317	23500	0.0034	0.0042	0.9025	-
3.5693	23750	0.003	0.0041	0.9027	-
3.6069	24000	0.0029	0.0041	0.9003	-
3.6444	24250	0.0027	0.0040	0.9023	-
3.6820	24500	0.0027	0.0040	0.9035	-
3.7196	24750	0.0033	0.0040	0.9042	-
3.7571	25000	0.0028	0.0039	0.9053	-
3.7947	25250	0.0027	0.0039	0.9049	-
3.8323	25500	0.0033	0.0039	0.9057	-
3.8699	25750	0.0025	0.0039	0.9075	-
3.9074	26000	0.003	0.0039	0.9068	-
3.9450	26250	0.0026	0.0039	0.9073	-
3.9826	26500	0.0023	0.0038	0.9072	-
4.0	26616	-	-	-	0.9074

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.3.1
Transformers: 4.47.1
PyTorch: 2.2.2+cu121
Accelerate: 1.2.1
Datasets: 3.2.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}