SentenceTransformer based on hiiamsid/sentence_similarity_spanish_es

This is a sentence-transformers model finetuned from hiiamsid/sentence_similarity_spanish_es. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: hiiamsid/sentence_similarity_spanish_es
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("igmochang/CR-biodiversity-preprocessed-sentence-similarity-es")
# Run inference
sentences = [
    '¿qu aspect morfolog plumaj charran embrid permit diferenci especi similar stern fuscat especial epoc cri ?',
    'pes 12 grs empidonax pequeñ caf alas cort redond anill ocul opac adult oliv cafecin encim anill ocul ante angost bien defin barr alar opac caf pal gargant blanc pech present tint parduzc abdom ante amarillent forr alas general coberter infracaudal encend maxil negr mandibul color anaranjadocarn forr boc anaranj pat negruzc especimen juvenil caf tizn opac encim barr alar contrast color ante canel cicl anual especi migratori neartic ver distribu cost ric distribu regional reproduc local nort mexic oest panam inviern part central mexic part central panam fenolog usos 1746 nombr cientif vire philadelphicus nombr comun vire amarillent descripcion mid 115 cm pes 115 grs vire pequeñ marc alar color parec vermivor peregrin asoci frecuenci cabez relat grand redond pic grues list facial anchas coberter infracaudal amarill adult pile gris tint oliv rest region superior verd oliv grisace cej are baj ojo color blanc separ mediant list ocul fusc mejill teñ oliv region inferior var principal amarill bastant brillant gargant abdom pal principal blancuzc amarill bien defin sol pech coberter infracaudal maxil color cuern fusc mandibul color cuern pal pat gris individu inmadur parec adult ocasion present coberter alar mayor cort opac punt pal reten plumaj juvenil cicl anual especi migratori neartic ver distribu cost ric distribu regional reproduc canad extrem nort eua inviern peninsul yucatan guatemal part central panam rar ocasion nort guatemal fenolog usos 1446 nombr cientif onychoprion anaethetus nombr comun charran embrid descripcion mid 36 cm pes 100 grs tamañ median espald oscur col profund ahorquill disting stern fuscat coll nucal clar manch blanc delg frent extiend form list supercili cort epoc cri adult present coronill list loreal negr coll nucal gris clar rest region superior gris parduzc oscur tod region inferior rectric extern blanc tint gris pech cost pic pat negr epoc cri ve rar ocasion cost ric coronill eminent blanc list negr part posterior',
    'nombr comun sinsont tropical descripcion dors cuerp color gris opac brillant part cabez part ventral blancuzc alas caf negruzc barr angost blanc col delg caf negruzc punt blanc list ocul negr cej blanc pat pic negr rand traylor 1961 sanchez 2003 cicl anual distribu regional sur mexic colombi andes sur brasil peterson chalf 1973 fenolog usos 2509 nombr cientif zanthoxylum melanostictum nombr comun lagart color lagartill descripcion arbol arbust 2 15 m fust poc aguijon ram joven rojiz glabr aguijon escas lenticeladashoj imparipinn 3 7 par foliol opuest elipt ovad 2 102142 08 5268 cm glabr lustros coriace apic cort acumin redond ocasion levement emargin bas obtus bord general enter peciol peciolul rojizosinflorescent panicul terminal 13 cmflor blanc 5 petal frut folicul verruc rojiz obovoid 4 8 mm diametr semill negr 3 6 mmse disting color mor rojiz peciol raquis peciolul fust general carec aguijon escas principal bas cicl anual distribu regional mexic centr amer fenolog floracion observ ener abril agost usos 3117 nombr cientif campnosperm panamens nombr comun descripcion arbol 12 30 m altur rmit ferrugineopuberulent hoj simpl altern cortopeciol oblongoobov 1435 5515 cm apic obtus redond cartac pubescent dens tricom estrell pequeñ escam pelt pard rojiz enves nervadur secundari evidenteinflorescent panicul general axil 40 cm larg flor amarillent pequeñ frut drup triangularovoid 115 0712 cmdiagnost caracteriz hoj agrup final ramit cuy aparient recuerd arbol espavel zapot carenci savi lechos habitat suel aneg permit distingu ademas present ramif simpodial tipic hoj viej torn rojoanaranj ramit hoj enves pubescent dor diminut pel escam particul cicl anual distribu regional hondur panam fenolog flor observ agost octubr frut setiembr octubr usos 2396 nombr cientif micrurus mipartitus',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Preprocess function:

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Initialize Spanish stemmer and stopwords
nltk.download('punkt')
nltk.download('stopwords')
spanish_stopwords = set(stopwords.words('spanish'))
stemmer = SnowballStemmer('spanish')

# Function for preprocessing text (lowercase, remove punctuation, stopwords, and apply stemming)
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s¿?%]', '', text)
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords and apply stemming
    words = [stemmer.stem(word) for word in words if word not in spanish_stopwords]
    # Rejoin the words
    return ' '.join(words)

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.7784
cosine_accuracy@3 0.8907
cosine_accuracy@5 0.9227
cosine_accuracy@10 0.9534
cosine_precision@1 0.7784
cosine_precision@3 0.2969
cosine_precision@5 0.1845
cosine_precision@10 0.0953
cosine_recall@1 0.7784
cosine_recall@3 0.8907
cosine_recall@5 0.9227
cosine_recall@10 0.9534
cosine_ndcg@10 0.8697
cosine_mrr@10 0.8425
cosine_map@100 0.844
dot_accuracy@1 0.7201
dot_accuracy@3 0.879
dot_accuracy@5 0.9155
dot_accuracy@10 0.9446
dot_precision@1 0.7201
dot_precision@3 0.293
dot_precision@5 0.1831
dot_precision@10 0.0945
dot_recall@1 0.7201
dot_recall@3 0.879
dot_recall@5 0.9155
dot_recall@10 0.9446
dot_ndcg@10 0.8406
dot_mrr@10 0.8064
dot_map@100 0.8086

Training Details

Training Dataset

Unnamed Dataset

  • Size: 2,748 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 14 tokens
    • mean: 24.86 tokens
    • max: 43 tokens
    • min: 323 tokens
    • mean: 508.41 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1
    ¿cual caracterist fisic distint esmerejon inclu diferent mach hembr ? 1407 nombr cientif falc columbarius nombr comun esmerejon descripcion mach mid 265 cm pes 160 grs hembr 33 cm 215 grs pequeñ constitu fuert alas puntiagud adult encim gris pizarr mach caf oscur hembr debaj ante list caf oscur profus cabez fin list cuent cej clar traz barr ocul oscur col negruzc 2 o 3 band clar gris mach hembr especimen inmadur parec hembr iris caf oscur cer pat amarill exhib ruf klisklis falc sparverius difier tamañ form accipit joven cicl anual especi migratori neartic ver distribu cost ric distribu regional reproduc nort alask canad extrem nort eua inviern sur part central alask sur canad nort amer sur antill especi ampli distribu viej mund fenolog usos 411 nombr cientif tillandsi subulifer nombr comun piñuel parasit descripcion epifitashoj 5 20 cm roset bulbos vain canel lamin 04 08 cm linearsubul involut escap 65 125 cm erect bract larg entrenudosinflorescent erect 5 105 cm simpl terminal 5 9 flor bract floral 18 3 cm verd cort sepal imbric ecarin conspicu nerv sepal 21 26 cm petal rojiz proximal amarill distal bord rojiz capsul 5 73 cmse reconoc facil roset tubul hoj espiral bract floral pequeñ sepal lamin foliar menor 1 cm ancho cicl anual distribu regional nicaragu venezuel trinid fenolog floracion abril juni frut octubr usos 802 nombr cientif liomys salvini nombr comun raton mont descripcion longitud cabez cuerp 103140mm longitud col 97143mm longitud pie 2630mm longitud orej 1216mm pes 3065gtamañ pequeñ median color caf parduzc pal part dorsal siempr gris sombr caf parduzc pel riz color caf amarillent crem tenuement interpuest espin oscur line anaranj cost part ventral pat color crem blanc col bicolor casi igual tamañ longitud cabez cuerp lc liger pelud ningun mechon pel 24mm punt plant pat traser vellud talon cojincill basal cicl anual
    describ morfolog tillandsi subulifer inclu detall hoj inflorescent bract 1407 nombr cientif falc columbarius nombr comun esmerejon descripcion mach mid 265 cm pes 160 grs hembr 33 cm 215 grs pequeñ constitu fuert alas puntiagud adult encim gris pizarr mach caf oscur hembr debaj ante list caf oscur profus cabez fin list cuent cej clar traz barr ocul oscur col negruzc 2 o 3 band clar gris mach hembr especimen inmadur parec hembr iris caf oscur cer pat amarill exhib ruf klisklis falc sparverius difier tamañ form accipit joven cicl anual especi migratori neartic ver distribu cost ric distribu regional reproduc nort alask canad extrem nort eua inviern sur part central alask sur canad nort amer sur antill especi ampli distribu viej mund fenolog usos 411 nombr cientif tillandsi subulifer nombr comun piñuel parasit descripcion epifitashoj 5 20 cm roset bulbos vain canel lamin 04 08 cm linearsubul involut escap 65 125 cm erect bract larg entrenudosinflorescent erect 5 105 cm simpl terminal 5 9 flor bract floral 18 3 cm verd cort sepal imbric ecarin conspicu nerv sepal 21 26 cm petal rojiz proximal amarill distal bord rojiz capsul 5 73 cmse reconoc facil roset tubul hoj espiral bract floral pequeñ sepal lamin foliar menor 1 cm ancho cicl anual distribu regional nicaragu venezuel trinid fenolog floracion abril juni frut octubr usos 802 nombr cientif liomys salvini nombr comun raton mont descripcion longitud cabez cuerp 103140mm longitud col 97143mm longitud pie 2630mm longitud orej 1216mm pes 3065gtamañ pequeñ median color caf parduzc pal part dorsal siempr gris sombr caf parduzc pel riz color caf amarillent crem tenuement interpuest espin oscur line anaranj cost part ventral pat color crem blanc col bicolor casi igual tamañ longitud cabez cuerp lc liger pelud ningun mechon pel 24mm punt plant pat traser vellud talon cojincill basal cicl anual
    ¿cual caracterist distint ramif hoj alzate verticillat permit identif ? color caf parduzc pal part dorsal siempr gris sombr caf parduzc pel riz color caf amarillent crem tenuement interpuest espin oscur line anaranj cost part ventral pat color crem blanc col bicolor casi igual tamañ longitud cabez cuerp lc liger pelud ningun mechon pel 24mm punt plant pat traser vellud talon cojincill basal cicl anual distribu regional mexic part central cost ric vertient pacif principal localiz tierr baj 1500msnm fenolog usos 3032 nombr cientif alzate verticillat nombr comun descripcion arbol arbust 4 15 m altur ramit cuadrangular pard rojiz exfoli hoj simpl opuest decus obovadoelipt 915 610 cm apic redond retus sesil subsesil coriac glabrasinflorescent panicul terminal 25 cm larg flor petal lil ros frut tip capsul aplan verd pal 5 8 mm largodiagnost reconoc ramif verticil fust pard exfoli escam hoj coriac semej clusi clusiacea secrecion lechos glabr sesil nervadur secundari evident ramit joven cuadrangular hoj torn anaranj rojiz viej cicl anual distribu regional cost ric suramer fenolog flor observ octubr diciembr frut febrer marz usos siti fald cordiller volcan central utiliz ornamental deb foment pues atract follaj arquitectur 1546 nombr cientif eubucc bourcierii nombr comun barbud cabecirroj descripcion mid 15 cm pes 35 grs robust cabezon dos sex present color llamat pic grues conspicu color amarill mach adult are loreal frent barbill negr rest cabez gargant pech roj profund desvanec form abrupt anaranj pech amarill list verd opac profus region posterior region superior alas col color verd opac separ roj cabez lad cuell mediant barr vertical blanc azul iris roj ladrill pic amarill verdos pat verd oliv hembr muestr gargant verd pal part anterior coronill lad cuell anaranj profund continu faj traves part superior pech rest coronill verd ocrace oscur tint anaranj mejill list cort ojo azul clar part baj pech verd oliv clar
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 2
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_map@100
0.1818 50 - 0.6806
0.3636 100 - 0.7391
0.5455 150 - 0.7742
0.7273 200 - 0.7927
0.9091 250 - 0.8211
1.0 275 - 0.8162
1.0909 300 - 0.8241
1.2727 350 - 0.8137
1.4545 400 - 0.8318
1.6364 450 - 0.8342
1.8182 500 0.4916 0.8432
2.0 550 - 0.8440

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
3
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for igmochang/CR-biodiversity-preprocessed-sentence-similarity-es

Finetuned
(5)
this model

Evaluation results