Saideepthi55's picture
Add new SentenceTransformer model.
be0d478 verified
|
raw
history blame
33.2 kB
metadata
base_model: microsoft/mpnet-base
datasets:
  - SwastikN/sxc_med_llm_chemical_gen
language:
  - en
library_name: sentence-transformers
license: apache-2.0
metrics:
  - cosine_accuracy
  - dot_accuracy
  - manhattan_accuracy
  - euclidean_accuracy
  - max_accuracy
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:117502
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      Help me make the molecule
      CC(=O)OC[C@H](OC(C)=O)C(=O)N1CCCC[C@H]1C1CCN(C(=O)c2cc3ccccc3n2C)CC1 with
      the same hydrogen bond donors. The output molecule should be similar to
      the input molecule. Please inform me of the number of hydrogen bond
      donor(s) of the optimized molecule.
    sentences:
      - >-
        Your requirements guided the optimization, resulting in the molecule
        "CC(=O)OC(CCl)C(Cc1cccs1)[C@H](OC(C)=O)C(=O)N1CCCC[C@H]1C1CCN(C(=O)c2cc3ccccc3n2C)CC1"
        with an approximate hydrogen bond donor(s) of 0.
      - >-
        Given a molecule expressed in SMILES string, help me optimize it
        according to my requirements.
      - >-
        Help me adapt a molecular structure denoted in SMILES string based on my
        preferences.
  - source_sentence: >-
      How can we modify the molecule
      CCC(CC)=C(CC)c1ccccc1OC(=O)OC(N=[N+]=[N-])c1ccccc1 to decrease its
      blood-brain barrier penetration (BBBP) value while keeping it similar to
      the input molecule? Please inform me of the BBBP value of the optimized
      molecule.
    sentences:
      - Describe a technology used for measuring people's emotional responses.
      - >-
        I've successfully optimized the molecule according to your needs,
        resulting in "CCOC(=O)c1ccccc1OC(=O)OC(N=[N+]=[N-])c1ccccc1" with an
        approximate BBBP value of 0.71.
      - >-
        Given a molecule expressed in SMILES string, help me optimize it
        according to my requirements.
  - source_sentence: >-
      How can we modify the molecule
      C/C(=C/C(=O)N1CC[C@H](CC(CCCCCC(CO)C(=O)O)NC(=O)OC(C)(C)C)[C@H]1c1cccnc1)C(=O)O
      to increase its blood-brain barrier penetration (BBBP) value while keeping
      it similar to the input molecule?
    sentences:
      - >-
        Given a molecule expressed in SMILES string, help me optimize it
        according to my requirements.
      - >-
        Aid me in refining a molecular structure written in SMILES notation
        based on my criteria.
      - >-
        Taking your requirements into account, I've optimized the molecule to
        "C/C(=C/C(=O)N1CC[C@H](CNC(=O)[C@H](CO)NC(=O)OC(C)(C)C)[C@H]1c1cccnc1)C(=O)O".
  - source_sentence: >-
      Support me in transforming the molecule [SMILES] by incorporating the same
      hydrogen bond acceptors and maintaining its resemblance to the original
      molecule.
    sentences:
      - >-
        Taking your requirements into account, I've optimized the molecule to
        "CCOc1cccc(C2c3c(oc4ccc(C)cc4c3=O)C(=O)N2CCN(CC)CC)c1".
      - >-
        Help me adapt a molecular structure denoted in SMILES string based on my
        preferences.
      - >-
        Help me adapt a molecular structure denoted in SMILES string based on my
        preferences.
  - source_sentence: >-
      With a molecule represented by the SMILES string
      CNNNCC(=O)N[C@H](C)C[C@@H](C)NCc1ccc2c(c1)CCC2, propose adjustments that
      can increase its logP value while keeping the output molecule structurally
      related to the input molecule.
    sentences:
      - >-
        Aid me in refining a molecular structure written in SMILES notation
        based on my criteria.
      - >-
        Given a molecule expressed in SMILES string, help me optimize it
        according to my requirements.
      - >-
        In line with your criteria, I've optimized the molecule and present it
        as "C[C@H](C[C@@H](C)NC(=O)COC(C)(C)C)NCc1ccc2c(c1)CCC2".
model-index:
  - name: MPNet base trained on AllNLI triplets
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: all nli dev
          type: all-nli-dev
        metrics:
          - type: cosine_accuracy
            value: 0.6562222222222223
            name: Cosine Accuracy
          - type: dot_accuracy
            value: 0.5342222222222223
            name: Dot Accuracy
          - type: manhattan_accuracy
            value: 0.7075555555555556
            name: Manhattan Accuracy
          - type: euclidean_accuracy
            value: 0.6584444444444445
            name: Euclidean Accuracy
          - type: max_accuracy
            value: 0.7075555555555556
            name: Max Accuracy
          - type: cosine_accuracy
            value: 0.9804444444444445
            name: Cosine Accuracy
          - type: dot_accuracy
            value: 0.01888888888888889
            name: Dot Accuracy
          - type: manhattan_accuracy
            value: 0.9811111111111112
            name: Manhattan Accuracy
          - type: euclidean_accuracy
            value: 0.9802222222222222
            name: Euclidean Accuracy
          - type: max_accuracy
            value: 0.9811111111111112
            name: Max Accuracy

MPNet base trained on AllNLI triplets

This is a sentence-transformers model finetuned from microsoft/mpnet-base on the sxc_med_llm_chemical_gen dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: microsoft/mpnet-base
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Training Dataset:
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Saideepthi55/sentencetransformer-ft")
# Run inference
sentences = [
    'With a molecule represented by the SMILES string CNNNCC(=O)N[C@H](C)C[C@@H](C)NCc1ccc2c(c1)CCC2, propose adjustments that can increase its logP value while keeping the output molecule structurally related to the input molecule.',
    'Given a molecule expressed in SMILES string, help me optimize it according to my requirements.',
    'In line with your criteria, I\'ve optimized the molecule and present it as "C[C@H](C[C@@H](C)NC(=O)COC(C)(C)C)NCc1ccc2c(c1)CCC2".',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.6562
dot_accuracy 0.5342
manhattan_accuracy 0.7076
euclidean_accuracy 0.6584
max_accuracy 0.7076

Triplet

Metric Value
cosine_accuracy 0.9804
dot_accuracy 0.0189
manhattan_accuracy 0.9811
euclidean_accuracy 0.9802
max_accuracy 0.9811

Training Details

Training Dataset

sxc_med_llm_chemical_gen

  • Dataset: sxc_med_llm_chemical_gen at ae28874
  • Size: 117,502 training samples
  • Columns: input, instruction, and output
  • Approximate statistics based on the first 1000 samples:
    input instruction output
    type string string string
    details
    • min: 2 tokens
    • mean: 50.03 tokens
    • max: 156 tokens
    • min: 6 tokens
    • mean: 16.19 tokens
    • max: 69 tokens
    • min: 3 tokens
    • mean: 67.94 tokens
    • max: 321 tokens
  • Samples:
    input instruction output
    List the countries of the Middle East Egypt, Israel, Iran, Iraq, Jordan, Kuwait, Lebanon, Qatar, Saudi Arabia, Syria, Turkey, The United Arab Emirates, Oman, and Yemen.
    Support me in refining the molecule CCC(C)=C(CC)c1ccc(Nc2nc3nonc3nc2N2CCC(Oc3ccc(Cl)cc3)CC2)cc1 to have a more drug-like nature, ensuring it remains close to the input molecule. Please inform me of the QED value of the optimized molecule. Help me adapt a molecular structure denoted in SMILES string based on my preferences. I've optimized the molecule according to your requirements, yielding the structure "COC(=O)c1ccc(Nc2nc3nonc3nc2N2CCC(Oc3ccc(Cl)cc3)CC2)cc1" and an estimated QED value of 0.4.
    Create a social media strategy to attract followers on Twitter. The social media strategy for Twitter should involve Tweeting on a regular basis, with content that is engaging and relevant to the target audience. This can include retweetable quotes, links to videos and articles, sharing photos/images, polls, and so on. It is also beneficial for the information to include hashtags relevant to the content and any relevant topics of the industry.

    It is also important to leverage Twitter influencers and accounts, as well as engaging with potential followers regularly and DM them when necessary. Additionally, running campaigns, contests and giveaways to gain more followers should also be part of the strategy. Finally, hashtag research should be conducted to increase the reach of the tweets.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

sxc_med_llm_chemical_gen

  • Dataset: sxc_med_llm_chemical_gen at ae28874
  • Size: 4,500 evaluation samples
  • Columns: input, instruction, and output
  • Approximate statistics based on the first 1000 samples:
    input instruction output
    type string string string
    details
    • min: 2 tokens
    • mean: 68.49 tokens
    • max: 165 tokens
    • min: 6 tokens
    • mean: 16.89 tokens
    • max: 36 tokens
    • min: 3 tokens
    • mean: 69.16 tokens
    • max: 205 tokens
  • Samples:
    input instruction output
    With a molecule represented by the SMILES string COc1ccc(CN(C(=O)Nc2ccc(C@Hc3ccccc3)c(Cl)c2)C2CC2)cc1, propose adjustments that can decrease its blood-brain barrier permeability (BBBP) value by at least 0.1 compared to the pre-optimized value while keeping the output molecule structurally related to the input molecule. Please inform me of the BBBP value of the optimized molecule. Help me adapt a molecular structure denoted in SMILES string based on my preferences. I've optimized the molecule according to your requirements, yielding the structure "CC(C)c1cccc(Oc2ccc(CN(C(=O)Nc3ccc(C@Hc4ccccc4)c(Cl)c3)C3CC3)cc2)c1" and an estimated BBBP value of 0.34.
    Which modifications can be applied to the molecule Cc1nn(-c2cc3cc(F)ccc3s2)c2ncc(C(=O)N3CCCC3c3nnc4n3CCCCC4)cc12 to increase its water solubility value while ensuring it remains similar to the original molecule? Aid me in refining a molecular structure written in SMILES notation based on my criteria. I've optimized the molecule according to your requirements, resulting in the following structure: "Cc1nn(C)c2ncc(C(=O)N3CCCC3c3nnc4n3CCCCC4)cc12".
    Describe the benefits of having a mentor. Having a mentor can provide tremendous benefits, including guidance, advice, support and connection to a network of contacts. Mentors can offer valuable feedback, help to create goals and strategies, and provide the accountability necessary to carry out the desired goals. They can also provide a fresh perspective which can help to create new ideas and solutions.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • learning_rate: 2e-05
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss all-nli-dev_max_accuracy
0 0 - - 0.7076
0.0174 64 - - 0.7156
0.0068 100 2.7336 2.6486 0.7524
0.0136 200 2.4965 1.9213 0.8162
0.0204 300 1.9042 1.7761 0.822
0.0272 400 1.6856 1.7172 0.8371
0.0340 500 1.6117 1.6916 0.8507
0.0408 600 1.5673 1.6809 0.8976
0.0477 700 1.5984 1.7052 0.9329
0.0545 800 1.5828 1.6841 0.9391
0.0613 900 1.5375 1.6534 0.9267
0.0681 1000 1.5561 1.6619 0.9509
0.0749 1100 1.4911 1.6538 0.9556
0.0817 1200 1.5075 1.6498 0.966
0.0885 1300 1.4722 1.6468 0.946
0.0953 1400 1.4806 1.6981 0.9631
0.1021 1500 1.4788 1.6335 0.9662
0.1089 1600 1.4668 1.6668 0.9731
0.1157 1700 1.4383 1.6473 0.9711
0.1225 1800 1.4549 1.6462 0.9713
0.1294 1900 1.4394 1.6184 0.9718
0.1362 2000 1.3861 1.6156 0.9676
0.1430 2100 1.4111 1.6045 0.9711
0.1498 2200 1.4286 1.6056 0.9782
0.1566 2300 1.4669 1.6174 0.9764
0.1634 2400 1.3761 1.6182 0.9776
0.1702 2500 1.4119 1.6150 0.9738
0.1770 2600 1.3625 1.5984 0.9776
0.1838 2700 1.3726 1.6092 0.9807
0.1906 2800 1.3265 1.6059 0.9789
0.1974 2900 1.3925 1.6004 0.978
0.2042 3000 1.3524 1.5964 0.9773
0.2111 3100 1.342 1.6213 0.9787
0.2179 3200 1.3478 1.6016 0.9822
0.2247 3300 1.3888 1.6038 0.9793
0.2315 3400 1.3328 1.5977 0.9813
0.2383 3500 1.372 1.6114 0.9824
0.2451 3600 1.3046 1.6082 0.9824
0.2519 3700 1.3857 1.5922 0.9824
0.2587 3800 1.3236 1.6127 0.9809
0.2655 3900 1.2929 1.5935 0.9824
0.2723 4000 1.3889 1.6047 0.9831
0.2791 4100 1.3509 1.6030 0.9844
0.2859 4200 1.3455 1.6099 0.9824
0.2928 4300 1.337 1.5939 0.984
0.2996 4400 1.3302 1.6057 0.9827
0.3064 4500 1.3377 1.6254 0.9833
0.3132 4600 1.3221 1.6020 0.9849
0.3200 4700 1.3209 1.6146 0.9824
0.3268 4800 1.354 1.6022 0.9824
0.3336 4900 1.3213 1.6136 0.9822
0.3404 5000 1.3484 1.5920 0.9807
0.3472 5100 1.3412 1.6106 0.978
0.3540 5200 1.3532 1.6001 0.9784
0.3608 5300 1.2984 1.6192 0.9762
0.3676 5400 1.3621 1.5850 0.98
0.3745 5500 1.2839 1.6158 0.9807
0.3813 5600 1.3664 1.6030 0.9831
0.3881 5700 1.327 1.6168 0.9822
0.3949 5800 1.3123 1.6040 0.982
0.4017 5900 1.3019 1.6092 0.9824
0.4085 6000 1.3908 1.5935 0.9829
0.4153 6100 1.3136 1.5916 0.9791
0.4221 6200 1.32 1.6091 0.9807
0.4289 6300 1.3018 1.6052 0.9827
0.4357 6400 1.3144 1.6083 0.9816
0.4425 6500 1.2865 1.6015 0.9829
0.4493 6600 1.2946 1.5882 0.9818
0.4562 6700 1.3245 1.5949 0.9824
0.4630 6800 1.3278 1.6081 0.9831
0.4698 6900 1.2842 1.6086 0.9836
0.4766 7000 1.3231 1.6170 0.9811

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.0
  • Transformers: 4.44.2
  • PyTorch: 2.4.0+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}