sachindatasociety's picture
Add new SentenceTransformer model.
36300c2 verified
metadata
base_model: BAAI/bge-base-en-v1.5
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:143
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: 'JSON APIs: Node.js'
    sentences:
      - 'Prerequisite course required: RESTful APIs: Node.js'
      - >-
        Course Name:JSON APIs: Node.js|Course Description:An introduction to
        JSON API, using Node.js.|Course language: JavaScript|Prerequisite course
        required: RESTful APIs: Node.js|Target Audience:Professionals who would
        like to learn the core concepts of JSON API, using Node.js.
      - An introduction to JSON API, using Node.js.
      - 'Course language: JavaScript'
      - >-
        Professionals who would like to learn the core concepts of JSON API,
        using Node.js.
  - source_sentence: Enzyme
    sentences:
      - >-
        For anyone who has built an application in React and wants to test the
        React components
      - >-
        A course that explores Enzyme, which is a JavaScript utility for React
        applications. The course equips users to simulate runs and test React
        components' outputs.
      - 'Prerequisite course required: React Testing Library'
      - 'Course language: TBD'
      - >-
        Course Name:Enzyme|Course Description:A course that explores Enzyme,
        which is a JavaScript utility for React applications. The course equips
        users to simulate runs and test React components' outputs.|Course
        language: TBD|Prerequisite course required: React Testing Library|Target
        Audience:For anyone who has built an application in React and wants to
        test the React components
  - source_sentence: 'React Ecosystem: State Management & Redux'
    sentences:
      - >-
        Course Name:React Ecosystem: State Management & Redux|Course
        Description:A course that builds on the React Ecosystem. It explains how
        state management works in React and goes over the Redux state management
        library|Course language: JavaScript|Prerequisite course required: React
        Ecosystem: Forms|Target Audience:Professionals who would like to learn
        about state management in React
      - 'Course language: JavaScript'
      - 'Prerequisite course required: React Ecosystem: Forms'
      - >-
        A course that builds on the React Ecosystem. It explains how state
        management works in React and goes over the Redux state management
        library
      - Professionals who would like to learn about state management in React
  - source_sentence: Ensemble Methods in Python
    sentences:
      - 'Course language: Python'
      - 'Prerequisite course required: Decision Trees'
      - >-
        This course covers an overview of ensemble learning methods like random
        forest and boosting. At the end of this course, students will be able to
        implement and compare random forest algorithm and boosting.
      - >-
        Professionals with some experience in building basic algorithms who
        would like to expand their skill set to more advanced Python
        classification techniques.
      - >-
        Course Name:Ensemble Methods in Python|Course Description:This course
        covers an overview of ensemble learning methods like random forest and
        boosting. At the end of this course, students will be able to implement
        and compare random forest algorithm and boosting.|Course language:
        Python|Prerequisite course required: Decision Trees|Target
        Audience:Professionals with some experience in building basic algorithms
        who would like to expand their skill set to more advanced Python
        classification techniques.
  - source_sentence: Visualizing Data with Matplotlib in Python
    sentences:
      - >-
        Professionals with basic Python experience who would like to expand
        their skill set to more Python visualization techniques and tools.
      - 'Prerequisite course required: Intro to Python'
      - 'Course language: Python'
      - >-
        Course Name:Visualizing Data with Matplotlib in Python|Course
        Description:This course covers the basics of data visualization and
        exploratory data analysis. It helps students learn different plots and
        their use cases.|Course language: Python|Prerequisite course required:
        Intro to Python|Target Audience:Professionals with basic Python
        experience who would like to expand their skill set to more Python
        visualization techniques and tools.
      - >-
        This course covers the basics of data visualization and exploratory data
        analysis. It helps students learn different plots and their use cases.

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("datasocietyco/bge-base-en-v1.5-course-recommender-v2")
# Run inference
sentences = [
    'Visualizing Data with Matplotlib in Python',
    'This course covers the basics of data visualization and exploratory data analysis. It helps students learn different plots and their use cases.',
    'Course language: Python',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 143 training samples
  • Columns: name, description, languages, prerequisites, target_audience, and combined
  • Approximate statistics based on the first 143 samples:
    name description languages prerequisites target_audience combined
    type string string string string string string
    details
    • min: 3 tokens
    • mean: 7.82 tokens
    • max: 17 tokens
    • min: 13 tokens
    • mean: 39.24 tokens
    • max: 117 tokens
    • min: 6 tokens
    • mean: 6.57 tokens
    • max: 10 tokens
    • min: 8 tokens
    • mean: 12.85 tokens
    • max: 22 tokens
    • min: 12 tokens
    • mean: 23.02 tokens
    • max: 54 tokens
    • min: 58 tokens
    • mean: 94.5 tokens
    • max: 187 tokens
  • Samples:
    name description languages prerequisites target_audience combined
    Reinforcement Learning This course covers the specialized branch of machine learning and deep learning called reinforcement learning (RL). By the end of this course students will be able to define RL use cases and real world scenarios where RL models are used, they will be able to create a simple RL model and evaluate its performance. Course language: Python Prerequisite course required: Working with Complex Pre-trained CNNs in Python Professionals some Python experience who would like to expand their skillset to more advanced machine learning algorithms for reinforcement learning. Course Name:Reinforcement Learning
    Optimizing Ensemble Methods in Python This course covers advanced topics in optimizing ensemble learning methods – specifically random forest and gradient boosting. Students will learn to implement base models and perform hyperparameter tuning to enhance the performance of models. Course language: Python Prerequisite course required: Ensemble Methods in Python Professionals experience in ensemble methods and who want to enhance their skill set in advanced Python classification techniques. Course Name:Optimizing Ensemble Methods in Python
    Fundamentals of Accelerated Computing with OpenACC Find out how to write and configure code parallelization with OpenACC, optimize memory movements between the CPU and GPU accelerator, and apply the techniques to accelerate a CPU-only Laplace Heat Equation to achieve performance gains. Course language: Python No prerequisite course required Professionals who want to learn how to write code, configure code parallelization with OpenACC, optimize memory movements between the CPU and GPU accelerator, and implement the workflow learnt for massive performance gains. Course Name:Fundamentals of Accelerated Computing with OpenACC
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 36 evaluation samples
  • Columns: name, description, languages, prerequisites, target_audience, and combined
  • Approximate statistics based on the first 36 samples:
    name description languages prerequisites target_audience combined
    type string string string string string string
    details
    • min: 3 tokens
    • mean: 7.92 tokens
    • max: 13 tokens
    • min: 13 tokens
    • mean: 46.39 tokens
    • max: 92 tokens
    • min: 6 tokens
    • mean: 6.75 tokens
    • max: 10 tokens
    • min: 8 tokens
    • mean: 13.47 tokens
    • max: 20 tokens
    • min: 5 tokens
    • mean: 23.75 tokens
    • max: 54 tokens
    • min: 61 tokens
    • mean: 103.28 tokens
    • max: 165 tokens
  • Samples:
    name description languages prerequisites target_audience combined
    Intro to CSS, Part 2 A course that continues to build on the foundational understanding of CSS syntax and allows students to work with responsive design and media queries. Course language: CSS, HTML Prerequisite course required: Intro to CSS, Part 1 Professionals who would like to continue learning the core concepts of CSS and be able to style simple web pages. Course Name:Intro to CSS, Part 2
    Foundations of Statistics in Python This course is designed for learners who would like to learn about statistics and apply it for decision-making. This course is a comprehensive review of statistical terms ranging from foundational (mean, median, mode, standard deviation, variance, covariance, correlation) to more complex concepts such as normality in data, confidence intervals, and p-values. Additional topics include how to calculate summary statistics and how to carry out hypothesis testing to inform decisions. Course language: Python Prerequisite course required: Intro to Visualization in Python Professionals some Python experience who would like to expand their skill set to more advanced Python visualization techniques and tools. Course Name:Foundations of Statistics in Python
    Spherical k-Means and Hierarchical Clustering in R This course covers the unsupervised learning method called clustering which is used to find patterns or groups in data without the need for labelled data. This course includes different methods of clustering on numerical data including density-based and hierarchical-based clustering and how to build, evaluate and interpret these models. Course language: R Prerequisite course required: Intro to Clustering in R Professionals with some R experience who would like to expand their skillset to more clustering techniques like hierarchical clustering and DBSCAN. Course Name:Spherical k-Means and Hierarchical Clustering in R
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 3e-06
  • max_steps: 64
  • warmup_ratio: 0.1
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-06
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3.0
  • max_steps: 64
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss
2.2222 20 1.5188 1.1718
4.4444 40 1.0652 0.8327
6.6667 60 0.677 0.7192

Framework Versions

  • Python: 3.9.13
  • Sentence Transformers: 3.1.1
  • Transformers: 4.45.1
  • PyTorch: 2.2.2
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.20.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}