Marsilia-Embeddings-EN-Base πŸš€

Introduction 🌟

Marsilia-Embeddings-EN-Base is an English language embedding model specifically designed for financial domain tasks. This model serves as a proof of concept, demonstrating the critical importance of fine-tuning embedding models for specific tasks in Retrieval-Augmented Generation (RAG) applications.

By focusing on the financial domain, Marsilia-Embeddings-EN-Base achieves performance that surpasses even closed-source models like OpenAI's embeddings, while offering a more cost-effective solution. This showcases how targeted fine-tuning can dramatically enhance the capabilities of open-source models, making them competitive with or even superior to proprietary alternatives in specialized domains.

Model Details πŸ“Š

  • Model Type: Sentence Transformer
  • Language: English πŸ‡¬πŸ‡§
  • Base Model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768
  • Similarity Function: Cosine Similarity

Usage πŸ’»

To use this model with the Sentence Transformers library:

from sentence_transformers import SentenceTransformer

# Download from the πŸ€— Hub
model = SentenceTransformer("sujet-ai/Marsilia-Embeddings-EN-Base")

# Run inference
sentences = [
    'What are the key factors affecting the performance of corporate bonds in the current market?',
    'The corporate bond market has been influenced by several factors in recent months. Interest rates set by central banks have a significant impact, as rising rates tend to decrease bond prices and increase yields. Economic indicators such as GDP growth, inflation rates, and employment figures also play a role in shaping investor sentiment and corporate financial health. Industry-specific trends and individual company performance are crucial, with factors like earnings reports, credit ratings, and debt levels affecting bond valuations. Global events, including geopolitical tensions and trade policies, can create market volatility. Liquidity in the bond market and overall investor risk appetite are additional considerations. It's important for investors to monitor these various factors when assessing corporate bond performance.',
    'CORPORATE BOND HOLDINGS (Continued) Principal Amount (000) Coupon Rate Maturity Date Market Value ($000) Vanguard Short-Term Corporate Bond ETF Bank of America Corp. 2,285 5.015% 1/22/24 2,285 JPMorgan Chase & Co. 2,250 3.875% 2/1/24 2,249 Goldman Sachs Group Inc. 2,200 3.750% 2/25/24 2,197 Morgan Stanley 2,190 3.875% 1/27/24 2,189 Citigroup Inc. 2,145 3.875% 3/26/24 2,141 Wells Fargo & Co. 2,100 3.750% 1/24/24 2,099 Bank of America Corp. 2,050 4.000% 4/1/24 2,047 Truist Bank 2,000 3.800% 10/30/23 2,000 PNC Bank NA 1,950 3.800% 7/25/23 1,950 U.S. Bancorp 1,900 3.375% 2/5/24 1,896 Bank of America Corp. 1,850 4.125% 1/22/24 1,850 Morgan Stanley 1,800 3.737% 4/24/24 1,795 Citigroup Inc. 1,750 3.668% 7/24/24 1,740 Goldman Sachs Group Inc. 1,700 3.625% 1/22/23 1,700 Wells Fargo & Co. 1,650 3.550% 8/14/23 1,650 JPMorgan Chase & Co. 1,600 3.875% 9/10/24 1,593'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Intended Use 🎯

This model is designed for generating sentence embeddings for English text, particularly in the financial domain. It can be used for various natural language processing tasks such as semantic search, clustering, and information retrieval.

Training Data πŸ“š

The model was fine-tuned on the sujet-ai/Sujet-Financial-RAG-EN-Dataset. This dataset consists of question-context pairs in English, focusing on financial topics.

Training Procedure πŸ› οΈ

Training Hyperparameters

  • Loss Function: MultipleNegativesRankingLoss
    • Scale: 20.0
    • Similarity Function: Cosine Similarity
  • Evaluation Strategy: Steps
  • Per Device Train Batch Size: 200
  • Per Device Eval Batch Size: 200
  • Number of Train Epochs: 10
  • Batch Sampler: no_duplicates
  • Multi Dataset Batch Sampler: round_robin
  • Scheduler: Warmup cosine

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.1
  • Transformers: 4.42.3
  • PyTorch: 2.5.0.dev20240704+cu124
  • Accelerate: 0.32.1
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Evaluation πŸ“ˆ

The model was evaluated using the InformationRetrievalEvaluator on the test split of the sujet-ai/Sujet-Financial-RAG-EN-Dataset.

Limitations ⚠️

The model is specifically trained on English financial texts and may not perform optimally on other domains or languages. Users should be aware of potential biases present in the training data.

Citation πŸ“„

If you use this model in your research or applications, please cite:

@software{Marsilia-Embeddings-EN-Base,
  author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
  title = {Marsilia-Embeddings-EN-Base: A fine-tuned English embedding model for financial texts},
  year = {2024},
  url = {https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Base}
}

Contact Information πŸ“§

For questions, feedback, or collaborations, please reach out to us on LinkedIn or visit our website https://sujet.ai.

Downloads last month
2
Safetensors
Model size
109M params
Tensor type
F32
Β·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Dataset used to train sujet-ai/Marsilia-Embeddings-EN-Base

Collection including sujet-ai/Marsilia-Embeddings-EN-Base