Model Card for KartonBERT-USE-base-v1
This universal sentence encoder model is designed to convert text content into a 768-float vector space, ensuring an effective representation. It aims to be proficient in tasks involving sentence / document similarity.
Despite its small size (104 million parameters only), the model maintains a high level of performance. It uses a lowercase-optimized tokenizer with a vocabulary size of 23,000 tokens. This balance between compactness and effectiveness allows the model to deliver strong results in text encoding tasks, ensuring both speed and accuracy in real-time applications.
Model Description
- Developed by: Bartłomiej Orlik, https://www.linkedin.com/in/bartłomiej-orlik/
- Model type: BERT Universal Sentence Encoder
- Language: Polish
- License: GPL-3.0
- Trained from model: OrlikB/KartonBERT_base_uncased_v1: https://huggingface.co/OrlikB/KartonBERT_base_uncased_v1
How to Get Started with the Model
Use the code below to get started with the model.
Using Sentence-Transformers
You can use the model with sentence-transformers:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('OrlikB/KartonBERT-USE-base-v1')
text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'
embeddings_1 = model.encode(text_1, normalize_embeddings=True)
embeddings_2 = model.encode(text_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Using HuggingFace Transformers
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
def encode_text(text):
encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt', max_length=512)
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.squeeze().numpy()
cosine_similarity = lambda a, b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
tokenizer = AutoTokenizer.from_pretrained('OrlikB/KartonBERT-USE-base-v1')
model = AutoModel.from_pretrained('OrlikB/KartonBERT-USE-base-v1')
model.eval()
text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'
embeddings_1 = encode_text(text_1)
embeddings_2 = encode_text(text_2)
print(cosine_similarity(embeddings_1, embeddings_2))
*Note: You can use the encode_text function for demonstration purposes. For the best experience, it's recommended to process text in batches.
Evaluation
Rank | Model | Model Size (Million Parameters) | Memory Usage (GB, fp32) | Embedding Dimensions | Max Tokens | Average (26 datasets) | Classification Average (7 datasets) | Clustering Average (1 dataset) | PairClassification Average (4 datasets) | Retrieval Average (11 datasets) | STS Average (3 datasets) |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | bge-multilingual-gemma2 | 9242 | 34.43 | 3584 | 8192 | 70 | 77.99 | 50.29 | 89.62 | 59.41 | 70.64 |
2 | gte-Qwen2-7B-instruct | 7613 | 28.36 | 3584 | 131072 | 67.86 | 77.84 | 51.36 | 88.48 | 54.69 | 70.86 |
3 | gte-Qwen2-1.5B-instruct | 1776 | 6.62 | 1536 | 131072 | 64.04 | 72.29 | 44.59 | 84.87 | 51.88 | 68.12 |
4 | jina-embeddings-v3 | 572 | 2.13 | 1024 | 8194 | 63.97 | 70.81 | 43.66 | 83.70 | 51.89 | 72.77 |
5 | jina-embeddings-v3 | 572 | 2.13 | 1024 | 8194 | 63.97 | 70.81 | 43.66 | 83.70 | 51.89 | 72.77 |
6 | mmlw-roberta-large | 435 | 1.62 | 1024 | 514 | 63.23 | 66.39 | 31.16 | 89.13 | 52.71 | 70.59 |
7 | KartonBERT-USE-base-v1 | 104 | 0.39 | 768 | 512 | 61.67 | 67.57 | 29.88 | 87.04 | 49.14 | 70.65 |
8 | mmlw-e5-large | 560 | 2.09 | 1024 | 514 | 61.17 | 61.07 | 30.62 | 85.90 | 52.63 | 69.98 |
9 | mmlw-roberta-base | 124 | 0.46 | 768 | 514 | 61.05 | 62.92 | 33.08 | 88.14 | 49.92 | 70.70 |
10 | multilingual-e5-large | 560 | 2.09 | 1024 | 514 | 60.08 | 63.82 | 33.88 | 85.50 | 48.98 | 66.91 |
11 | mmlw-e5-base | 278 | 1.04 | 768 | 514 | 59.71 | 59.52 | 30.25 | 86.16 | 50.06 | 70.13 |
12 | gte-multilingual-base | 305 | 1.14 | 768 | 8192 | 58.22 | 60.15 | 33.67 | 85.45 | 46.40 | 68.92 |
13 | st-polish-kartonberta-base-alpha-v1 | 124 | 0.46 | 768 | 514 | 56.92 | 60.44 | 32.85 | 87.92 | 42.19 | 69.47 |
More Information
If I have spare computing resources (GPU), I may improve the quality of the model by further training.
- Downloads last month
- 3,371
Evaluation results
- accuracy on MTEB AllegroReviews (default)test set self-reported54.155
- f1 on MTEB AllegroReviews (default)test set self-reported46.533
- f1_weighted on MTEB AllegroReviews (default)test set self-reported54.684
- main_score on MTEB AllegroReviews (default)test set self-reported54.155
- main_score on MTEB ArguAna-PL (default)test set self-reported56.901
- map_at_1 on MTEB ArguAna-PL (default)test set self-reported31.792
- map_at_10 on MTEB ArguAna-PL (default)test set self-reported48.054
- map_at_100 on MTEB ArguAna-PL (default)test set self-reported48.878
- map_at_1000 on MTEB ArguAna-PL (default)test set self-reported48.882
- map_at_20 on MTEB ArguAna-PL (default)test set self-reported48.737