|
# ember-v1 |
|
|
|
<p align="center"> |
|
<img src="https://console.llmrails.com/assets/img/logo-black.svg" width="150px"> |
|
</p> |
|
|
|
This model is trained on a large-scale corpus of relevance text pairs, covering a wide range of domains like financial, scientific, medical, legal and others. While training we used some technics from Retromae and SetFit papers. |
|
|
|
We are also providing it on our own platform as API as a service, feel free to signup: [LLMRails](https://llmrails.com/?ref=ember-v1). |
|
|
|
### Plans |
|
- Paper will be published soon |
|
- v2 is on it's way with 4k maximum sequence length |
|
|
|
## Usage |
|
Use with API request: |
|
```bash |
|
curl --location 'https://api.llmrails.com/v1/embeddings' \ |
|
--header 'X-API-KEY: {token}' \ |
|
--header 'Content-Type: application/json' \ |
|
--data '{ |
|
"input": ["This is an example sentence"], |
|
"model":"embedding-english-v1" # equals to ember-v1 |
|
}' |
|
``` |
|
API docs: https://docs.llmrails.com/embedding/embed-text |
|
Langchain plugin: https://python.langchain.com/docs/integrations/text_embedding/llm_rails |
|
|
|
Use with transformers: |
|
```python |
|
import torch.nn.functional as F |
|
from torch import Tensor |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
def average_pool(last_hidden_states: Tensor, |
|
attention_mask: Tensor) -> Tensor: |
|
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) |
|
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] |
|
|
|
input_texts = [ |
|
"This is an example sentence", |
|
"Each sentence is converted" |
|
] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1") |
|
model = AutoModel.from_pretrained("llmrails/ember-v1") |
|
|
|
# Tokenize the input texts |
|
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') |
|
|
|
outputs = model(**batch_dict) |
|
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
|
|
|
# (Optionally) normalize embeddings |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
scores = (embeddings[:1] @ embeddings[1:].T) * 100 |
|
print(scores.tolist()) |
|
``` |
|
|
|
Use with sentence-transformers: |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
|
|
sentences = [ |
|
"This is an example sentence", |
|
"Each sentence is converted" |
|
] |
|
|
|
model = SentenceTransformer('llmrails/ember-v1') |
|
embeddings = model.encode(sentences) |
|
print(cos_sim(embeddings[0], embeddings[1])) |
|
``` |
|
|
|
## Massive Text Embedding Benchmark (MTEB) Evaluation |
|
Our model achieve state-of-the-art performance on [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) |
|
|
|
| Model Name | Dimension | Sequence Length | Average (56) | |
|
|:-----------------------------------------------------------------------:|:---------:|:---:|:------------:| |
|
| [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1024 | 512 | 64.23 | |
|
| [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 768 | 512 | 63.55 | |
|
| [ember-v1](https://huggingface.co/llmrails/emmbedding-en-v1) | 1024 | 512 | **63.54** | |
|
| [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/types-of-embedding-models) | 1536 | 8191 | 60.99 | |
|
|
|
### Limitation |
|
|
|
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens. |