chore-add-mmteb
#62
by
bwang0911
- opened
README.md
CHANGED
@@ -25015,7 +25015,7 @@ model-index:
|
|
25015 |
<br><br>
|
25016 |
|
25017 |
<p align="center">
|
25018 |
-
<img src="https://huggingface.co/
|
25019 |
</p>
|
25020 |
|
25021 |
|
@@ -25029,7 +25029,7 @@ model-index:
|
|
25029 |
|
25030 |
## Quick Start
|
25031 |
|
25032 |
-
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-embeddings-v3
|
25033 |
|
25034 |
|
25035 |
## Intended Usage & Model Info
|
@@ -25056,13 +25056,6 @@ While the foundation model supports 100 languages, we've focused our tuning effo
|
|
25056 |
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
25057 |
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
25058 |
|
25059 |
-
|
25060 |
-
> **⚠️ Important Notice:**
|
25061 |
-
> We fixed a bug in the `encode` function [#60](https://huggingface.co/jinaai/jina-embeddings-v3/discussions/60) where **Matryoshka embedding truncation** occurred *after normalization*, leading to non-normalized truncated embeddings. This issue has been resolved in the latest code revision.
|
25062 |
-
>
|
25063 |
-
> If you have encoded data using the previous version and wish to maintain consistency, please use the specific code revision when loading the model: `AutoModel.from_pretrained('jinaai/jina-embeddings-v3', code_revision='da863dd04a4e5dce6814c6625adfba87b83838aa', ...)`
|
25064 |
-
|
25065 |
-
|
25066 |
## Usage
|
25067 |
|
25068 |
**<details><summary>Apply mean pooling when integrating the model.</summary>**
|
@@ -25213,15 +25206,6 @@ import onnxruntime
|
|
25213 |
import numpy as np
|
25214 |
from transformers import AutoTokenizer, PretrainedConfig
|
25215 |
|
25216 |
-
# Mean pool function
|
25217 |
-
def mean_pooling(model_output: np.ndarray, attention_mask: np.ndarray):
|
25218 |
-
token_embeddings = model_output
|
25219 |
-
input_mask_expanded = np.expand_dims(attention_mask, axis=-1)
|
25220 |
-
input_mask_expanded = np.broadcast_to(input_mask_expanded, token_embeddings.shape)
|
25221 |
-
sum_embeddings = np.sum(token_embeddings * input_mask_expanded, axis=1)
|
25222 |
-
sum_mask = np.clip(np.sum(input_mask_expanded, axis=1), a_min=1e-9, a_max=None)
|
25223 |
-
return sum_embeddings / sum_mask
|
25224 |
-
|
25225 |
# Load tokenizer and model config
|
25226 |
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')
|
25227 |
config = PretrainedConfig.from_pretrained('jinaai/jina-embeddings-v3')
|
@@ -25243,11 +25227,7 @@ inputs = {
|
|
25243 |
}
|
25244 |
|
25245 |
# Run model
|
25246 |
-
outputs = session.run(None, inputs)
|
25247 |
-
|
25248 |
-
# Apply mean pooling and normalization to the model outputs
|
25249 |
-
embeddings = mean_pooling(outputs, input_text["attention_mask"])
|
25250 |
-
embeddings = embeddings / np.linalg.norm(embeddings, ord=2, axis=1, keepdims=True)
|
25251 |
```
|
25252 |
|
25253 |
</p>
|
|
|
25015 |
<br><br>
|
25016 |
|
25017 |
<p align="center">
|
25018 |
+
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
|
25019 |
</p>
|
25020 |
|
25021 |
|
|
|
25029 |
|
25030 |
## Quick Start
|
25031 |
|
25032 |
+
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-embeddings-v3) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
|
25033 |
|
25034 |
|
25035 |
## Intended Usage & Model Info
|
|
|
25056 |
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
25057 |
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
25058 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25059 |
## Usage
|
25060 |
|
25061 |
**<details><summary>Apply mean pooling when integrating the model.</summary>**
|
|
|
25206 |
import numpy as np
|
25207 |
from transformers import AutoTokenizer, PretrainedConfig
|
25208 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25209 |
# Load tokenizer and model config
|
25210 |
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')
|
25211 |
config = PretrainedConfig.from_pretrained('jinaai/jina-embeddings-v3')
|
|
|
25227 |
}
|
25228 |
|
25229 |
# Run model
|
25230 |
+
outputs = session.run(None, inputs)
|
|
|
|
|
|
|
|
|
25231 |
```
|
25232 |
|
25233 |
</p>
|