jupyterjazz commited on
Commit
16c35f6
·
verified ·
1 Parent(s): 3406ca1

readme-adjustments (#21)

Browse files

- adjust readme (ad320ec2a47056efb688f6da1bacc6be2190c90e)

Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -21524,7 +21524,7 @@ model-index:
21524
 
21525
 
21526
  <p align="center">
21527
- <b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
21528
  </p>
21529
 
21530
  <p align="center">
@@ -21555,7 +21555,7 @@ Additionally, it features 5 LoRA adapters to generate task-specific embeddings e
21555
  - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
21556
 
21557
  ### Supported Languages:
21558
- While the foundation model supports 89 languages, we've focused our tuning efforts on the following 30 languages:
21559
  **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
21560
  Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
21561
  Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
@@ -21598,9 +21598,11 @@ tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
21598
  model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
21599
 
21600
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
21601
-
 
 
21602
  with torch.no_grad():
21603
- model_output = model(**encoded_input, task='retrieval.query')
21604
 
21605
  embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
21606
  embeddings = F.normalize(embeddings, p=2, dim=1)
@@ -21661,9 +21663,6 @@ embeddings = model.encode(['Sample text'], truncate_dim=256)
21661
  ```
21662
 
21663
 
21664
- Note that the `truncate_dim` could be any integer between 1 and 1024 for the `separation`, `classification`, and `text-matching` tasks. As for the `retrieval.passage` and `retrieval.query` tasks, the value must be larger than the length of the instruction prompt. By default, the value must be larger than 9 for the `retrieval.passage` task and larger than 12 for the `retrieval.query` task.
21665
-
21666
-
21667
  The latest version (3.1.0) of [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) also supports `jina-embeddings-v3`:
21668
 
21669
  ```bash
 
21524
 
21525
 
21526
  <p align="center">
21527
+ <b>The embedding model trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
21528
  </p>
21529
 
21530
  <p align="center">
 
21555
  - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
21556
 
21557
  ### Supported Languages:
21558
+ While the foundation model supports 100 languages, we've focused our tuning efforts on the following 30 languages:
21559
  **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
21560
  Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
21561
  Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
 
21598
  model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
21599
 
21600
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
21601
+ task = 'retrieval.query'
21602
+ task_id = model._adaptation_map[task]
21603
+ adapter_mask = torch.full((len(sentences),), task_id, dtype=torch.int32)
21604
  with torch.no_grad():
21605
+ model_output = model(**encoded_input, adapter_mask=adapter_mask)
21606
 
21607
  embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
21608
  embeddings = F.normalize(embeddings, p=2, dim=1)
 
21663
  ```
21664
 
21665
 
 
 
 
21666
  The latest version (3.1.0) of [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) also supports `jina-embeddings-v3`:
21667
 
21668
  ```bash