jupyterjazz commited on
Commit
8177662
·
verified ·
1 Parent(s): e59450e

readme: usage (#3)

Browse files

- readme: usage and performance (af01f51748e574f8bd01490f78cc32abbf9cfd18)
- Update README.md (8ea62b5d6e1c04d42979f1771fed5965ee2bd300)
- Update README.md (a8ff6b60acaa14060e37bf56e6381a79c44c202d)
- Update README.md (23100f3589e4abdb6a8d7918ddde8c3c3ce72133)

Files changed (1) hide show
  1. README.md +88 -18
README.md CHANGED
@@ -66,7 +66,7 @@ language:
66
  - my
67
  - ne
68
  - nl
69
- - 'no'
70
  - om
71
  - or
72
  - pa
@@ -160,37 +160,107 @@ The data and training details are described in the technical report (coming soon
160
 
161
  ## Usage
162
 
163
- 1. The easiest way to starting using jina-clip-v1-en is to use Jina AI's [Embeddings API](https://jina.ai/embeddings/).
164
- 2. Alternatively, you can use Jina CLIP directly via transformers package.
 
 
 
 
 
 
 
 
 
 
 
165
 
166
  ```python
167
- !pip install transformers einops flash_attn
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  from transformers import AutoModel
169
 
170
  # Initialize the model
171
  model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
172
 
173
- # New meaningful sentences
174
- sentences = [
175
- "Organic skincare for sensitive skin with aloe vera and chamomile.",
176
- "New makeup trends focus on bold colors and innovative techniques",
177
- "Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
178
- "Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
179
- "Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
180
- "Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
181
- "针对敏感肌专门设计的天然有机护肤产品",
182
- "新的化妆趋势注重鲜艳的颜色和创新的技巧",
183
- "敏感肌のために特別に設計された天然有機スキンケア製品",
184
- "新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています",
185
  ]
186
 
187
- # Encode sentences
188
- embeddings = model.encode(sentences, truncate_dim=1024, task_type='index') # TODO UPDATE
 
 
189
 
190
  # Compute similarities
191
  print(embeddings[0] @ embeddings[1].T)
192
  ```
193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
  ## Performance
196
 
 
66
  - my
67
  - ne
68
  - nl
69
+ - no
70
  - om
71
  - or
72
  - pa
 
160
 
161
  ## Usage
162
 
163
+ **<details><summary>Apply mean pooling when integrating the model.</summary>**
164
+ <p>
165
+
166
+ ### Why Use Mean Pooling?
167
+
168
+ Mean pooling takes all token embeddings from the model's output and averages them at the sentence or paragraph level.
169
+ This approach has been shown to produce high-quality sentence embeddings.
170
+
171
+ We provide an `encode` function that handles this for you automatically.
172
+
173
+ However, if you're working with the model directly, outside of the `encode` function,
174
+ you'll need to apply mean pooling manually. Here's how you can do it:
175
+
176
 
177
  ```python
178
+ import torch
179
+ import torch.nn.functional as F
180
+ from transformers import AutoTokenizer, AutoModel
181
+
182
+ def mean_pooling(model_output, attention_mask):
183
+ token_embeddings = model_output[0]
184
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
185
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
186
+
187
+ sentences = ['How is the weather today?', 'What is the current weather like today?']
188
+
189
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')
190
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
191
+
192
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
193
+
194
+ with torch.no_grad():
195
+ model_output = model(**encoded_input)
196
+
197
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
198
+ embeddings = F.normalize(embeddings, p=2, dim=1)
199
+ ```
200
+
201
+ </p>
202
+ </details>
203
+
204
+ The easiest way to start using `jina-embeddings-v3` is Jina AI's [Embeddings API](https://jina.ai/embeddings/).
205
+
206
+ Alternatively, you can use `jina-embeddings-v3` directly via Transformers package:
207
+ ```python
208
+ !pip install transformers
209
  from transformers import AutoModel
210
 
211
  # Initialize the model
212
  model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
213
 
214
+ texts = [
215
+ 'Follow the white rabbit.', # English
216
+ 'Sigue al conejo blanco.', # Spanish
217
+ 'Suis le lapin blanc.', # French
218
+ '跟着白兔走。', # Chinese
219
+ 'اتبع الأرنب الأبيض.', # Arabic
220
+ 'Folge dem weißen Kaninchen.' # German
 
 
 
 
 
221
  ]
222
 
223
+ # When calling the `encode` function, you can choose a `task_type` based on the use case:
224
+ # 'retrieval.query', 'retrieval.passage', 'separation', 'classification', 'text-matching'
225
+ # Alternatively, you can choose not to pass a `task_type`, and no specific LoRA adapter will be used.
226
+ embeddings = model.encode(texts, task_type='text-matching')
227
 
228
  # Compute similarities
229
  print(embeddings[0] @ embeddings[1].T)
230
  ```
231
 
232
+ By default, the model supports a maximum sequence length of 8192 tokens.
233
+ However, if you want to truncate your input texts to a shorter length, you can pass the `max_length` parameter to the `encode` function:
234
+ ```python
235
+ embeddings = model.encode(
236
+ ['Very long ... document'],
237
+ max_length=2048
238
+ )
239
+ ```
240
+
241
+ In case you want to use **Matryoshka embeddings** and switch to a different dimension,
242
+ you can adjust it by passing the `truncate_dim` parameter to the `encode` function:
243
+ ```python
244
+ embeddings = model.encode(
245
+ ['Sample text'],
246
+ truncate_dim=256
247
+ )
248
+ ```
249
+
250
+ The latest version (#todo: specify version) of SentenceTransformers also supports `jina-embeddings-v3`:
251
+
252
+ ```python
253
+ !pip install -U sentence-transformers
254
+ from sentence_transformers import SentenceTransformer
255
+
256
+ model = SentenceTransformer(
257
+ "jinaai/jina-embeddings-v3", trust_remote_code=True
258
+ )
259
+
260
+ embeddings = model.encode(['How is the weather today?'], task_type='retrieval.query')
261
+ ```
262
+
263
+
264
 
265
  ## Performance
266