Image-Text-to-Text
sentence-transformers
Safetensors
Transformers
qwen2_vl
Qwen2-VL
conversational
cheesyFishes commited on
Commit
719ef6e
·
verified ·
1 Parent(s): d6e8ec9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -1
README.md CHANGED
@@ -30,7 +30,10 @@ To know more about the model, read the [announcement blogpost](https://huggingfa
30
 
31
  # Usage
32
 
33
- **Initialize model and processor**
 
 
 
34
 
35
  ```python
36
  from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
@@ -46,6 +49,7 @@ min_pixels = 1 * 28 * 28
46
  # Load the embedding model and processor
47
  model = Qwen2VLForConditionalGeneration.from_pretrained(
48
  'llamaindex/vdr-2b-multi-v1',
 
49
  attn_implementation="flash_attention_2",
50
  torch_dtype=torch.bfloat16,
51
  device_map="cuda:0"
@@ -105,6 +109,7 @@ def encode_queries(queries: list[str], dimension: int) -> torch.Tensor:
105
  ```
106
 
107
  **Encode documents**
 
108
  ```python
109
  def round_by_factor(number: float, factor: int) -> int:
110
  return round(number / factor) * factor
@@ -167,6 +172,59 @@ def encode_documents(documents: list[Image.Image], dimension: int):
167
  return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
168
  ```
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  # Training
171
 
172
  The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
 
30
 
31
  # Usage
32
 
33
+ <details>
34
+ <summary>
35
+ via HuggingFace Transformers
36
+ </summary>
37
 
38
  ```python
39
  from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
 
49
  # Load the embedding model and processor
50
  model = Qwen2VLForConditionalGeneration.from_pretrained(
51
  'llamaindex/vdr-2b-multi-v1',
52
+ # These are the recommended kwargs for the model, but change them as needed
53
  attn_implementation="flash_attention_2",
54
  torch_dtype=torch.bfloat16,
55
  device_map="cuda:0"
 
109
  ```
110
 
111
  **Encode documents**
112
+
113
  ```python
114
  def round_by_factor(number: float, factor: int) -> int:
115
  return round(number / factor) * factor
 
172
  return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
173
  ```
174
 
175
+ </details>
176
+
177
+ <details>
178
+ <summary>
179
+ via LlamaIndex
180
+ </summary>
181
+
182
+ ```bash
183
+ pip install -U llama-index-embeddings-huggingface
184
+ ```
185
+
186
+ ```python
187
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
188
+
189
+ model = HuggingFaceEmbedding(
190
+ model_name_or_path="llamaindex/vdr-2b-multi-v1",
191
+ device="mps",
192
+ trust_remote_code=True,
193
+ )
194
+
195
+ embeddings = model.get_image_embedding("image.png")
196
+ ```
197
+
198
+ </details>
199
+
200
+
201
+ <details>
202
+ <summary>
203
+ via SentenceTransformers
204
+ </summary>
205
+
206
+ ```python
207
+ from sentence_transformers import SentenceTransformer
208
+
209
+ model = SentenceTransformer(
210
+ model_name_or_path="llamaindex/vdr-2b-multi-v1",
211
+ device="mps",
212
+ trust_remote_code=True,
213
+ # These are the recommended kwargs for the model, but change them as needed
214
+ model_kwargs={
215
+ "torch_dtype": torch.bfloat16,
216
+ "device_map": "cuda:0",
217
+ "attn_implementation": "flash_attention_2"
218
+ },
219
+ )
220
+
221
+ embeddings = model.encode("image.png")
222
+ ```
223
+
224
+ </details>
225
+
226
+
227
+
228
  # Training
229
 
230
  The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.