rabiulawal
commited on
Added model usage code snipptes
Browse files
README.md
CHANGED
@@ -1,3 +1,54 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
---
|
4 |
+
|
5 |
+
|
6 |
+
**Model Details**
|
7 |
+
|
8 |
+
The VisMin-Idefics2 model was developed as a fine-tuned version of the Idefics2 model, leveraging the VisMin dataset for enhanced performance in multimodal tasks. This model excels in visual-text alignment and is designed to handle tasks where models must differentiate between similar images based on textual descriptions. By employing the QLoRa technique and focusing on a rule-based selection of image-text pairs, the VisMin-Idefics2 model is optimized for fine-grained understanding and improved generalization across various multimodal benchmarks.
|
9 |
+
|
10 |
+
**Model Summary**
|
11 |
+
|
12 |
+
- Model Date: July 2024
|
13 |
+
- Model type: Multi-modal model (image+text)
|
14 |
+
- Parent Models: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) and [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
15 |
+
|
16 |
+
**Usage**
|
17 |
+
|
18 |
+
This section shows snippets of code for generation for fine-tuned idefics2-8b. The codes only differ by the input formatting. Let's first define some common imports and inputs.
|
19 |
+
|
20 |
+
```python
|
21 |
+
from transformers import AutoProcessor, AutoModelForVision2Seq
|
22 |
+
|
23 |
+
model_name_or_path = "path/to/fine-tuned-model"
|
24 |
+
if "A100" in gpu_name or "H100" in gpu_name:
|
25 |
+
attn_implementation = "flash_attention_2"
|
26 |
+
else:
|
27 |
+
attn_implementation = None
|
28 |
+
|
29 |
+
quantization_config = BitsAndBytesConfig(
|
30 |
+
load_in_4bit=True,
|
31 |
+
bnb_4bit_quant_type="nf4",
|
32 |
+
bnb_4bit_compute_dtype=torch.float16,
|
33 |
+
)
|
34 |
+
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
|
35 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
36 |
+
model_name_or_path,
|
37 |
+
low_cpu_mem_usage=True,
|
38 |
+
device_map="auto",
|
39 |
+
torch_dtype=torch.float16,
|
40 |
+
_attn_implementation=attn_implementation, # only A100, H100 GPUs
|
41 |
+
quantization_config=quantization_config
|
42 |
+
if model_name_or_path in ["HuggingFaceM4/idefics2-8b", "HuggingFaceM4/idefics2-8b-base"]
|
43 |
+
else None,
|
44 |
+
)
|
45 |
+
```
|
46 |
+
|
47 |
+
**Bibtex**
|
48 |
+
```
|
49 |
+
@article{vismin2024,
|
50 |
+
title={VisMin: Visual Minimal-Change Understanding},
|
51 |
+
author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
|
52 |
+
year={2024}
|
53 |
+
}
|
54 |
+
```
|