|
--- |
|
license: cc-by-4.0 |
|
--- |
|
|
|
|
|
**Model Details** |
|
|
|
The VisMin-Idefics2 model was developed as a fine-tuned version of the Idefics2 model, leveraging the VisMin dataset for enhanced performance in multimodal tasks. This model excels in visual-text alignment and is designed to handle tasks where models must differentiate between similar images based on textual descriptions. By employing the QLoRa technique and focusing on a rule-based selection of image-text pairs, the VisMin-Idefics2 model is optimized for fine-grained understanding and improved generalization across various multimodal benchmarks. |
|
|
|
**Model Summary** |
|
|
|
- Model Date: July 2024 |
|
- Model type: Multi-modal model (image+text) |
|
- Parent Models: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) and [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
|
|
|
**Usage** |
|
|
|
This section shows snippets of code for generation for fine-tuned idefics2-8b. The codes only differ by the input formatting. Let's first define some common imports and inputs. |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForVision2Seq |
|
|
|
model_name_or_path = "path/to/fine-tuned-model" |
|
if "A100" in gpu_name or "H100" in gpu_name: |
|
attn_implementation = "flash_attention_2" |
|
else: |
|
attn_implementation = None |
|
|
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.float16, |
|
) |
|
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False) |
|
model = AutoModelForVision2Seq.from_pretrained( |
|
model_name_or_path, |
|
low_cpu_mem_usage=True, |
|
device_map="auto", |
|
torch_dtype=torch.float16, |
|
_attn_implementation=attn_implementation, # only A100, H100 GPUs |
|
quantization_config=quantization_config |
|
if model_name_or_path in ["HuggingFaceM4/idefics2-8b", "HuggingFaceM4/idefics2-8b-base"] |
|
else None, |
|
) |
|
``` |
|
|
|
**Bibtex** |
|
``` |
|
@article{vismin2024, |
|
title={VisMin: Visual Minimal-Change Understanding}, |
|
author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya}, |
|
year={2024} |
|
} |
|
``` |
|
|