File size: 8,579 Bytes
6f7fb38 f3bb353 6f7fb38 f3bb353 6f7fb38 f3bb353 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
---
language:
- en
- ko
license: cc-by-nc-4.0
tags:
- multimodal
- conversational
- ncsoft
- varco
base_model:
- Qwen/Qwen2.5-14B-Instruct
- google/siglip-so400m-patch14-384
library_name: transformers
---
# VARCO-VISION-14B
## About the Model
**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports groundingโthe ability to identify the locations of objects within an imageโas well as OCR (Optical Character Recognition) to recognize text within images.
- **Developed by:** NC Research, Multimodal Generation Team
- **Technical Report:** [Coming Soon]()
- **Languages:** Korean, English
- **License:** CC BY-NC 4.0
- **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
- **Base Model:**
- **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
- **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
## Uses
### Direct Use
To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
```bash
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e ".[train]"
```
After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
```python
import torch
from transformers import AutoTokenizer
from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
from llava.mm_utils import tokenizer_image_token, process_images
model_name = "NCSOFT/VARCO-VISION-14B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = LlavaQwenForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True,
device_map="auto"
)
vision_tower = model.get_vision_tower()
image_processor = vision_tower.image_processor
```
Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
```python
import requests
from PIL import Image
# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image"},
],
},
]
prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
IMAGE_TOKEN_INDEX = -200
EOS_TOKEN = "<|im_end|>"
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
input_ids = input_ids.unsqueeze(0).to(model.device)
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw)
image_tensors = process_images([raw_image], image_processor, model.config)
image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
image_sizes = [raw_image.size]
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensors,
image_sizes=image_sizes,
do_sample=False,
max_new_tokens=1024,
use_cache=True,
)
outputs = tokenizer.batch_decode(output_ids)[0]
if outputs.endswith(EOS_TOKEN):
outputs = outputs[: -len(EOS_TOKEN)]
outputs = outputs.strip()
print(outputs)
```
### Specialized Features
To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
The following special tokens are used to define specific tasks, inputs and outputs for the model:
- `<gro>`: Indicates that the model's response should include bounding box information.
- `<ocr>`: Specifies OCR tasks for recognizing text within an image.
- `<char>` and `</char>`: Used to mark a text phrase.
- `<obj>` and `</obj>`: Used to indicate an object.
- `<bbox>` and `</bbox>`: Used to represent a bounding box.
- `<delim>`: Represents multiple location points for a single object or text.
#### Grounding
Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
```python
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "<gro>\nDescribe the image in detail."},
{"type": "image"},
],
},
]
```
**Expected Output Example:**
```html
The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
```
<img src="assets/grounding.png" alt="Grounding Example" width="400"/>
#### Referring
VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์ ์ด๋ป๊ฒ ์ฐ๋๊ฑฐ์ผ?",
},
{"type": "image"},
],
},
]
```
**Expected Output Example:**
```
**์ด ๋ฌผ๊ฑด**์ ๋ฆฌ๋ชจ์ปจ์ผ๋ก, ์ฃผ๋ก ํ
๋ ๋น์ ์ด๋ ๋ค๋ฅธ ์ ์ ๊ธฐ๊ธฐ๋ฅผ ์๊ฒฉ์ผ๋ก ์กฐ์ํ๋ ๋ฐ ์ฌ์ฉ๋ฉ๋๋ค. ๋ฒํผ์ ๋๋ฅด๋ฉด ์ฑ๋ ๋ณ๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ , ์ ์ ์ผ๊ธฐ/๋๊ธฐ ๋ฑ์ ๊ธฐ๋ฅ์ ์ํํ ์ ์์ต๋๋ค. ๋ฆฌ๋ชจ์ปจ์ ๋ฒํผ์๋ ์ผ๋ฐ์ ์ผ๋ก ์ซ์, ๋ฉ๋ด, ์ค์ , ์ฌ์/์ผ์์ ์ง ๋ฑ์ ๊ธฐ๋ฅ์ด ํฌํจ๋์ด ์์ผ๋ฉฐ, ์ฌ์ฉ์๋ ์ด๋ฅผ ํตํด ์์ฝ๊ฒ ๊ธฐ๊ธฐ๋ฅผ ์ ์ดํ ์ ์์ต๋๋ค.
```
#### OCR
To perform Optical Character Recognition (OCR), use the `<ocr>` token.
```python
image_file = "./assets/ocr_1.png"
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "<ocr>"},
{"type": "image"},
],
},
]
```
**Expected Output Example:**
```
<char>๋ฐฑ๋ฒ๋ก</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
<char>124๋ฒ๊ธธ</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
<char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
<char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
<char>๋ง์์ฃผ๊ณต์ํํธ</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
<char>์ํฅ</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
<char>์์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
<char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
<char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
<char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
<char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
<char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
<char>์ธ์ฒ๋๊ณต์</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
<char>๋ชจ๋๋ด์์ฅ์ญ</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
<char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
<char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
```
<img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
## Citing the Model
(*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
```bibtex
``` |