File size: 8,579 Bytes
6f7fb38
f3bb353
 
 
6f7fb38
f3bb353
 
 
 
 
 
 
 
 
6f7fb38
f3bb353
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---

language:
- en
- ko
license: cc-by-nc-4.0
tags:
- multimodal
- conversational
- ncsoft
- varco
base_model:
- Qwen/Qwen2.5-14B-Instruct
- google/siglip-so400m-patch14-384
library_name: transformers
---


# VARCO-VISION-14B

## About the Model

**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports groundingโ€”the ability to identify the locations of objects within an imageโ€”as well as OCR (Optical Character Recognition) to recognize text within images.

- **Developed by:** NC Research, Multimodal Generation Team
- **Technical Report:** [Coming Soon]()
- **Languages:** Korean, English
- **License:** CC BY-NC 4.0
- **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
- **Base Model:**
  - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
  - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)



## Uses

### Direct Use

To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:

```bash

git clone https://github.com/LLaVA-VL/LLaVA-NeXT

cd LLaVA-NeXT

pip install -e ".[train]"

```

After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:

```python

import torch

from transformers import AutoTokenizer

from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM

from llava.mm_utils import tokenizer_image_token, process_images



model_name = "NCSOFT/VARCO-VISION-14B"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = LlavaQwenForCausalLM.from_pretrained(

    model_name,

    torch_dtype=torch.float16,

    attn_implementation="flash_attention_2",

    low_cpu_mem_usage=True,

    device_map="auto"

)



vision_tower = model.get_vision_tower()

image_processor = vision_tower.image_processor

```

Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.

```python

import requests

from PIL import Image



# Define a chat history and use `apply_chat_template` to get correctly formatted prompt

# Each value in "content" has to be a list of dicts with types ("text", "image")

conversation = [

    {

        "role": "user",

        "content": [

            {"type": "text", "text": "Describe this image."},

            {"type": "image"},

        ],

    },

]

prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)



IMAGE_TOKEN_INDEX = -200

EOS_TOKEN = "<|im_end|>"

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")

input_ids = input_ids.unsqueeze(0).to(model.device)



image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"

raw_image = Image.open(requests.get(image_url, stream=True).raw)

image_tensors = process_images([raw_image], image_processor, model.config)

image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]

image_sizes = [raw_image.size]



with torch.inference_mode():

    output_ids = model.generate(

        input_ids,

        images=image_tensors,

        image_sizes=image_sizes,

        do_sample=False,

        max_new_tokens=1024,

        use_cache=True,

    )



outputs = tokenizer.batch_decode(output_ids)[0]

if outputs.endswith(EOS_TOKEN):

    outputs = outputs[: -len(EOS_TOKEN)]



outputs = outputs.strip()

print(outputs)

```

### Specialized Features

To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.

The following special tokens are used to define specific tasks, inputs and outputs for the model:

- `<gro>`: Indicates that the model's response should include bounding box information.
- `<ocr>`: Specifies OCR tasks for recognizing text within an image.
- `<char>` and `</char>`: Used to mark a text phrase.
- `<obj>` and `</obj>`: Used to indicate an object.
- `<bbox>` and `</bbox>`: Used to represent a bounding box.
- `<delim>`: Represents multiple location points for a single object or text.

#### Grounding

Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.

```python

conversation = [

    {

        "role": "user",

        "content": [

            {"type": "text", "text": "<gro>\nDescribe the image in detail."},

            {"type": "image"},

        ],

    },

]

```

**Expected Output Example:**
```html

The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.

```

<img src="assets/grounding.png" alt="Grounding Example" width="400"/>

#### Referring

VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.

```python

conversation = [

    {

        "role": "user",

        "content": [

            {

                "type": "text",

                "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",

            },

            {"type": "image"},

        ],

    },

]

```

**Expected Output Example:**
```

**์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์˜ ๋ฒ„ํŠผ์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž, ๋ฉ”๋‰ด, ์„ค์ •, ์žฌ์ƒ/์ผ์‹œ์ •์ง€ ๋“ฑ์˜ ๊ธฐ๋Šฅ์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž๋Š” ์ด๋ฅผ ํ†ตํ•ด ์†์‰ฝ๊ฒŒ ๊ธฐ๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```

#### OCR

To perform Optical Character Recognition (OCR), use the `<ocr>` token.

```python

image_file = "./assets/ocr_1.png"



conversation = [

    {

        "role": "user",

        "content": [

            {"type": "text", "text": "<ocr>"},

            {"type": "image"},

        ],

    },

]

```

**Expected Output Example:**

```

<char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>

<char>124๋ฒˆ๊ธธ</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>

<char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>

<char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>

<char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>

<char>์‹œํฅ</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>

<char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>

<char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>

<char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>

<char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>

<char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>

<char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>

<char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>

<char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>

<char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>

<char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>

```

<img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>

## Citing the Model

(*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following: 

```bibtex



```