qnguyen3
/

nanoLLaVA

Text Generation

Model card Files Files and versions Community

qnguyen3 commited on Apr 8, 2024

Commit

df31b96

·

verified ·

1 Parent(s): 609a34f

Update README.md

Files changed (1) hide show

README.md +21 -3

README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# nanoLLaVA
 <p align="center">
   <img src="https://i.ibb.co/W6qgZNp/pixelllava.webp" alt="Logo" width="350">
@@ -7,7 +7,16 @@
 ## Description
 nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.
 ## Usage
 ```python
 import torch
 import transformers
@@ -21,7 +30,7 @@ transformers.logging.disable_progress_bar()
 warnings.filterwarnings('ignore')
 # set device
-torch.set_default_device('cuda')  # or 'cuda'
 # create model
 model = AutoModelForCausalLM.from_pretrained(
@@ -51,7 +60,7 @@ text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
 input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)
 # image, sample images can be found in images folder
-image = Image.open('/home/qnguyen3/qnguyen3/nanoLLaVA/icon.png')
 image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
 # generate
@@ -62,4 +71,13 @@ output_ids = model.generate(
     use_cache=True)[0]
 print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
 ```

+# nanoLLaVA - Sub 1B Vision-Language Model
 <p align="center">
   <img src="https://i.ibb.co/W6qgZNp/pixelllava.webp" alt="Logo" width="350">
 ## Description
 nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.
+| Model   | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA**  | **MM-VET** |
+|---------|--------|---------|-----------|------|-------------|-------------|------|--------|
+| Score   | 70.84  | 46.71   | 58.97     | 84.1 | 28.6        | 30.4        | 54.79| 23.9   |
+## Training Data
+Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.
 ## Usage
+You can use with `transformers` with the following script:
 ```python
 import torch
 import transformers
 warnings.filterwarnings('ignore')
 # set device
+torch.set_default_device('cuda')  # or 'cpu'
 # create model
 model = AutoModelForCausalLM.from_pretrained(
 input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)
 # image, sample images can be found in images folder
+image = Image.open('/path/to/image.png')
 image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
 # generate
     use_cache=True)[0]
 print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
+```
+## Prompt Format
+The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`:
+```
+<|im_start|>system
+Answer the question<|im_end|><|im_start|>user
+<image>
+What is the picture about?<|im_end|><|im_start|>assistant
 ```