|
--- |
|
license: cc |
|
datasets: |
|
- liuhaotian/LLaVA-Instruct-150K |
|
- liuhaotian/LLaVA-Pretrain |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for LLaVA-LLaMA-3-8B |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
A reproduced LLaVA LVLM based on Llama-3-8B LLM backbone. Not an official implementation. |
|
Please follow my reproduced implementation [LLaVA-Llama-3](https://github.com/Victorwz/LLaVA-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM. |
|
|
|
## Model Details |
|
Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3. |
|
|
|
## How to Use |
|
|
|
Please firstly install llava via |
|
``` |
|
pip install git+https://github.com/Victorwz/LLaVA-Llama-3.git |
|
``` |
|
|
|
You can load the model and perform inference as follows: |
|
```python |
|
from llava.conversation import conv_templates, SeparatorStyle |
|
from llava.model.builder import load_pretrained_model |
|
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path |
|
from PIL import Image |
|
import requests |
|
import torch |
|
from io import BytesIO |
|
|
|
# load model and processor |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_name = get_model_name_from_path("weizhiwang/LLaVA-Llama-3-8B") |
|
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Llama-3-8B", None, model_name, False, False, device=device) |
|
|
|
# prepare inputs for the model |
|
text = '<image>' + '\n' + "Describe the image." |
|
conv = conv_templates["llama_3"].copy() |
|
conv.append_message(conv.roles[0], text) |
|
conv.append_message(conv.roles[1], None) |
|
prompt = conv.get_prompt() |
|
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda() |
|
|
|
# prepare image input |
|
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png" |
|
response = requests.get(url) |
|
image = Image.open(BytesIO(response.content)).convert('RGB') |
|
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda() |
|
|
|
# autoregressively generate text |
|
with torch.inference_mode(): |
|
output_ids = model.generate( |
|
input_ids, |
|
images=image_tensor, |
|
do_sample=False, |
|
max_new_tokens=512, |
|
use_cache=True) |
|
|
|
outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True) |
|
print(outputs[0]) |
|
``` |
|
The image caption results look like: |
|
``` |
|
The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away. |
|
|
|
In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection. |
|
``` |
|
|
|
# Fine-Tune LLaVA-Llama-3 on Your Visual Instruction Data |
|
Please refer to a forked [LLaVA-Llama-3](https://github.com/Victorwz/LLaVA-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer. |
|
|
|
## Benchmark Results |
|
|
|
|
|
| Model | MMMU Val | |
|
| :-------------------- | :---------------: | |
|
| LLaVA-v1.5-7B | 35.3 | |
|
| LLaVA-Llama-3-8B | 36.7 | |
|
|
|
Please refer to `eval_outputs/LLaVA-Llama-3-8B_mmmu_val.json` for reproduce the benchmark performance on MMMU validation set. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{wang2024llavallama3, |
|
title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone}, |
|
author={Wang, Weizhi}, |
|
year={2024} |
|
} |
|
``` |
|
|