HawkLlama_8b / README.md

Update README.md

1356e92 verified 8 months ago

5.7 kB

	---
	license: apache-2.0
	---
	<div align="center">
	<img src="assets/logo.png" alt="Lit-LLaMA" width="200"/>


	<p style="font-size: 30px;"><b>HawkLlama</b></p>

	[🤗Huggingface](https://huggingface.co/AIM-ZJU/HawkLlama_8b) \| [🗂️Github](https://github.com/aim-uofa/VLModel) \| [📖Technical Report](assets/technical_report.pdf)

	Zhejiang University, China

	</div>


	This is the official implementation of HawkLlama, an open-source multimodal large language model designed for real-world vision and language understanding applications. Our model features the following highlights.

	1. HawkLlama-8B is constructed utilizing:
	- [Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), the latest open-source large language model, trained on over 15 trillion tokens.
	- [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384), an enhancement over CLIP employing sigmoid loss, which achieves superior performance in image recognition.
	- An efficient vision-language connector, designed to capture high-resolution details without increasing the number of visual tokens, helps reduce the training overhead associated with high-resolution images.

	2. For model training, we utilize [Llava-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) dataset for pretraining and a mixed dataset specifically curated for instruction tuning, which contains both multimodal and language-only data for supervised fine-tuning.

	3. HawkLlama-8B is developed on [NeMo](https://github.com/NVIDIA/NeMo.git) framework, which facilitates 3D parallelism and offers scalability potential for future extension.

	Our model is open-source and reproducable. Please check our [technical report](assets/technical_report.pdf) for more details.


	<!-- ## News

	[04/30] Llama3-LaMMly-8B is released, trained on a larger dataset, supporting higher resolution images, and also supporting Llama3 as the backbone. For LaMMly, we constructed a multimodal dataset containing 2.6M SFT sample, ensuring that LaMMly can achieve better generalization and improved image understanding. For more details, please refer to our [blog] and [technical report]. -->

	## Contents
	- [Setup](#setup)
	- [Model Weights](#model-weights)
	- [Inference](#inference)
	- [Evaluation](#evaluation)
	- [Demo](#demo)


	## Setup

	1. Create envoirment and activate it.
	```Shell
	conda create -n hawkllama python=3.10 -y
	conda activate hawkllama
	```

	2. Clone and install this repo.
	```
	git clone https://github.com/aim-uofa/VLModel.git
	cd VLModel
	pip install -e .
	pip install -e third_party/VLMEvalKit
	```

	## Model Weights

	Please refer to our [HuggingFace repository](https://huggingface.co/AIM-ZJU/HawkLlama_8b) to download the pretrained model weights.

	## Inference

	We provide an example code for inference.

	```Python
	import torch
	from PIL import Image
	from HawkLlama.model import LlavaNextProcessor, LlavaNextForConditionalGeneration
	from HawkLlama.utils.conversation import conv_llava_llama_3, DEFAULT_IMAGE_TOKEN

	processor = LlavaNextProcessor.from_pretrained("AIM-ZJU/HawkLlama_8b")

	model = LlavaNextForConditionalGeneration.from_pretrained("AIM-ZJU/HawkLlama_8b", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
	model.to("cuda:0")

	image_file = "assets/coin.png"
	image = Image.open(image_file).convert('RGB')

	prompt = "what coin is that?"
	prompt = DEFAULT_IMAGE_TOKEN + "\n" + prompt

	conversation = conv_llava_llama_3.copy()
	user_role_ind = 0
	bot_role_ind = 1
	conversation.append_message(conversation.roles[user_role_ind], prompt)
	conversation.append_message(conversation.roles[bot_role_ind], "")
	prompt = conversation.get_prompt()
	inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
	inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
	output = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, max_new_tokens=2048, do_sample=False, use_cache=True)

	print(processor.decode(output[0], skip_special_tokens=True))
	```

	## Evaluation

	Evaluate is modified based on the VLMEval codebase.

	``` bash
	# single gpu
	python third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
	# multi-gpus
	torchrun --nproc-per-node=8 third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
	```

	The results are shown below:

	\| Benchmark \| Our MethodName \| LLaVA-Llama3-v1.1 \| LLaVA-Next \|
	\|-----------------\|----------------\|-------------------\|------------\|
	\| MMMU val \| 37.8 \| 36.8 \| 36.9 \|
	\| SEEDBench img \| 71.0 \| 70.1 \| 70.0 \|
	\| MMBench-EN dev \| 70.6 \| 70.4 \| 68.0 \|
	\| MMBench-CN dev \| 64.4 \| 64.2 \| 60.6 \|
	\| CCBench \| 33.9 \| 31.6 \| 24.7 \|
	\| AI2D test \| 65.6 \| 70.0 \| 67.1 \|
	\| ScienceQA test \| 76.1 \| 72.9 \| 70.4 \|
	\| HallusionBench \| 41.0 \| 47.7 \| 35.2 \|
	\| MMStar \| 43.0 \| 45.1 \| 38.1 \|

	## Demo

	Welcome to try our [demo](http://115.236.57.99:30020/)!



	## Acknowledgements

	We express our appreciation to the following projects for their outstanding contributions in academia and code development: [LLaVA](https://github.com/haotian-liu/LLaVA), [NeMo](https://github.com/NVIDIA/NeMo), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and [xtuner](https://github.com/InternLM/xtuner).

	## License

	HawkLlama is released under the [Apache 2.0](https://github.com/Lightning-AI/lightning-llama/blob/main/LICENSE) license.