Safetensors
English
llava_next
medical
biology
AdaptLLM commited on
Commit
15f27d4
·
verified ·
1 Parent(s): 968f824

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - AdaptLLM/medicine-visual-instructions
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Lin-Chen/open-llava-next-llama3-8b
9
+ tags:
10
+ - medical
11
+ - biology
12
+ ---
13
+ # Adapting Multimodal Large Language Models to Domains via Post-Training
14
+
15
+ This repo contains the **biomedicine MLLM developed from LLaVA-NeXT-Llama3-8B** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930). The correspoding training dataset is in [medicine-visual-instructions](https://huggingface.co/datasets/AdaptLLM/medicine-visual-instructions).
16
+
17
+ The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
18
+
19
+ We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation.
20
+ **(1) Data Synthesis**: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. **Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs.**
21
+ **(2) Training Pipeline**: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.
22
+ **(3) Task Evaluation**: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
23
+
24
+ <p align='center'>
25
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/-Jp7pAsCR2Tj4WwfwsbCo.png" width="600">
26
+ </p>
27
+
28
+ ## How to use
29
+
30
+ ```python
31
+ from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
32
+ import torch
33
+ from PIL import Image
34
+ import requests
35
+
36
+ # Define your input image and instruction here:
37
+ ## image
38
+ url = "https://cdn-uploads.huggingface.co/production/uploads/650801ced5578ef7e20b33d4/bRu85CWwP9129bSCRzos2.png"
39
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
40
+
41
+ instruction = "What's in the image?"
42
+
43
+ model_path='AdaptLLM/medicine-LLaVA-NeXT-Llama3-8B'
44
+
45
+ # =========================== Do NOT need to modify the following ===============================
46
+ # Load the processor
47
+ processor = LlavaNextProcessor.from_pretrained(model_path)
48
+
49
+ # Define image token
50
+ image_token = "<|reserved_special_token_4|>"
51
+
52
+ # Format the prompt
53
+ prompt = (
54
+ f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
55
+ f"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
56
+ f"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
57
+ f"{image_token}\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
58
+ )
59
+
60
+ # Load the model
61
+ model = LlavaNextForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
62
+
63
+ # Prepare inputs and generate output
64
+ inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
65
+ answer_start = int(inputs["input_ids"].shape[-1])
66
+ output = model.generate(**inputs, max_new_tokens=512)
67
+
68
+ # Decode predictions
69
+ pred = processor.decode(output[0][answer_start:], skip_special_tokens=True)
70
+ print(pred)
71
+ ```
72
+
73
+ ## Citation
74
+ If you find our work helpful, please cite us.
75
+
76
+ AdaMLLM
77
+ ```bibtex
78
+ @article{adamllm,
79
+ title={On Domain-Specific Post-Training for Multimodal Large Language Models},
80
+ author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
81
+ journal={arXiv preprint arXiv:2411.19930},
82
+ year={2024}
83
+ }
84
+ ```
85
+
86
+ [Instruction Pre-Training](https://huggingface.co/papers/2406.14491) (EMNLP 2024)
87
+ ```bibtex
88
+ @article{cheng2024instruction,
89
+ title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
90
+ author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
91
+ journal={arXiv preprint arXiv:2406.14491},
92
+ year={2024}
93
+ }
94
+ ```