|
--- |
|
license: mit |
|
datasets: |
|
- liuhaotian/LLaVA-Instruct-150K |
|
- liuhaotian/LLaVA-Pretrain |
|
language: |
|
- en |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi). |
|
|
|
## Model Details |
|
1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2) |
|
2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) |
|
4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) |
|
5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) |
|
6. Finetuned Model: [GunaKoppula/Llava-Phi2](https://huggingface.co/GunaKoppula/Llava-Phi2) |
|
|
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi) |
|
- **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330) |
|
- **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
1. Clone this repository and navigate to llava-phi folder |
|
```bash |
|
git clone https://github.com/zhuyiche/llava-phi.git |
|
cd llava-phi |
|
``` |
|
2. Install Package |
|
```bash |
|
conda create -n llava_phi python=3.10 -y |
|
conda activate llava_phi |
|
pip install --upgrade pip # enable PEP 660 support |
|
pip install -e . |
|
``` |
|
3. Run the Model |
|
```bash |
|
python llava_phi/eval/run_llava_phi.py --model-path="GunaKoppula/Llava-Phi2" \ |
|
--image-file="https://huggingface.co/GunaKoppula/Llava-Phi2/resolve/main/people.jpg?download=true" \ |
|
--query="How many people are there in the image?" |
|
``` |
|
|
|
### Acknowledgement |
|
This implementation is based on wonderful work done by: \ |
|
[LlaVA-Phi](https://github.com/zhuyiche/llava-phi) \ |
|
[Llava](https://github.com/haotian-liu/LLaVA) \ |
|
[Phi2](https://huggingface.co/microsoft/phi-2) |