File size: 1,616 Bytes

0fccd40
 
e0ab3fb
 
 
8b6c69b
 
 
 
 
 
 
2e4f805
0fccd40
88897c3
8b6c69b
2e4f805
 
8b6c69b
88897c3
2e4f805
88897c3
0fccd40
2e4f805
88897c3
0fccd40
88897c3
 
 
0fccd40
88897c3
 
 
 
 
 
8b6c69b

---
library_name: peft
tags:
- llava
pipeline_tag: image-text-to-text
license: mit
datasets:
- MaoXun/Synergy-General-MultimodalPairs
language:
- en
base_model:
- liuhaotian/llava-pretrain-vicuna-7b-v1.3
- lmsys/vicuna-7b-v1.3
---
# Brief
This is the LoRA Model of LLaVA 7B v1.3 trained on [Synergy-General-MultimodalPairs](https://huggingface.co/datasets/MaoXun/Synergy-General-MultimodalPairs).
The dataset is to enhance the ability of describing images in detail for vision language models (VLM).
Below is the introduction of this dataset.

# Dataset
### Link
[Github](https://github.com/mao-code/Synergy-General-MultimodalPairs) | [Paper](https://link.springer.com/chapter/10.1007/978-981-97-6125-8_12)

### Introduction
This is a visual-text pair dataset synergistically generated by a text-to-image model and multimodal large language model.

The name of the file means (n_th generation)\_(numbers of batch)\_(numbers of initial description of each batch)\_(numbers of refined cycles of each initial description)
For example, the 1_20_10_5.zip means this dataset is dataset number one with 20 batches, 10 initial descriptions for each batch, and 5 refined cycles for each initial description.
Therefore, this dataset has a total of 20\*10\*5=1000 image and text pair data.

Once you unzip one of the datasets, you will see 2 files. The first is the zip file of images. The second is the CSV file which contains the image path and the description of this image.

Here is the GitHub script of the generation process: https://github.com/mao-code/Synergy-General-MultimodalPairs


# Framework versions
- PEFT 0.4.0