File size: 1,616 Bytes
0fccd40 e0ab3fb 8b6c69b 2e4f805 0fccd40 88897c3 8b6c69b 2e4f805 8b6c69b 88897c3 2e4f805 88897c3 0fccd40 2e4f805 88897c3 0fccd40 88897c3 0fccd40 88897c3 8b6c69b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
---
library_name: peft
tags:
- llava
pipeline_tag: image-text-to-text
license: mit
datasets:
- MaoXun/Synergy-General-MultimodalPairs
language:
- en
base_model:
- liuhaotian/llava-pretrain-vicuna-7b-v1.3
- lmsys/vicuna-7b-v1.3
---
# Brief
This is the LoRA Model of LLaVA 7B v1.3 trained on [Synergy-General-MultimodalPairs](https://huggingface.co/datasets/MaoXun/Synergy-General-MultimodalPairs).
The dataset is to enhance the ability of describing images in detail for vision language models (VLM).
Below is the introduction of this dataset.
# Dataset
### Link
[Github](https://github.com/mao-code/Synergy-General-MultimodalPairs) | [Paper](https://link.springer.com/chapter/10.1007/978-981-97-6125-8_12)
### Introduction
This is a visual-text pair dataset synergistically generated by a text-to-image model and multimodal large language model.
The name of the file means (n_th generation)\_(numbers of batch)\_(numbers of initial description of each batch)\_(numbers of refined cycles of each initial description)
For example, the 1_20_10_5.zip means this dataset is dataset number one with 20 batches, 10 initial descriptions for each batch, and 5 refined cycles for each initial description.
Therefore, this dataset has a total of 20\*10\*5=1000 image and text pair data.
Once you unzip one of the datasets, you will see 2 files. The first is the zip file of images. The second is the CSV file which contains the image path and the description of this image.
Here is the GitHub script of the generation process: https://github.com/mao-code/Synergy-General-MultimodalPairs
# Framework versions
- PEFT 0.4.0 |