--- library_name: peft tags: - llava pipeline_tag: image-text-to-text license: mit datasets: - MaoXun/Synergy-General-MultimodalPairs language: - en base_model: - liuhaotian/llava-pretrain-vicuna-7b-v1.3 - lmsys/vicuna-7b-v1.3 --- # Brief This is the LoRA Model of LLaVA 7B v1.3 trained on [Synergy-General-MultimodalPairs](https://huggingface.co/datasets/MaoXun/Synergy-General-MultimodalPairs). The dataset is to enhance the ability of describing images in detail for vision language models (VLM). Below is the introduction of this dataset. # Dataset ### Link [Github](https://github.com/mao-code/Synergy-General-MultimodalPairs) | [Paper](https://link.springer.com/chapter/10.1007/978-981-97-6125-8_12) ### Introduction This is a visual-text pair dataset synergistically generated by a text-to-image model and multimodal large language model. The name of the file means (n_th generation)\_(numbers of batch)\_(numbers of initial description of each batch)\_(numbers of refined cycles of each initial description) For example, the 1_20_10_5.zip means this dataset is dataset number one with 20 batches, 10 initial descriptions for each batch, and 5 refined cycles for each initial description. Therefore, this dataset has a total of 20\*10\*5=1000 image and text pair data. Once you unzip one of the datasets, you will see 2 files. The first is the zip file of images. The second is the CSV file which contains the image path and the description of this image. Here is the GitHub script of the generation process: https://github.com/mao-code/Synergy-General-MultimodalPairs # Framework versions - PEFT 0.4.0