How do you fine tune LLaVA NeXT?
Is there a way to fine tune LLaVA-NeXT?
cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.
For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.
Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?
Nishant
Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.
For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.
Hey @RaushanTurganbay , very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next
Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that
Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models
Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850
I updated the comment in PR to (with and w/o images). The model should be tunable with images as well
@RaushanTurganbay , thanks for sharing the notebook on finetuning LLaVA-NEXT! Is there a similar one for finetuning LLaVA-NEXT-Video? or can I easily adapt this notebook for LLaVA-NEXT-Video as well? @nielsr
Yes here it is: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VideoLLaVa. Should be very similar for LLaVA-Next-Video.
There is actually a notebook for llava-next-video here, I will port it to the Tutorials repo for easier discovery
Hey, thanks so much for the great examples! Trying to follow along, but I have only small GPUs and try to use Deepspeed. Do you know if your code would work with Deepspeed on 4 GPUs?
For DeepSpeed we support it when using Trainer
but the example notebook relies on custom trainer. Take a look at (https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/deepspeed#deepspeed-non-trainer-integration) for more information on how to use deepspeed with custom Trainers
Sorry a little of a different question.. How many images and/or videos can LLaVA-Next-Video take? I couldn't find it stated elsewhere. Thanks in advance. @RaushanTurganbay @nielsr
@tjiang217 LLaVA-Next-Video was not trained in multi-image/multi-video setting afaik, but it doesn't mean we can't try and feed several visuals. But note that the generation quality might not be as good as in single image.
You can also take a look at https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19, which were trained for interleaved images/videos. It doesn't state however how many images/videos per prompt was used in train, I guess it was 2 images/videos in most examples
@RaushanTurganbay I tried to run the llava-next-video finetuning notebook you shared without changing any code on 4 A10 GPU ec2 instance and ran into the following issue. The inference code works just the training part. Do you have any ideas why? It has to do with device_map = 'auto' but putting on one gpu causes CUDA out of memory error. Any help would be greatly appreciated
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
@RaushanTurganbay sorry just wanted to follow up here. I was able to bypass the previous bug when I make the batch size smaller and remove device_map = 'auto', but ran into the following bug using the same code in the llava-next-video finetuning notebook. Do you know for this notebook, which transformers version you used and other package versions? Thanks in advance!
Error I ran into.
RuntimeError: Input tensor at index 1 has invalid shape [1, 1595, 32064], but expected [1, 1500, 32064]
Further discussion/solutions will be in https://github.com/huggingface/trl/issues/1785#issuecomment-2314793662 for anyone having the same issue
What changes i need to make in the notebook if my dataset is unique_id, image and conversations. I can't see any notebook using conversations to train.
You can find SFT tuning example for VLMs here (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py). But the general idea is same, and you just have to prepare the inputs in the format you want and thus write your own data collator. You can also take a look at how LLMs are tuned with dialog datasets to see how the inputs have to be formatted/masked
@RaushanTurganbay I understand the current llava-next-video model processes each frame as 12x12 tokens (result of 2 stride pooling from 24x24 tokens), I am working with a soccer video dataset that has fine-grain details, such as the soccer ball, so I thought the 12x12 tokens may not be able capture enough details. The LLaVA-next-video blog talked about testing different variation of pooling strides. Do you know if we could tweak the current model or access the other model so the number of tokens representing each frame is greater than 12x12 tokens?
Thanks in advance, much appreciated!
Unfortunately we don't support different polling methods and strides. Maybe you can tune your model with llava-vl repo for that and then convert to HF format? We are currently trying to make VLMs more modular and will take out image encoder related code into a separate method. So you will have more freedom of how to obtain image hidden states by overwriting only that method :)
Hi @lcolonn ! Yes, the PR was merged and LLaVa-NeXT is tunable now. Fine-tuning script is almost the same as LLaVa with a few changes in input arguments, find here my adaptation of Niels' notebook
Hi guys!
I'm trying to fine tune LLava 1.6, but I'm facing a problem - I've tried
@RaushanTurganbay
collab (and many others I found on the Internet) but there is CUDA out of memory error when I try to run Lightning Trainer:
OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 7.06 MiB is free. Process 130036 has 14.74 GiB memory in use. Of the allocated memory 14.04 GiB is allocated by PyTorch, and 580.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I've tested this code in Collab on L4 (16GB VRAM) as well as on my local machine with 3090 (24GB VRAM). I'm not sure if I need a lot more VRAM or is there a memory leak somewhere - maybe some library or driver has changed since then? Setting PYTORCH_CUDA_ALLOC_CONF to max_split_size_mb:512
or expandable_segments:True
didn't resove problem. I've tested with most recent versions of packages as well with these (tutorial from 2 months ago):
https://github.com/Farzad-R/Finetune-LLAVA-NEXT/blob/main/requirements.txt
Can somebody help me?
Thanks
FYI, I had an 80GB A100 GPU when training the model and simply uploaded the notebook in colab for ease of sharing. You might consider getting more GPU :)
hi @RaushanTurganbay hope you are doing well! Just a qq, do we plan on supporting the new LLaVA-Video model too? (previously llava-next-video) thanks
Hey! If you mean support for fine-tuning demo notebook, we have it here -> https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVA-NeXT-Video/Fine_tune_LLaVa_NeXT_Video_with_HFTrainer.ipynb
Also there is a community maintained repo for tuning various VLMs in https://github.com/zjysteven/lmms-finetune
Hi @RaushanTurganbay , sorry I wasn't super clear. I meant that I see lmms-lab released a new set of LLaVA-Video models here (https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944) Does llava-hf have plan to support them too, as I see you had also recently supported LLaVA-OneVision. It has made things quite easier working with your models, would really appreciate if you could. Thanks in advance!