Query regarding flan-ul2-lora

#1
by cyt79 - opened

Hi there,

I'm also trying to finetune flan-ul2 (google/flan-ul2) using LoRA and I have few questions if you don't mind:

I'm trying to do this on a p3dn.24xlarge instance (8 GPUs with 32 GB gpu memory each). I follow this blog post (https://www.philschmid.de/fine-tune-flan-t5-peft) which is written for finetuning t5-xxl with lora. When I use more than one GPU, I'm getting this error:

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

Therefore, I try to finetune flan-ul2 using only one of the 8 GPUs but it doesn't help me either because this time I got:
RuntimeError: No executable batch size found, reached zero.

which doesn't make sense at all because I didn't change anything related to the data processing.

So, my questions are:

  1. Were you using both of the A100 GPUs for fine-tuning?
  2. Did you encounter any of these errors when you're fine-tuning? If so, could you please share with me how did you fix them?

In case you're wondering what I've tried to do to fine-tune flan-ul2 using LoRA, I didn't change too many things on the blog post. All I did was to change the model name actually:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq

#model_id="google/flan-t5-xxl"
model_id="google/flan-ul2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

#model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model_id = "google/flan-ul2"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

Then I run the trainer as shown below:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
Conifer Labs LLC org

Hey @cyt79 sorry to hear you're running into issues fine tuning the flan-ul2 model. I suspect the reason you're able to finetune the flan-t5-xxl model and are having issues with flan-ul2 is b/c the former has 11B parameters while the latter has 20B so almost 2x in size of the model. Additionally, the code you've shared doesn't look like it's going to be able to utilize multiple GPUs.

In my example I used a single A100 with 80GB of RAM and utilization sat around ~50GB with batch_size = 1, so I didn't need multi-GPU support. However, updating the code to use multiple GPUs should be rather straightforward using Accelerate. I have a blog post in the works to demo exactly this :) In the meantime I hope this helps.
https://huggingface.co/docs/transformers/accelerate

This comment has been hidden

Hey @kevin510 , I'm wondering if you had a chance to publish your blog post on finetuning flan-ul2 with LoRA on multi-gpus?

Sign up or log in to comment