openai/whisper-medium · Finetuned medium on single GPU but can't train on 2 GPUs. (RTX 3060)

I trained medium on NVIDIA GeForce RTX 3060 GPU on windows 11, pycharm . The training took almost 5 days to complete training. Dataset: 8000 audio files and 8000 transcripts, each audio is only 4 seconds. Now I have 2 RTX 3060, but getting cuda OOM error. So, switched to ubuntu 22.04 so that multi GPU training is smoother. But still CUDA OOM error.

Why is this happening?
How can I fix it?
Originally purchased GPU for training large but now not even medium works.
Please Help.

pytorch==2.0.1
torchvision==0.15.2
torchaudio==2.0.2
pytorch-cuda=11.7

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 11.66 GiB total capacity; 11.16 GiB already allocated; 7.50 MiB free; 11.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

training_args = Seq2SeqTrainingArguments(
output_dir="dir_medium",
per_device_train_batch_size=1,
gradient_accumulation_steps=16, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
# max_steps=4000,
per_device_eval_batch_size=8,
gradient_checkpointing=True,
fp16=True,
eval_strategy="steps",
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
save_total_limit=2,
eval_steps=100,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
optim="adamw_bnb_8bit",
)