Using official training example, model was neither saved nor pushed to repo
Hello, I am working on training a model based on the official training example which can be located here: https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/
I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:
from dataclasses import dataclass
@dataclass
class TrainingConfig:
image_size = 256 # the generated image resolution
train_batch_size = 16
eval_batch_size = 16 # how many images to sample during evaluation
num_epochs = 50
gradient_accumulation_steps = 1
learning_rate = 1e-4
lr_warmup_steps = 500
save_image_epochs = 10
dataset_name= 'imagefolder'
save_model_epochs = 30
mixed_precision = 'fp16' # `no` for float32, `fp16` for automatic mixed precision
output_dir = 'ddpm-iris-256' # the model namy locally and on the HF Hub
push_to_hub = True # whether to upload the saved model to the HF Hub
hub_private_repo = False
overwrite_output_dir = True # overwrite the old model when re-running the notebook
seed = 0
config = TrainingConfig()
On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks
I have found a work around for this issue:
The issue is in the training loop:
if accelerator.is_main_process:
pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
evaluate(config, epoch, pipeline)
if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
if config.push_to_hub:
repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
else:
pipeline.save_pretrained(config.output_dir) # this never gets called
For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:
if accelerator.is_main_process:
pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
pipeline.save_pretrained(config.output_dir) # move to here
if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
evaluate(config, epoch, pipeline)
if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
if config.push_to_hub:
repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
else:
print('saving..') # replaced with print to see if it gets called
note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.