Using official training example, model was neither saved nor pushed to repo

#12
by nakajimayoshi - opened

Hello, I am working on training a model based on the official training example which can be located here: https://huggingface.co/nakajimayoshi/ddpm-iris-256/tree/main/

I was able to successfully train the model, and the training logs/samples were successfully uploaded, but the model was neither saved in the runtime as a .bin or .pth or pushed to my repository. I have made no modifications to the training loop, only the training config and dataset loading pipeline. You can see the modification of the training config below:

from dataclasses import dataclass

@dataclass
class TrainingConfig:
    image_size = 256  # the generated image resolution
    train_batch_size = 16
    eval_batch_size = 16  # how many images to sample during evaluation
    num_epochs = 50
    gradient_accumulation_steps = 1
    learning_rate = 1e-4
    lr_warmup_steps = 500
    save_image_epochs = 10
    dataset_name= 'imagefolder'
    save_model_epochs = 30
    mixed_precision = 'fp16'  # `no` for float32, `fp16` for automatic mixed precision
    output_dir = 'ddpm-iris-256'  # the model namy locally and on the HF Hub

    push_to_hub = True  # whether to upload the saved model to the HF Hub
    hub_private_repo = False
    overwrite_output_dir = True  # overwrite the old model when re-running the notebook
    seed = 0

config = TrainingConfig()

On my repository, you can see the logs and samples were uploaded, but none of the model checkpoints were uploaded nor can I find them in my google colab notebook. Any help is appreciated. Thanks

I have found a work around for this issue:
The issue is in the training loop:

 if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)

            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)

            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    pipeline.save_pretrained(config.output_dir) # this never gets called

For one reason or another, the 'else' condition is not being reached, therefore pipline.save_pretrained(config.output_dir) never gets called. I solved this by simply moving the method call out of the else statement and saving it on every epoch:

 if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
            pipeline.save_pretrained(config.output_dir) # move to here
            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)

            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    print('saving..') # replaced with print to see if it gets called

note I could have easily just removed the entire nested if statement and have it push to hub, but to prevent any unexpected behaviors I left it as is, and only moved the method call.
This slows down the training speed but at the very least the model doesn't get lost.

Sign up or log in to comment