|
# Training |
|
<p align="center"> |
|
<img src="../figures/collage_3.jpg" width="600"> |
|
</p> |
|
|
|
This directory provides a training code for Stable Cascade, as well as guides to download the models you need. |
|
Specifically, you can find training scripts for the following use-cases: |
|
- Text-to-Image |
|
- ControlNet |
|
- LoRA |
|
- Image Reconstruction |
|
|
|
#### Note: |
|
A quick clarification, Stable Cascade uses Stage A & B to compress images and Stage C is used for the text-conditional |
|
learning. Therefore, it makes sense to train a LoRA or ControlNet **only** for Stage C. You also don't train a LoRA or |
|
ControlNet for the Stable Diffusion VAE right? |
|
|
|
## Basics |
|
In the [training configs](../configs/training) folder we provide config files for all trainings. All config files |
|
follow a similar structure and only contain the most essential parameters you need to set. Let's take a look at the |
|
structure each config follows: |
|
|
|
At first, you will set the run name, checkpoint-, & output-folder and which version you want to train. |
|
```yaml |
|
experiment_id: stage_c_3b_controlnet_base |
|
checkpoint_path: /path/to/checkpoint |
|
output_path: /path/to/output |
|
model_version: 3.6B |
|
``` |
|
|
|
Next, you can set your [Weights & Biases]() information if you want to use it for logging. |
|
```yaml |
|
wandb_project: StableCascade |
|
wandb_entity: wandb_username |
|
``` |
|
|
|
Afterwards, you define the training parameters. |
|
```yaml |
|
lr: 1.0e-4 |
|
batch_size: 256 |
|
image_size: 768 |
|
multi_aspect_ratio: [1/1, 1/2, 1/3, 2/3, 3/4, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 9/16] |
|
grad_accum_steps: 1 |
|
updates: 500000 |
|
backup_every: 50000 |
|
save_every: 2000 |
|
warmup_updates: 1 |
|
use_fsdp: False |
|
``` |
|
|
|
Most, of them will be quite familiar to you probably already. A few clarification tho: `updates` refers to the number of |
|
training steps, `backup_every` creates additional checkpoints, so you can revert to earlier ones if you want, |
|
`save_every` concerns how often models will be saved and sampling will be done. Furthermore, since distributed training |
|
is essential when training large models from scratch or doing large finetunes, we have an option to use PyTorch's |
|
[**Fully Shared Data Parallel (FSDP)**](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/). You |
|
can use it by setting `use_fsdp: True`. Note, that you will need multiple GPUs for FSDP. However, this as mentioned |
|
above, this is only needed for large runs. You can still train and finetune our largest models on a powerful single |
|
machine. <br><br> |
|
Another thing we provide is training with **Multi-Aspect-Ratio**. You can set the aspect ratios you want in the list |
|
for `multi_aspect_ratio`.<br><br> |
|
|
|
For diffusion models, having an EMA (Exponential Moving Average) model, can drastically improve the performance of |
|
your model. To include an EMA model in your training you can set the following parameters, otherwise you can just |
|
leave them away. |
|
```yaml |
|
ema_start_iters: 5000 |
|
ema_iters: 100 |
|
ema_beta: 0.9 |
|
``` |
|
|
|
Next, you can define the dataset that you want to use. Note, that the code uses |
|
[webdataset](https://github.com/webdataset/webdataset) for this. |
|
```yaml |
|
webdataset_path: |
|
- s3://path/to/your/first/dataset/on/s3 |
|
- file:/path/to/your/local/dataset.tar |
|
``` |
|
You can set as many dataset paths as you want, and they can either be on |
|
[Amazon S3 storage](https://aws.amazon.com/s3/) or just local. |
|
<br><br> |
|
There are a few more specifics to each kind of training and to datasets in general. These will be discussed below. |
|
|
|
## Starting a Training |
|
You can start an actual training very easily by first moving to the root directory of this repository (so [here](..)). |
|
Next, the python command looks like the following: |
|
```python |
|
python3 training_file training_config |
|
``` |
|
For example, if you want to train a LoRA model, the command would look like this: |
|
```python |
|
python3 train/train_c_lora.py configs/training/finetune_c_3b_lora.yaml |
|
``` |
|
|
|
Moreover, we also provide a [bash script](example_train.sh) for working with slurm. Note, this assumes you have access to a cluster |
|
that runs slurm as the cluster manager. |
|
|
|
## Dataset |
|
As mentioned above, the code uses [webdataset](https://github.com/webdataset/webdataset) for working with datasets, |
|
because this library supports working with large amounts of data very easily. In case you want to **finetune** a model, |
|
train a **LoRA** or train a **ControlNet**, you might not have them in a webdataset format. Therefore, here follows |
|
a simple example how you can convert your dataset into the appropriate format. |
|
1. Put all your images and captions into a folder |
|
2. Rename them to have the same number / id as the name. For example: |
|
`0000.jpg, 0000.txt, 0001.jpg, 0001.txt, 0002.jpg, 0002.txt, 0003.jpg, 0003.txt` |
|
3. Run the following command: ``tar --sort=name -cf dataset.tar dataset/`` or manually create a tar file from the folder |
|
4. Set the `webdataset_path: file:/path/to/your/local/dataset.tar` in the config file |
|
|
|
Next, there are a few more settings that might be helpful to you, especially when working with large datasets that |
|
might contain more information about images, like some kind of variables that you want to filter for. You can apply |
|
dataset filters like the following in the config file: |
|
```yaml |
|
dataset_filters: |
|
- ['aesthetic_score', 'lambda s: s > 4.5'] |
|
- ['nsfw_probability', 'lambda s: s < 0.01'] |
|
``` |
|
In this case, you would have `0000.json, 0001.json, 0002.json, 0003.json` in your dataset as well, with keys for |
|
`aesthetic_score` and `nsfw_probability`. |
|
|
|
## Starting from a Pretrained Model |
|
If you want to finetune any model you need the pretrained models. You can find details on how to download them in the |
|
[models](../models) section. After downloading them, you need to modify the checkpoint paths in the config file too. |
|
See below for example config files. |
|
|
|
## Text-to-Image Training |
|
You can use the following configs for finetuning Stage C on your own datasets. All necessary parameters were already |
|
explained above. So there is nothing new here. Take a look at the config for finetuning the |
|
[3.6B Stage C](../configs/training/finetune_c_3b.yaml) and the [1B Stage C](../configs/training/finetune_c_1b.yaml). |
|
|
|
## ControlNet Training |
|
Training a ControlNet requires setting some extra parameters as well as adding the specific ControlNet Filter you want. |
|
With filter, we simply mean a class that for example performs Canny Edge Detection, Human Pose Detection, etc. |
|
```yaml |
|
controlnet_blocks: [0, 4, 8, 12, 51, 55, 59, 63] |
|
controlnet_filter: CannyFilter |
|
controlnet_filter_params: |
|
resize: 224 |
|
``` |
|
Here we need to give a little more detail on how Stage C's architecture looks like. It basically is just a stack of |
|
residual blocks (convolutional and attention) that all work at the same latent resolution. We **do not** use a UNet. |
|
And this is where `controlnet_blocks` comes in. It determines at which blocks you want to inject the controlling |
|
information. This way, the ControlNet architecture differs from the common one used in Stable Diffusion where you |
|
create an entire copy of the encoder of the UNet. With Stable Cascade it is a bit simpler and comes with the great |
|
benefit of using much fewer parameters. <br> |
|
Next you define the class that filters the images and extracts the information you want to condition Stage C on |
|
(Canny Edge Detection, Human Pose Detection, etc.) with the `controlnet_filter` parameter. In the example, we use the |
|
CannyFilter defined in the [controlnet.py](../modules/controlnet.py) file. This is the place where you can add your own |
|
ControlNet Filters. Lastly, `controlnet_filter_params` simply sets additional parameters to your `controlnet_filter` |
|
class. That's it. You can view the example ControlNet configs for |
|
[Inpainting / Outpainting](../configs/training/controlnet_c_3b_inpainting.yaml), |
|
[Face Identity](../configs/training/controlnet_c_3b_identity.yaml), |
|
[Canny](../configs/training/controlnet_c_3b_canny.yaml) and |
|
[Super Resolution](../configs/training/controlnet_c_3b_sr.yaml). |
|
|
|
## LoRA Training |
|
To train a LoRA on Stage C, you have a few more parameters available to set for the training. |
|
```yaml |
|
module_filters: ['.attn'] |
|
rank: 4 |
|
train_tokens: |
|
# - ['^snail', null] # token starts with "snail" -> "snail" & "snails", don't need to be reinitialized |
|
- ['[fernando]', '^dog</w>'] # custom token [snail], initialize as avg of snail & snails |
|
``` |
|
These include the `module_filters`, which simply determines on what modules you want to train LoRA-layers. In the |
|
example above, it is using the attention layers (`.attn`). Currently, only linear layers can be lora'd. |
|
However, adding different layers (like convolutions) is possible as well. <br> |
|
You can also set the `rank` and if you want to learn a specific token for your training. The latter can be done by |
|
setting `train_tokens` which expects a list of two things for each element: the token you want to train and a regex for |
|
the token / tokens that you want to use for initializing the token. In the example above, a token `[fernando]` is |
|
created and is initialized with the average of all tokens that include the word `dog`. Note, in order to **add** a new |
|
token, **it has to start with `[` and end with `]`**. There is also the option of using existing tokens which will be |
|
trained. For this, you just enter the token, **without** placing `[ ]` around it, like in the commented example above |
|
for the token `sanil`. The second element is `null`, because we don't initialize this token and just finetune the |
|
`snail` token. <br> |
|
You can find an example config for training a LoRA [here](../configs/training/finetune_c_3b_lora.yaml). |
|
Additionally, you can also download an |
|
[example dataset](https://huggingface.co/dome272/stable-cascade/blob/main/fernando.tar) for a cute little good boy dog. |
|
Simply download it and set the path in the config file to your destination path. |
|
|
|
## Image Reconstruction Training |
|
Here we mainly focus on training **Stage B**, because it is doing most of the heavy lifting for the compression, while |
|
Stage A only applies a very small compression and thus the results are near perfect. Why do we use Stage A even? The |
|
reason is just to make the training and inference of Stage B cheaper and faster. With Stage A in place, Stage B works |
|
at a 4x smaller space (for example `1 x 4 x 256 x 256` instead of `1 x 3 x 1024 x 1024`). Furthermore, we observed that |
|
Stage B learns faster when using Stage A compared to learning Stage B directly at pixel space. Anyway, why would you |
|
even want to train Stage B? Either you want to try to create an even higher compression or finetune on something |
|
very specific. But this probably is a rare occasion. If you do want to, you can take a look at the training config |
|
for the large Stage B [here](../configs/training/finetune_b_3b.yaml) or for the small Stage B |
|
[here](../configs/training/finetune_b_700m.yaml). |
|
|
|
## Remarks |
|
The codebase is in early development. You might encounter unexpected errors or not perfectly optimized training and |
|
inference code. We apologize for that in advance. If there is interest, we will continue releasing updates to it, |
|
aiming to bring in the latest improvements and optimizations. Moreover, we would be more than happy to receive |
|
ideas, feedback or even updates from people that would like to contribute. Cheers. |