Spaces:

ameerazam08
/

Stable-Cascade-Super-Resolution

Paused

App Files Files Community

Stable-Cascade-Super-Resolution / train /readme.md

ameerazam08

Upload folder using huggingface_hub

6a6edcb verified 12 months ago

preview code

raw

history blame contribute delete

11.2 kB

	# Training
	<p align="center">
	<img src="../figures/collage_3.jpg" width="600">
	</p>

	This directory provides a training code for Stable Cascade, as well as guides to download the models you need.
	Specifically, you can find training scripts for the following use-cases:
	- Text-to-Image
	- ControlNet
	- LoRA
	- Image Reconstruction

	#### Note:
	A quick clarification, Stable Cascade uses Stage A & B to compress images and Stage C is used for the text-conditional
	learning. Therefore, it makes sense to train a LoRA or ControlNet only for Stage C. You also don't train a LoRA or
	ControlNet for the Stable Diffusion VAE right?

	## Basics
	In the [training configs](../configs/training) folder we provide config files for all trainings. All config files
	follow a similar structure and only contain the most essential parameters you need to set. Let's take a look at the
	structure each config follows:

	At first, you will set the run name, checkpoint-, & output-folder and which version you want to train.
	```yaml
	experiment_id: stage_c_3b_controlnet_base
	checkpoint_path: /path/to/checkpoint
	output_path: /path/to/output
	model_version: 3.6B
	```

	Next, you can set your [Weights & Biases]() information if you want to use it for logging.
	```yaml
	wandb_project: StableCascade
	wandb_entity: wandb_username
	```

	Afterwards, you define the training parameters.
	```yaml
	lr: 1.0e-4
	batch_size: 256
	image_size: 768
	multi_aspect_ratio: [1/1, 1/2, 1/3, 2/3, 3/4, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 9/16]
	grad_accum_steps: 1
	updates: 500000
	backup_every: 50000
	save_every: 2000
	warmup_updates: 1
	use_fsdp: False
	```

	Most, of them will be quite familiar to you probably already. A few clarification tho: `updates` refers to the number of
	training steps, `backup_every` creates additional checkpoints, so you can revert to earlier ones if you want,
	`save_every` concerns how often models will be saved and sampling will be done. Furthermore, since distributed training
	is essential when training large models from scratch or doing large finetunes, we have an option to use PyTorch's
	[Fully Shared Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/). You
	can use it by setting `use_fsdp: True`. Note, that you will need multiple GPUs for FSDP. However, this as mentioned
	above, this is only needed for large runs. You can still train and finetune our largest models on a powerful single
	machine. <br><br>
	Another thing we provide is training with Multi-Aspect-Ratio. You can set the aspect ratios you want in the list
	for `multi_aspect_ratio`.<br><br>

	For diffusion models, having an EMA (Exponential Moving Average) model, can drastically improve the performance of
	your model. To include an EMA model in your training you can set the following parameters, otherwise you can just
	leave them away.
	```yaml
	ema_start_iters: 5000
	ema_iters: 100
	ema_beta: 0.9
	```

	Next, you can define the dataset that you want to use. Note, that the code uses
	[webdataset](https://github.com/webdataset/webdataset) for this.
	```yaml
	webdataset_path:
	- s3://path/to/your/first/dataset/on/s3
	- file:/path/to/your/local/dataset.tar
	```
	You can set as many dataset paths as you want, and they can either be on
	[Amazon S3 storage](https://aws.amazon.com/s3/) or just local.
	<br><br>
	There are a few more specifics to each kind of training and to datasets in general. These will be discussed below.

	## Starting a Training
	You can start an actual training very easily by first moving to the root directory of this repository (so [here](..)).
	Next, the python command looks like the following:
	```python
	python3 training_file training_config
	```
	For example, if you want to train a LoRA model, the command would look like this:
	```python
	python3 train/train_c_lora.py configs/training/finetune_c_3b_lora.yaml
	```

	Moreover, we also provide a [bash script](example_train.sh) for working with slurm. Note, this assumes you have access to a cluster
	that runs slurm as the cluster manager.

	## Dataset
	As mentioned above, the code uses [webdataset](https://github.com/webdataset/webdataset) for working with datasets,
	because this library supports working with large amounts of data very easily. In case you want to finetune a model,
	train a LoRA or train a ControlNet, you might not have them in a webdataset format. Therefore, here follows
	a simple example how you can convert your dataset into the appropriate format.
	1. Put all your images and captions into a folder
	2. Rename them to have the same number / id as the name. For example:
	`0000.jpg, 0000.txt, 0001.jpg, 0001.txt, 0002.jpg, 0002.txt, 0003.jpg, 0003.txt`
	3. Run the following command: ``tar --sort=name -cf dataset.tar dataset/`` or manually create a tar file from the folder
	4. Set the `webdataset_path: file:/path/to/your/local/dataset.tar` in the config file

	Next, there are a few more settings that might be helpful to you, especially when working with large datasets that
	might contain more information about images, like some kind of variables that you want to filter for. You can apply
	dataset filters like the following in the config file:
	```yaml
	dataset_filters:
	- ['aesthetic_score', 'lambda s: s > 4.5']
	- ['nsfw_probability', 'lambda s: s < 0.01']
	```
	In this case, you would have `0000.json, 0001.json, 0002.json, 0003.json` in your dataset as well, with keys for
	`aesthetic_score` and `nsfw_probability`.

	## Starting from a Pretrained Model
	If you want to finetune any model you need the pretrained models. You can find details on how to download them in the
	[models](../models) section. After downloading them, you need to modify the checkpoint paths in the config file too.
	See below for example config files.

	## Text-to-Image Training
	You can use the following configs for finetuning Stage C on your own datasets. All necessary parameters were already
	explained above. So there is nothing new here. Take a look at the config for finetuning the
	[3.6B Stage C](../configs/training/finetune_c_3b.yaml) and the [1B Stage C](../configs/training/finetune_c_1b.yaml).

	## ControlNet Training
	Training a ControlNet requires setting some extra parameters as well as adding the specific ControlNet Filter you want.
	With filter, we simply mean a class that for example performs Canny Edge Detection, Human Pose Detection, etc.
	```yaml
	controlnet_blocks: [0, 4, 8, 12, 51, 55, 59, 63]
	controlnet_filter: CannyFilter
	controlnet_filter_params:
	resize: 224
	```
	Here we need to give a little more detail on how Stage C's architecture looks like. It basically is just a stack of
	residual blocks (convolutional and attention) that all work at the same latent resolution. We do not use a UNet.
	And this is where `controlnet_blocks` comes in. It determines at which blocks you want to inject the controlling
	information. This way, the ControlNet architecture differs from the common one used in Stable Diffusion where you
	create an entire copy of the encoder of the UNet. With Stable Cascade it is a bit simpler and comes with the great
	benefit of using much fewer parameters. <br>
	Next you define the class that filters the images and extracts the information you want to condition Stage C on
	(Canny Edge Detection, Human Pose Detection, etc.) with the `controlnet_filter` parameter. In the example, we use the
	CannyFilter defined in the [controlnet.py](../modules/controlnet.py) file. This is the place where you can add your own
	ControlNet Filters. Lastly, `controlnet_filter_params` simply sets additional parameters to your `controlnet_filter`
	class. That's it. You can view the example ControlNet configs for
	[Inpainting / Outpainting](../configs/training/controlnet_c_3b_inpainting.yaml),
	[Face Identity](../configs/training/controlnet_c_3b_identity.yaml),
	[Canny](../configs/training/controlnet_c_3b_canny.yaml) and
	[Super Resolution](../configs/training/controlnet_c_3b_sr.yaml).

	## LoRA Training
	To train a LoRA on Stage C, you have a few more parameters available to set for the training.
	```yaml
	module_filters: ['.attn']
	rank: 4
	train_tokens:
	# - ['^snail', null] # token starts with "snail" -> "snail" & "snails", don't need to be reinitialized
	- ['[fernando]', '^dog</w>'] # custom token [snail], initialize as avg of snail & snails
	```
	These include the `module_filters`, which simply determines on what modules you want to train LoRA-layers. In the
	example above, it is using the attention layers (`.attn`). Currently, only linear layers can be lora'd.
	However, adding different layers (like convolutions) is possible as well. <br>
	You can also set the `rank` and if you want to learn a specific token for your training. The latter can be done by
	setting `train_tokens` which expects a list of two things for each element: the token you want to train and a regex for
	the token / tokens that you want to use for initializing the token. In the example above, a token `[fernando]` is
	created and is initialized with the average of all tokens that include the word `dog`. Note, in order to add a new
	token, it has to start with `[` and end with `]`. There is also the option of using existing tokens which will be
	trained. For this, you just enter the token, without placing `[ ]` around it, like in the commented example above
	for the token `sanil`. The second element is `null`, because we don't initialize this token and just finetune the
	`snail` token. <br>
	You can find an example config for training a LoRA [here](../configs/training/finetune_c_3b_lora.yaml).
	Additionally, you can also download an
	[example dataset](https://huggingface.co/dome272/stable-cascade/blob/main/fernando.tar) for a cute little good boy dog.
	Simply download it and set the path in the config file to your destination path.

	## Image Reconstruction Training
	Here we mainly focus on training Stage B, because it is doing most of the heavy lifting for the compression, while
	Stage A only applies a very small compression and thus the results are near perfect. Why do we use Stage A even? The
	reason is just to make the training and inference of Stage B cheaper and faster. With Stage A in place, Stage B works
	at a 4x smaller space (for example `1 x 4 x 256 x 256` instead of `1 x 3 x 1024 x 1024`). Furthermore, we observed that
	Stage B learns faster when using Stage A compared to learning Stage B directly at pixel space. Anyway, why would you
	even want to train Stage B? Either you want to try to create an even higher compression or finetune on something
	very specific. But this probably is a rare occasion. If you do want to, you can take a look at the training config
	for the large Stage B [here](../configs/training/finetune_b_3b.yaml) or for the small Stage B
	[here](../configs/training/finetune_b_700m.yaml).

	## Remarks
	The codebase is in early development. You might encounter unexpected errors or not perfectly optimized training and
	inference code. We apologize for that in advance. If there is interest, we will continue releasing updates to it,
	aiming to bring in the latest improvements and optimizations. Moreover, we would be more than happy to receive
	ideas, feedback or even updates from people that would like to contribute. Cheers.