Update README.md

523d3b7 verified 9 months ago

9.96 kB

	---
	pipeline_tag: text-to-image
	license: other
	license_name: faipl-1.0-sd
	license_link: LICENSE
	decoder:
	- Disty0/sotediffusion-wuerstchen3-alpha1-decoder
	---


	# SoteDiffusion Wuerstchen3

	Anime finetune of Würstchen V3.
	Currently is in early state in training.
	No commercial use thanks to StabilityAI.

	# Release Notes

	- Ran LLaVa on the images that has "english text" tag in it.
	This adds `The text says "text"` tag.
	If LLaVa has no idea what the text is, it describes the image instead.

	<style>
	.image {
	float: left;
	margin-left: 10px;
	}
	</style>

	<table>
	<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/pFev-xGFut0o3qlZQJpwb.png" width="320">
	<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/adhnnnmRBkTULFNl9AfT2.png" width="320">
	</table>

	# UI Guide

	## SD.Next
	URL: https://github.com/vladmandic/automatic/

	Go to Models -> Huggingface and type `Disty0/sotediffusion-wuerstchen3-alpha2-decoder` into the model name and press download.
	Load `Disty0/sotediffusion-wuerstchen3-alpha2-decoder` after the download process is complete.

	Prompt:
	```
	very aesthetic, best quality, newest,
	```

	Negative Prompt:
	```
	very displeasing, worst quality, oldest, monochrome, sketch, realistic,
	```

	Parameters:
	Sampler: Default

	Steps: 30 or 40
	Refiner Steps: 10

	CFG: 8
	Secondary CFG: 1 or 1.2

	Resolution: 1024x1536, 2048x1152
	Anything works as long as it's a multiply of 128.


	## ComfyUI

	Please refer to CivitAI: https://civitai.com/models/353284


	# Code Example

	```shell
	pip install diffusers
	```

	```python
	import torch
	from diffusers import StableCascadeCombinedPipeline

	device = "cuda"
	dtype = torch.bfloat16
	model = "Disty0/sotediffusion-wuerstchen3-alpha2-decoder"

	pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype)

	# send everything to the gpu:
	pipe = pipe.to(device, dtype=dtype)
	pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)

	# or enable model offload to save vram:
	# pipe.enable_model_cpu_offload()



	prompt = "1girl, solo, cowboy shot, straight hair, looking at viewer, hoodie, indoors, slight smile, casual, furniture, doorway, very aesthetic, best quality, newest,"
	negative_prompt = "very displeasing, worst quality, oldest, monochrome, sketch, realistic,"

	output = pipe(
	width=1024,
	height=1536,
	prompt=prompt,
	negative_prompt=negative_prompt,
	decoder_guidance_scale=1.0,
	prior_guidance_scale=8.0,
	prior_num_inference_steps=40,
	output_type="pil",
	num_inference_steps=10
	).images[0]

	## do something with the output image
	```


	## Training Status:

	GPU used for training: 1x AMD RX 7900 XTX 24GB
	GPU Hours: 250 (Accumulative starting from alpha1)

	\| dataset name \| training done \| remaining \|
	\|---\|---\|---\|
	\| newest \| 010 \| 221 \|
	\| recent \| 010 \| 162 \|
	\| mid \| 010 \| 114 \|
	\| early \| 010 \| 060 \|
	\| oldest \| 010 \| 010 \|
	\| pixiv \| 010 \| 032 \|
	\| visual novel cg \| 010 \| 018 \|
	\| anime wallpaper \| 010 \| 003 \|
	\| Total \| 88 \| 620 \|

	Note: chunks starts from 0 and there are 8000 images per chunk


	## Dataset:

	GPU used for captioning: 1x Intel ARC A770 16GB
	GPU Hours: 350

	Model used for captioning: SmilingWolf/wd-swinv2-tagger-v3
	Command:
	```
	python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
	```


	\| dataset name \| total images \| total chunk \|
	\|---\|---\|---\|
	\| newest \| 1.848.331 \| 232 \|
	\| recent \| 1.380.630 \| 173 \|
	\| mid \| 993.227 \| 125 \|
	\| early \| 566.152 \| 071 \|
	\| oldest \| 160.397 \| 021 \|
	\| pixiv \| 343.614 \| 043 \|
	\| visual novel cg \| 231.358 \| 029 \|
	\| anime wallpaper \| 104.790 \| 014 \|
	\| Total \| 5.628.499 \| 708 \|

	Note:
	- Smallest size is 1280x600 \| 768.000 pixels
	- Deduped based on image similarity using czkawka-cli


	## Tags:

	Model is trained with random tag order but this is the order in the dataset if you are interested:
	```
	aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags
	```

	### Date:

	\| tag \| date \|
	\|---\|---\|
	\| newest \| 2022 to 2024 \|
	\| recent \| 2019 to 2021 \|
	\| mid \| 2015 to 2018 \|
	\| early \| 2011 to 2014 \|
	\| oldest \| 2005 to 2010 \|

	### Aesthetic Tags:
	Model used: shadowlilac/aesthetic-shadow-v2

	\| score greater than \| tag \| count \|
	\|---\|---\|---\|
	\| 0.90 \| extremely aesthetic \| 125.451 \|
	\| 0.80 \| very aesthetic \| 887.382 \|
	\| 0.70 \| aesthetic \| 1.049.857 \|
	\| 0.50 \| slightly aesthetic \| 1.643.091 \|
	\| 0.40 \| not displeasing \| 569.543 \|
	\| 0.30 \| not aesthetic \| 445.188 \|
	\| 0.20 \| slightly displeasing \| 341.424 \|
	\| 0.10 \| displeasing \| 237.660 \|
	\| rest of them \| very displeasing \| 328.712 \|

	### Quality Tags:
	Model used: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

	\| score greater than \| tag \| count \|
	\|---\|---\|---\|
	\| 0.980 \| best quality \| 1.270.447 \|
	\| 0.900 \| high quality \| 498.244 \|
	\| 0.750 \| great quality \| 351.006 \|
	\| 0.500 \| medium quality \| 366.448 \|
	\| 0.250 \| normal quality \| 368.380 \|
	\| 0.125 \| bad quality \| 279.050 \|
	\| 0.025 \| low quality \| 538.958 \|
	\| rest of them \| worst quality \| 1.955.966 \|

	## Rating Tags

	\| tag \| count \|
	\|---\|---\|
	\| general \| 1.416.451 \|
	\| sensitive \| 3.447.664 \|
	\| nsfw \| 427.459 \|
	\| explicit nsfw \| 336.925 \|

	## Custom Tags:

	\| dataset name \| custom tag \|
	\|---\|---\|
	\| image boards \| date, \|
	\| pixiv \| art by Display_Name, \|
	\| visual novel cg \| Full_VN_Name (short_3_letter_name), visual novel cg, \|
	\| anime wallpaper \| date, anime wallpaper, \|

	## Training Parameters:
	Software used: Kohya SD-Scripts with Stable Cascade branch
	https://github.com/kohya-ss/sd-scripts/tree/stable-cascade

	Base model: Disty0/sote-diffusion-cascade-alpha0
	### Command:
	```shell
	#!/bin/sh

	CURRENT=$1
	CURRENT_SUB=$2

	PAST=$3
	PAST_SUB=$4

	LD_PRELOAD=/usr/lib/libtcmalloc.so.4 accelerate launch --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
	--mixed_precision fp16 \
	--save_precision fp16 \
	--full_fp16 \
	--sdpa \
	--gradient_checkpointing \
	--train_text_encoder \
	--resolution "1024,1024" \
	--train_batch_size 2 \
	--gradient_accumulation_steps 8 \
	--learning_rate 1e-5 \
	--learning_rate_te1 1e-5 \
	--lr_scheduler constant_with_warmup \
	--lr_warmup_steps 100 \
	--optimizer_type adafactor \
	--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
	--max_grad_norm 0 \
	--token_warmup_min 1 \
	--token_warmup_step 0 \
	--shuffle_caption \
	--caption_separator ", " \
	--caption_dropout_rate 0 \
	--caption_tag_dropout_rate 0 \
	--caption_dropout_every_n_epochs 0 \
	--dataset_repeats 1 \
	--save_state \
	--save_every_n_steps 256 \
	--sample_every_n_steps 64 \
	--max_token_length 225 \
	--max_train_epochs 1 \
	--caption_extension ".txt" \
	--max_data_loader_n_workers 2 \
	--persistent_data_loader_workers \
	--enable_bucket \
	--min_bucket_reso 256 \
	--max_bucket_reso 4096 \
	--bucket_reso_steps 64 \
	--bucket_no_upscale \
	--log_with tensorboard \
	--output_name sotediffusion-wr3_3b \
	--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-$CURRENT/$CURRENT_SUB \
	--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-$CURRENT/$CURRENT_SUB.json \
	--output_dir /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$CURRENT/$CURRENT_SUB \
	--logging_dir /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$CURRENT/$CURRENT_SUB/logs \
	--resume /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$PAST/$PAST_SUB/sotediffusion-wr3_3b-state \
	--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$PAST/$PAST_SUB/sotediffusion-wr3_3b.safetensors \
	--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$PAST/$PAST_SUB/sotediffusion-wr3_3b_text_model.safetensors \
	--effnet_checkpoint_path /mnt/DataSSD/AI/models/wuerstchen3/effnet_encoder.safetensors \
	--previewer_checkpoint_path /mnt/DataSSD/AI/models/wuerstchen3/previewer.safetensors \
	--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/config/sotediffusion-prompt.txt
	```


	## Limitations and Bias

	### Bias

	- This model is intended for anime illustrations.
	Realistic capabilites are not tested at all.

	### Limitations

	- Can fall back to realistic.
	Add "realistic" tag to the negatives when this happens.
	- Far shot eyes can be bad.
	- Anatomy and hands can be bad.
	- Still in active training.


	## License

	SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points:

	1. Modification Sharing: If you modify SoteDiffusion models, you must share both your changes and the original license.
	2. Source Code Accessibility: If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too.
	3. Distribution Terms: Any distribution must be under this license or another with similar rules.
	4. Compliance: Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values.

	Notes: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT. Meaning, still no commercial use of any kind.