--- pipeline_tag: text-to-image license: other license_name: faipl-1.0-sd license_link: LICENSE base_model: stabilityai/stable-cascade tags: - text-to-image - anime library_name: diffusers language: en inference: false decoder: Disty0/sotediffusion-wuerstchen3-decoder new_version: Disty0/sotediffusion-v2 --- # New verison is available: https://huggingface.co/Disty0/sotediffusion-v2 # SoteDiffusion Wuerstchen3 Anime finetune of Würstchen V3. # Release Notes - This release is sponsored by fal.ai/grants - Trained on 6M images for 3 epochs using 8x A100 80G GPUs. # API Usage This model can be used via API with Fal.AI For more details: https://fal.ai/models/fal-ai/stable-cascade/sote-diffusion
# UI Guide ## SD.Next URL: https://github.com/vladmandic/automatic/ Go to Models -> Huggingface and type `Disty0/sotediffusion-wuerstchen3-decoder` into the model name and press download. Load `Disty0/sotediffusion-wuerstchen3-decoder` after the download process is complete. Prompt: ``` newest, extremely aesthetic, best quality, ``` Negative Prompt: ``` very displeasing, worst quality, monochrome, realistic, oldest, loli, ``` Parameters: Sampler: Default Steps: 30 or 40 Refiner Steps: 10 CFG: 7 Secondary CFG: 2 or 1 Resolution: 1024x1536, 2048x1152 Anything works as long as it's a multiply of 128. ## ComfyUI Please refer to CivitAI: https://civitai.com/models/353284 # Code Example ```shell pip install diffusers ``` ```python import torch from diffusers import StableCascadeCombinedPipeline device = "cuda" dtype = torch.bfloat16 # or torch.float16 model = "Disty0/sotediffusion-wuerstchen3-decoder" pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype) # send everything to the gpu: pipe = pipe.to(device, dtype=dtype) pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype) # or enable model offload to save vram: # pipe.enable_model_cpu_offload() prompt = "newest, extremely aesthetic, best quality, 1girl, solo, cat ears, pink hair, orange eyes, long hair, bare shoulders, looking at viewer, smile, indoors, casual, living room, playing guitar," negative_prompt = "very displeasing, worst quality, monochrome, realistic, oldest, loli," output = pipe( width=1024, height=1536, prompt=prompt, negative_prompt=negative_prompt, decoder_guidance_scale=2.0, prior_guidance_scale=7.0, prior_num_inference_steps=30, output_type="pil", num_inference_steps=10 ).images[0] ## do something with the output image ``` ## Training: **Software used**: Kohya SD-Scripts with Stable Cascade branch. https://github.com/kohya-ss/sd-scripts/tree/stable-cascade **GPU used**: 8x Nvidia A100 80GB **GPU Hours**: 220 ### Base | parameter | value | |---|---| | **amp** | bf16 | | **weights** | fp32 | | **save weights** | fp16 | | **resolution** | 1024x1024 | | **effective batch size** | 128 | | **unet learning rate** | 1e-5 | | **te learning rate** | 4e-6 | | **optimizer** | Adafactor | | **images** | 6M | | **epochs** | 3 | ### Final | parameter | value | |---|---| | **amp** | bf16 | | **weights** | fp32 | | **save weights** | fp16 | | **resolution** | 1024x1024 | | **effective batch size** | 128 | | **unet learning rate** | 4e-6 | | **te learning rate** | none | | **optimizer** | Adafactor | | **images** | 120K | | **epochs** | 16 | ## Dataset: **GPU used for captioning**: 1x Intel ARC A770 16GB **GPU Hours**: 350 **Model used for captioning**: SmilingWolf/wd-swinv2-tagger-v3 **Model used for text**: llava-hf/llava-1.5-7b-hf **Command:** ``` python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./ ``` | dataset name | total images | |---|---| | **newest** | 1.848.331 | | **recent** | 1.380.630 | | **mid** | 993.227 | | **early** | 566.152 | | **oldest** | 160.397 | | **pixiv** | 343.614 | | **visual novel cg** | 231.358 | | **anime wallpaper** | 104.790 | | **Total** | 5.628.499 | **Note**: - Smallest size is 1280x600 | 768.000 pixels - Deduped based on image similarity using czkawka-cli - Around 120K very high quality images got intentionally duplicated 5 times, making the total image count 6.2M ## Tags: Model is trained with random tag order but this is the order in the dataset if you are interested: ``` aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags ``` ### Date: | tag | date | |---|---| | **newest** | 2022 to 2024 | | **recent** | 2019 to 2021 | | **mid** | 2015 to 2018 | | **early** | 2011 to 2014 | | **oldest** | 2005 to 2010 | ### Aesthetic Tags: **Model used**: shadowlilac/aesthetic-shadow-v2 | score greater than | tag | count | |---|---|---| | **0.90** | extremely aesthetic | 125.451 | | **0.80** | very aesthetic | 887.382 | | **0.70** | aesthetic | 1.049.857 | | **0.50** | slightly aesthetic | 1.643.091 | | **0.40** | not displeasing | 569.543 | | **0.30** | not aesthetic | 445.188 | | **0.20** | slightly displeasing | 341.424 | | **0.10** | displeasing | 237.660 | | **rest of them** | very displeasing | 328.712 | ### Quality Tags: **Model used**: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth | score greater than | tag | count | |---|---|---| | **0.980** | best quality | 1.270.447 | | **0.900** | high quality | 498.244 | | **0.750** | great quality | 351.006 | | **0.500** | medium quality | 366.448 | | **0.250** | normal quality | 368.380 | | **0.125** | bad quality | 279.050 | | **0.025** | low quality | 538.958 | | **rest of them** | worst quality | 1.955.966 | ## Rating Tags: | tag | count | |---|---| | **general** | 1.416.451 | | **sensitive** | 3.447.664 | | **nsfw** | 427.459 | | **explicit nsfw** | 336.925 | ## Custom Tags: | dataset name | custom tag | |---|---| | **image boards** | date, | | **text** | The text says "text", | | **characters** | character, series | **pixiv** | art by Display_Name, | | **visual novel cg** | Full_VN_Name (short_3_letter_name), visual novel cg, | | **anime wallpaper** | date, anime wallpaper, | ## Limitations and Bias ### Bias - This model is intended for anime illustrations. Realistic capabilites are not tested at all. ### Limitations - Can fall back to realistic. Add "realistic" tag to the negatives when this happens. - Far shot eyes and hands can be bad. ## License SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points: 1. **Modification Sharing:** If you modify SoteDiffusion models, you must share both your changes and the original license. 2. **Source Code Accessibility:** If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too. 3. **Distribution Terms:** Any distribution must be under this license or another with similar rules. 4. **Compliance:** Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values. **Notes**: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT.