File size: 2,856 Bytes
9405a79
 
3bdbe6e
9405a79
ac79322
 
3bdbe6e
bd7faa4
3bdbe6e
 
6ec1154
 
 
3bdbe6e
 
 
 
bd7faa4
3bdbe6e
0c1ffd6
 
 
 
 
 
 
 
 
3bdbe6e
6ec1154
3bdbe6e
133d9f5
0c1ffd6
133d9f5
 
 
a431b99
3bdbe6e
 
505a5e9
3bdbe6e
 
 
 
 
 
 
505a5e9
 
3bdbe6e
505a5e9
3bdbe6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
050d5a1
 
0c1ffd6
 
50b995a
050d5a1
 
 
 
50b995a
050d5a1
 
 
 
3bdbe6e
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: apache-2.0
pipeline_tag: text-to-image
---
# Work / train in progress!
![image](./promo.png)

⚡️Waifu: efficient high-resolution waifu synthesis


waifu is a free text-to-image model that can efficiently generate images in 80 languages. Our goal is to create a small model without compromising on quality.

## Core designs include:

(1) [**AuraDiffusion/16ch-vae**](https://huggingface.co/AuraDiffusion/16ch-vae): A fully open source 16ch VAE. Natively trained in fp16. \
(2) [**Linear DiT**](https://github.com/NVlabs/Sana): we use 1.6b DiT transformer with linear attention. \
(3) [**MEXMA-SigLIP**](https://huggingface.co/visheratin/mexma-siglip): MEXMA-SigLIP is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) model. This allows us to get a high-performance CLIP model for 80 languages.. \
(4) Other: we use Flow-Euler sampler, Adafactor-Fused optimizer and bf16 precision for training, and combine efficient caption labeling (MoonDream, CogVlM, Human, Gpt's) and danbooru tags to accelerate convergence.

## Pros
- Small model that can be trained on a common GPU; fast training process.
- Supports multiple languages and demonstrates good prompt adherence.
- Utilizes the best 16-channel VAE (Variational Autoencoder).

## Cons
- Trained on only 2 million images (low-budget model, approximately $3,000).
- Training dataset consists primarily of anime and illustrations (only about 1% realistic images).
- Only lowres for now (512) 

## Example

```py
# 1st, install latest diffusers from source!! 
pip install git+https://github.com/huggingface/diffusers
```

```py
import torch
from diffusers import DiffusionPipeline
#from pipeline_waifu import WaifuPipeline

pipe_id = "AiArtLab/waifu-2b"
variant = "fp16"

# Pipeline
pipeline = DiffusionPipeline.from_pretrained(
    pipe_id,
    variant=variant,
    trust_remote_code = True
).to("cuda")
#print(pipeline)

prompt = 'аниме девушка, waifu, يبتسم جنسيا , sur le fond de la tour Eiffel'
generator = torch.Generator(device="cuda").manual_seed(42)

image = pipeline(
    prompt = prompt,
    negative_prompt = "",
    generator=generator,
)[0]

for img in image:
    img.show()
    img.save('waifu.png')

```

![image](./waifu.png)

## Donations

We are a small GPU poor group of enthusiasts (current train budget ~$3k)
![image](./low.png)

Please contact with us if you may provide some GPU's on training

DOGE: DEw2DR8C7BnF8GgcrfTzUjSnGkuMeJhg83


## Contacts

[recoilme](https://t.me/recoilme)

## How to cite

```bibtex
@misc{Waifu, 
    url    = {[https://huggingface.co/AiArtLab/waifu-2b](https://huggingface.co/AiArtLab/waifu-2b)}, 
    title  = {waifu-2b}, 
    author = {recoilme, muinez, femboysLover}
}
```