Ketengan-Diffusion commited on
Commit
e8d321f
·
1 Parent(s): f928a1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -1,3 +1,57 @@
1
  ---
2
  license: creativeml-openrail-m
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: creativeml-openrail-m
3
+ language:
4
+ - en
5
+ tags:
6
+ - stable-diffusion
7
+ - SDXL
8
+ - art
9
+ - stable-diffusion-XL
10
+ - fantasy
11
+ - anime
12
+ - aiart
13
+ - ketengan
14
+ - AnySomniumXL
15
  ---
16
+
17
+ # AnySomniumXl v2
18
+
19
+ `Ketengan-Diffusion/AnySomniumXL v2` is a SDXL model that has been fine-tuned on [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
20
+
21
+ # Our Dataset Process Curation
22
+ Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract
23
+
24
+ This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 65-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted
25
+
26
+ The dataset curation proccess is using Nvidia T4 16GB Machine and takes about 2 days for curating 300.000 images.
27
+
28
+ # Captioning process
29
+ We using Open Source Multimodal LLM as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.
30
+
31
+ This captioning process to captioning 33k images takes about 6 Days with NVIDIA Tesla A100 80GB PCIe. We still improving our script to generate caption faster. The minimum VRAM that required for this captioning process is 24GB VRAM which is not sufficient if we using NVIDIA Tesla T4 16GB
32
+
33
+ # Tagging Process
34
+ We simply using booru tags, that retrieved from booru boards so this could be tagged by manually by human hence make this tags more accurate.
35
+
36
+ # Training Process
37
+
38
+ AnySomniumXL v2 Technical Specifications:
39
+
40
+ Training per 1 Epoch 20 Epoch (Results from AnySomniumXL using Epoch 20) 1x Batch Size without gradient checkpointing
41
+
42
+ Learning Rate: 4e-7 Text Encoder with Natural Language Captioning by LLaVA 1.5 which is more complex than BLIP2
43
+
44
+ Trained with a bucket size of 1024x1024
45
+
46
+ Optimizer: Adafactor
47
+
48
+ LR Scheduler: Constant with warmup
49
+
50
+ Shuffle Caption: Yes
51
+
52
+ Clip Skip: 2
53
+
54
+ Trained with NVIDIA A100 80GB for an estimated 126 training Hours with 2 batch size
55
+
56
+ You can support me:
57
+ - on [Ko-FI](https://ko-fi.com/ncaix)