Twig v0 alpha

Twig v0 alpha is a multilingual text-to-image model with strong instruction following and long context capabilities.

Model Details

Model Description

This model, Twig v0 alpha, is a multilingual text-to-image (T2I) model based on Efficient-Large-Model/Sana_1600M_1024px_MultiLing, specifically designed for strong instruction following. It supports both English and Chinese prompts directly. A key feature of Twig v0 alpha is its long context capability, supporting up to 1200 tokens, enabling users to exert fine-grained control over the generated image composition and details.

Despite its relatively small size of 1.6 billion parameters, Twig v0 alpha demonstrates competitive instruction following performance, surpassing some larger closed-source models (e.g., 20B parameters) in instruction adherence. In comprehensive evaluations, it has also shown performance exceeding models like Flux-dev (12B).

Notably, Twig v0 alpha is optimized for efficiency. It can generate large images up to 2048x2048 resolution in approximately 10 seconds on modern CPUs, without requiring a dedicated GPU. On a single NVIDIA 4090 GPU, it generates 1024x1024 images in around 0.4 seconds.

The alpha version was trained on a dataset of approximately 50,000 carefully curated image-text pairs. Future iterations, including the beta version, will focus on expanding the dataset and exploring innovative training methodologies. Version 1 (v1) is planned to incorporate a novel linear attention binary visual autoregressive network architecture, which is anticipated to further enhance the model's capabilities and push the boundaries of text-to-image generation.

Due to its smaller parameter size and training dataset, it is recommended to use common negative prompts for text-to-image models to improve generation quality. The model may exhibit limitations in generating accurate human anatomy due to the base model's quality. It is advised to utilize auxiliary techniques commonly used with other text-to-image models, such as ADetailer, to mitigate these issues and enhance the details.

License Note: The license of the original repository is not explicitly stated but is assumed to be compatible with the Apache 2.0 license used here. Please refer to the original repository (linked in "Model Sources") for further clarification regarding licensing if needed.

Developed by: Swarmeta-AI & Rath-Academy
Funded by: National Supercomputing Center
Language(s): English,Chinese
License: apache-2.0
Finetuned from model: Efficient-Large-Model/Sana_1600M_1024px_MultiLing

Model Sources [optional]

Repository: https://github.com/NVlabs/Sana (Only for v0)
Paper: coming soon
Demo: coming soon

Uses

Direct Use

This model is intended for direct use in generating images from text prompts in both English and Chinese. Users can leverage its strong instruction following and long context window (up to 1200 tokens) to create images with detailed compositions and specific attributes. While capable, it's recommended to use negative prompts to further refine image quality. Due to potential limitations in human anatomy generation, users should be aware of this and consider using techniques like ADetailer for improvement, especially when generating images involving people.

Out-of-Scope Use

This model may not be suitable for applications requiring highly accurate human anatomy generation without employing additional refinement techniques. It's also important to be mindful of the base model's potential biases and limitations, especially when generating images related to sensitive topics. Users should avoid using this model for malicious purposes or generating harmful content.

Bias, Risks, and Limitations

This model, like many large language models, may exhibit biases present in its training data. Due to the smaller model size and dataset, it may have limitations in comprehensively understanding and generating diverse and complex scenes or concepts compared to larger models. Specifically, the model may struggle with accurate human anatomy. Users should be aware of these limitations and critically evaluate the generated content, especially in applications where accuracy and fairness are paramount.

Recommendations

Users are recommended to:

Utilize common negative prompts for text-to-image models to improve generation quality.
Employ auxiliary techniques like ADetailer to enhance details and address potential issues with human anatomy, especially when generating images with people.
Be aware of potential biases and limitations of the model and critically evaluate the generated content.
Consult the original repository's licensing information if there are any concerns about license compatibility.

How to Get Started with the Model

Use the code below to get started with the model.

The model can be used with Gradio and ComfyUI, as indicated in the original repository's documentation. Please refer to the original repository (https://github.com/NVlabs/Sana - Only for v0, update to current repo if available) for detailed instructions on how to load and run the model in these environments.

Training Details

Training Data

The model was trained on a private dataset consisting of approximately 50,000 carefully curated image-text pairs. This dataset is not publicly available at this time. Future versions will explore expanded datasets and novel training methods.

Speeds, Sizes, Times [optional]

Model Size: 1.6B parameters
Inference Speed:
- GPU (NVIDIA 4090): Approximately 0.4 seconds per 1024x1024 image.
- CPU (Modern CPU): Approximately 10 seconds per 2048x2048 image (GPU-less).

Evaluation

Results

| Methods (1024x1024)   | Throughput (samples/s)| Latency (s) | Params (B) | Speedup | FID 👇 | CLIP 👆 | GenEval 👆 | DPG 👆|
|-----------------------|-----------------------|-------------|------------|---------|--------|---------|----------|-------|
| FLUX-dev              | 0.04                  | 23.0        | 12.0       | 1.0×    | 10.15  | 27.47   | 0.67     | 84.0  |
| Sana-1.6B-MultiLing   | 1.0                   | 1.2         | 1.6        | 23.3×   | 5.92   | 28.94   | 0.69     | 84.5  |
| Twig-v0-alpha         | 1.0                   | 1.2         | 1.6        | 23.3×   | 5.98   | 32.92   | 0.73     | 87.2  |

Seeking Support: We are actively seeking donations and commercial collaborations/sponsorships to support the development of open-source models. Donations will be used to further open-source model development. For commercial collaborations/sponsorships, we will prioritize providing professional closed-source models, deployment, and support services.

Contact Us

Email:[email protected]

Swarmeta-AI
/

Twig-v0-alpha