fireworks-ai
/

FLUX.1-schnell-fp8-flumina

Safetensors

Model card Files Files and versions Community

aredden commited on Aug 20, 2024

Commit

9dc5b0b

1 Parent(s): 58082af

Add all relevent args to argparser & update readme

Browse files

Files changed (3) hide show

README.md +102 -23
main.py +64 -3
util.py +23 -5

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Flux FP8 (true) Matmul Implementation with FastAPI
-This repository contains an implementation of the Flux model, along with an API that allows you to generate images based on text prompts. The API can be run via command-line arguments.
 ## Speed Comparison
@@ -73,13 +73,21 @@ If you get errors installing `torch-cublas-hgemm`, feel free to comment it out i
 ## Usage
 You can run the API server using the following command:
 ```bash
 python main.py --config-path <path_to_config> --port <port_number> --host <host_address>
 ```
-### Command-Line Arguments
 -   `--config-path`: Path to the configuration file. If not provided, the model will be loaded from the command line arguments.
 -   `--port`: Port to run the server on (default: 8088).
@@ -91,17 +99,47 @@ python main.py --config-path <path_to_config> --port <port_number> --host <host_
 -   `--flux-device`: Device to run the flow model on (default: cuda:0).
 -   `--text-enc-device`: Device to run the text encoder on (default: cuda:0).
 -   `--autoencoder-device`: Device to run the autoencoder on (default: cuda:0).
--   `--num-to-quant`: Number of linear layers in the flow transformer to quantize (default: 20).
 ## Configuration
 The configuration files are located in the `configs` directory. You can specify different configurations for different model versions and devices.
-Example configuration file (`configs/config-dev.json`):
-```json
 {
-    "version": "flux-dev",
     "params": {
         "in_channels": 64,
         "vec_in_dim": 768,
@@ -114,7 +152,7 @@ Example configuration file (`configs/config-dev.json`):
         "axes_dim": [16, 56, 56],
         "theta": 10000,
         "qkv_bias": true,
-        "guidance_embed": true
     },
     "ae_params": {
         "resolution": 256,
@@ -127,23 +165,27 @@ Example configuration file (`configs/config-dev.json`):
         "scale_factor": 0.3611,
         "shift_factor": 0.1159
     },
-    "ckpt_path": "/path/to/your/flux1-dev.sft",
-    "ae_path": "/path/to/your/ae.sft",
-    "repo_id": "black-forest-labs/FLUX.1-dev",
-    "repo_flow": "flux1-dev.sft",
-    "repo_ae": "ae.sft",
-    "text_enc_max_length": 512,
-    "text_enc_path": "path/to/your/t5-v1_1-xxl-encoder-bf16",
-    "text_enc_device": "cuda:1",
-    "ae_device": "cuda:1",
     "flux_device": "cuda:0",
     "flow_dtype": "float16",
     "ae_dtype": "bfloat16",
     "text_enc_dtype": "bfloat16",
-    "text_enc_quantization_dtype": "qfloat8",
-    "compile_extras": true,
-    "compile_blocks": true,
-    ...
 }
 ```
@@ -157,6 +199,12 @@ The only things you should need to change in general are the:
 Other things to change can be the
 -   `"text_enc_quantization_dtype": "qfloat8"`
     quantization dtype for the text encoder, if `qfloat8` or `qint2` will use quanto, `qint4`, `qint8` will use bitsandbytes
@@ -220,9 +268,11 @@ python main.py --port 8088 --host 0.0.0.0 \
     --autoencoder-path /path/to/your/ae.sft \
     --model-version flux-dev \
     --flux-device cuda:0 \
-    --text-enc-device cuda:1 \
-    --autoencoder-device cuda:1 \
-    --num-to-quant 20
 ```
 ### Generating an Image
@@ -263,3 +313,32 @@ with open(f"output.jpg", "wb") as f:
     f.write(io.BytesIO(res.content).read())
 ```

 # Flux FP8 (true) Matmul Implementation with FastAPI
+This repository contains an implementation of the Flux model, along with an API that allows you to generate images based on text prompts. And also a simple single line of code to use the generator as a single object, similar to diffusers pipelines.
 ## Speed Comparison
 ## Usage
+For a single ADA GPU with less than 24GB vram, and more than 16GB vram, you should use the `configs/config-dev-1-4080.json` config file as a base, and then tweak the parameters to fit your needs. It offloads all models to CPU when not in use, compiles the flow model with extra optimizations, and quantizes the text encoder to nf4 and the autoencoder to qfloat8.
+For a single ADA GPU with more than ~32GB vram, you should use the `configs/config-dev-1-RTX6000ADA.json` config file as a base, and then tweak the parameters to fit your needs. It does not offload any models to CPU, compiles the flow model with extra optimizations, and quantizes the text encoder to qfloat8 and the autoencoder to stays as bfloat16.
+For a single 4090 GPU, you should use the `configs/config-dev-1-4090.json` config file as a base, and then tweak the parameters to fit your needs. It offloads the text encoder and the autoencoder to CPU, compiles the flow model with extra optimizations, and quantizes the text encoder to nf4 and the autoencoder to float8.
+**NOTE:** For all of these configs, you must change the `ckpt_path`, `ae_path`, and `text_enc_path` parameters to the path to your own checkpoint, autoencoder, and text encoder.
 You can run the API server using the following command:
 ```bash
 python main.py --config-path <path_to_config> --port <port_number> --host <host_address>
 ```
+### API Command-Line Arguments
 -   `--config-path`: Path to the configuration file. If not provided, the model will be loaded from the command line arguments.
 -   `--port`: Port to run the server on (default: 8088).
 -   `--flux-device`: Device to run the flow model on (default: cuda:0).
 -   `--text-enc-device`: Device to run the text encoder on (default: cuda:0).
 -   `--autoencoder-device`: Device to run the autoencoder on (default: cuda:0).
+-   `--compile`: Compile the flow model with extra optimizations (default: False).
+-   `--quant-text-enc`: Quantize the T5 text encoder to the given dtype (`qint4`, `qfloat8`, `qint2`, `qint8`, `bf16`), if `bf16`, will not quantize (default: `qfloat8`).
+-   `--quant-ae`: Quantize the autoencoder with float8 linear layers, otherwise will use bfloat16 (default: False).
+-   `--offload-flow`: Offload the flow model to the CPU when not being used to save memory (default: False).
+-   `--no-offload-ae`: Disable offloading the autoencoder to the CPU when not being used to increase e2e inference speed (default: True).
+-   `--no-offload-text-enc`: Disable offloading the text encoder to the CPU when not being used to increase e2e inference speed (default: True).
+-   `--prequantized-flow`: Load the flow model from a prequantized checkpoint, which reduces the size of the checkpoint by about 50% & reduces startup time (default: False).
+## Examples
+### Running the Server
+```bash
+python main.py --config-path configs/config-dev-1-4090.json --port 8088 --host 0.0.0.0
+```
+Or if you need more granular control over the all of the settings, you can run the server with something like this:
+```bash
+python main.py --port 8088 --host 0.0.0.0 \
+    --flow-model-path /path/to/your/flux1-dev.sft \
+    --text-enc-path /path/to/your/t5-v1_1-xxl-encoder-bf16 \
+    --autoencoder-path /path/to/your/ae.sft \
+    --model-version flux-dev \
+    --flux-device cuda:0 \
+    --text-enc-device cuda:0 \
+    --autoencoder-device cuda:0 \
+    --compile \
+    --quant-text-enc qfloat8 \
+    --quant-ae
+```
 ## Configuration
 The configuration files are located in the `configs` directory. You can specify different configurations for different model versions and devices.
+Example configuration file for a single 4090 (`configs/config-dev-1-4090.json`):
+```js
 {
+    "version": "flux-dev", // or flux-schnell
     "params": {
         "in_channels": 64,
         "vec_in_dim": 768,
         "axes_dim": [16, 56, 56],
         "theta": 10000,
         "qkv_bias": true,
+        "guidance_embed": true // if you are using flux-schnell, set this to false
     },
     "ae_params": {
         "resolution": 256,
         "scale_factor": 0.3611,
         "shift_factor": 0.1159
     },
+    "ckpt_path": "/your/path/to/flux1-dev.sft", // local path to original bf16 BFL flux checkpoint
+    "ae_path": "/your/path/to/ae.sft", // local path to original bf16 BFL autoencoder checkpoint
+    "repo_id": "black-forest-labs/FLUX.1-dev", // can ignore
+    "repo_flow": "flux1-dev.sft", // can ignore
+    "repo_ae": "ae.sft", // can ignore
+    "text_enc_max_length": 512, // use 256 if you are using flux-schnell
+    "text_enc_path": "city96/t5-v1_1-xxl-encoder-bf16", // or custom HF full bf16 T5EncoderModel repo id
+    "text_enc_device": "cuda:0",
+    "ae_device": "cuda:0",
     "flux_device": "cuda:0",
     "flow_dtype": "float16",
     "ae_dtype": "bfloat16",
     "text_enc_dtype": "bfloat16",
+    "flow_quantization_dtype": "qfloat8", // will always be qfloat8, so can ignore
+    "text_enc_quantization_dtype": "qint4", // choose between qint4, qint8, qfloat8, qint2 or delete entry for no quantization
+    "ae_quantization_dtype": "qfloat8", // can either be qfloat8 or delete entry for no quantization
+    "compile_extras": true, // compile the layers not included in the single-blocks or double-blocks
+    "compile_blocks": true, // compile the single-blocks and double-blocks
+    "offload_text_encoder": true, // offload the text encoder to cpu when not in use
+    "offload_vae": true, // offload the autoencoder to cpu when not in use
+    "offload_flow": false // offload the flow transformer to cpu when not in use
 }
 ```
 Other things to change can be the
+-   `"text_enc_max_length": 512`
+    max length for the text encoder, 256 if you are using flux-schnell
+-   `"ae_quantization_dtype": "qfloat8"`
+    quantization dtype for the autoencoder, can be `qfloat8` or delete entry for no quantization, will use the float8 linear layer implementation included in this repo.
 -   `"text_enc_quantization_dtype": "qfloat8"`
     quantization dtype for the text encoder, if `qfloat8` or `qint2` will use quanto, `qint4`, `qint8` will use bitsandbytes
     --autoencoder-path /path/to/your/ae.sft \
     --model-version flux-dev \
     --flux-device cuda:0 \
+    --text-enc-device cuda:0 \
+    --autoencoder-device cuda:0 \
+    --compile \
+    --quant-text-enc qfloat8 \
+    --quant-ae
 ```
 ### Generating an Image
     f.write(io.BytesIO(res.content).read())
 ```
+You can also generate an image by directly importing the FluxPipeline class and using it to generate an image. This is useful if you have a custom model configuration and want to generate an image without having to run the server.
+```py
+import io
+from flux_pipeline import FluxPipeline
+pipe = FluxPipeline.load_pipeline_from_config_path(
+    "configs/config-dev-1-4090.json"  # or whatever your config is
+)
+output_jpeg_bytes: io.BytesIO = pipe.generate(
+    # Required args:
+    prompt="A beautiful asian woman in traditional clothing with golden hairpin and blue eyes, wearing a red kimono with dragon patterns",
+    # Optional args:
+    width=1024,
+    height=1024,
+    num_inference_steps=20,
+    guidance_scale=3.5,
+    seed=13456,
+    init_image="path/to/your/init_image.jpg",
+    strength=0.8,
+)
+with open("output.jpg", "wb") as f:
+    f.write(output_jpeg_bytes.getvalue())
+```

main.py CHANGED Viewed

@@ -1,8 +1,6 @@
 import argparse
 import uvicorn
 from api import app
-from flux_pipeline import FluxPipeline
-from util import load_config, ModelVersion
 def parse_args():
@@ -79,13 +77,68 @@ def parse_args():
         default=False,
         help="Compile the flow model with extra optimizations",
     )
     return parser.parse_args()
 def main():
     args = parse_args()
     if args.config_path:
         app.state.model = FluxPipeline.load_pipeline_from_config_path(
             args.config_path, flow_model_path=args.flow_model_path
@@ -110,6 +163,14 @@ def main():
             num_to_quant=args.num_to_quant,
             compile_extras=args.compile,
             compile_blocks=args.compile,
         )
         app.state.model = FluxPipeline.load_pipeline_from_config(config)

 import argparse
 import uvicorn
 from api import app
 def parse_args():
         default=False,
         help="Compile the flow model with extra optimizations",
     )
+    parser.add_argument(
+        "-qT",
+        "--quant-text-enc",
+        type=str,
+        default="qfloat8",
+        choices=["qint4", "qfloat8", "qint2", "qint8", "bf16"],
+        help="Quantize the t5 text encoder to the given dtype, if bf16, will not quantize",
+        dest="quant_text_enc",
+    )
+    parser.add_argument(
+        "-qA",
+        "--quant-ae",
+        action="store_true",
+        default=False,
+        help="Quantize the autoencoder with float8 linear layers, otherwise will use bfloat16",
+        dest="quant_ae",
+    )
+    parser.add_argument(
+        "-OF",
+        "--offload-flow",
+        action="store_true",
+        default=False,
+        dest="offload_flow",
+        help="Offload the flow model to the CPU when not being used to save memory",
+    )
+    parser.add_argument(
+        "-OA",
+        "--no-offload-ae",
+        action="store_false",
+        default=True,
+        dest="offload_ae",
+        help="Disable offloading the autoencoder to the CPU when not being used to increase e2e inference speed",
+    )
+    parser.add_argument(
+        "-OT",
+        "--no-offload-text-enc",
+        action="store_false",
+        default=True,
+        dest="offload_text_enc",
+        help="Disable offloading the text encoder to the CPU when not being used to increase e2e inference speed",
+    )
+    parser.add_argument(
+        "-PF",
+        "--prequantized-flow",
+        action="store_true",
+        default=False,
+        dest="prequantized_flow",
+        help="Load the flow model from a prequantized checkpoint "
+        + "(requires loading the flow model, running a minimum of 24 steps, "
+        + "and then saving the state_dict as a safetensors file), "
+        + "which reduces the size of the checkpoint by about 50% & reduces startup time",
+    )
     return parser.parse_args()
 def main():
     args = parse_args()
+    # lazy loading so cli returns fast instead of waiting for torch to load modules
+    from flux_pipeline import FluxPipeline
+    from util import load_config, ModelVersion
     if args.config_path:
         app.state.model = FluxPipeline.load_pipeline_from_config_path(
             args.config_path, flow_model_path=args.flow_model_path
             num_to_quant=args.num_to_quant,
             compile_extras=args.compile,
             compile_blocks=args.compile,
+            quant_text_enc=(
+                None if args.quant_text_enc == "bf16" else args.quant_text_enc
+            ),
+            quant_ae=args.quant_ae,
+            offload_flow=args.offload_flow,
+            offload_ae=args.offload_ae,
+            offload_text_enc=args.offload_text_enc,
+            prequantized_flow=args.prequantized_flow,
         )
         app.state.model = FluxPipeline.load_pipeline_from_config(config)

util.py CHANGED Viewed

@@ -1,6 +1,6 @@
 import json
 from pathlib import Path
-from typing import Optional
 import torch
 from modules.autoencoder import AutoEncoder, AutoEncoderParams
@@ -113,7 +113,16 @@ def load_config(
     num_to_quant: Optional[int] = 20,
     compile_extras: bool = False,
     compile_blocks: bool = False,
-):
     text_enc_device = str(parse_device(text_enc_device))
     ae_device = str(parse_device(ae_device))
     flux_device = str(parse_device(flux_device))
@@ -166,6 +175,17 @@ def load_config(
         num_to_quant=num_to_quant,
         compile_extras=compile_extras,
         compile_blocks=compile_blocks,
     )
@@ -193,12 +213,10 @@ def print_load_warning(missing: list[str], unexpected: list[str]) -> None:
         )
-def load_flow_model(config: ModelSpec) -> Flux:
     ckpt_path = config.ckpt_path
     FluxClass = Flux
     if config.prequantized_flow:
-        from modules.flux_model_f8 import Flux as FluxF8
         FluxClass = FluxF8
     with torch.device("meta"):

 import json
 from pathlib import Path
+from typing import Literal, Optional
 import torch
 from modules.autoencoder import AutoEncoder, AutoEncoderParams
     num_to_quant: Optional[int] = 20,
     compile_extras: bool = False,
     compile_blocks: bool = False,
+    offload_text_enc: bool = False,
+    offload_ae: bool = False,
+    offload_flow: bool = False,
+    quant_text_enc: Optional[Literal["float8", "qint2", "qint4", "qint8"]] = None,
+    quant_ae: bool = False,
+    prequantized_flow: bool = False,
+) -> ModelSpec:
+    """
+    Load a model configuration using the passed arguments.
+    """
     text_enc_device = str(parse_device(text_enc_device))
     ae_device = str(parse_device(ae_device))
     flux_device = str(parse_device(flux_device))
         num_to_quant=num_to_quant,
         compile_extras=compile_extras,
         compile_blocks=compile_blocks,
+        offload_flow=offload_flow,
+        offload_text_encoder=offload_text_enc,
+        offload_vae=offload_ae,
+        text_enc_quantization_dtype={
+            "float8": QuantizationDtype.qfloat8,
+            "qint2": QuantizationDtype.qint2,
+            "qint4": QuantizationDtype.qint4,
+            "qint8": QuantizationDtype.qint8,
+        }.get(quant_text_enc, None),
+        ae_quantization_dtype=QuantizationDtype.qfloat8 if quant_ae else None,
+        prequantized_flow=prequantized_flow,
     )
         )
+def load_flow_model(config: ModelSpec) -> Flux | FluxF8:
     ckpt_path = config.ckpt_path
     FluxClass = Flux
     if config.prequantized_flow:
         FluxClass = FluxF8
     with torch.device("meta"):