Perplexed - code runs online, caches everything (26GB!), but won't run disconnected from internet

#34

by rustyw007 - opened Nov 5, 2024

Nov 5, 2024

•

edited Nov 5, 2024

I've been chasing this for a few hours - started out feeling pretty sure I would figure it out, but 4 hours later still stuck. Would be super grateful for any help.

When connected to the internet, the code (below) executes without issue. However, when I don't have internet, it fails.

I also have the SD 35 medium model set up and it runs fine on and offline (but doesn't use BitsAndBytesConfig, SD3Transformer2DModel)

I have made several successful runs, when connected, which as I understand it should download the required files to the local cache. I know the first time I ran the code it downloaded something like 26GB of files. :)

I have verified the cache exists on my system. So why won't this run if I don't have an internet connection?!? I have a suspicion it has to do with BitsAndBytesConfig and SD3Transformer2DModel which are not using in my medium code... but I still can't figure it out.

***** CODE *****
import sys
from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

Get the prompt

prompt = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Describe Image: ")

model_id = "stabilityai/stable-diffusion-3.5-large"

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
model_id,
transformer=model_nf4,
torch_dtype=torch.bfloat16
)

Disable the safety checker

pipeline.safety_checker = None

pipeline.enable_model_cpu_offload()

image = pipeline(
prompt=prompt,
num_inference_steps=28,
guidance_scale=4.5,
max_sequence_length=512,
).images[0]

image.show()

mkfs

Nov 7, 2024

No doubt you know about
import os
os.putenv("TRANSFORMERS_OFFLINE", "1")
When you try to use it with this model, it fails because the repo has no checkpoint (and no CLIP, and no UNET ... it's pretty half-baked).

I was able to get it to work offline by saving a local checkpoint of the repo. Instantiate the pipe as you normally do, then:
pipe.save_pretrained("/path/to/models/stable-diffusion-3.5-large")
Then load from this path instead of the HF model ID:
pipe = DiffusionPipeline.from_pretrained("/path/to/models/stable-diffusion-3.5-large", subfolder="transformer",local_files_only=True)

rustyw007

Nov 7, 2024

@mkfs - Thank you! In fact, I did not know about that, but I do now! I really appreciate your insight!

It's seems I blinding stumbled into your proposed solution (I've copied it below) by repeated trial and error - and I mean a LOT of trial and error with a lot of verbose logging... :)

Here is where I ended up and it works fine now (but needs to be cleaned up, and perhaps I can still tweak for a little better performance):

import logging
import sys
import os
from huggingface_hub import snapshot_download

Configure logging to both file and console

logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('debug.log'), # Save to file
logging.StreamHandler() # Print to console
]
)

Enable logging for specific libraries

logging.getLogger("diffusers").setLevel(logging.DEBUG)
logging.getLogger("transformers").setLevel(logging.DEBUG)
logging.getLogger("huggingface_hub").setLevel(logging.DEBUG)

Test that logging is working

logging.info("Logging setup complete")

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

Get the default cache location

cache_dir = os.path.expanduser("~/.cache/huggingface/hub") # This resolves to C:\Users\rusty.cache\huggingface\hub on Windows

logging.info("Downloading/verifying model files...")
local_model_path = snapshot_download(
repo_id=model_id,
revision="main",
resume_download=True,
cache_dir=cache_dir
)
logging.info(f"Model files located at: {local_model_path}")

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model_nf4 = SD3Transformer2DModel.from_pretrained(
local_model_path,
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
local_model_path,
transformer=model_nf4,
torch_dtype=torch.bfloat16
)

Disable the safety checker

pipeline.safety_checker = None

pipeline.enable_model_cpu_offload()

Get the prompt

prompt = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Describe Image: ")

image = pipeline(
prompt=prompt,
num_inference_steps=28,
guidance_scale=3.5,

max_sequence_length=512,

height=1024,

width=512

).images[0]

image.show()

mkfs

Nov 7, 2024

•

edited Nov 7, 2024

Looks good. I've been having some pretty good success offloading layers of SD until it will fit on a consumer-grade video card (8-12GB VRAM). The method I use might be of interest to you.

First, I load the entire thing onto the "cpu" device and infer the memory map of the model:

from diffusers import SD3Transformer2DModel
from accelerate import infer_auto_device_map
model = SD3Transformer2DModel.from_pretrained(model_id, subfolder="transformer", device="cpu")
memspec={0: "7GiB", "cpu": "64GiB"} # not too critical
device_map = infer_auto_device_map(model, max_memory=memspec)
print("DEVICE MAP: " + str(device_map))

This will print a dict that looks something like this:
{
'pos_embed': 'cpu',
'time_text_embed': 'cpu',
'context_embedder': 'cpu',
'transformer_blocks.0': 'cpu',
'transformer_blocks.1': 'cpu',
...
'transformer_blocks.4.norm1': 'cpu',
'transformer_blocks.4.norm1_context.silu': 'cpu',
'transformer_blocks.4.norm1_context.linear': 'cpu',
'transformer_blocks.4.norm1_context.norm': 'cpu',
'transformer_blocks.4.attn': 'cpu',
'transformer_blocks.4.norm2': 'cpu',
'transformer_blocks.4.ff': 'cpu',
'transformer_blocks.4.norm2_context': 'cpu',
'transformer_blocks.4.ff_context': 'cpu',
...
'norm_out': 'cpu',
'proj_out': 'cpu'
}

Paste the dict into your code as variable "device_map", and set every member of the dict to value "cpu", then start tweaking by setting specific layers to 0 (first video card). You can change your model loading code to something like this to use the map:

model = SD3Transformer2DModel.from_pretrained(model_id, subfolder="transformer", device_map=device_map)

Now for the important bits.

the memspec is used to tell infer how much memory to use on each device, but infer's strategy is terrible so we only use it to generate the map (probably could just do this by inspecting the model, but this is fast)
layers "pos_embed", "time_text_embed", "context_embedder", and "proj_out" should always be on CPU
the first and last transformer_blocks layers should be on GPU (i.e. 0)
the complex layers, such as 4 in the above, should be on GPU
in general, you want the first couple and the last couple of layers on the GPU, as well as those surrounding the complex layers
you want to minimize CPU-to-GPU copying, so try to clump layers together, so if there are multiple complex layers, try to also add all of the layers between them into the GPU
use nvtop or something similar to watch GPU memory while testing - different parameters sent to the model can change memory usage, o ensure there is 500MB to 1GB free on the video card when the model has loaded
some layers "like" to be on the CPU, and some pairs of layers "like" to be on the same device - you'll see "missing/no data" errors when that is the case
all of this applies only when the entire model cannot fit onto the video card; for those lucky souls who can just load the model without worrying about it, disregard this advice

I've been doing this all using the native (float32?) version of the model. I did some comparisons setting the datatype to float16 version takes twice as long to run, even though it shouldn't. Once you nail down a device map, the float32 version really flies.

have fun!

rustyw007

Nov 7, 2024

wow! thank you for sharing that @mkfs !!! Can't wait to get some cycles to try this. I've got limited resources, so currently doing all this on a computer with only an AMD Ryzen 3 5300G, 16GB RAM, and my "big spend" on an RTX 4060. Speaking candidly the fact that the SD35 Large can be made to run at all on such a tiny configuration is amazing!

Now, it's become my personal challenge to see how much performance I can get from such a modest set up. Your helpful info above will take me a huge leap forward and I thank you for sharing it.

I'm going to go ahead and close this out just to keep things tidy since original issue is solved. You have my deep appreciation.

rustyw007 changed discussion status to closed Nov 7, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment