Perplexed - code runs online, caches everything (26GB!), but won't run disconnected from internet
I've been chasing this for a few hours - started out feeling pretty sure I would figure it out, but 4 hours later still stuck. Would be super grateful for any help.
When connected to the internet, the code (below) executes without issue. However, when I don't have internet, it fails.
I also have the SD 35 medium model set up and it runs fine on and offline (but doesn't use BitsAndBytesConfig, SD3Transformer2DModel)
I have made several successful runs, when connected, which as I understand it should download the required files to the local cache. I know the first time I ran the code it downloaded something like 26GB of files. :)
I have verified the cache exists on my system. So why won't this run if I don't have an internet connection?!? I have a suspicion it has to do with BitsAndBytesConfig and SD3Transformer2DModel which are not using in my medium code... but I still can't figure it out.
***** CODE *****
import sys
from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch
Get the prompt
prompt = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Describe Image: ")
model_id = "stabilityai/stable-diffusion-3.5-large"
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16
)
pipeline = StableDiffusion3Pipeline.from_pretrained(
model_id,
transformer=model_nf4,
torch_dtype=torch.bfloat16
)
Disable the safety checker
pipeline.safety_checker = None
pipeline.enable_model_cpu_offload()
image = pipeline(
prompt=prompt,
num_inference_steps=28,
guidance_scale=4.5,
max_sequence_length=512,
).images[0]
image.show()
No doubt you know about
import os
os.putenv("TRANSFORMERS_OFFLINE", "1")
When you try to use it with this model, it fails because the repo has no checkpoint (and no CLIP, and no UNET ... it's pretty half-baked).
I was able to get it to work offline by saving a local checkpoint of the repo. Instantiate the pipe as you normally do, then:
pipe.save_pretrained("/path/to/models/stable-diffusion-3.5-large")
Then load from this path instead of the HF model ID:
pipe = DiffusionPipeline.from_pretrained("/path/to/models/stable-diffusion-3.5-large", subfolder="transformer",local_files_only=True)
@mkfs - Thank you! In fact, I did not know about that, but I do now! I really appreciate your insight!
It's seems I blinding stumbled into your proposed solution (I've copied it below) by repeated trial and error - and I mean a LOT of trial and error with a lot of verbose logging... :)
Here is where I ended up and it works fine now (but needs to be cleaned up, and perhaps I can still tweak for a little better performance):
import logging
import sys
import os
from huggingface_hub import snapshot_download
Configure logging to both file and console
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('debug.log'), # Save to file
logging.StreamHandler() # Print to console
]
)
Enable logging for specific libraries
logging.getLogger("diffusers").setLevel(logging.DEBUG)
logging.getLogger("transformers").setLevel(logging.DEBUG)
logging.getLogger("huggingface_hub").setLevel(logging.DEBUG)
Test that logging is working
logging.info("Logging setup complete")
from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch
model_id = "stabilityai/stable-diffusion-3.5-large"
Get the default cache location
cache_dir = os.path.expanduser("~/.cache/huggingface/hub") # This resolves to C:\Users\rusty.cache\huggingface\hub on Windows
logging.info("Downloading/verifying model files...")
local_model_path = snapshot_download(
repo_id=model_id,
revision="main",
resume_download=True,
cache_dir=cache_dir
)
logging.info(f"Model files located at: {local_model_path}")
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
local_model_path,
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16
)
pipeline = StableDiffusion3Pipeline.from_pretrained(
local_model_path,
transformer=model_nf4,
torch_dtype=torch.bfloat16
)
Disable the safety checker
pipeline.safety_checker = None
pipeline.enable_model_cpu_offload()
Get the prompt
prompt = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Describe Image: ")
image = pipeline(
prompt=prompt,
num_inference_steps=28,
guidance_scale=3.5,
max_sequence_length=512,
height=1024,
width=512
).images[0]
image.show()
Looks good. I've been having some pretty good success offloading layers of SD until it will fit on a consumer-grade video card (8-12GB VRAM). The method I use might be of interest to you.
First, I load the entire thing onto the "cpu" device and infer the memory map of the model:
from diffusers import SD3Transformer2DModel
from accelerate import infer_auto_device_map
model = SD3Transformer2DModel.from_pretrained(model_id, subfolder="transformer", device="cpu")
memspec={0: "7GiB", "cpu": "64GiB"} # not too critical
device_map = infer_auto_device_map(model, max_memory=memspec)
print("DEVICE MAP: " + str(device_map))
This will print a dict that looks something like this:
{
'pos_embed': 'cpu',
'time_text_embed': 'cpu',
'context_embedder': 'cpu',
'transformer_blocks.0': 'cpu',
'transformer_blocks.1': 'cpu',
...
'transformer_blocks.4.norm1': 'cpu',
'transformer_blocks.4.norm1_context.silu': 'cpu',
'transformer_blocks.4.norm1_context.linear': 'cpu',
'transformer_blocks.4.norm1_context.norm': 'cpu',
'transformer_blocks.4.attn': 'cpu',
'transformer_blocks.4.norm2': 'cpu',
'transformer_blocks.4.ff': 'cpu',
'transformer_blocks.4.norm2_context': 'cpu',
'transformer_blocks.4.ff_context': 'cpu',
...
'norm_out': 'cpu',
'proj_out': 'cpu'
}
Paste the dict into your code as variable "device_map", and set every member of the dict to value "cpu", then start tweaking by setting specific layers to 0 (first video card). You can change your model loading code to something like this to use the map:
model = SD3Transformer2DModel.from_pretrained(model_id, subfolder="transformer", device_map=device_map)
Now for the important bits.
- the memspec is used to tell infer how much memory to use on each device, but infer's strategy is terrible so we only use it to generate the map (probably could just do this by inspecting the model, but this is fast)
- layers "pos_embed", "time_text_embed", "context_embedder", and "proj_out" should always be on CPU
- the first and last transformer_blocks layers should be on GPU (i.e. 0)
- the complex layers, such as 4 in the above, should be on GPU
- in general, you want the first couple and the last couple of layers on the GPU, as well as those surrounding the complex layers
- you want to minimize CPU-to-GPU copying, so try to clump layers together, so if there are multiple complex layers, try to also add all of the layers between them into the GPU
- use nvtop or something similar to watch GPU memory while testing - different parameters sent to the model can change memory usage, o ensure there is 500MB to 1GB free on the video card when the model has loaded
- some layers "like" to be on the CPU, and some pairs of layers "like" to be on the same device - you'll see "missing/no data" errors when that is the case
- all of this applies only when the entire model cannot fit onto the video card; for those lucky souls who can just load the model without worrying about it, disregard this advice
I've been doing this all using the native (float32?) version of the model. I did some comparisons setting the datatype to float16 version takes twice as long to run, even though it shouldn't. Once you nail down a device map, the float32 version really flies.
have fun!
wow! thank you for sharing that @mkfs !!! Can't wait to get some cycles to try this. I've got limited resources, so currently doing all this on a computer with only an AMD Ryzen 3 5300G, 16GB RAM, and my "big spend" on an RTX 4060. Speaking candidly the fact that the SD35 Large can be made to run at all on such a tiny configuration is amazing!
Now, it's become my personal challenge to see how much performance I can get from such a modest set up. Your helpful info above will take me a huge leap forward and I thank you for sharing it.
I'm going to go ahead and close this out just to keep things tidy since original issue is solved. You have my deep appreciation.