Any optimization ways to accelerate the speed of inference
The inference time increased with a big jump from 90s to 180s on a single device for cogvideox-5b to 550s-1000s now. Just wondering if there is any solution to reduce inference time, thanks!
This significant event growth is due to the substantial increase in video frame rate and resolution, which requires a lot of time with the same computing power, as the amount of computation has expanded several times over.
Have you switched to pipe.to("cuda") as mentioned in the cli_demo.py in the GitHub repository?
@Ravencwn
comment out pipe.enable_sequential_cpu_offload()
:
CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process.
...but inference is much slower...
https://huggingface.co/docs/diffusers/main/en/optimization/memory#cpu-offloading
@Ravencwn hard to tell, it depends on how many steps/frames per second/output size you will generate. In example to be able generate anything on free A100(ZeroGPU space) i have to lower steps and generated frames to fit generation time less than 120 seconds you can try (but this is not 1.5version) - CogVideoX-5B-24frames_20steps-low_vram
I've also tried This CogVideoX1.5-5B-I2V workflow for ComfyUI locally - i have RTX 3060 12GB and 53 frames, 29 steps and dimensions 704x448 takes about 1 hour with GGUF version of model with vae_tilling enabled but without cpu offload(because of using quant version) - i'm fresh too ^^
@EDIT
:
i've installed flash_attention_2 and from 1 hour time decrased inferencing to 8 minutes in 352x352 + 50 steps + 49 frames + 16fps so i don't know what to tell. Pretty crazy.