Any optimization ways to accelerate the speed of inference

#7
by mayukitan - opened

The inference time increased with a big jump from 90s to 180s on a single device for cogvideox-5b to 550s-1000s now. Just wondering if there is any solution to reduce inference time, thanks!

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

This significant event growth is due to the substantial increase in video frame rate and resolution, which requires a lot of time with the same computing power, as the amount of computation has expanded several times over.

Using the H800, the inference time shows it will take 4 hours. Is this normal?

65c263c0e60aa66988c0468919f33b4.png

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Have you switched to pipe.to("cuda") as mentioned in the cli_demo.py in the GitHub repository?

Have you switched to pipe.to("cuda") as mentioned in the cli_demo.py in the GitHub repository?

I used it, but it still takes a long time:
image.png

image.png

Here is my code.

image.png

@Ravencwn comment out pipe.enable_sequential_cpu_offload():

CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process.

...but inference is much slower...
https://huggingface.co/docs/diffusers/main/en/optimization/memory#cpu-offloading

@tsqn Thanks!Could you tell me how long it takes for you to perform inference using CogVideoX1.5?

@Ravencwn hard to tell, it depends on how many steps/frames per second/output size you will generate. In example to be able generate anything on free A100(ZeroGPU space) i have to lower steps and generated frames to fit generation time less than 120 seconds you can try (but this is not 1.5version) - CogVideoX-5B-24frames_20steps-low_vram

I've also tried This CogVideoX1.5-5B-I2V workflow for ComfyUI locally - i have RTX 3060 12GB and 53 frames, 29 steps and dimensions 704x448 takes about 1 hour with GGUF version of model with vae_tilling enabled but without cpu offload(because of using quant version) - i'm fresh too ^^

@EDIT :
i've installed flash_attention_2 and from 1 hour time decrased inferencing to 8 minutes in 352x352 + 50 steps + 49 frames + 16fps so i don't know what to tell. Pretty crazy.

Sign up or log in to comment