Inference generation extremely slow

#57
by aledane - opened

Hi,
I am using the model in the quantization version with this setting:

                "params" : {
                            "trust_remote_code" : True,
                            "torch_dtype":torch.bfloat16,
                            "return_full_text" : True,
                            "device_map" : "auto",
                            "max_new_tokens" : 16,
                            "do_sample" : True,
                            "temperature" : 0.01,
                            "renormalize_logits" : True
                           },

However, in inference the model is extremely slow (it is running for 1 hour for a simple question).
I am using the model on a g5.4xlarge Sagemaker instance (16gb vcpu, 64gb RAM, NVIDIA A10 GPU)
Any idea on how to speed up the process? Thanks

Hi @aledane

I am using the model in the quantization version

Can you elaborate more? Which quantization method are you using?

I was trying to use this: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
However, I realized I was not really using it due to a coding mistake; instead, I was deploying the original version mistralai/Mixtral-8x7B-Instruct-v0.1.

Other than using a quantization method, is there any way to speed up the inference generation by using the original model?
Is it just a problem of resources (so I have to increase the Sagemaker instance), or is there another way?

Can you share your code and versions of sagemaker sdk and the TGI you used? I've been trying to deploy both the models on SM but I havent been able to.

Hi @aledane
I suspect your model is silently loaded with CPU offloading because you don't have enough GPU RAM. You can make sure to use torch.float16 by passing torch_dtype=torch.float16 in from_pretrained, or load the model in 4-bit precision through bitsandbytes package so that your model will fit into your GPU

Hi @ybelkada , thank you for your reply. I am already using torch.float16 in from_pretrained as I have already shown in the set of parameters above.
I can try with 4-bit precision tho, even if I do not think will change so much honestly

you will need more then 1 a10 for that .. .fp16 takes about 90g vram in so 2 a100/h100 2 a6000 are fine either way
6bpw on exlv2 takes 38g so you can cramp that into an a6000

Sign up or log in to comment