Inference generation extremely slow
Hi,
I am using the model in the quantization version with this setting:
"params" : {
"trust_remote_code" : True,
"torch_dtype":torch.bfloat16,
"return_full_text" : True,
"device_map" : "auto",
"max_new_tokens" : 16,
"do_sample" : True,
"temperature" : 0.01,
"renormalize_logits" : True
},
However, in inference the model is extremely slow (it is running for 1 hour for a simple question).
I am using the model on a g5.4xlarge Sagemaker instance (16gb vcpu, 64gb RAM, NVIDIA A10 GPU)
Any idea on how to speed up the process? Thanks
I was trying to use this: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
However, I realized I was not really using it due to a coding mistake; instead, I was deploying the original version mistralai/Mixtral-8x7B-Instruct-v0.1.
Other than using a quantization method, is there any way to speed up the inference generation by using the original model?
Is it just a problem of resources (so I have to increase the Sagemaker instance), or is there another way?
Can you share your code and versions of sagemaker sdk and the TGI you used? I've been trying to deploy both the models on SM but I havent been able to.
Hi
@aledane
I suspect your model is silently loaded with CPU offloading because you don't have enough GPU RAM. You can make sure to use torch.float16
by passing torch_dtype=torch.float16
in from_pretrained, or load the model in 4-bit precision through bitsandbytes
package so that your model will fit into your GPU
you will need more then 1 a10 for that .. .fp16 takes about 90g vram in so 2 a100/h100 2 a6000 are fine either way
6bpw on exlv2 takes 38g so you can cramp that into an a6000