Optimize inference speed
Can it be applied ONNX optimization to improve inference speed?
Yes. There are some open-sourced onnx models in huggingface, like: https://huggingface.co/aapot/bge-m3-onnx
@CoolWP I am maintainer of https://github.com/michaelfeil/infinity - bge-m3 is compatible and will accelerate your inference speed on gpu around 2-3x by using (async tokenization, fp16, flash-attention, torch nested, torch.compile)
@michaelfeil hi!, nice project, I have 2 questions:
- it will accelerate CPU inference?
- on GPU it will reduce the VRAM usage, or only performance optimizations are supported ?
I'm running low on VRAM
It will reduce VRAM by 0.5 by using fp16 precision, and can dispatch e.g. memory-efficient attention. If you go for the full-sequence length, I would suggest to limit batch size in infinity to 8.
You can also run ONNX inference (no onnx version for this model at this point in time), which will give you the best in class acceleration for CPU on intel / amd.
@CoolWP Hi!,
i'm trying infinity with BAAI/bge-m3 but i'm only getting the embeddings results, and the rerank endpoint will not work I suspect to get the scores.... is there any way to get the model scores
ex:
{
'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],
'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],
'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],
'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],
'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
}
it will be very useful because this feature is the most relevant in my opinion for this great multilingual model, may be thru the re-rank endpoint.
regards