How to speed up inferring?
Apart from int8, is there any plan to speed up inferring, such as fastertransformer?
I dont think fastertransformer is an easy way... may torchscript and pytorch2.0 work
The easiest way to do this may be to use the inference server:
https://github.com/bigcode-project/starcoder#text-generation-inference
You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md
Amazing, how much speed could ct2fast-starcoder bring compared with the oringinal starcoder?
This also seems interesting: https://github.com/bigcode-project/starcoder.cpp
You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md
Amazing, how much speed could ct2fast-starcoder bring compared with the oringinal starcoder?
Did not have time to check for starcoder. For santacoder:
Task: "def hello" -> generate 30 tokens
-> transformers pipeline in float 16, cuda: ~1300ms per inference
-> ctranslate2 in int8, cuda -> 315ms per inference
I assume for starcoder, weights are bigger, hence maybe 1.5-2.5x speedup.
You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md
This works like a charm, 100 times faster than the starchat and starcoder. I tried with 8, 12, 16G but failed, at least 24G RAM GPU will work.