Efficiently run the model locally using ScaleLLM
https://github.com/vectorch-ai/ScaleLLM
ScaleLLM is a tool that enables you to serve language models locally. You can find the project and documentation here: ScaleLLM GitHub. Here's how you can set it up:
1: start the model inference server
First, run the model inference server using the following Docker command. This command will start a container with GPU support (if available) and link it to your local model cache:
docker run -it --gpus=all --net=host --shm-size=1g \
-v $HOME/.cache/huggingface/hub:/models \
-e HF_MODEL_ID=01-ai/Yi-6B \
-e DEVICE=auto \
docker.io/vectorchai/scalellm:latest --logtostderr
2: start REST API server
Next, start the REST API server by running the following Docker command:
docker run -it --net=host \
docker.io/vectorchai/scalellm-gateway:latest --logtostderr
you will get following running services:
ScaleLLM gRPC server on port 8888: localhost:8888
ScaleLLM HTTP server for monitoring on port 9999: localhost:9999
ScaleLLM REST API server on port 8080: localhost:8080
You can now send requests to the local REST API server to generate text completions using a command like this:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "01-ai/Yi-6B",
"prompt": "what is vue.js",
"max_tokens": 32,
"temperature": 0.7
}'
This command sends a POST request to the local REST API server, specifying the model, prompt, and other parameters to generate completions.
Make sure you have Docker installed and configured for GPU usage if you want to take advantage of GPU acceleration. This setup allows you to efficiently run the language model locally with ScaleLLM.
I'm closing this one as explained in https://huggingface.co/01-ai/Yi-34B/discussions/16#65558f0c7446daf1ce2f4dbf