Update README.md
Browse files
README.md
CHANGED
@@ -14,20 +14,6 @@ For real-world deployment, please refer to the [vLLM Distributed Inference and S
|
|
14 |
|
15 |
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`.
|
16 |
```shell
|
17 |
-
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
|
18 |
-
--quantization gptq \
|
19 |
-
--trust-remote-code \
|
20 |
-
--dtype float16 \
|
21 |
-
--max-model-len 4096 \
|
22 |
-
--distributed-executor-backend mp \
|
23 |
-
--pipeline-parallel-size 4 \
|
24 |
-
--api-key token-abc123
|
25 |
-
```
|
26 |
-
Please check [here](https://docs.vllm.ai/en/latest/usage/engine_args.html) if you wanna change `Engine Arguments`.
|
27 |
-
|
28 |
-
If you would like to deploy your LoRA adapter, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/lora.html#serving-lora-adapters) for a detailed guide.
|
29 |
-
It provides step-by-step instructions on how to serve LoRA adapters effectively in a vLLM environment.
|
30 |
-
```shell
|
31 |
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
|
32 |
--quantization gptq \
|
33 |
--trust-remote-code \
|
|
|
14 |
|
15 |
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`.
|
16 |
```shell
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
|
18 |
--quantization gptq \
|
19 |
--trust-remote-code \
|