Update README.md
Browse files
README.md
CHANGED
@@ -17,7 +17,9 @@ For real-world deployment, please refer to the [vLLM Distributed Inference and S
|
|
17 |
> [!NOTE]
|
18 |
> The vLLM version we are using is `0.6.2`. Please check [this version](https://github.com/vllm-project/vllm/releases/tag/v0.6.2).
|
19 |
|
20 |
-
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
|
|
|
|
21 |
```shell
|
22 |
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
|
23 |
--quantization gptq \
|
@@ -28,7 +30,7 @@ vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
|
|
28 |
--pipeline-parallel-size 4 \
|
29 |
--api-key token-abc123 \
|
30 |
--enable-lora \
|
31 |
-
--lora-modules adapter=checkpoint-18640
|
32 |
```
|
33 |
|
34 |
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API.
|
|
|
17 |
> [!NOTE]
|
18 |
> The vLLM version we are using is `0.6.2`. Please check [this version](https://github.com/vllm-project/vllm/releases/tag/v0.6.2).
|
19 |
|
20 |
+
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
21 |
+
By default, it starts the server at `http://localhost:8000`.
|
22 |
+
And please use the vLLM to serve the base model with the LoRA adapter by including the `--enable-lora` flag and specifying `--lora-modules`:
|
23 |
```shell
|
24 |
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
|
25 |
--quantization gptq \
|
|
|
30 |
--pipeline-parallel-size 4 \
|
31 |
--api-key token-abc123 \
|
32 |
--enable-lora \
|
33 |
+
--lora-modules adapter=Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ/checkpoint-18640
|
34 |
```
|
35 |
|
36 |
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API.
|