File size: 5,207 Bytes
4b24cb3
 
 
 
 
 
 
293cb52
3bdf21a
 
bd99d0b
3bdf21a
 
 
293cb52
 
 
 
 
 
9b0cb12
 
 
293cb52
 
 
 
 
 
 
 
 
 
9b0cb12
293cb52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9caefcb
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
---

# The Public-shared LoRA Adapter for shuyuej/Llama-3.3-70B-Instruct-GPTQ Model
This is publicly-shared LoRA Adapter for the `shuyuej/Llama-3.3-70B-Instruct-GPTQ` model.<br>
Please check our GPTQ-quantized model [https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ).

## Download Our LoRA Adapter
```bash
git lfs install
git clone https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ
```

# 🔥 Real-world deployment
For real-world deployment, please refer to the [vLLM Distributed Inference and Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). We provide a deployment script [here](https://github.com/vkola-lab/PodGPT/blob/main/scripts/deployment.py).

> [!NOTE]  
> The vLLM version we are using is `0.6.2`. Please check [this version](https://github.com/vllm-project/vllm/releases/tag/v0.6.2).

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. 
By default, it starts the server at `http://localhost:8000`. 
And please use the vLLM to serve the base model with the LoRA adapter by including the `--enable-lora` flag and specifying `--lora-modules`:
```shell
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
    --quantization gptq \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 4096 \
    --distributed-executor-backend mp \
    --pipeline-parallel-size 4 \
    --api-key token-abc123 \
    --enable-lora \
    --lora-modules adapter=Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ/checkpoint-18640
```

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. 
For example, another way to query the server is via the openai python package:
```python
#!/usr/bin/env python
# coding=utf-8

import time
import asyncio

from openai import AsyncOpenAI

# Our system prompt
SYSTEM_PROMPT = (
    "I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, "
    "specializing in science, technology, engineering, mathematics, and medicine "
    "(STEMM)-related research and education, powered by podcast audio.\n"
    "I provide information based on established scientific knowledge but must not offer "
    "personal medical advice or present myself as a licensed medical professional.\n"
    "I will maintain a consistently professional and informative tone, avoiding humor, "
    "sarcasm, and pop culture references.\n"
    "I will prioritize factual accuracy and clarity while ensuring my responses are "
    "educational and non-harmful, adhering to the principle of 'do no harm'.\n"
    "My responses are for informational purposes only and should not be considered a "
    "substitute for professional consultation."
)

# Initialize the AsyncOpenAI client
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)


async def main(message):
    """
    Streaming responses with async usage and "await" with each API call:
    Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses
    :param message: The user query
    """
    start_time = time.time()
    stream = await client.chat.completions.create(
        model="shuyuej/Llama-3.3-70B-Instruct-GPTQ",
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": message,
            }
        ],
        max_tokens=2048,
        temperature=0.2,
        top_p=1,
        stream=True,
        extra_body={
            "ignore_eos": False,
            # https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/blob/main/config.json#L10-L14
            "stop_token_ids": [128001, 128008, 128009],
        },
    )

    print(f"The user's query is\n {message}\n  ")
    print("The model's response is\n")
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="")
    print(f"\nInference time: {time.time() - start_time:.2f} seconds\n")
    print("=" * 100)


if __name__ == "__main__":
    # Some random user queries
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
        "Can you tell me more about Bruce Lee?",
        "What are the differences between DNA and RNA?",
        "What is dementia and Alzheimer's disease?",
        "Tell me the differences between Alzheimer's disease and dementia"
    ]
    
    # Conduct model inference
    for message in prompts:
        asyncio.run(main(message=message))
        print("\n\n")
```


<summary>Here is a demo of the real-world model inference and deployment</summary>
<p align="center">
    <a href="https://www.medrxiv.org/content/10.1101/2024.07.11.24310304v2"> <img src="https://github.com/vkola-lab/PodGPT/raw/main/figures/inference_demo.gif"></a>
</p>