|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# The Public-shared LoRA Adapter for shuyuej/Llama-3.3-70B-Instruct-GPTQ Model |
|
This is publicly-shared LoRA Adapter for the `shuyuej/Llama-3.3-70B-Instruct-GPTQ` model.<br> |
|
Please check our GPTQ-quantized model [https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ). |
|
|
|
## Download Our LoRA Adapter |
|
```bash |
|
git lfs install |
|
git clone https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ |
|
``` |
|
|
|
# 🔥 Real-world deployment |
|
For real-world deployment, please refer to the [vLLM Distributed Inference and Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). We provide a deployment script [here](https://github.com/vkola-lab/PodGPT/blob/main/scripts/deployment.py). |
|
|
|
> [!NOTE] |
|
> The vLLM version we are using is `0.6.2`. Please check [this version](https://github.com/vllm-project/vllm/releases/tag/v0.6.2). |
|
|
|
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. |
|
By default, it starts the server at `http://localhost:8000`. |
|
And please use the vLLM to serve the base model with the LoRA adapter by including the `--enable-lora` flag and specifying `--lora-modules`: |
|
```shell |
|
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \ |
|
--quantization gptq \ |
|
--trust-remote-code \ |
|
--dtype float16 \ |
|
--max-model-len 4096 \ |
|
--distributed-executor-backend mp \ |
|
--pipeline-parallel-size 4 \ |
|
--api-key token-abc123 \ |
|
--enable-lora \ |
|
--lora-modules adapter=Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ/checkpoint-18640 |
|
``` |
|
|
|
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. |
|
For example, another way to query the server is via the openai python package: |
|
```python |
|
#!/usr/bin/env python |
|
# coding=utf-8 |
|
|
|
import time |
|
import asyncio |
|
|
|
from openai import AsyncOpenAI |
|
|
|
# Our system prompt |
|
SYSTEM_PROMPT = ( |
|
"I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, " |
|
"specializing in science, technology, engineering, mathematics, and medicine " |
|
"(STEMM)-related research and education, powered by podcast audio.\n" |
|
"I provide information based on established scientific knowledge but must not offer " |
|
"personal medical advice or present myself as a licensed medical professional.\n" |
|
"I will maintain a consistently professional and informative tone, avoiding humor, " |
|
"sarcasm, and pop culture references.\n" |
|
"I will prioritize factual accuracy and clarity while ensuring my responses are " |
|
"educational and non-harmful, adhering to the principle of 'do no harm'.\n" |
|
"My responses are for informational purposes only and should not be considered a " |
|
"substitute for professional consultation." |
|
) |
|
|
|
# Initialize the AsyncOpenAI client |
|
client = AsyncOpenAI( |
|
base_url="http://localhost:8000/v1", |
|
api_key="token-abc123", |
|
) |
|
|
|
|
|
async def main(message): |
|
""" |
|
Streaming responses with async usage and "await" with each API call: |
|
Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses |
|
:param message: The user query |
|
""" |
|
start_time = time.time() |
|
stream = await client.chat.completions.create( |
|
model="shuyuej/Llama-3.3-70B-Instruct-GPTQ", |
|
messages=[ |
|
{ |
|
"role": "system", |
|
"content": SYSTEM_PROMPT, |
|
}, |
|
{ |
|
"role": "user", |
|
"content": message, |
|
} |
|
], |
|
max_tokens=2048, |
|
temperature=0.2, |
|
top_p=1, |
|
stream=True, |
|
extra_body={ |
|
"ignore_eos": False, |
|
# https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/blob/main/config.json#L10-L14 |
|
"stop_token_ids": [128001, 128008, 128009], |
|
}, |
|
) |
|
|
|
print(f"The user's query is\n {message}\n ") |
|
print("The model's response is\n") |
|
async for chunk in stream: |
|
print(chunk.choices[0].delta.content or "", end="") |
|
print(f"\nInference time: {time.time() - start_time:.2f} seconds\n") |
|
print("=" * 100) |
|
|
|
|
|
if __name__ == "__main__": |
|
# Some random user queries |
|
prompts = [ |
|
"Hello, my name is", |
|
"The president of the United States is", |
|
"The capital of France is", |
|
"The future of AI is", |
|
"Can you tell me more about Bruce Lee?", |
|
"What are the differences between DNA and RNA?", |
|
"What is dementia and Alzheimer's disease?", |
|
"Tell me the differences between Alzheimer's disease and dementia" |
|
] |
|
|
|
# Conduct model inference |
|
for message in prompts: |
|
asyncio.run(main(message=message)) |
|
print("\n\n") |
|
``` |
|
|
|
|
|
<summary>Here is a demo of the real-world model inference and deployment</summary> |
|
<p align="center"> |
|
<a href="https://www.medrxiv.org/content/10.1101/2024.07.11.24310304v2"> <img src="https://github.com/vkola-lab/PodGPT/raw/main/figures/inference_demo.gif"></a> |
|
</p> |
|
|