|
--- |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
- pt |
|
- hi |
|
- es |
|
- th |
|
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct |
|
pipeline_tag: text-generation |
|
tags: |
|
- directml |
|
- windows |
|
--- |
|
# Model Card for Model ID |
|
|
|
## Model Details |
|
meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization.<br> |
|
Output is reformatted that each sentence starts at new line to improve readability. |
|
<pre> |
|
... |
|
vNewDecoded = tokenizer_stream.decode(new_token) |
|
if re.findall("^[\x2E\x3A\x3B]$", vPreviousDecoded) and vNewDecoded.startswith(" ") and (not vNewDecoded.startswith(" *")) : |
|
vNewDecoded = "\n" + vNewDecoded.replace(" ", "", 1) |
|
print(vNewDecoded, end='', flush=True) |
|
vPreviousDecoded = vNewDecoded |
|
... |
|
</pre> |
|
|
|
### Model Description |
|
meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization<br> |
|
https://onnxruntime.ai/docs/genai/howto/install.html#directml |
|
|
|
Created using ONNX Runtime GenAI's builder.py<br> |
|
https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py |
|
|
|
Build options:<br> |
|
INT4 accuracy level: FP32 (float32) |
|
|
|
- **Developed by:** Mochamad Aris Zamroni |
|
|
|
### Model Sources [optional] |
|
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct |
|
|
|
### Direct Use |
|
This is Microsoft Windows DirectML optimized model.<br> |
|
It might not be working in ONNX execution provider other than DmlExecutionProvider.<br> |
|
The needed python scripts are included in this repository |
|
|
|
Prerequisites:<br> |
|
1. Install Python 3.10 from Windows Store:<br> |
|
https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US |
|
|
|
2. Open command line cmd.exe |
|
|
|
3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br> |
|
mkdir c:\temp<br> |
|
cd c:\temp<br> |
|
python -m venv dmlgenai<br> |
|
dmlgenai\Scripts\activate.bat<br> |
|
pip install onnxruntime-genai-directml |
|
|
|
4. Use the onnxgenairun.py to get chat interface.<br> |
|
It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br> |
|
The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read. |
|
|
|
rem Change directory to where model and script files is stored<br> |
|
cd this_onnx_model_directory<br> |
|
python onnxgenairun.py --help<br> |
|
python onnxgenairun.py -m . -v -g |
|
|
|
5. (Optional but recommended) Device specific optimization.<br> |
|
a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br> |
|
b. Run the python script: python dml-device-specific-optim.py<br> |
|
c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br> |
|
d. Rerun step 4. |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
15 token/s in Radeon 780M with 8GB pre-allocated RAM.<br> |
|
Increase to 16 token/s with device specific optimized model.onnx.<br> |
|
As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s. |
|
|
|
#### Hardware |
|
AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU<br> |
|
RAM 32GB<br> |
|
|
|
#### Software |
|
Microsoft DirectML on Windows 10 |
|
|
|
## Model Card Authors [optional] |
|
Mochamad Aris Zamroni |
|
|
|
## Model Card Contact |
|
https://www.linkedin.com/in/zamroni/ |