File size: 3,318 Bytes
43e4be9 83c4fe6 43e4be9 83c4fe6 07a4389 821da69 43e4be9 a840c14 b9b1fc7 eb77e74 08f785c eb77e74 43e4be9 2304c05 43e4be9 bc78057 452fed3 43e4be9 bc78057 d7264eb 43e4be9 2a8e106 9acd6a3 43e4be9 c42f4cc 43e4be9 49b8e8a 43e4be9 0829077 43e4be9 bc78057 2df16cb 43e4be9 2df16cb 43e4be9 2df16cb bc78057 452fed3 43e4be9 2df16cb 0829077 2603e42 2347765 b170344 0829077 43e4be9 673dd82 0829077 43e4be9 65ba526 08bce6b 0829077 43e4be9 6ad901b 0829077 43e4be9 0829077 43e4be9 c42f4cc 43e4be9 c42f4cc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
pipeline_tag: text-generation
tags:
- directml
- windows
- onnx
- conversational
---
# Model Card for Model ID
## Model Details
meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization.<br>
Output is reformatted that each sentence starts at new line to improve readability.
<pre>
...
vNewDecoded = tokenizer_stream.decode(new_token)
if re.findall("^[\x2E\x3A\x3B]$", vPreviousDecoded) and vNewDecoded.startswith(" ") and (not vNewDecoded.startswith(" *")) :
vNewDecoded = "\n" + vNewDecoded.replace(" ", "", 1)
print(vNewDecoded, end='', flush=True)
vPreviousDecoded = vNewDecoded
...
</pre>
<img src="https://zci.sourceforge.io/epub/llama31.png">
### Model Description
meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization<br>
https://onnxruntime.ai/docs/genai/howto/install.html#directml
Created using ONNX Runtime GenAI's builder.py<br>
https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py
Build options:<br>
INT4 accuracy level: FP32 (float32)
- **Developed by:** Mochamad Aris Zamroni
### Model Sources [optional]
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
### Direct Use
This is Microsoft Windows DirectML optimized model.<br>
It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
The needed python scripts are included in this repository
Prerequisites:<br>
1. Install Python 3.11 from Windows Store:<br>
https://apps.microsoft.com/search/publisher?name=Python+Software+Foundation
3. Open command line cmd.exe
4. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
mkdir c:\temp<br>
cd c:\temp<br>
python -m venv dmlgenai<br>
dmlgenai\Scripts\activate.bat<br>
pip install onnxruntime-genai-directml
5. Use the onnxgenairun.py to get chat interface.<br>
It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read.
rem Change directory to where model and script files is stored<br>
cd this_onnx_model_directory<br>
python onnxgenairun.py --help<br>
python onnxgenairun.py -m . -v -g
5. (Optional but recommended) Device specific optimization.<br>
a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
b. Run the python script: python dml-device-specific-optim.py<br>
c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
d. Rerun step 4.
#### Speeds, Sizes, Times [optional]
15 token/s in Radeon 780M with 8GB pre-allocated RAM.<br>
Increase to 16 token/s with device specific optimized model.onnx.<br>
As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.
#### Hardware
AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU<br>
RAM 32GB<br>
#### Software
Microsoft DirectML on Windows 10
## Model Card Authors [optional]
Mochamad Aris Zamroni
## Model Card Contact
https://www.linkedin.com/in/zamroni/ |