zamroni111
/

Meta-Llama-3.1-8B-Instruct-ONNX-DirectML-GenAI-INT4

Text Generation

ONNX

directml

windows

conversational

Model card Files Files and versions Community

zamroni111 commited on Sep 12, 2024

Commit

0829077

verified ·

1 Parent(s): 6e510b3

Update README.md

Browse files

Files changed (1) hide show

README.md +21 -65

README.md CHANGED Viewed

@@ -6,10 +6,6 @@ pipeline_tag: text-generation
 ---
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
 ## Model Details
 meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization
@@ -24,24 +20,14 @@ INT4 accuracy level: FP32 (float32)<br>
 8-bit quantization for MoE layers
 - **Developed by:** Mochamad Aris Zamroni
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
 https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-This is Windows DirectML optimized model.
 Prerequisites:<br>
 1. Install Python 3.10 from Windows Store:<br>
@@ -49,71 +35,41 @@ https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US
 2. Open command line cmd.exe
-3. Create python virtual environment and install onnxruntime-genai-directml<br>
 mkdir c:\temp<br>
 cd c:\temp<br>
 python -m venv dmlgenai<br>
 dmlgenai\Scripts\activate.bat<br>
 pip install onnxruntime-genai-directml
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-#### Preprocessing [optional]
-[More Information Needed]
 #### Speeds, Sizes, Times [optional]
-15 token/s in Radeon 780M with 8GB dedicated RAM
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-Microsoft Windows DirectML
 #### Hardware
-AMD Ryzen 7840U with integrated Radeon 780M GPU
-RAM 32GB
-shared VRAM 8GB
 #### Software
-Microsoft Windows DirectML
 ## Model Card Authors [optional]
 Mochamad Aris Zamroni
 ## Model Card Contact
 https://www.linkedin.com/in/zamroni/

 ---
 # Model Card for Model ID
 ## Model Details
 meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization
 8-bit quantization for MoE layers
 - **Developed by:** Mochamad Aris Zamroni
 ### Model Sources [optional]
 https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
 ### Direct Use
+This is Microsoft Windows DirectML optimized model.<br>
+It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
+The needed python scripts are included in this repository
 Prerequisites:<br>
 1. Install Python 3.10 from Windows Store:<br>
 2. Open command line cmd.exe
+3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
 mkdir c:\temp<br>
 cd c:\temp<br>
 python -m venv dmlgenai<br>
 dmlgenai\Scripts\activate.bat<br>
 pip install onnxruntime-genai-directml
+5. Use the onnxgenairun.py to get chat interface.<br>
+It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
+The modification makes the text output changes to new line after ".", ":", ";" to make the output easier to be read.
+python onnxgenairun.py --help<br>
+python onnxgenairun.py -m . -v -g
+5. (Optional) Device specific optimization.<br>
+a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
+b. Run the python script: python dml-device-specific-optim.py<br>
+c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
+d. Rerun step 4.
 #### Speeds, Sizes, Times [optional]
+15 token/s in Radeon 780M with 8GB dedicated RAM.<br>
+Increase to 16token/s with device specific optimized model.onnx.<br>
+As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.
 #### Hardware
+AMD Ryzen 7840U with integrated Radeon 780M GPU<br>
+RAM 32GB<br>
+8GB pre-allocated iGPU VRAM
 #### Software
+Microsoft DirectML on Windows 10
 ## Model Card Authors [optional]
 Mochamad Aris Zamroni
 ## Model Card Contact
 https://www.linkedin.com/in/zamroni/