zamroni111 commited on
Commit
0829077
·
verified ·
1 Parent(s): 6e510b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -65
README.md CHANGED
@@ -6,10 +6,6 @@ pipeline_tag: text-generation
6
  ---
7
  # Model Card for Model ID
8
 
9
- <!-- Provide a quick summary of what the model is/does. -->
10
-
11
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
12
-
13
  ## Model Details
14
  meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization
15
 
@@ -24,24 +20,14 @@ INT4 accuracy level: FP32 (float32)<br>
24
  8-bit quantization for MoE layers
25
 
26
  - **Developed by:** Mochamad Aris Zamroni
27
- - **Model type:** [More Information Needed]
28
- - **Language(s) (NLP):** [More Information Needed]
29
- - **License:** [More Information Needed]
30
- - **Finetuned from model [optional]:** [More Information Needed]
31
 
32
  ### Model Sources [optional]
33
  https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
34
 
35
- - **Repository:** [More Information Needed]
36
- - **Paper [optional]:** [More Information Needed]
37
- - **Demo [optional]:** [More Information Needed]
38
-
39
- ## Uses
40
-
41
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
42
-
43
  ### Direct Use
44
- This is Windows DirectML optimized model.
 
 
45
 
46
  Prerequisites:<br>
47
  1. Install Python 3.10 from Windows Store:<br>
@@ -49,71 +35,41 @@ https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US
49
 
50
  2. Open command line cmd.exe
51
 
52
- 3. Create python virtual environment and install onnxruntime-genai-directml<br>
53
  mkdir c:\temp<br>
54
  cd c:\temp<br>
55
  python -m venv dmlgenai<br>
56
  dmlgenai\Scripts\activate.bat<br>
57
  pip install onnxruntime-genai-directml
58
 
 
 
 
59
 
60
- ## How to Get Started with the Model
61
-
62
- Use the code below to get started with the model.
63
-
64
- [More Information Needed]
65
-
66
- #### Preprocessing [optional]
67
-
68
- [More Information Needed]
69
 
 
 
 
 
 
70
 
71
  #### Speeds, Sizes, Times [optional]
72
- 15 token/s in Radeon 780M with 8GB dedicated RAM
73
-
74
- #### Metrics
75
-
76
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
77
-
78
- [More Information Needed]
79
-
80
- ### Results
81
-
82
- [More Information Needed]
83
-
84
- #### Summary
85
-
86
-
87
-
88
- ## Model Examination [optional]
89
-
90
- <!-- Relevant interpretability work for the model goes here -->
91
-
92
- [More Information Needed]
93
-
94
- ## Technical Specifications [optional]
95
-
96
- ### Model Architecture and Objective
97
-
98
- [More Information Needed]
99
-
100
- ### Compute Infrastructure
101
-
102
- Microsoft Windows DirectML
103
 
104
  #### Hardware
105
-
106
- AMD Ryzen 7840U with integrated Radeon 780M GPU
107
- RAM 32GB
108
- shared VRAM 8GB
109
 
110
  #### Software
111
-
112
- Microsoft Windows DirectML
113
 
114
  ## Model Card Authors [optional]
115
  Mochamad Aris Zamroni
116
 
117
  ## Model Card Contact
118
-
119
  https://www.linkedin.com/in/zamroni/
 
6
  ---
7
  # Model Card for Model ID
8
 
 
 
 
 
9
  ## Model Details
10
  meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization
11
 
 
20
  8-bit quantization for MoE layers
21
 
22
  - **Developed by:** Mochamad Aris Zamroni
 
 
 
 
23
 
24
  ### Model Sources [optional]
25
  https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
26
 
 
 
 
 
 
 
 
 
27
  ### Direct Use
28
+ This is Microsoft Windows DirectML optimized model.<br>
29
+ It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
30
+ The needed python scripts are included in this repository
31
 
32
  Prerequisites:<br>
33
  1. Install Python 3.10 from Windows Store:<br>
 
35
 
36
  2. Open command line cmd.exe
37
 
38
+ 3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
39
  mkdir c:\temp<br>
40
  cd c:\temp<br>
41
  python -m venv dmlgenai<br>
42
  dmlgenai\Scripts\activate.bat<br>
43
  pip install onnxruntime-genai-directml
44
 
45
+ 5. Use the onnxgenairun.py to get chat interface.<br>
46
+ It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
47
+ The modification makes the text output changes to new line after ".", ":", ";" to make the output easier to be read.
48
 
49
+ python onnxgenairun.py --help<br>
50
+ python onnxgenairun.py -m . -v -g
 
 
 
 
 
 
 
51
 
52
+ 5. (Optional) Device specific optimization.<br>
53
+ a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
54
+ b. Run the python script: python dml-device-specific-optim.py<br>
55
+ c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
56
+ d. Rerun step 4.
57
 
58
  #### Speeds, Sizes, Times [optional]
59
+ 15 token/s in Radeon 780M with 8GB dedicated RAM.<br>
60
+ Increase to 16token/s with device specific optimized model.onnx.<br>
61
+ As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  #### Hardware
64
+ AMD Ryzen 7840U with integrated Radeon 780M GPU<br>
65
+ RAM 32GB<br>
66
+ 8GB pre-allocated iGPU VRAM
 
67
 
68
  #### Software
69
+ Microsoft DirectML on Windows 10
 
70
 
71
  ## Model Card Authors [optional]
72
  Mochamad Aris Zamroni
73
 
74
  ## Model Card Contact
 
75
  https://www.linkedin.com/in/zamroni/