zamroni111
/

Meta-Llama-3.1-8B-Instruct-ONNX-DirectML-GenAI-INT4

Text Generation

Model card Files Files and versions Community

Meta-Llama-3.1-8B-Instruct-ONNX-DirectML-GenAI-INT4 / README.md

zamroni111's picture

Update README.md

673dd82 verified 3 days ago

|

history blame contribute delete

3.22 kB

	---
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
	pipeline_tag: text-generation
	tags:
	- directml
	- windows
	---
	# Model Card for Model ID

	## Model Details
	meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization.<br>
	Output is reformatted that each sentence starts at new line to improve readability.
	<pre>
	...
	vNewDecoded = tokenizer_stream.decode(new_token)
	if re.findall("^[\x2E\x3A\x3B]$", vPreviousDecoded) and vNewDecoded.startswith(" ") and (not vNewDecoded.startswith(" *")) :
	vNewDecoded = "\n" + vNewDecoded.replace(" ", "", 1)
	print(vNewDecoded, end='', flush=True)
	vPreviousDecoded = vNewDecoded
	...
	</pre>

	### Model Description
	meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization<br>
	https://onnxruntime.ai/docs/genai/howto/install.html#directml

	Created using ONNX Runtime GenAI's builder.py<br>
	https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py

	Build options:<br>
	INT4 accuracy level: FP32 (float32)

	- Developed by: Mochamad Aris Zamroni

	### Model Sources [optional]
	https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

	### Direct Use
	This is Microsoft Windows DirectML optimized model.<br>
	It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
	The needed python scripts are included in this repository

	Prerequisites:<br>
	1. Install Python 3.10 from Windows Store:<br>
	https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US

	2. Open command line cmd.exe

	3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
	mkdir c:\temp<br>
	cd c:\temp<br>
	python -m venv dmlgenai<br>
	dmlgenai\Scripts\activate.bat<br>
	pip install onnxruntime-genai-directml

	4. Use the onnxgenairun.py to get chat interface.<br>
	It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
	The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read.

	rem Change directory to where model and script files is stored<br>
	cd this_onnx_model_directory<br>
	python onnxgenairun.py --help<br>
	python onnxgenairun.py -m . -v -g

	5. (Optional but recommended) Device specific optimization.<br>
	a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
	b. Run the python script: python dml-device-specific-optim.py<br>
	c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
	d. Rerun step 4.

	#### Speeds, Sizes, Times [optional]
	15 token/s in Radeon 780M with 8GB pre-allocated RAM.<br>
	Increase to 16 token/s with device specific optimized model.onnx.<br>
	As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.

	#### Hardware
	AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU<br>
	RAM 32GB<br>

	#### Software
	Microsoft DirectML on Windows 10

	## Model Card Authors [optional]
	Mochamad Aris Zamroni

	## Model Card Contact
	https://www.linkedin.com/in/zamroni/