parinitarahi commited on
Commit
b50e3d9
·
verified ·
1 Parent(s): e482f4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -1,3 +1,68 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - ONNX
6
+ - DML
7
+ - ONNXRuntime
8
+ - mistral
9
+ - conversational
10
+ - custom_code
11
+ inference: false
12
+ ---
13
+
14
+ # Mistral-7B-Instruct-v0.2 ONNX models
15
+
16
+ <!-- Provide a quick summary of what the model is/does. -->
17
+ This repository hosts the optimized versions of [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) to accelerate inference with ONNX Runtime.
18
+
19
+ The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2.
20
+
21
+ Optimized Mistral models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms and Windows, Linux, and Mac desktops, with the precision best suited to each of these targets.
22
+
23
+ [DirectML](https://aka.ms/directml) support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Mistral across a range of devices for CPU and GPU.
24
+
25
+ To easily get started with Mistral, you can use [Olive](https://github.com/microsoft/Olive), our easy-to-use, hardware-aware model optimization tool. See [here](https://github.com/microsoft/Olive/tree/main/examples/mistral) for instructions on how to run it with Mistral.
26
+
27
+ ## ONNX Models
28
+
29
+ Here are some of the optimized configurations we have added:
30
+
31
+ 1. ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using [AWQ](https://arxiv.org/abs/2306.00978).
32
+ 2. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
33
+ 3. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
34
+ 4. ONNX model for int4 CPU: ONNX model for your CPU, using int4 quantization via RTN.
35
+
36
+ ## Hardware Supported
37
+
38
+ The models are tested on:
39
+ - GPU SKU: RTX 4090 (DirectML)
40
+ - GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
41
+ - CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
42
+
43
+ Minimum Configuration Required:
44
+ - Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
45
+ - CUDA: Streaming Multiprocessors (SMs) >= 70 (i.e. V100 or newer)
46
+
47
+ ### Model Description
48
+
49
+ - **Developed by:** Microsoft
50
+ - **Model type:** ONNX
51
+ - **Language(s) (NLP):** Python, C, C++
52
+ - **License:** Apache License Version 2.0
53
+ - **Model Description:** This is a conversion of the Mistral-7B-Instruct-v0.2 model for ONNX Runtime inference.
54
+
55
+ ## Additional Details
56
+ - [**Mistral Model Announcement Link**](https://mistral.ai/news/announcing-mistral-7b/)
57
+ - [**Mistral Model Card**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
58
+ - [**Mistral Technical Report**](https://arxiv.org/abs/2310.06825)
59
+
60
+ ## Appendix
61
+
62
+ ### Activation Aware Quantization
63
+
64
+ AWQ works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see [here](https://arxiv.org/abs/2306.00978).
65
+
66
+
67
+ ## Model Card Contact
68
+ sschoenmeyer, sunghcho, kvaishnavi