kevinintel
commited on
Commit
•
52e97e8
1
Parent(s):
a0bc363
Update README.md
Browse files
README.md
CHANGED
@@ -2,6 +2,19 @@
|
|
2 |
The article discusses the how to make inference of transformer-based models more efficient on Intel hardware. The authors propose sparse pattern 1x4 to fit Intel instructions and improve the performance. We implement 1x4 block pruning and get an 80% sparse model on the SQuAD1.1 dataset. Combined with quantization, it achieves up to **x24.2 speedup with less than 1% accuracy loss**. The article also shows performance gains of other models with this approach.
|
3 |
The model card has been written by Intel.
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
|
6 |
### How to use
|
7 |
-
Please follow Readme in
|
|
|
2 |
The article discusses the how to make inference of transformer-based models more efficient on Intel hardware. The authors propose sparse pattern 1x4 to fit Intel instructions and improve the performance. We implement 1x4 block pruning and get an 80% sparse model on the SQuAD1.1 dataset. Combined with quantization, it achieves up to **x24.2 speedup with less than 1% accuracy loss**. The article also shows performance gains of other models with this approach.
|
3 |
The model card has been written by Intel.
|
4 |
|
5 |
+
### Model license
|
6 |
+
Licensed under MIT license.
|
7 |
+
|
8 |
+
| Model Detail | Description |
|
9 |
+
| ---- | --- |
|
10 |
+
| language: | en |
|
11 |
+
| Model Authors Company | Intel |
|
12 |
+
| Date | June 7, 2023 |
|
13 |
+
| Version | 1 |
|
14 |
+
| Type | NLP - Tiny language model|
|
15 |
+
| Architecture | " we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime." |
|
16 |
+
| Paper or Other Resources | https://arxiv.org/abs/2211.07715.pdf |
|
17 |
+
| License | TBD |
|
18 |
|
19 |
### How to use
|
20 |
+
Please follow Readme in https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-classification/deployment/sparse/distilbert_base_uncased
|