kevinbazira commited on
Commit
8b855d1
·
verified ·
1 Parent(s): 8ba1d4d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +117 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - it
8
+ - pt
9
+ - ja
10
+ - ko
11
+ - zh
12
+ - ar
13
+ - el
14
+ - fa
15
+ - pl
16
+ - id
17
+ - cs
18
+ - he
19
+ - hi
20
+ - nl
21
+ - ro
22
+ - ru
23
+ - tr
24
+ - uk
25
+ - vi
26
+ license: cc-by-nc-4.0
27
+ library_name: transformers
28
+ tags:
29
+ - cohere
30
+ - pytorch
31
+ - awq
32
+ model_name: aya-expanse-8b-awq-4bit
33
+ base_model: CohereForAI/aya-expanse-8b
34
+ inference: false
35
+ model_creator: Cohere For AI
36
+ pipeline_tag: text-generation
37
+ quantized_by: kevinbazira
38
+ ---
39
+
40
+ # aya-expanse-8b-awq-4bit
41
+
42
+ This repository contains a quantized version of the `CohereForAI/aya-expanse-8b` model using the [AWQ](https://huggingface.co/docs/transformers/en/quantization/awq) method in 4-bit precision.
43
+
44
+ ## Model Summary
45
+
46
+ - **Quantized Model**: [kevinbazira/aya-expanse-8b-awq-4bit](https://huggingface.co/kevinbazira/aya-expanse-8b-awq-4bit)
47
+ - **Quantization Method**: [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978)
48
+ - **AWQ Version**: [GEMM](https://github.com/casper-hansen/AutoAWQ/tree/f1abb8ef8e261db78eb6c603f691801797fbb293?tab=readme-ov-file#int4-gemm-vs-int4-gemv-vs-fp16)
49
+ - **Precision**: 4-bit
50
+ - **Original Model**: [CohereForAI/aya-expanse-8b](https://huggingface.co/CohereForAI/aya-expanse-8b)
51
+
52
+ ## How to Use the Quantized Model
53
+
54
+ ### 1. Install the necessary packages
55
+
56
+ Before using the quantized model, please ensure your environment has:
57
+ - [AutoAWQ_kernels](https://github.com/casper-hansen/AutoAWQ_kernels)
58
+ - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
59
+
60
+ ### 2. Run inference
61
+ Load and use the quantized model as shown below in Python:
62
+
63
+ ```python
64
+ import torch
65
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
66
+
67
+ # Set up device
68
+ device = torch.device('cuda:1') # Remember to use the correct device here
69
+
70
+ # Load model and tokenizer
71
+ model_name = "kevinbazira/aya-expanse-8b-awq-4bit"
72
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
73
+ quantization_config = AwqConfig(version="exllama")
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ model_name,
76
+ device_map={"": device.index},
77
+ quantization_config=quantization_config
78
+ )
79
+
80
+ # Prepare input
81
+ # https://huggingface.co/docs/transformers/en/pad_truncation
82
+ input_text = "Add your prompt here."
83
+ inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=64)
84
+ inputs = {key: value.to(device) for key, value in inputs.items()}
85
+
86
+ # Perform text generation
87
+ # https://huggingface.co/docs/transformers/en/main_classes/text_generation
88
+ outputs = model.generate(
89
+ **inputs,
90
+ num_return_sequences=1,
91
+ min_new_tokens=64,
92
+ max_new_tokens=64,
93
+ do_sample=False,
94
+ use_cache=True,
95
+ num_beams=1
96
+ )
97
+
98
+ # Decode and print the output
99
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
100
+ ```
101
+
102
+ ## Benchmark Results
103
+
104
+ To evaluate the performance of the quantized model, we run benchmarks using the Hugging Face [Optimum Benchmark](https://github.com/huggingface/optimum-benchmark/tree/7cec62e016d76fe612308e4c2c074fc7f09289fd) tool on an AMD MI200 GPU with ROCm 6.1 and below are the results:
105
+
106
+ ### Unquantized Model Results:
107
+ <img src="unquantized-model-results.png" alt="Unquantized Model Results" style="width: 100%; object-fit: cover; display: block;">
108
+
109
+ ### AWQ Quantized Model Results:
110
+ <img src="awq-quantized-model-results.png" alt="AWQ Quantized Model Results" style="width: 100%; object-fit: cover; display: block;">
111
+
112
+ These results show that the AWQ quantized model offers significant speed advantages during critical inference stages (decode and per-token), outweighing the higher latencies encountered during the load and prefill phases. For deployment scenarios where inference speed is paramount, you can preload the quantized model to eliminate initial latency concerns.
113
+
114
+ ## More Information
115
+
116
+ - **Original Model**: For details about the original model's architecture, training dataset, and performance, please visit the CohereForAI [aya-expanse-8b model card](https://huggingface.co/CohereForAI/aya-expanse-8b).
117
+ - **Support or inquiries**: If you run into any issues or have questions about the quantized model, feel free to reach me via email:`[email protected]`. I'll be happy to help!