|
--- |
|
base_model: FreedomIntelligence/AceGPT-13B-chat |
|
inference: false |
|
license: llama2 |
|
model_creator: FreedomIntelligence |
|
model_name: AceGPT 13B chat |
|
model_type: llama2 |
|
quantized_by: MohamedRashad |
|
datasets: |
|
- FreedomIntelligence/Arabic-Vicuna-80 |
|
- FreedomIntelligence/Arabic-AlpacaEval |
|
- FreedomIntelligence/MMLU_Arabic |
|
- FreedomIntelligence/EXAMs |
|
- FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment |
|
language: |
|
- en |
|
- ar |
|
library_name: transformers |
|
--- |
|
<center> |
|
<img src="https://i.pinimg.com/564x/b1/6b/fd/b16bfd356bb55de1b1b911a4a04fb9a6.jpg"> |
|
</center> |
|
|
|
# AceGPT 13B Chat - AWQ |
|
- Model creator: [FreedomIntelligence](https://huggingface.co/FreedomIntelligence) |
|
- Original model: [AceGPT 13B Chat](https://huggingface.co/FreedomIntelligence/AceGPT-13B-chat) |
|
|
|
<!-- description start --> |
|
## Description |
|
|
|
This repo contains AWQ model files for [FreedomIntelligence's AceGPT 13B Chat](https://huggingface.co/FreedomIntelligence/AceGPT-13B-chat). |
|
|
|
In my effort of making Arabic LLms Available for consumers with simple GPUs I have Quantized two important models: |
|
- [AceGPT 13B Chat AWQ](https://huggingface.co/MohamedRashad/AceGPT-13B-chat-AWQ) **(We are Here)** |
|
- [AceGPT 7B Chat AWQ](https://huggingface.co/MohamedRashad/AceGPT-7B-chat-AWQ) |
|
|
|
### About AWQ |
|
|
|
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings. |
|
|
|
It is supported by: |
|
|
|
- [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ |
|
- [vLLM](https://github.com/vllm-project/vllm) - Llama and Mistral models only |
|
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) |
|
- [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers |
|
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code |
|
|
|
<!-- description end --> |
|
|
|
<!-- prompt-template start --> |
|
## Prompt template: Unknown |
|
|
|
``` |
|
[INST] <<SYS>>\nأنت مساعد مفيد ومحترم وصادق. أجب دائما بأكبر قدر ممكن من المساعدة بينما تكون آمنا. يجب ألا تتضمن إجاباتك أي محتوى ضار أو غير أخلاقي أو عنصري أو جنسي أو سام أو خطير أو غير قانوني. يرجى التأكد من أن ردودك غير متحيزة اجتماعيا وإيجابية بطبيعتها.\n\nإذا كان السؤال لا معنى له أو لم يكن متماسكا من الناحية الواقعية، اشرح السبب بدلا من الإجابة على شيء غير صحيح. إذا كنت لا تعرف إجابة سؤال ما، فيرجى عدم مشاركة معلومات خاطئة.\n<</SYS>>\n\n |
|
[INST] {prompt} [/INST] |
|
``` |
|
<!-- prompt-template end --> |
|
|
|
<!-- README_AWQ.md-use-from-python start --> |
|
## Inference from Python code using Transformers |
|
|
|
### Install the necessary packages |
|
|
|
- Requires: [Transformers](https://huggingface.co/docs/transformers) 4.35.0 or later. |
|
- Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.6 or later. |
|
|
|
```shell |
|
pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0" |
|
``` |
|
|
|
Note that if you are using PyTorch 2.0.1, the above AutoAWQ command will automatically upgrade you to PyTorch 2.1.0. |
|
|
|
If you are using CUDA 11.8 and wish to continue using PyTorch 2.0.1, instead run this command: |
|
|
|
```shell |
|
pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl |
|
``` |
|
|
|
If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead: |
|
|
|
```shell |
|
pip3 uninstall -y autoawq |
|
git clone https://github.com/casper-hansen/AutoAWQ |
|
cd AutoAWQ |
|
pip3 install . |
|
``` |
|
|
|
### Transformers example code (requires Transformers 4.35.0 and later) |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
|
|
model_name_or_path = "MohamedRashad/AceGPT-13B-chat-AWQ" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name_or_path, |
|
use_flash_attention_2=True, # disable if you have problems with flash attention 2 |
|
torch_dtype=torch.float16, |
|
low_cpu_mem_usage=True, |
|
device_map="auto" |
|
) |
|
|
|
# Using the text streamer to stream output one token at a time |
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
|
|
prompt = "ما أجمل بيت شعر فى اللغة العربية ؟" |
|
prompt_template=f'''[INST] <<SYS>>\nأنت مساعد مفيد ومحترم وصادق. أجب دائما بأكبر قدر ممكن من المساعدة بينما تكون آمنا. يجب ألا تتضمن إجاباتك أي محتوى ضار أو غير أخلاقي أو عنصري أو جنسي أو سام أو خطير أو غير قانوني. يرجى التأكد من أن ردودك غير متحيزة اجتماعيا وإيجابية بطبيعتها.\n\nإذا كان السؤال لا معنى له أو لم يكن متماسكا من الناحية الواقعية، اشرح السبب بدلا من الإجابة على شيء غير صحيح. إذا كنت لا تعرف إجابة سؤال ما، فيرجى عدم مشاركة معلومات خاطئة.\n<</SYS>>\n\n |
|
[INST] {prompt} [/INST] |
|
''' |
|
|
|
# Convert prompt to tokens |
|
tokens = tokenizer( |
|
prompt_template, |
|
return_tensors='pt' |
|
).input_ids.cuda() |
|
|
|
generation_params = { |
|
"do_sample": True, |
|
"temperature": 0.7, |
|
"top_p": 0.95, |
|
"top_k": 40, |
|
"max_new_tokens": 512, |
|
"repetition_penalty": 1.1 |
|
} |
|
|
|
# Generate streamed output, visible one token at a time |
|
generation_output = model.generate( |
|
tokens, |
|
streamer=streamer, |
|
**generation_params |
|
) |
|
|
|
# Generation without a streamer, which will include the prompt in the output |
|
generation_output = model.generate( |
|
tokens, |
|
**generation_params |
|
) |
|
|
|
# Get the tokens from the output, decode them, print them |
|
token_output = generation_output[0] |
|
text_output = tokenizer.decode(token_output) |
|
print("model.generate output: ", text_output) |
|
|
|
# Inference is also possible via Transformers' pipeline |
|
from transformers import pipeline |
|
|
|
pipe = pipeline( |
|
"text-generation", |
|
model=model, |
|
tokenizer=tokenizer, |
|
**generation_params |
|
) |
|
|
|
pipe_output = pipe(prompt_template)[0]['generated_text'] |
|
print("pipeline output: ", pipe_output) |
|
|
|
``` |
|
<!-- README_AWQ.md-use-from-python end --> |
|
|
|
|
|
<!-- README_AWQ.md-provided-files start --> |
|
## How AWQ Quantization happened ? |
|
```python |
|
from awq import AutoAWQForCausalLM |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_path = "FreedomIntelligence/AceGPT-13B-chat" |
|
quant_path = "AceGPT-13B-chat-AWQ" |
|
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} |
|
load_config = { |
|
"low_cpu_mem_usage": True, |
|
"device_map": "auto", |
|
"trust_remote_code": True, |
|
} |
|
# Load model |
|
model = AutoAWQForCausalLM.from_pretrained(model_path, **load_config) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
# Quantize |
|
model.quantize(tokenizer, quant_config=quant_config) |
|
|
|
# Save quantized model |
|
model.save_quantized(quant_path) |
|
tokenizer.save_pretrained(quant_path) |
|
|
|
# Load quantized model |
|
model = AutoModelForCausalLM.from_pretrained(quant_path) |
|
tokenizer = AutoTokenizer.from_pretrained(quant_path) |
|
|
|
# Push to hub |
|
model.push_to_hub(quant_path) |
|
tokenizer.push_to_hub(quant_path) |
|
``` |
|
|
|
<!-- README_AWQ.md-provided-files end --> |
|
|
|
|