OPEA
/

DeepSeek-V3-int4-sym-gptq-inc

Safetensors

deepseek_v3

custom_code

4-bit precision

gptq

Model card Files Files and versions Community

wenhuach commited on 6 days ago

Commit

7c8c9c6

1 Parent(s): 8303562

add details to generate the model

Browse files

Signed-off-by: wenhuach <[email protected]>

Files changed (2) hide show

README.md +63 -11
config.json +1 -1

README.md CHANGED Viewed

@@ -4,31 +4,32 @@ datasets:
 base_model:
 - deepseek-ai/DeepSeek-V3
 ---
 ## Model Details
 This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
-**Please note this model may introduce overflow issue cuased by FP16 kernel , which is tipically used on cuda device. Additionally, loading the model in Transformers can be quite slow. Consider using an alternative serving framework that could run int4 models with bf16 computing dtype.**
-Due to limited GPU resources, we have only tested a few prompts on a CPU backend with QBits. If this model does not meet your performance expectations, you may explore another quantized model in AWQ format, generated via AutoRound with different hyperparameters. This alternative model will be uploaded soon.
 Please follow the license of the original model.
 ## How To Use
-### INT4 Inference on CPU with Qbits(Recommend)
 **pip3 install auto-round** (it will install intel-extension-for-pytorch and intel-extension-for-transformers both). For intel cpu, it will  prioritize using intel-extension-for-pytorch , for other cpus, it will prioritize using intel-extension-for-transformers.
 **To make sure to use qbits with intel-extension-for-transformers, please uninstall intel-extension-for-pytorch**
-intel-extension-for-transformers faster repacking, slower inference,higher accuracy
-intel-extension-for-pytorch  much slower repacking, faster inferecne, lower accuracy
-~~~python
 from auto_round import AutoRoundConfig ##must import for autoround format
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
@@ -160,10 +161,11 @@ prompt = "There is a girl who likes adventure,"
 prompt = "Please give a brief introduction of DeepSeek company."
 ##INT4:
 """DeepSeek Artificial Intelligence Co., Ltd. (referred to as "DeepSeek" or "深度求索") , founded in 2023, is a Chinese company dedicated to making AGI a reality"""
-~~~
 ### INT4 Inference on CUDA(have not tested, maybe need 8X80G GPU)
-BF16 int4 kernel is required.
 ````python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -211,9 +213,59 @@ we have no enough resource to evaluate the model
 ### Generate the model
-need 200G GPU memory, details will be updated later
 ## Ethical Considerations and Limitations
@@ -237,4 +289,4 @@ The license on this model does not constitute legal advice. We are not responsib
 @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
-[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)

 base_model:
 - deepseek-ai/DeepSeek-V3
 ---
 ## Model Details
 This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
+**On CUDA devices, this model is prone to overflow caused by the INT4 kernel using the FP16 computation dtype. Additionally, loading the model in Transformers can be quite slow. Consider using an alternative serving framework capable of running INT4 models with the BF16 computation dtype.**
+Due to limited GPU resources, we have only tested a few prompts on a CPU backend with intel-extension-for-transformers . If this model does not meet your performance expectations, you may explore another quantized model in AWQ format, generated via AutoRound with different hyperparameters. This alternative model will be uploaded soon.
 Please follow the license of the original model.
 ## How To Use
+### INT4 Inference on CPU with ITREX(Recommended)
 **pip3 install auto-round** (it will install intel-extension-for-pytorch and intel-extension-for-transformers both). For intel cpu, it will  prioritize using intel-extension-for-pytorch , for other cpus, it will prioritize using intel-extension-for-transformers.
 **To make sure to use qbits with intel-extension-for-transformers, please uninstall intel-extension-for-pytorch**
+intel-extension-for-transformers: faster repacking, slower inference,higher accuracy
+intel-extension-for-pytorch:  much slower repacking, faster inference, lower accuracy
+~~python
 from auto_round import AutoRoundConfig ##must import for autoround format
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 prompt = "Please give a brief introduction of DeepSeek company."
 ##INT4:
 """DeepSeek Artificial Intelligence Co., Ltd. (referred to as "DeepSeek" or "深度求索") , founded in 2023, is a Chinese company dedicated to making AGI a reality"""
+~~
 ### INT4 Inference on CUDA(have not tested, maybe need 8X80G GPU)
+Int4 kernel with BF16 computing dtype is required.
 ````python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 ### Generate the model
+**5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
+We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
+~~python
+model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
+model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
+model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
+model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
+model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
+model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
+model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
+model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
+~~
+**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
+~~python
+import safetensors
+from safetensors.torch import save_file
+for i in range(1, 164):
+    idx_str = "0" * (5-len(str(i))) + str(i)
+    safetensors_path = f"model-{idx_str}-of-000163.safetensors"
+    print(safetensors_path)
+    tensors = dict()
+    with safetensors.safe_open(safetensors_path, framework="pt") as f:
+        for key in f.keys():
+            tensors[key] = f.get_tensor(key)
+    save_file(tensors, safetensors_path, metadata={'format': 'pt'})
+~~
+**2 replace the  modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
+https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
+**3   tuning**
+~~
+git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
+~~
+```bash
+python3 -m auto_round --model  "/models/DeepSeek-V3-bf16/"  --group_size 128 --format "auto_gptq"  --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512   --low_gpu_mem_usage    --output_dir "tmp_autoround"  --disable_eval e 2>&1 | tee -a seekv3.txt
+```
 ## Ethical Considerations and Limitations
 @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
+[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)

config.json CHANGED Viewed

@@ -79,7 +79,7 @@
   "tie_word_embeddings": false,
   "topk_group": 4,
   "topk_method": "noaux_tc",
-  "torch_dtype": "float16",
   "transformers_version": "4.47.0",
   "use_cache": true,
   "v_head_dim": 128,

   "tie_word_embeddings": false,
   "topk_group": 4,
   "topk_method": "noaux_tc",
+  "torch_dtype": "bfloat16",
   "transformers_version": "4.47.0",
   "use_cache": true,
   "v_head_dim": 128,