dolphin-2.9.4-llama3.1-8b ?
Would it be possible to quantize cognitivecomputations/dolphin-2.9.4-llama3.1-8b
? I wanted to do that, but I am getting an error perhaps due to setup. I created an issue for that.
sure. i will do it now.
thank-you.
I cant seem to do any AWQ quants anymore.
I also get the same error with the llm-quantkit
python package.
pip install --upgrade llm-quantkit[cuda]
quantkit awq cognitivecomputations/dolphin-2.9.4-llama3.1-8b -out dolphin-2.9.4-llama3.1-8b-AWQ
Even the simplest example is not working.
seems this model needs transformers>=4.44.0.dev0
and AutoAWQ library wants 4.35 or something like that.
I will try downgrading the transformers version to see if it can work
OK, i've changed the autoawq-kernel, and rebuilding the wheels for it, maybe I can get this working afterall..
basically, both of them(awq and kernels repo) lock the pytorch version to specifically to 2.3.1
and we need 2.4.0
to work with the new transformers.
Completed: https://huggingface.co/solidrust/dolphin-2.9.4-llama3.1-8b-AWQ
- thank-you for you encouragement, this problem pissed me off and I had given up on it
You are the best! Thank you.
These Mintron models that distilled Llama 3.1 8B into 4B look to be working. One fine-tune is: Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05
Getting a weird error with that one:
in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 3072]) in "weight" (which has shape torch.Size([768, 3072])), this looks incorrect.
I will need to debug it. Here is an example of how to reproduce the error (feeling too lazy to fix my repo, so using quantkit today):
transformers==4.42.3
llm-quantkit[cuda]
import json
import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoConfig
from huggingface_hub import snapshot_download, create_repo, upload_folder
model_path = 'Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05'
quant_path = 'Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
quanter = 'suparious'
quant_org = 'solidrust'
# Download the model to a local directory
local_model_path = snapshot_download(model_path)
# Load the model configuration file from the local directory
config_file = os.path.join(local_model_path, "config.json")
with open(config_file, "r") as f:
config_dict = json.load(f)
# Modify the rope_scaling dictionary to include only the required fields
if 'rope_scaling' in config_dict:
rope_scaling = config_dict['rope_scaling']
if 'type' in rope_scaling and 'factor' in rope_scaling:
# Ensure the type is one of the valid values
if rope_scaling['type'] not in ['linear', 'dynamic']:
rope_scaling['type'] = 'linear' # Set to a default valid value
# Ensure the factor is a float greater than 1
if not isinstance(rope_scaling['factor'], float) or rope_scaling['factor'] <= 1.0:
rope_scaling['factor'] = 2.0 # Set to a default valid value
config_dict['rope_scaling'] = {'type': rope_scaling['type'], 'factor': rope_scaling['factor']}
else:
# If 'type' or 'factor' is missing, set default values
config_dict['rope_scaling'] = {'type': 'linear', 'factor': 2.0}
# Save the modified configuration file
with open(config_file, "w") as f:
json.dump(config_dict, f, indent=2)
# Load the model with the modified configuration
model = AutoAWQForCausalLM.from_pretrained(
local_model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
# Upload the quantized model to the Hugging Face Hub
create_repo(repo_id=f"{quant_org}/{quant_path}")
upload_folder(
folder_path=quant_path,
repo_id=f"{quant_org}/{quant_path}",
)
print(f'Quantized model uploaded to "{quant_org}/{quant_path}"')
I had locked transformers version to workaround a llama 3.1 bug, but maybe the latest library is working / required for this model?
Nope, latest transformers doesn't stop AWQ from choking on these rope scaling methods that people are using to extend LLM context windows.
python quantize.py
Fetching 15 files: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 156503.88it/s]
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/ubuntu/smd/quantize.py", line 42, in <module>
model = AutoAWQForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/awq/models/auto.py", line 71, in from_pretrained
return AWQ_CAUSAL_LM_MODEL_MAP[model_type].from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/awq/models/base.py", line 380, in from_pretrained
model = target_cls.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3960, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4434, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 3072]) in "weight" (which has shape torch.Size([768, 3072])), this looks incorrect.
I stopped doing AWQ quants as I would need to really learn AutoAWQ library to keep up with all the weird stuff that the community does to the foundational models.
I'll try my best to debug this one.
OK, I managed to handle this is such a shitty and miserable way....
def override_rope_embeddings():
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
def custom_apply_rotary_pos_emb(q, k, cos, sin):
min_dim = min(q.shape[-1], cos.shape[-1], sin.shape[-1])
q = q[..., :min_dim]
k = k[..., :min_dim]
cos = cos[..., :min_dim]
sin = sin[..., :min_dim]
return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
# Override the existing function
transformers.models.llama.modeling_llama.apply_rotary_pos_emb = custom_apply_rotary_pos_emb
This is so stupid....
The attention layers in this model are transitioning from computing the RoPE embeddings internally through position_ids
(2D tensor with the indexes of the tokens), to using externally computed position_embeddings
(Tuple of tensors, containing cos and sin). In transformers v4.45 position_ids
will be removed and position_embeddings
will be mandatory.
It is quantizing now...
OK, @vaclavkosar - Thank-you for the Saturday morning algebra challenge.
solidrust/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ is ready, not sure it it will work properly or not, I did alot of messing around this time.
Trying now!
I have to say that solidrust/Meta-Llama-3.1-8B-Instruct-abliterated-AWQ
was the best model so far. Maybe because the Llama fine-tuning is exceptional and abliteration just add the free-range talk back in.
It failed with:
/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
360 if value is not None:
361 if old_value.shape != value.shape:
--> 362 raise ValueError(
363 f'Trying to set a tensor of shape {value.shape} in "{tensor_name}" (which has shape {old_value.shape}), this look incorrect.'
364 )
ValueError: Trying to set a tensor of shape torch.Size([3072, 96]) in "qweight" (which has shape torch.Size([3072, 128])), this look incorrect.
I think quant config needs to be added like: https://huggingface.co/solidrust/Starling-LM-7B-beta-AWQ/blob/main/quant_config.json
Yeah, these models have this issue.
Here is the script that I used, but it is not great as the quantized model seems shit after...
import json
import os
import torch
import transformers
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoConfig
from huggingface_hub import snapshot_download, create_repo, upload_folder
model_path = 'Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05'
quant_path = 'Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ'
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
quant_org = 'solidrust'
def download_model(model_path):
try:
return snapshot_download(model_path)
except Exception as e:
print(f"Error downloading model: {e}")
raise
def sanitize_rope_scaling(config_dict):
if 'rope_scaling' in config_dict:
rope_scaling = config_dict['rope_scaling']
if isinstance(rope_scaling, dict):
valid_keys = ['type', 'factor']
rope_scaling = {k: v for k, v in rope_scaling.items() if k in valid_keys}
if rope_scaling.get('type') not in ['linear', 'dynamic']:
print(f"Invalid 'type' in rope_scaling. Setting to 'linear'.")
rope_scaling['type'] = 'linear'
if not isinstance(rope_scaling.get('factor'), float) or rope_scaling['factor'] <= 1.0:
print(f"Invalid 'factor' in rope_scaling. Setting to 2.0.")
rope_scaling['factor'] = 2.0
else:
print("Unexpected format for 'rope_scaling'. Removing it.")
del config_dict['rope_scaling']
return config_dict
def override_rope_embeddings():
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, rotate_half
def custom_apply_rotary_pos_emb(q, k, cos, sin):
# Truncate or pad the tensors to match dimensions
min_dim = min(q.shape[-1], cos.shape[-1], sin.shape[-1])
q = q[..., :min_dim]
k = k[..., :min_dim]
cos = cos[..., :min_dim]
sin = sin[..., :min_dim]
return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
# Override the function within transformers
transformers.models.llama.modeling_llama.apply_rotary_pos_emb = custom_apply_rotary_pos_emb
def load_model(local_model_path):
try:
config_file = os.path.join(local_model_path, "config.json")
with open(config_file, "r") as f:
config_dict = json.load(f)
config_dict = sanitize_rope_scaling(config_dict)
with open(config_file, "w") as f:
json.dump(config_dict, f, indent=2)
# Load model with to_empty to avoid copying from meta tensors
model = AutoAWQForCausalLM.from_pretrained(
local_model_path, **{"low_cpu_mem_usage": True, "use_cache": False}, ignore_mismatched_sizes=True
)
model.to_empty(device=torch.device("cuda"))
return model
except Exception as e:
print(f"Error loading model: {e}")
raise
def quantize_model(model, tokenizer):
try:
model.quantize(tokenizer, quant_config=quant_config)
except ValueError as ve:
print(f"Quantization Error: {ve}")
raise
def upload_to_hf(quant_path, quant_org):
try:
create_repo(repo_id=f"{quant_org}/{quant_path}", exist_ok=True)
upload_folder(
folder_path=quant_path,
repo_id=f"{quant_org}/{quant_path}",
)
print(f'Quantized model uploaded to "{quant_org}/{quant_path}"')
except Exception as e:
print(f"Error uploading to Hugging Face: {e}")
raise
def main():
local_model_path = download_model(model_path)
# Override RoPE embedding calculations
override_rope_embeddings()
model = load_model(local_model_path)
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)
quantize_model(model, tokenizer)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
upload_to_hf(quant_path, quant_org)
if __name__ == "__main__":
main()
I was told by Casper Hansen that the quant_config.json
is no longer supported as he is adding this JSON block into the model's native config.json
. I don't like this approach and prefer to use the quant_config.json
in order to prevent molesting the native models config JSON. Also, some tools and apps still look for this file. So my quant process usually adds in this file, but today we used the quantkit project, which seems to have been abandoned, but simplifies what I wanted to do with my srt-model-quantizing repo. I might pick up that project and deprecate my version.
So that is the reason why my AWQ quants typically always have the quant_config.json
, despite
@casperhansen
advice.
aslo super pissed off about AutoAWQ requiring torch==2.1.3 which is such a shitty torch version, with most of the issues fixed in 2.4.
I always have to build my own AutoAWQ, and the kernels, to support the latest transformers and torch. Such a useless waste of my time.
I made a fork of AutoAWQ that just always enjoys the latest versions of everything, but sometimes it is unstable, so that is the current state of this issue.
I see. Well, these things probably will get resolved later by the package authors.
This particular fine-tune is probably not that good. But in general, these Llama Minitron model fine-tunes will probably be used, since they are efficient distillations.
Hey, can you share your autoawq repository?
It is currently https://github.com/SolidRusT/srt-model-quantizing but I haven't worked on it in awhile.
this Pypi package may be easier to use: https://pypi.org/project/llm-quantkit/
Thanks.
In general, I must tell you, pip dependencies are horrible. I just wanted to run an old, unrelated notebook. And it wouldn't start even when I pinned the original versions of the main packages. I would have to pin down every single one. For example with poetry lock file.
I will setup a Docker image will all the python stuff sorted out, to help users with this exact issue, however, I was able to get the https://github.com/SolidRusT/srt-model-quantizing/awq repo to a stable state. I literally worked on this all day today.
I still cant quant on my 12GB, just like I had done, over 500 AWQ quants. The reason I stopped doing them, is that the memory management in AutoAWQ is not existant / incomplete. and I havent figured out how to solve it. I even connected with Casper Hansen on it.
but I can rent a NVIDIA A10g machine from Amazon, and the quant works fine on that 24GB GPU.
I also got the new llama3.1 models to quant there, using my awq repo.
That sounds painful... But, is it like, solvable in the python side? With new 3.12? That memory thing
unfortunately, this seems to be a compounded issue issue with AutoAWQ, Llama 3.1. rope_scaling hackery and then GPU VRAM.
There is no way to use python 3.12 here. I tried for weeks, and there is just too much work to figure out by myself, and I had to take a break.
the problem with people using the rope scaling methods to increase the shitty context limitation of llama 3.1 (8192 tokens), is that to quantize for AWQ, you need to ensure your tensors are only on a single device (single CPU or single GPU), and there is no conceivable way to distribute them. I really detest this methodology of increasing context windows of models, for this reason.
so now I have my repo exclusively using a single device for tensors, which solves for shitty llama 3.1 rope scaling, but this now disables me from using multi-GPU, partial CPU offload and other memory management techniques. and I cant even quant a 8B model on my 12GB GPU, of which I have over 12 of them, and I intended to automate AWQ quants automatically with this hardware investment.
But now that is is all memory fucked, I have to just release my pipeline and help people make their own AWQ.
I cant do more than 24GB in AWS. I can do multi-GPU, but this is now disabled in AutoAWQ.
I am sincerely considering making a fork of it, and using Claude Sonnet 3.5 to fix AutoAWQ automagically for us.
Maybe tomorrow, I am exhausted from todays refactoring.
I got 98% code coverage and all unit tests passing now in my repo.
Great work!
But.. let say we delete rope scaling from config.json quantize it and put it back after quantization, will that be a problem? But idk, their rope scaling type is different that usual isn't it?
Trying this now, seeing this message, but it might still work... we'll see:
The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.45 `position_ids` will be removed and `position_embeddings` will be mandatory.
Genius idea, seems to work.
solidrust/Hermes-3-Llama-3.1-8B-lorablated-AWQ
let me play more with my script and get it stable.
That's unexpected