OPEA/OLMo-2-1124-13B-Instruct-int4-sym-inc

Model Card Details

This model is an int4 model with group_size 128 and symmetric quantization of allenai/OLMo-2-1124-13B-Instruct generated by intel/auto-round. Load the model with revision 4b5e415 to use AutoGPTQ format

Inference on CPU/HPU/CUDA

pip3 install transformers>=4.47

HPU: docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in Gaudi Guide.

from auto_round import AutoHfQuantizer ##must import for auto-round format
import torch
from transformers import AutoModelForCausalLM,AutoTokenizer
quantized_model_dir = "OPEA/OLMo-2-1124-13B-Instruct-int4-sym-inc"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype='auto',
    device_map="auto",
    ##revision="4b5e415", ##AutoGPTQ format
)

##import habana_frameworks.torch.core as htcore ## uncommnet it for HPU
##import habana_frameworks.torch.hpu as hthpu ## uncommnet it for HPU
##model = model.to(torch.bfloat16).to("hpu") ## uncommnet it for HPU

prompt = "There is a girl who likes adventure,"
messages = [
    {"role": "system", "content": "You are OLMo 2, a helpful and harmless AI Assistant built by the Allen Institute for AI."},
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=200, 
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

##prompt = "There is a girl who likes adventure,"
##INT4
"""
That sounds exciting! Adventure can come in many forms. For someone who enjoys the thrill of exploration, here are a few adventure-filled ideas:

1. **Travel to New Places**: Encourage her to explore new cities, countries, or even different cultures. Traveling can be an adventure in itself, offering new experiences and perspectives.

2. **Outdoor Activities**: Engage in outdoor adventures such as hiking, camping, rock climbing, or even kayaking. These activities can provide a sense of freedom and connection with nature.

3. **Learn a New Skill**: Learning something new, like scuba diving, horseback riding, or even skydiving, can be an adventure in itself. It's not just about the activity but also about the journey of learning and mastering something new.

4. **Volunteer Work**: Consider volunteering for adventure-related activities, such as wildlife conservation, archaeological digs, or even helping with outdoor events. This can be both
"""

##BF16 
"""
That sounds exciting! Adventure can come in many forms. It could be exploring new places, trying new activities, or even diving into books and movies about thrilling quests and journeys. What kind of adventure is she interested in?
"""

##prompt = "Which one is larger, 9.11 or 9.8"
## INT4
"""9.11 is larger than 9.8.
"""

## BF16
"""9.11 is larger than 9.8."""

prompt = "How many r in strawberry."
## INT4
"""There are two 'r's in the word "strawberry.""""
## BF16 
"""There are two 'r's in the word "strawberry.""""


##prompt = "Once upon a time,"
##INT4
"""There was a curious user who wanted to continue a story. How should the story unfold?"""

##BF16
"""there was a curious user who wanted to explore the vast world of knowledge and storytelling. And I, OLMo 2, was here to assist and guide them on their journey. What would you like to explore today?
"""

Evaluate the model

pip3 install lm-eval==0.4.5

auto-round --eval --model "OPEA/OLMo-2-1124-13B-Instruct-int4-sym-inc" --eval_bs 16  --tasks leaderboard_mmlu_pro,leaderboard_ifeval,lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,gsm8k

Metric	BF16	INT4
avg	0.6557	0.6525
leaderboard_mmlu_pro 5shot	0.3314	0.3264
leaderboard_ifeval	0.6879=(0.7398+0.6359)/2	0.6832=(0.7362+0.6303)/2
lambada_openai	0.7479	0.7559
hellaswag	0.6853	0.6808
winogrande	0.7758	0.7806
piqa	0.8248	0.8177
truthfulqa_mc1	0.4296	0.4247
openbookqa	0.4260	0.4220
boolq	0.7850	0.7532
arc_easy	0.8304	0.8295
arc_challenge	0.5742	0.5776
gsm8k(5shot) strict match	0.7703	0.7779

Reproduce the model

Here is the sample command to generate the model.

auto-round  \
--model OLMo-2-1124-13B-Instruct \
--device 0 \
--nsamples 512 \
--model_dtype "fp16" \
--iter 1000 \
--disable_eval \
--format 'auto_gptq,auto_round' \
--output_dir "./tmp_autoround"

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

OPEA
/

OLMo-2-1124-13B-Instruct-int4-sym-inc