YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

For inference:

Load the model:

import torch
from transformers import AutoTokenizer, AwqConfig, AutoModelForCausalLM

model_id = "vamshigvk/Llama-3.1-8b-Instruct-awq-int4-gemm"
quantization_config=AwqConfig(bits=4, fuse_max_seq_len=512, do_fuse=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model= AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16,
                                            low_cpu_mem_usage=True,
                                            device_map="auto",
                                            quantization_config=quantization_config)

Inference:

%%time


# Convert prompt to tokens
prompt = "what is 2+2?"

prompt_template = [
    {"role":"system", "content":"you are a smart assistant"},
    {"role":"user", "content": prompt}
]


inputs = tokenizer.apply_chat_template(
    prompt_template,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors='pt'
).to("cuda")

# Generate output
output = model.generate(
    **inputs, 
    max_new_tokens=32000,
    do_sample=True
)

print(tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Downloads last month
3
Safetensors
Model size
2.17B params
Tensor type
I32
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.