YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
For inference:
Load the model:
import torch
from transformers import AutoTokenizer, AwqConfig, AutoModelForCausalLM
model_id = "vamshigvk/Llama-3.1-8b-Instruct-awq-int4-gemm"
quantization_config=AwqConfig(bits=4, fuse_max_seq_len=512, do_fuse=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model= AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
quantization_config=quantization_config)
Inference:
%%time
# Convert prompt to tokens
prompt = "what is 2+2?"
prompt_template = [
{"role":"system", "content":"you are a smart assistant"},
{"role":"user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
prompt_template,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors='pt'
).to("cuda")
# Generate output
output = model.generate(
**inputs,
max_new_tokens=32000,
do_sample=True
)
print(tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
- Downloads last month
- 3
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.