This model has been quantized using GPTQModel.

  • bits: 4
  • group_size: 128
  • desc_act: true
  • static_groups: false
  • sym: true
  • lm_head: false
  • damp_percent: 0.01
  • true_sequential: true
  • model_name_or_path: ""
  • model_file_base_name: "model"
  • quant_method: "gptq"
  • checkpoint_format: "gptq"
  • meta
    • quantizer: "gptqmodel:0.9.9-dev0"

Here is an example:

import torch
from transformers import AutoTokenizer
from gptqmodel import GPTQModel

device = torch.device("cuda:0")

model_name = "ModelCloud/Meta-Llama-3.1-8B-gptq-4bit"

prompt = "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = GPTQModel.from_quantized(model_name)

inputs = tokenizer(prompt, return_tensors="pt").to(device)
res = model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=512)
print(tokenizer.decode(res[0]))
Downloads last month
17
Safetensors
Model size
1.99B params
Tensor type
I32
·
BF16
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.