Introduction

This model originates from Xkev/Llama-3.2V-11B-cot. This repository simply quantizes the model into the NF4 format using the bitsandbytes library. All credit goes to the original repository.

Usage

from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time

# Load model
model_id = "zhangsongbo365/Llama-3.2V-11B-cot-nf4" 
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    use_safetensors=True,
    device_map="cuda:0",
    trust_remote_code=True
)

# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

# Caption a local image
IMAGE = Image.open("1.png").convert("RGB")  # 改为你的实际图片路径
PROMPT = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Caption this image:
<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

inputs = processor(IMAGE, PROMPT, return_tensors="pt").to(model.device)
prompt_tokens = len(inputs['input_ids'][0])
print(f"Prompt tokens: {prompt_tokens}")

t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=256)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

output = processor.decode(generate_ids[0][prompt_tokens:]).replace('<|eot_id|>', '')
Downloads last month
7
Safetensors
Model size
6.05B params
Tensor type
F32
·
FP16
·
U8
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for zhangsongbo365/Llama-3.2V-11B-cot-nf4

Quantized
(13)
this model