WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
I am getting "WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu." while loading the Mixtral to text-genetation pipeline.
You don't have enough GPU memory. Consider renting a GPU, or loading the model in a more efficient way (e.g. in 4-Bit)
I second what
@cekal
said, you probably don't have enough GPU ram to fit the model, try either to load it with smaller precision (e.g. float16
or load_in_4bit
, or using the serialized 4-bit here: https://huggingface.co/ybelkada/Mixtral-8x7B-Instruct-v0.1-bnb-4bit)
Hi @ybelkada
Any idea what is the minimum system requirement to run this model (for e.g. GPU, etc..) ? I am trying to run below python code using streamlit and I get the above error (or warning, I would say) -
import streamlit as st
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch
token = ""
model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
max_length=1000,
eos_token_id=tokenizer.eos_token_id
)
llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})
template = """
You are an intelligent chatbot that gives out useful information to humans.
You return the responses in sentences with arrows at the start of each sentence
{query}
"""
prompt = PromptTemplate(template=template, input_variables=["query"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
print(llm_chain.invoke('What are the 3 causes of glacier meltdowns?'))
I have an error or something i dont understand, what to do?, Thanks
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 23.97it/s]
Some parameters are on the meta device device because they were offloaded to the cpu and disk.
Setting pad_token_id
to eos_token_id
:128009 for open-end generation.
the code ///
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "/MyPath/Meta-Llama-3.1-8B-Instruct"
Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
Set the padding token to be the same as the EOS token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Define the messages
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
Prepare input IDs and attention mask
inputs = tokenizer(
[msg["content"] for msg in messages],
return_tensors="pt",
padding=True,
truncation=True,
)
Ensure inputs are moved to the correct device
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
Set terminators
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
Generate text
outputs = model.generate(
input_ids,
attention_mask=attention_mask, # Add attention mask
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
Decode and print the response
response = outputs[0][input_ids.shape[-1]:]
print("Generated Response:", tokenizer.decode(response, skip_special_tokens=True))
Additional debugging output
print("Inputs:")
print(inputs)
print("Generated Output IDs:")
print(outputs)