ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM
Loading the model in 8bit=True does not seem to work on google colab.
model_name = "google/flan-ul2"
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(
model_name,
#torch_dtype=torch.bfloat16,
load_in_8bit=True,
device_map="auto",
#offload_folder="offload",
#offload_state_dict=True,
)
This code leads to the following error. I'm running it on a standard google colab GPU (Tesla T4, 15GB RAM).
(I had successfully loaded the model with torch_dtype=torch.bfloat16
and offloading (accelerate and bitsandbytes is installed), but it doesn't seem to work with 8bit)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-22-b20ea4cc84c4> in <module>
3 from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer
4 tokenizer = AutoTokenizer.from_pretrained(model_name)
----> 5 model = T5ForConditionalGeneration.from_pretrained(
6 model_name,
7 #torch_dtype=torch.bfloat16,
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2423 }
2424 if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
-> 2425 raise ValueError(
2426 """
2427 Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you have set a value for `max_memory` you should increase that. To have
an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map.
same for my deployment in sagemaker using instance instance_type="ml.g4dn.4xlarge". Waiting for someone to help on this as well.
my code:
def model_fn(model_dir):
#load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2",
load_in_8bit=True, device_map="auto", cache_dir="/tmp/model_cache/")
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
return model, tokenizer
from sagemaker.huggingface.model import HuggingFaceModel
huggingface_model = HuggingFaceModel(
model_data=s3_location,
role=role,
transformers_version="4.17",
pytorch_version="1.10",
py_version='py38',
)
from sagemaker.utils import name_from_base
endpoint_name = name_from_base(model_name)
predictor = huggingface_model.deploy(
initial_instance_count=1,
#instance_type="ml.g5.4xlarge",
instance_type="ml.g4dn.4xlarge",
endpoint_name=endpoint_name,
)
data = {
"inputs": prompt,
"min_length": 20,
"max_length": 50,
"do_sample": True,
"temperature": 0.6,
}
res = predictor.predict(data=data)
print(res)
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit\n the quantized model. If you have set a value for max_memory
you should increase that. To have\n an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map
Hi
@MoritzLaurer
@miltonc
As stated by the error trace it seems that you don't have enough memory to fit the model on a 15GB GPU. The model is a 20B parameters model so you would need roughly 20GB GPU RAM at least, to run the model in int8
However, you might be interested in dispatching the model between CPU and GPU, and fit ~70% of the model weights on the GPU and the rest on CPU using BitsAndBytesConfig
. Please have a look at the following section: https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
Hi @ybelkada , yeah just increasing memory makes sense. What confused me is that it worked with bf16, but the same setup did not work with int8. I thought bf16 requires more memory than 8int. If that's the case, then I don't really understand why I got the error with int8, especially since a central motivation for using int8 is to decrease memory requirements. But maybe I misunderstand something
Hey, @MoritzLaurer were you ever able to figure this out? I am having a similar problem in the Google Colab workspace on a T4 GPU which has 16Gb of memory, but I am loading my fine tuned Llama 2 7B hf model which should in theory work but I run into the same error. Is it really as simple as I need more memory? I would really like to remain in the free tier of google colab if at all possible.
EDIT: To clarify, I am even using the 4 bit quantization using BitsAndBytes
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
training the model works fine but then when I load the trained model I encounter the error.
@fredbananas
I ran into the same error despite having aws g5.4xlarge instance. (using a different model) For me it was because nvidia runtime wasn't up. If you think that might be your issue too, try nvidia-smi and see if it's working
Hi
@chenaclee
@fredbananas
In order to load the model on a free tier Gcolab instance, I recommend you to use the sharded version of the model instead such as: Trelis/Llama-2-7b-chat-hf-sharded-bf16
You can find more sharded checkpoints on my personal collection: https://huggingface.co/collections/ybelkada/sharded-checkpoints-64fefe78cccea7ce7b268310
@fredbananas were you able to fix the error?
Free Up GPU :)
Guys who are trying in Google Colab, you can restart session.
Then the GPU Memory is cleared for your use.
Same issue, Is anyone find the solution?
Same issue, anyone have any suggestion? I am using TinyLlama-1.1B-Chat-v0.1 LLM. I have 8.0 GB GPU.
As this session talked about, it is because your GPU memory is not big enough to load the model. You can increase GPU memory, or seperated part of the modules into CPU memory.
Can you recoomend the amount of needed GPU memory? Mine is 48.0 GB, Having 2 GPUs both Nvidia RTX A4000.
Getting this error as well with google/madlad400-7b-mt and 10b. Using bfloat16 everything runs fine without error. in my 12G VRAM+CPU. However using this quantization config I get the ValueError above. Can you explain why it would work for bfloat16 and not 4-bit? Seems the quantization would use CPU ram as needed just like bfloat16. Is this a bug?
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)