It can run with two 4090 or a single 6000 ADA.
Changing the code a little bit then run it.
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
model_kwargs={"load_in_8bit": True}
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
I am 2x3090 User. I can run Lama 30b model. ın fact ı can open Lama 65b but cant run cuz of memory (system memory not cuz of vram). That means we can use the falcon 40b model ı guess
you can setup a swap file to expand virtual memoery. refere to "swapon"
Hi @znsoft - thanks for this. Were you able to run it yourself on 2x4090? I have this set up and I got an error running this which appears to be linked with running out of VRAM. I ran exactly the same code with the 7b model, and it works. The exact log and error I got was:
Overriding torch_dtype=torch.bfloat16 with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/generation/utils.py:1255: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 sequences = pipeline(
2 "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
3 max_length=200,
4 do_sample=True,
5 top_k=10,
6 num_return_sequences=1,
7 eos_token_id=tokenizer.eos_token_id,
8 )
9 for seq in sequences:
10 print(f"Result: {seq['generated_text']}")
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:201, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
160 def __call__(self, text_inputs, **kwargs):
161 """
162 Complete the prompt(s) given as inputs.
163
(...)
199 ids of the generated text.
200 """
--> 201 return super().__call__(text_inputs, **kwargs)
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/base.py:1119, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1111 return next(
1112 iter(
1113 self.get_iterator(
(...)
1116 )
1117 )
1118 else:
-> 1119 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/base.py:1126, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1124 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
1125 model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1126 model_outputs = self.forward(model_inputs, **forward_params)
1127 outputs = self.postprocess(model_outputs, **postprocess_params)
1128 return outputs
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/base.py:1025, in Pipeline.forward(self, model_inputs, **forward_params)
1023 with inference_context():
1024 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1025 model_outputs = self._forward(model_inputs, **forward_params)
1026 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
1027 else:
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:263, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
260 generate_kwargs["min_length"] += prefix_length
262 # BS x SL
--> 263 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
264 out_b = generated_sequence.shape[0]
265 if self.framework == "pt":
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/generation/utils.py:1568, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
1560 input_ids, model_kwargs = self._expand_inputs_for_generation(
1561 input_ids=input_ids,
1562 expand_size=generation_config.num_return_sequences,
1563 is_encoder_decoder=self.config.is_encoder_decoder,
1564 **model_kwargs,
1565 )
1567 # 13. run sample
-> 1568 return self.sample(
1569 input_ids,
1570 logits_processor=logits_processor,
1571 logits_warper=logits_warper,
1572 stopping_criteria=stopping_criteria,
1573 pad_token_id=generation_config.pad_token_id,
1574 eos_token_id=generation_config.eos_token_id,
1575 output_scores=generation_config.output_scores,
1576 return_dict_in_generate=generation_config.return_dict_in_generate,
1577 synced_gpus=synced_gpus,
1578 streamer=streamer,
1579 **model_kwargs,
1580 )
1582 elif is_beam_gen_mode:
1583 if generation_config.num_return_sequences > generation_config.num_beams:
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/generation/utils.py:2615, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2612 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2614 # forward pass to get next token
-> 2615 outputs = self(
2616 **model_inputs,
2617 return_dict=True,
2618 output_attentions=output_attentions,
2619 output_hidden_states=output_hidden_states,
2620 )
2622 if synced_gpus and this_peer_finished:
2623 continue # don't waste resources running the code we don't need
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:759, in RWForCausalLM.forward(self, input_ids, past_key_values, attention_mask, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, **deprecated_arguments)
755 raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
757 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 759 transformer_outputs = self.transformer(
760 input_ids,
761 past_key_values=past_key_values,
762 attention_mask=attention_mask,
763 head_mask=head_mask,
764 inputs_embeds=inputs_embeds,
765 use_cache=use_cache,
766 output_attentions=output_attentions,
767 output_hidden_states=output_hidden_states,
768 return_dict=return_dict,
769 )
770 hidden_states = transformer_outputs[0]
772 lm_logits = self.lm_head(hidden_states)
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:654, in RWModel.forward(self, input_ids, past_key_values, attention_mask, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, **deprecated_arguments)
646 outputs = torch.utils.checkpoint.checkpoint(
647 create_custom_forward(block),
648 hidden_states,
(...)
651 head_mask[i],
652 )
653 else:
--> 654 outputs = block(
655 hidden_states,
656 layer_past=layer_past,
657 attention_mask=causal_mask,
658 head_mask=head_mask[i],
659 use_cache=use_cache,
660 output_attentions=output_attentions,
661 alibi=alibi,
662 )
664 hidden_states = outputs[0]
665 if use_cache is True:
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:411, in DecoderLayer.forward(self, hidden_states, alibi, attention_mask, layer_past, head_mask, use_cache, output_attentions)
408 outputs = attn_outputs[1:]
410 # MLP.
--> 411 mlp_output = self.mlp(ln_mlp)
413 output = dropout_add(
414 mlp_output + attention_output, residual, self.config.hidden_dropout, training=self.training
415 )
417 if use_cache:
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:356, in MLP.forward(self, x)
355 def forward(self, x: torch.Tensor) -> torch.Tensor:
--> 356 x = self.act(self.dense_h_to_4h(x))
357 x = self.dense_4h_to_h(x)
358 return x
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:320, in Linear8bitLt.forward(self, x)
317 if self.bias is not None and self.bias.dtype != x.dtype:
318 self.bias.data = self.bias.data.to(x.dtype)
--> 320 out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
322 if not self.state.has_fp16_weights:
323 if self.state.CB is not None and self.state.CxB is not None:
324 # we converted 8-bit row major to turing/ampere format in the first inference pass
325 # we no longer need the row-major weight
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:500, in matmul(A, B, out, state, threshold, bias)
498 if threshold > 0.0:
499 state.threshold = threshold
--> 500 return MatMul8bitLt.apply(A, B, out, bias, state)
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/autograd/function.py:506, in Function.apply(cls, *args, **kwargs)
503 if not torch._C._are_functorch_transforms_active():
504 # See NOTE: [functorch vjp and autograd interaction]
505 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506 return super().apply(*args, **kwargs) # type: ignore[misc]
508 if cls.setup_context == _SingleLevelFunction.setup_context:
509 raise RuntimeError(
510 'In order to use an autograd.Function with functorch transforms '
511 '(vmap, grad, jvp, jacrev, ...), it must override the setup_context '
512 'staticmethod. For more details, please see '
513 'https://pytorch.org/docs/master/notes/extending.func.html')
File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:417, in MatMul8bitLt.forward(ctx, A, B, out, bias, state)
415 # 4. Mixed-precision decomposition matmul
416 if coo_tensorA is not None and subA is not None:
--> 417 output += torch.matmul(subA, state.subB)
419 # 5. Save state
420 ctx.state = state
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
I'm very interested in being able to run this model, so any help would be greatly appreciated.
For anyone else landing here with the same problem I had ^, I just recompiled pytorch with CUDA=12.1 and it worked fine. Thanks!
You need to quantize the model to 8bits