It can run with two 4090 or a single 6000 ADA.

#20
by znsoft - opened

Changing the code a little bit then run it.


from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto", 
    model_kwargs={"load_in_8bit": True}
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")
znsoft changed discussion title from It can run wiht two 4090 or a single 6000 ADA. to It can run with two 4090 or a single 6000 ADA.

I am 2x3090 User. I can run Lama 30b model. ın fact ı can open Lama 65b but cant run cuz of memory (system memory not cuz of vram). That means we can use the falcon 40b model ı guess

you can setup a swap file to expand virtual memoery. refere to "swapon"

Hi @znsoft - thanks for this. Were you able to run it yourself on 2x4090? I have this set up and I got an error running this which appears to be linked with running out of VRAM. I ran exactly the same code with the 7b model, and it works. The exact log and error I got was:

Overriding torch_dtype=torch.bfloat16 with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...
The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/generation/utils.py:1255: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 1
----> 1 sequences = pipeline(
      2    "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
      3     max_length=200,
      4     do_sample=True,
      5     top_k=10,
      6     num_return_sequences=1,
      7     eos_token_id=tokenizer.eos_token_id,
      8 )
      9 for seq in sequences:
     10     print(f"Result: {seq['generated_text']}")

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:201, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
    160 def __call__(self, text_inputs, **kwargs):
    161     """
    162     Complete the prompt(s) given as inputs.
    163 
   (...)
    199           ids of the generated text.
    200     """
--> 201     return super().__call__(text_inputs, **kwargs)

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/base.py:1119, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1111     return next(
   1112         iter(
   1113             self.get_iterator(
   (...)
   1116         )
   1117     )
   1118 else:
-> 1119     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/base.py:1126, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1124 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1125     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1126     model_outputs = self.forward(model_inputs, **forward_params)
   1127     outputs = self.postprocess(model_outputs, **postprocess_params)
   1128     return outputs

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/base.py:1025, in Pipeline.forward(self, model_inputs, **forward_params)
   1023     with inference_context():
   1024         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1025         model_outputs = self._forward(model_inputs, **forward_params)
   1026         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1027 else:

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:263, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
    260         generate_kwargs["min_length"] += prefix_length
    262 # BS x SL
--> 263 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
    264 out_b = generated_sequence.shape[0]
    265 if self.framework == "pt":

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/generation/utils.py:1568, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
   1560     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1561         input_ids=input_ids,
   1562         expand_size=generation_config.num_return_sequences,
   1563         is_encoder_decoder=self.config.is_encoder_decoder,
   1564         **model_kwargs,
   1565     )
   1567     # 13. run sample
-> 1568     return self.sample(
   1569         input_ids,
   1570         logits_processor=logits_processor,
   1571         logits_warper=logits_warper,
   1572         stopping_criteria=stopping_criteria,
   1573         pad_token_id=generation_config.pad_token_id,
   1574         eos_token_id=generation_config.eos_token_id,
   1575         output_scores=generation_config.output_scores,
   1576         return_dict_in_generate=generation_config.return_dict_in_generate,
   1577         synced_gpus=synced_gpus,
   1578         streamer=streamer,
   1579         **model_kwargs,
   1580     )
   1582 elif is_beam_gen_mode:
   1583     if generation_config.num_return_sequences > generation_config.num_beams:

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/transformers/generation/utils.py:2615, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2612 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2614 # forward pass to get next token
-> 2615 outputs = self(
   2616     **model_inputs,
   2617     return_dict=True,
   2618     output_attentions=output_attentions,
   2619     output_hidden_states=output_hidden_states,
   2620 )
   2622 if synced_gpus and this_peer_finished:
   2623     continue  # don't waste resources running the code we don't need

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:759, in RWForCausalLM.forward(self, input_ids, past_key_values, attention_mask, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, **deprecated_arguments)
    755     raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
    757 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 759 transformer_outputs = self.transformer(
    760     input_ids,
    761     past_key_values=past_key_values,
    762     attention_mask=attention_mask,
    763     head_mask=head_mask,
    764     inputs_embeds=inputs_embeds,
    765     use_cache=use_cache,
    766     output_attentions=output_attentions,
    767     output_hidden_states=output_hidden_states,
    768     return_dict=return_dict,
    769 )
    770 hidden_states = transformer_outputs[0]
    772 lm_logits = self.lm_head(hidden_states)

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:654, in RWModel.forward(self, input_ids, past_key_values, attention_mask, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, **deprecated_arguments)
    646     outputs = torch.utils.checkpoint.checkpoint(
    647         create_custom_forward(block),
    648         hidden_states,
   (...)
    651         head_mask[i],
    652     )
    653 else:
--> 654     outputs = block(
    655         hidden_states,
    656         layer_past=layer_past,
    657         attention_mask=causal_mask,
    658         head_mask=head_mask[i],
    659         use_cache=use_cache,
    660         output_attentions=output_attentions,
    661         alibi=alibi,
    662     )
    664 hidden_states = outputs[0]
    665 if use_cache is True:

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:411, in DecoderLayer.forward(self, hidden_states, alibi, attention_mask, layer_past, head_mask, use_cache, output_attentions)
    408 outputs = attn_outputs[1:]
    410 # MLP.
--> 411 mlp_output = self.mlp(ln_mlp)
    413 output = dropout_add(
    414     mlp_output + attention_output, residual, self.config.hidden_dropout, training=self.training
    415 )
    417 if use_cache:

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py:356, in MLP.forward(self, x)
    355 def forward(self, x: torch.Tensor) -> torch.Tensor:
--> 356     x = self.act(self.dense_h_to_4h(x))
    357     x = self.dense_4h_to_h(x)
    358     return x

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:320, in Linear8bitLt.forward(self, x)
    317 if self.bias is not None and self.bias.dtype != x.dtype:
    318     self.bias.data = self.bias.data.to(x.dtype)
--> 320 out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
    322 if not self.state.has_fp16_weights:
    323     if self.state.CB is not None and self.state.CxB is not None:
    324         # we converted 8-bit row major to turing/ampere format in the first inference pass
    325         # we no longer need the row-major weight

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:500, in matmul(A, B, out, state, threshold, bias)
    498 if threshold > 0.0:
    499     state.threshold = threshold
--> 500 return MatMul8bitLt.apply(A, B, out, bias, state)

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/torch/autograd/function.py:506, in Function.apply(cls, *args, **kwargs)
    503 if not torch._C._are_functorch_transforms_active():
    504     # See NOTE: [functorch vjp and autograd interaction]
    505     args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506     return super().apply(*args, **kwargs)  # type: ignore[misc]
    508 if cls.setup_context == _SingleLevelFunction.setup_context:
    509     raise RuntimeError(
    510         'In order to use an autograd.Function with functorch transforms '
    511         '(vmap, grad, jvp, jacrev, ...), it must override the setup_context '
    512         'staticmethod. For more details, please see '
    513         'https://pytorch.org/docs/master/notes/extending.func.html')

File ~/.local/share/virtualenvs/huggingface-OfWfm_Zx/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:417, in MatMul8bitLt.forward(ctx, A, B, out, bias, state)
    415 # 4. Mixed-precision decomposition matmul
    416 if coo_tensorA is not None and subA is not None:
--> 417     output += torch.matmul(subA, state.subB)
    419 # 5. Save state
    420 ctx.state = state

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I'm very interested in being able to run this model, so any help would be greatly appreciated.

For anyone else landing here with the same problem I had ^, I just recompiled pytorch with CUDA=12.1 and it worked fine. Thanks!

You need to quantize the model to 8bits

Sign up or log in to comment