Model not running on CPU, due to flash_attn package requirement.
I am trying to import the prosparse-llama-2-7b, model on ARM CPU machine. (gr3 instance)
It requires flash_attn, and if we try to install flash_attn, it raises an nvcc error.
Other SparseLLM models like SparseLLM/ReluLLaMA-7B and https://huggingface.co/SparseLLM/ProSparse-MiniCPM-1B-sft seem to work, and execute on CPUs, the issue is with this particular model only and the larger variants too, i.e. prosparse-llama2-13b.
Requesting to look into this issue, I think we need to get rid of the flash_attn dependency, because otherwise the model won't be able to execute on CPUs.
The problems seem strange. I've tried to load the model on a CPU machine with model = AutoModelForCausalLM.from_pretrained("SparseLLM/prosparse-llama-2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True)
and succeed. Generally, if you have no GPU on the machine or flash-attn
is not installed, transformers.utils.is_flash_attn_2_available()
should return False
so that flash_attn
will not be required. You may check line 47 in modeling_sparsellama.py
.
Therefore, you may check the return value of is_flash_attn_2_available()
. The package flash-attn
and GPUs are not necessary to load these models on CPU machines.
From your pictures, the problems seem to lie in the import phase of package transformers
. My transformers
version is 4.43.3
. If changing the version cannot solve this problem, I suggest you dive deep into the source codes that raise the exception to fix it.
Thanks, I updated the transformers version and it seems to work!