Aphrodite/VLLM/SGLang all refuse to load this model

#5
by fullstack - opened

all same error
KeyError: 'model.layers.0.mlp.down_proj.weight'

(sglang) ➜  ~ python -m sglang.launch_server --model-path unsloth/gemma-2-27b-it-bnb-4bit  --port 6002
WARNING 09-10 13:26:55 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information.
[13:26:56] When using sliding window in gemma-2, turn on flashinfer.
[13:26:56] server_args=ServerArgs(model_path='unsloth/gemma-2-27b-it-bnb-4bit', tokenizer_path='unsloth/gemma-2-27b-it-bnb-4bit', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='unsloth/gemma-2-27b-it-bnb-4bit', chat_template=None, is_embedding=False, host='127.0.0.1', port=6002, additional_ports=[6003, 6004, 6005, 6006], mem_fraction_static=0.88, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=298685233, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[13:26:56 TP0] Init nccl begin.
[13:26:56 TP0] Load weight begin. avail mem=23.40 GB
WARNING 09-10 13:26:57 interfaces.py:132] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set `supports_lora=True`.
INFO 09-10 13:26:57 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Process Process-1:
Traceback (most recent call last):
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/launch_server.py", line 19, in <module>
    raise e
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/launch_server.py", line 17, in <module>
    launch_server(server_args)
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/server.py", line 365, in launch_server
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller_single.py", line 149, in start_controller_process
    controller = ControllerSingle(
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/controller_single.py", line 83, in __init__
    self.tp_server = ModelTpServer(
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 99, in __init__
    self.model_runner = ModelRunner(
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 110, in __init__
    self.load_model()
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 204, in load_model
    self.model = get_model(
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
    model.load_weights(
  File "/home/shazam/miniforge3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/gemma2.py", line 401, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.0.mlp.down_proj.weight'
, detoken_init_state: init ok
Unsloth AI org

Oh wait bitsandbytes models might be able supported (yet) in SGLang - I think vLLM maybe.
It's better to the 16bit version https://huggingface.co/unsloth/gemma-2-27b-it

Um, it should work with VLLM, im using it rn.

Sign up or log in to comment