CUDA out of memory error during fp8 to bf16 model conversion + fix

#17
by sszymczyk - opened

When using fp8_cast_bf16.py script to convert the model to bf16 I got the following exception during conversion:

Traceback (most recent call last):
  File "/mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/4c1f24cc10a2a1894304c7ab52edd9710c047571/inference/fp8_cast_bf16.py", line 81, in <module>
    main(args.input_fp8_hf_path, args.output_bf16_hf_path)
  File "/mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/4c1f24cc10a2a1894304c7ab52edd9710c047571/inference/fp8_cast_bf16.py", line 37, in main
    current_state_dict = load_file(safetensor_file, device="cuda")
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phm/.local/opt/miniconda3/envs/deepseek3/lib/python3.11/site-packages/safetensors/torch.py", line 315, in load_file
    result[k] = f.get_tensor(k)
                ^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 35.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

It always happened after converting the first 14 safetensor files. I have a single RTX 4090 GPU.

Anyway, I fixed the problem with the following changes:

--- fp8_cast_bf16.py.bak	2024-12-30 18:42:39.237812762 +0100
+++ fp8_cast_bf16.py	2024-12-30 18:57:43.448315873 +0100
@@ -26,14 +26,14 @@
         file_name = weight_map[tensor_name]
         if file_name not in loaded_files:
             file_path = os.path.join(fp8_path, file_name)
-            loaded_files[file_name] = load_file(file_path, device="cuda")
+            loaded_files[file_name] = load_file(file_path, device="cpu")
         return loaded_files[file_name][tensor_name]
 
     safetensor_files = list(glob(os.path.join(fp8_path, "*.safetensors")))
     safetensor_files.sort()
     for safetensor_file in tqdm(safetensor_files):
         file_name = os.path.basename(safetensor_file)
-        current_state_dict = load_file(safetensor_file, device="cuda")
+        current_state_dict = load_file(safetensor_file, device="cpu")
         loaded_files[file_name] = current_state_dict
         
         new_state_dict = {}
@@ -46,7 +46,7 @@
                     # Get scale_inv from the correct file
                     scale_inv = get_tensor(scale_inv_name)
                     fp8_weight_names.append(weight_name)
-                    new_state_dict[weight_name] = weight_dequant(weight, scale_inv)
+                    new_state_dict[weight_name] = weight_dequant(weight.cuda(), scale_inv.cuda()).cpu()
                 except KeyError:
                     print(f"Warning: Missing scale_inv tensor for {weight_name}, skipping conversion")
                     new_state_dict[weight_name] = weight

With these changes the conversion was successful.

I've run into similar issue by try to dequant on an RTX4080 with 16GB.
The FP8 _ the Bf16 was just too big to fit to vram.
I've ended up creating a safetensor splitter, which goes file by file, and split the safetensor by model layers to have smaller chunks.

Here's the tool:
https://github.com/csabakecskemeti/ai_utils/blob/main/safetensor_splitter.py
I guess this is a different solution

Sign up or log in to comment