deepseek-ai/DeepSeek-V3 · CUDA out of memory error during fp8 to bf16 model conversion + fix

When using fp8_cast_bf16.py script to convert the model to bf16 I got the following exception during conversion:

Traceback (most recent call last):
  File "/mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/4c1f24cc10a2a1894304c7ab52edd9710c047571/inference/fp8_cast_bf16.py", line 81, in <module>
    main(args.input_fp8_hf_path, args.output_bf16_hf_path)
  File "/mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/4c1f24cc10a2a1894304c7ab52edd9710c047571/inference/fp8_cast_bf16.py", line 37, in main
    current_state_dict = load_file(safetensor_file, device="cuda")
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phm/.local/opt/miniconda3/envs/deepseek3/lib/python3.11/site-packages/safetensors/torch.py", line 315, in load_file
    result[k] = f.get_tensor(k)
                ^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 35.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

It always happened after converting the first 14 safetensor files. I have a single RTX 4090 GPU.

Anyway, I fixed the problem with the following changes:

--- fp8_cast_bf16.py.bak	2024-12-30 18:42:39.237812762 +0100
+++ fp8_cast_bf16.py	2024-12-30 18:57:43.448315873 +0100
@@ -26,14 +26,14 @@
         file_name = weight_map[tensor_name]
         if file_name not in loaded_files:
             file_path = os.path.join(fp8_path, file_name)
-            loaded_files[file_name] = load_file(file_path, device="cuda")
+            loaded_files[file_name] = load_file(file_path, device="cpu")
         return loaded_files[file_name][tensor_name]
 
     safetensor_files = list(glob(os.path.join(fp8_path, "*.safetensors")))
     safetensor_files.sort()
     for safetensor_file in tqdm(safetensor_files):
         file_name = os.path.basename(safetensor_file)
-        current_state_dict = load_file(safetensor_file, device="cuda")
+        current_state_dict = load_file(safetensor_file, device="cpu")
         loaded_files[file_name] = current_state_dict
         
         new_state_dict = {}
@@ -46,7 +46,7 @@
                     # Get scale_inv from the correct file
                     scale_inv = get_tensor(scale_inv_name)
                     fp8_weight_names.append(weight_name)
-                    new_state_dict[weight_name] = weight_dequant(weight, scale_inv)
+                    new_state_dict[weight_name] = weight_dequant(weight.cuda(), scale_inv.cuda()).cpu()
                 except KeyError:
                     print(f"Warning: Missing scale_inv tensor for {weight_name}, skipping conversion")
                     new_state_dict[weight_name] = weight

With these changes the conversion was successful.