Error using multi-gpu support

#26
by bobwhiterabbit - opened

The following code (the example code in the card) to support loading on multiple gpu (since the model doesn't fit on one of my rtx3090), doesn't work. The embedding model is still loaded in the RAM.

import torch.nn.functional as F
from transformers import AutoModel
from torch.nn import DataParallel


# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?',
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained("nvidia_nv-embed-v1", trust_remote_code=True)
for module_key, module in model._modules.items():
    model._modules[module_key] = DataParallel(module)

# get the embeddings
max_length = 4096

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
batch_size = 5
query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length)
passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

image.png

I also tried this ,but it cannot run on multiple GPU,but actually still running on the 1 gpu.why is that?how to make it running on multiple gpu

Thanks for question. Below example shows the full script for multi-gpu implementation.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from torch.nn import DataParallel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/NV-Embed-v1', trust_remote_code=True)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

for module_key, module in model._modules.items():
    model._modules[module_key] = DataParallel(module)

# get the embeddings
max_length = 4096

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
batch_size=2
query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length, num_workers=32)
passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length, num_workers=32)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

Hi Nada5,

Thanks for giving the sample code above! I have some follow-up questions. I'm new to NLP and this particular model, please kindly correct me if I'm not making any sense.

  • The code above utilizes DataParallel to do multi-gpu runs. Assume that encoding in a forward pass, DataParallel loads the model into all assigned GPUs, split the input so each GPU handles a portion of the input.
  • In my case, I have several 40G GPUs. I have observed that loading the model alone occupies 31400 MB of GPU memory. Then, encoding even the simplest sentence (for example, a one-liner) would result in GPU out of memory. Thus, what I'm looking for is to load the model across multiple small GPUs, something like device_map='auto'. However, NVEmbedModel does not supportdevice_map='auto'.
  • Is there a plan to support device_map=auto? Or, what are the alternative approaches to load the model across multiple small GPUs?

Thank you in advance

Hi Nada5, thanks for your answer, but the code sample you provided doesn't work for multi gpu. You allocate the first gpu to the model, and in my case it's 24gb cards, so i get cuda out of memory error. To use multi model is mostly what Clonylu said: implement device_map='auto'. Thanks!

Hi, this might be a bit too late for you ( @bobwhiterabbit ).

However, I encountered the same problem when trying to load nvidia/NV-Embed-v2 across multiple GPUs, and here's a solution that worked well for me.

Posting this here in case it helps others in the future.

I was able to load the model on multiple GPUs with the following code (you can check the memory usage of each card with "nvidia-smi" on Linux environments):

# Load model
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True, device_map='auto')

You can also pass a dict to control the spread across GPUs, but this caused issues in the transformers version specified in the model's requirements.

Sample function:

import os
import time
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer
from torch.nn import DataParallel
import torch.nn.functional as F

# Set environment variables to optimize GPU memory allocation and specify which GPUs to use.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"  # Use multiple GPUs

def create_embeddings(texts_dict, instruction, batch_size=8, max_length=512):
    """
    Generate embeddings for a given dictionary of texts using a pre-trained model.

    Args:
        texts_dict (dict): Dictionary with identifiers as keys and texts to embed as values.
        instruction (str): Task-specific prompt if supported by the model's encode method.
        batch_size (int, optional): Number of texts to process per batch (default 8).
        max_length (int, optional): Max token length for each text (default 512).

    Returns:
        tuple: A tuple containing:
            - embeddings (list): List of numpy arrays with embeddings.
            - df_timings (pd.DataFrame): DataFrame with the time taken for each batch.
    """
    embeddings = []
    timings = []

    # Load model
    model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True, device_map='auto', torch_dtype=torch.float16)
    
    try:
        print("Attempting to wrap model with DataParallel")
        model = DataParallel(model)  # Enable multi-GPU support
        print("Model successfully wrapped with DataParallel")
    except RuntimeError as e:
        print(f"Failed to wrap model with DataParallel: {e}")
        print("Falling back to CPU...")
        device = torch.device("cpu")
        model.to(device)
        print("Model moved to CPU.")
    
    # Start timing the total execution
    total_start_time = time.time()
    
    texts = list(texts_dict.values())
    values = list(texts_dict.keys())

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        batch_values = values[i:i + batch_size]
        
        start_time = time.time()
        try:
            with torch.no_grad():
                # Use model's encode method with instruction
                embeddings_batch = model.module.encode(batch_texts, instruction=instruction)
                
                # Normalize the embeddings
                embeddings_batch = F.normalize(embeddings_batch, p=2, dim=1)
                embeddings.append(embeddings_batch.cpu().numpy())
        except Exception as e:
            print(f'No embeddings created for this batch! Exception: {e}')
            pass
        
        end_time = time.time()
        duration = end_time - start_time
        timings.append({'batch_values': batch_values, 'seconds': duration})
        print(f"Time taken for batch {batch_values}: {duration:.2f} seconds")
    
    # End timing the total execution
    total_end_time = time.time()
    total_duration = total_end_time - total_start_time
    print(f"Total time taken for embedding generation: {total_duration:.2f} seconds")
    
    df_timings = pd.DataFrame(timings)
    return embeddings, df_timings

Sign up or log in to comment