Slow inference CodeGen2 + PEFT
Hi,
I am using CodeGen2-1B, CodeGen2-3_7B and CodeGen-7B, and I encounter some serious latency at inference compared to other LLMs of the same size (e.g., CodeLlama-7b).
I used PEFT to fine-tune the models on a custom dataset, and then use the model at inference on a test set.
Here is the code I use to load the model using a local checkpoint containing the adapter (args.adapter_path):
model = AutoModelForCausalLM(args.model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, args.adapter_path).to(args.device)
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True)
Then I simply loop over my test set and call model.generate
. Here are the inference times I get on the same machine, using a single GPU and the same datasets:
CodeLlama-7b: ~11min (540 samples, max_new_tokens=64)
CodeGen2-7B: ~45min (540 samples, max_new_tokens=64)
Using CodeGen2, the inference time dramatically increases as I attempt to generate more tokens.
Am I missing something specific to CodeGen2 when loading the model?
Thanks for your help!
I think the overhead is expected, when you use Peft models, especially lora, during inference there is an overhead due to the LoRA layers - see figure below
As you perform at the same time the computation on the left and on the right and sum the final results, this creates an overhead that can be quite considerable during generation.
However you can overcome this by "merging" the adapter weights into the base model as LoRA can be simply rewritten as a refactorization of simple matrix multiplication
Therefore you can merge everything in a single weight matrix and retrive base model's performance. You can do that as follows:
model = AutoModelForCausalLM(args.model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, args.adapter_path).to(args.device)
+ model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True)
More details about it here: https://huggingface.co/docs/peft/conceptual_guides/lora#merge-lora-weights-into-the-base-model
@ybelkada I reopen this thread as I still observe a very slow inference for CodeGen2 models even after merging LoRA's layers with the base model. I do not observe latency for other models such as Llama/CodeLlama/CodeT5+ when generating under identical settings.
As I increase the prompt length, the inference is getting dramatically slow. For instance, it takes about 7 to 8 hours to complete 628 generations, whereas CodeLlama-7b-hf takes about 30 minutes.
Here's my code (I do not paste everything here as I do not think the problem stems from the data preprocessing or postprocessing):
model = AutoModelForCausalLM(args.model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, args.adapter_path).to(args.device)
model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True)
# preprocessing ...
# prediction loop ...
generated_sequences = model.generate(
input_ids=sample["input_ids"].to(args.device),
num_beams=10,
num_return_sequences=10,
max_new_tokens=args.max_target_length,
stopping_criteria=StoppingCriteriaList(
[EndOfFunctionCriteria(sample["input_ids"].shape[1], eof_string, tokenizer)]
)
)
# postprocessing ...
I do not do batch generation, use a single GPU, and args.max_target_length
= 128. I get about the same latency for any CodeGen2-1B, CodeGen2-3_7B and CodeGen2-7B models.
Could it be because of the custom modeling script?
Hi @martiwey
Hmm yes that could be it, what is the relative different between the custom and non custom model?
Usually what we can do in this case is to profile the generation script using torch profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html and try to identify which operation creates a potential overhead.
Operations that can take overhead can be CPU->GPU device placement, for example here: https://huggingface.co/Salesforce/codegen2-7B/blob/main/modeling_codegen.py#L62 I can see that a tensor is being created on CPU (default) then potentially moved to GPU.
Another room of improvement could be to replace these lines: https://huggingface.co/Salesforce/codegen2-7B/blob/main/modeling_codegen.py#L157-L181 with torch.scaled_dot_product_attention
from pytorch: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html you can create your own fork of this model and implement these features (+ fix the issue described above).
Let me know how that goes!
Hi @ybelkada ,
Thanks for the feedback, I ran torch profiler on module.generate()
, and I could spot high CPU time spent on data transfer CPU->GPU ( aten::copy_
).
I believe the following code fixes the issue at #L62, as aten::copy_
disappeared from the profiler report:
def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
dim = x.shape[-1]
if seq_len is None:
seq_len = x.shape[seq_dim]
inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=x.device) / dim))
sinusoid_inp = (
torch.einsum("i , j -> i j", torch.arange(seq_len, dtype=torch.float, device=x.device), inv_freq).float()
)
return torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)
However, the overall execution time remains identical. I am going to continue investigating the issue.
OK thanks! Let me know how it goes