Slow generation with code from the description
I used the code from the description with 2 x T4 on Kaggle, but generation was approximately 3 tokens per second. Is it the way it should be?
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
MODEL_NAME = "IlyaGusev/saiga_mistral_7b"
DEFAULT_MESSAGE_TEMPLATE = "<s>{role}\n{content}</s>"
DEFAULT_RESPONSE_TEMPLATE = "<s>bot\n"
DEFAULT_SYSTEM_PROMPT = "Ты — Сайга, русскоязычный автоматический ассистент. Ты разговариваешь с людьми и помогаешь им."
class Conversation:
def __init__(
self,
message_template=DEFAULT_MESSAGE_TEMPLATE,
system_prompt=DEFAULT_SYSTEM_PROMPT,
response_template=DEFAULT_RESPONSE_TEMPLATE
):
self.message_template = message_template
self.response_template = response_template
self.messages = [{
"role": "system",
"content": system_prompt
}]
def add_user_message(self, message):
self.messages.append({
"role": "user",
"content": message
})
def add_bot_message(self, message):
self.messages.append({
"role": "bot",
"content": message
})
def get_prompt(self, tokenizer):
final_text = ""
for message in self.messages:
message_text = self.message_template.format(**message)
final_text += message_text
final_text += DEFAULT_RESPONSE_TEMPLATE
return final_text.strip()
def generate(model, tokenizer, prompt, generation_config):
data = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
data = {k: v.to(model.device) for k, v in data.items()}
output_ids = model.generate(
**data,
generation_config=generation_config
)[0]
output_ids = output_ids[len(data["input_ids"][0]):]
output = tokenizer.decode(output_ids, skip_special_tokens=True)
return output.strip()
config = PeftConfig.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(
model,
MODEL_NAME,
torch_dtype=torch.float16
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
print(generation_config)
inputs = ["Почему трава зеленая?", "Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч"]
for inp in inputs:
conversation = Conversation()
conversation.add_user_message(inp)
prompt = conversation.get_prompt(tokenizer)
output = generate(model, tokenizer, prompt, generation_config)
print(inp)
print(output)
print()
print("==============================")
print()
Я также столкнулся с медленной генерацией при использовании 8-ми битной квантизации load_in_8bit=True.
Если поставить False, то есть использовать 16-битную - модель работает в несколько раз быстрее.
С чем это связано?
4:30 сек на 2 вопроса на лаптопе с rtx4070 (top_k=2)
Смена load_in_8bit не помогла.
Она же, но от TheBlock Q4_K_M, 5-20 сек на ответ
На цпу не пробвал.
Я также столкнулся с медленной генерацией при использовании 8-ми битной квантизации load_in_8bit=True.
Если поставить False, то есть использовать 16-битную - модель работает в несколько раз быстрее.С чем это связано?
Так и должно быть. Если еще сильнее квантовать (4 бита. можете прикольнуться и 2 бита попробовать), будет еще медленее. Авторы BytsAndBytes писали об этом сами, что медленее.
Я это в конце одного соревнования на каггле узнал, офигел нормально.
I used the code from the description with 2 x T4 on Kaggle, but generation was approximately 3 tokens per second. Is it the way it should be?
import torch from peft import PeftModel, PeftConfig from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig MODEL_NAME = "IlyaGusev/saiga_mistral_7b" DEFAULT_MESSAGE_TEMPLATE = "<s>{role}\n{content}</s>" DEFAULT_RESPONSE_TEMPLATE = "<s>bot\n" DEFAULT_SYSTEM_PROMPT = "Ты — Сайга, русскоязычный автоматический ассистент. Ты разговариваешь с людьми и помогаешь им." class Conversation: def __init__( self, message_template=DEFAULT_MESSAGE_TEMPLATE, system_prompt=DEFAULT_SYSTEM_PROMPT, response_template=DEFAULT_RESPONSE_TEMPLATE ): self.message_template = message_template self.response_template = response_template self.messages = [{ "role": "system", "content": system_prompt }] def add_user_message(self, message): self.messages.append({ "role": "user", "content": message }) def add_bot_message(self, message): self.messages.append({ "role": "bot", "content": message }) def get_prompt(self, tokenizer): final_text = "" for message in self.messages: message_text = self.message_template.format(**message) final_text += message_text final_text += DEFAULT_RESPONSE_TEMPLATE return final_text.strip() def generate(model, tokenizer, prompt, generation_config): data = tokenizer(prompt, return_tensors="pt", add_special_tokens=False) data = {k: v.to(model.device) for k, v in data.items()} output_ids = model.generate( **data, generation_config=generation_config )[0] output_ids = output_ids[len(data["input_ids"][0]):] output = tokenizer.decode(output_ids, skip_special_tokens=True) return output.strip() config = PeftConfig.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( config.base_model_name_or_path, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto" ) model = PeftModel.from_pretrained( model, MODEL_NAME, torch_dtype=torch.float16 ) model.eval() tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False) generation_config = GenerationConfig.from_pretrained(MODEL_NAME) print(generation_config) inputs = ["Почему трава зеленая?", "Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч"] for inp in inputs: conversation = Conversation() conversation.add_user_message(inp) prompt = conversation.get_prompt(tokenizer) output = generate(model, tokenizer, prompt, generation_config) print(inp) print(output) print() print("==============================") print()
If I recall correctly, device_map='auto' shards the model by layers between 2 GPUS (pipeline parralelism). This means that only one GPU can be active at a time (because layer's outputs depend on each other). If you combine it with quantization overhead (quantization saves memory, but actually harms the perfomance), the result is not suprising.
You should either implement a better parralelism strategy or choose P100
Максимальный некропостинг. Нашел: https://huggingface.co/docs/transformers/v4.36.1/en/quantization#bitsandbytes. Картинка оттуда.
Как мы можем видеть, по скорости BnB просто мусор по сравнению с fp16, особенно при большом batch size. Почти уверен, что это усугубляется дурацким device_map='auto'
Только на картинке не скорость, а пропускная способность. Скорость там на соседней картинке, и она примерно одинаковая для всего, кроме AWQ.
А так отвечал здесь: https://huggingface.co/IlyaGusev/saiga2_13b_lora/discussions/3#65bbf64531e7709efb75d47c
You should either implement a better parralelism strategy or choose P100
Thank you. I tried it, but got approximatelly 18 tokens per second. It is much more, but I need at least 30 tokens per second. What should I do? It looks like I need to use another way of loading the model or even take it from TheBlock for example.
Can I ask what is your goal? If your purpose is to load the model on Kaggle, you could try using a TPU instead of GPU (provided for free at Kaggle). I actually have a pretty good notebook for loading the model in JAX on TPU (jax is faster than pytorch). This will give you the speed of about 300 token/s.
If you want to use a Kaggle gpu, what's the sequence length of your prompts? You should probably be able to load the model without quantization in bfloat16/float16 format (7B model takes about 14GBs in bfloat16 - less than a half of P100 memory). You could also try experimenting with flash attention.
My goal is to load a model on the GPU for my small project. Now I'm choosing the model itself, I like this one. I decided to test it to understand the approximate generation speed I can expect. For this I used the GPU from Kaggle, but it turned out that even on P100 the unquantized model produces about 18 tokens per second. The GPU on the Kaggle uses 14.7GB of video memory.
Prompt:
Where are burried Charles Darwin and Charles Dickens?
Prompt has 7 tokens.
Answer:
Charles Darwin is buried in Westminster Abbey, London, England. His grave is located in the North Transept of the Abbey, near the graves of other notable scientists such as Isaac Newton and Michael Faraday.
Charles Dickens is also buried in Westminster Abbey, but his grave is located in the Poets' Corner, which is a special area dedicated to famous poets and writers. His grave is close to the graves of other famous authors such as Geoffrey Chaucer, William Shakespeare, and Robert Burns.
Answer has 122 tokens. P100 spent approximately 7 seconds for the question. It is about 18 tokens per second.
Got it. There are still several tricks you can try to improve generation speed. For example, you can use torch.compile(), or flash attention 2.
The generation speed has increased to 20 tokens per second.