generation_config.json adds a mapping with the special token '<|im_end|>' to solve the problem of non-stop generation when <|im_end|> is encountered.
Using vllm to infer 'Llama3-ChatQA-1.5-8B', it will continue to be generated when encountering the special token '<|im_end|>', as shown in the figure below. This PR adds a mapping to '<|im_end|>' in the tokenizer.
At the same time, '<|im_end|>' needs to be configured in the tokenizer: https://huggingface.co/nvidia/Llama3-ChatQA-1.5-8B/discussions/16
Hi,
what's your prompt format for our model? if you try to follow the prompt template we provide in the model card, it should avoid this issue.
Hi,
what's your prompt format for our model? if you try to follow the prompt template we provide in the model card, it should avoid this issue.
The prompt in the above picture is as follows:
prompt: '<|im_start|>user\n24*(1.0824) = ?<|im_end|>\n<|im_start|>assistant\n'
prompt: '<|im_start|>user\n24(1.0824) = ?<|im_end|>\n<|im_start|>assistant\n24(1.0824) = ?<|im_end|>\n<|im_start|>system\n24(1.0824) = ?<|im_end|>\n<|im_start|>assistant (24(1.08*24) = ?<|im_end|>)<|im_end|>\n<|im_start|>user\n24(1.08^24)<|im_end|>\n<|im_start|>assistant\n'
This prompt is what I checked in vllm docker log.
I used vllm to deploy Llama3-ChatQA-1.5-70B on my local machine and used vllm's compatible openai api interface. I also encountered the problem of <|im_end|> not stopping the generation. As shown below:
Hi,
This template is not correct for our models, and we do not use <|im_start|> and <|im_end|> tokens in the training. Please refer to the prompt format or sample codes we provide in the model card.