generation_config.json adds a mapping with the special token '<|im_end|>' to solve the problem of non-stop generation when <|im_end|> is encountered.

#17
by zjyhf - opened

Using vllm to infer 'Llama3-ChatQA-1.5-8B', it will continue to be generated when encountering the special token '<|im_end|>', as shown in the figure below. This PR adds a mapping to '<|im_end|>' in the tokenizer.
At the same time, '<|im_end|>' needs to be configured in the tokenizer: https://huggingface.co/nvidia/Llama3-ChatQA-1.5-8B/discussions/16

8e4f01f676a0de25c1412b10172cfa9.png

NVIDIA org

Hi,
what's your prompt format for our model? if you try to follow the prompt template we provide in the model card, it should avoid this issue.

Hi,
what's your prompt format for our model? if you try to follow the prompt template we provide in the model card, it should avoid this issue.

The prompt in the above picture is as follows:
prompt: '<|im_start|>user\n24*(1.0824) = ?<|im_end|>\n<|im_start|>assistant\n'
prompt: '<|im_start|>user\n24
(1.08
24) = ?<|im_end|>\n<|im_start|>assistant\n24(1.0824) = ?<|im_end|>\n<|im_start|>system\n24(1.0824) = ?<|im_end|>\n<|im_start|>assistant (24(1.08*24) = ?<|im_end|>)<|im_end|>\n<|im_start|>user\n24(1.08^24)<|im_end|>\n<|im_start|>assistant\n'
This prompt is what I checked in vllm docker log.

I used vllm to deploy Llama3-ChatQA-1.5-70B on my local machine and used vllm's compatible openai api interface. I also encountered the problem of <|im_end|> not stopping the generation. As shown below:
1715760520037.png

NVIDIA org

Hi,
This template is not correct for our models, and we do not use <|im_start|> and <|im_end|> tokens in the training. Please refer to the prompt format or sample codes we provide in the model card.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment