Chat template

#1
by ehartford - opened

Exciting model!

What is the chat template format?

@ehartford Thank you, it uses ChatML.

Do you plan to add chatml special tokens to list of tokens? Or replace s /s?

https://huggingface.co/senseable/WestLake-7B-v2/blob/main/tokenizer_config.json

yes - there are no tokens for <|im_start|> and <|im_end|>

to get this working properly, you will need to retrain it with those tokens added, and <|im_end|> designated as the EOS token

if you like I can help you

as theodotus implies - there are two ways.

1: add a new token for <|im_end|> (with a new token id)

2: replace (token id 2) mapping with <|im_end|>

method 1 is easier
method 2 is more difficult but more compatible with merging and clients that are hardcoded to use token_id 2 as EOS

The problem is that right now the model is trained to generate the string <|im_end|> rather than the EOS token, and it does that imperfectly (sometimes it generates <|im_end without the |> for instance)

Yeah, I'm working on v3 where that will be addressed. @ehartford Appreciate the LASER work.

Looking forward to it! (and to finetuning it with Samantha!)

This model seems to work great using the config.json and tokenizer_config.json parameters from this one: https://huggingface.co/NurtureAI/OpenHermes-2.5-Mistral-7B-16k/tree/main

@senseable in a closed conversation, you mentioned the chat template was ChatML:

"chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"

However, the above is Zephyr prompt format, with the addition of <|im_end|> from what we have seen in practice. The correct format for ChatML is

"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",

Which one did you actually use for training?

I am working on updating the config json files to fix the eos problem, and I would like to make sure I have the correct format.

@froggeric I trained using <im_start|> and <|im_end|> but oddly it probably performs better with Alpaca.

I have done a few tests now using a few different prompt formats (ChatML, Zephyr, Alpaca, Mistral Instruct). I find that using Zephyr instead of ChatML actually often performs betters, and is not affected by the <|im_end|> problem. Alpaca works ok too, but has a few problems with tokens inserted in the converstation. But the best results are when using Mistral Instruct, which is not surprising as it is the underlying foundation; however it suffers the most from token insertion.

Why don't you stick to the Mistral Instruct format for the v3 training? I think the best results should be achieved when using the same format as what was used for the base model.

Sign up or log in to comment