metadata
library_name: transformers
tags:
- gemma
- chatml
ChatML Tokenizer for Gemma
This repository includes a fast tokenizer for google/gemma-7b with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id 106
(<start_of_turn>
) and 107
(<end_of_turn>
) with the chatML tokens <|im_start|>
and <|im_end|>
.
No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.
Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems google/gemma-7b, always requires the original <bos>
token to be part of the input. This means the chat template is <bos>
+ chatml
+ <eos>
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
messages = [
{"role": "system", "content": "You are Gemma."},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)
# <bos><|im_start|>system
# You are Gemma.<|im_end|>
# <|im_start|>user
# Hello, how are you?<|im_end|>
# <|im_start|>assistant
# I'm doing great. How can I help you today?<|im_end|>\n<eos>
Test
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)
# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"
# tokenize messages
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")