|
--- |
|
library_name: transformers |
|
tags: ["gemma","chatml"] |
|
--- |
|
|
|
# ChatML Tokenizer for Gemma |
|
|
|
This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (`<start_of_turn>`) and `107` (`<end_of_turn>`) with the chatML tokens `<|im_start|>` and `<|im_end|>`. |
|
|
|
No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified. |
|
|
|
|
|
_Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems [google/gemma-7b](https://huggingface.co/google/gemma-7b), always requires the original `<bos>` token to be part of the input. This means the chat template is `<bos>` + `chatml` + `<eos>`_ |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml") |
|
|
|
messages = [ |
|
{"role": "system", "content": "You are Gemma."}, |
|
{"role": "user", "content": "Hello, how are you?"}, |
|
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, |
|
] |
|
|
|
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) |
|
print(chatml) |
|
# <bos><|im_start|>system |
|
# You are Gemma.<|im_end|> |
|
# <|im_start|>user |
|
# Hello, how are you?<|im_end|> |
|
# <|im_start|>assistant |
|
# I'm doing great. How can I help you today?<|im_end|>\n<eos> |
|
|
|
``` |
|
|
|
|
|
## Test |
|
|
|
```python |
|
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml") |
|
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it") |
|
|
|
# get special tokens |
|
print(tokenizer.special_tokens_map) |
|
print(original_tokenizer.special_tokens_map) |
|
|
|
# check length of vocab |
|
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length" |
|
|
|
# tokenize messages |
|
messages = [ |
|
{"role": "user", "content": "Hello, how are you?"}, |
|
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, |
|
] |
|
|
|
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) |
|
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) |
|
|
|
print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}") |
|
|
|
``` |