random Cyrillic?

#15
by fbjr - opened

Anyone else getting random Cyrillic in outputs? Can't tell if this is only on mlx yet, but when I use tokenizer.apply_tool_template, and add in code to the context window/prompt input, i start to get Cyrillic outputs at the end. It seems to only show up when there's markdown or code formatting in the context. Here's a sample output where it starts off well, then goes random Cyrillic at the tail end:

<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Explain this code:\nimport sys
import click
from mlx_lm import load, generate
import mlx.core as mx
from tool_utils import load_tools_from_file

<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the directly-answer tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:

[
    {
        "tool_name": title of the tool in the specification,
        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
    }
]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
Action: ```json
[
    {
        "tool_name": "internet_search",
        "parameters": {
            "query": "what's funny about the code snippet import sys\nimport click\nfrom mx_lm import load, generate\nimport mx.core as mx\nfrom tool_utils import load_tools greetings_from_file\ndefault_temp =  случайно = 0.5

This looks like it might be an issue with either transformers or the tokenizer.json itself.

You can reproduce with:

from transformers import AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")
fbjr changed discussion title from random Cyrillic? to (edit: tokenizer.json inconsistencies) random Cyrillic?
Cohere For AI org
edited Apr 8, 2024

hi @fbjr , the difference between tokenizers.json is unicode encoding in tokenizer.json (command-r-plus). In the config<|END_OF_TURN_TOKEN|> token is also set as special because it is used as eos_token, which is also overwritten in the original tokenizer (command-r-plus) as well. Therefore, tokenizers should work the same. Can you post the text you tokenize to double-check?

fbjr changed discussion status to closed
fbjr changed discussion title from (edit: tokenizer.json inconsistencies) random Cyrillic? to random Cyrillic?

Sign up or log in to comment