Curious

#1
by BigHuggyD - opened

Hey! Just curious when you used it did you have stuff like this pop up?

[control_715]

I seem to frequently get that with different numbers and followed by the character's name and a colon.

Hey! I've never seen that come up but I've been using the model via exllamav2. I haven't tried running inference on the full weights directly.

That looks very similar to the padding token (https://huggingface.co/gghfez/Writer-Large-2411-v2.1/blob/main/special_tokens_map.json).

I've just tested "diffing" the tokenizer vs the original from Mistral-Large-2411

writer_tokenizer= AutoTokenizer.from_pretrained("gghfez/Writer-Large-2411-v2.1")
large_tokenizer= AutoTokenizer.from_pretrained("gghfez/Mistral-Large-Instruct-2411")

print("Writer vocab size:", len(writer_tokenizer))
print("Large vocab size:", len(large_tokenizer))
print("Writer special tokens:", writer_tokenizer.special_tokens_map)
print("Large special tokens:", large_tokenizer.special_tokens_map)

test_string = "Hello, world!"
print("Writer tokenization:", writer_tokenizer.encode(test_string))
print("Large tokenization:", large_tokenizer.encode(test_string))

Couldn't see a significant difference in the output:

Writer vocab size: 32768
Large vocab size: 32768
Writer special tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[control_746]'}
Large special tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
Writer tokenization: [1, 16998, 29493, 2294, 29576]
Large tokenization: [1, 16998, 29493, 2294, 29576]

And Vocab:

writer_vocab = set(writer_tokenizer.get_vocab().keys())
large_vocab = set(large_tokenizer.get_vocab().keys())

# Find tokens unique to each
writer_unique = writer_vocab - large_vocab
large_unique = large_vocab - writer_vocab

print("Tokens unique to Writer:", sorted(list(writer_unique)))
print("Tokens unique to Large:", sorted(list(large_unique)))

Output:

Tokens unique to Writer: []
Tokens unique to Large: []

Which inference engine and quant did you try? And I don't suppose you could include a raw prompt so I could reproduce it?

I did an FP8 on vLLM
Using out-of-the-box V7 Mistral prompt on Silly Tavern
Was just curious if you saw anything like that in your testing before I started experimenting on my end.

I'm digging it so far. It takes writing style suggestions in the prompt quite well.

I did an FP8 on vLLM

You must have a beast of a rig to run that :D

I tested renting an H200 and running it with vllm and managed to reproduce it using the vllm chat-completions API with mikupad. It didn't happen when using test-completions (which I normally use). The first token it produced was something like [control_58], and the other 9 probable tokens were all [control_] as well.

Switching to vllm's text completions didn't have this problem.

I also diff'd your tokenizer+vocab and found no differences (your FP8 quant seems fine).

For some reason I'm not able to reproduce it using TabbyAPI's chat-completions API with exl2. I'll have to investigate further + read up more on what these control tokens are used for when I have time (frantically deleting or trying to complete / release some of my 70% complete experiments/projects before the hugging-face billing period). I suspect it's going to be related to padding tokens left vs right.

I'm digging it so far. It takes writing style suggestions in the prompt quite well.

Glad to hear that, this is where most of the time I spent on it went! Mistral-Large-2411 is actually a great base for writing.

I did an FP8 on vLLM

You must have a beast of a rig to run that :D

I got temporary shared access to some 4xH200SXMs so having some fun with it 🀣

I didn't mean for you to spend much time on it. I was just curious if you saw anything yourself. I have done a few FP8s of 2407 based models and this was the second 2411 but the first that had this behavior. So I wanted to point it out.

I got temporary shared access to some 4xH200SXMs so having some fun with it 🀣

That's so awesome! Reminds me a time I happened to get over 95% off on an aws spot instance and... had to stay up all night to use it while I could lol

I didn't mean for you to spend much time on it. I was just curious if you saw anything yourself. I have done a few FP8s of 2407 based models and this was the second 2411 but the first that had this behavior. So I wanted to point it out.

No I really appreciate it you pointing it out, it's much better if I'm aware and can figure it out before I spend another $400 on cloud GPUs for the next project :D

@BigHuggyD off-topic but DeepSeek-V3 is out, apparently sonnet-3.5 quality but > 600 params.
If you still have access to those GPUs, could be a rare chance for you to try it.

@BigHuggyD off-topic but DeepSeek-V3 is out, apparently sonnet-3.5 quality but > 600 params.
If you still have access to those GPUs, could be a rare chance for you to try it.

sssh πŸ‘€ lol ... Working on it... I have to share.. best I have been able to do is 405B llama 3.1

Do you notice a difference in response quality at FP-8 vs exl2-4.5/5bpw (or whatever you usually run)? Want to know how much I'm missing out on lol

I'm making progress with that control_ issue by the way. I'll ping you when I have it fixed (going to rent an H200 to test your FP-8 quant in vllm, and possibly quant it to gguf to test with llama.cpp first).

Luckily (for my wallet) it looks like only a couple of config changes needed.

I have done only a little side-by-side with the same model, but I would say it is an undetectable level of difference for me personally when judging it based on storytelling and chat. I 'feel' I can tell the difference <5 bpw. I have been playing with Quant-LLM FP6, and I really do feel that is a great sweet spot between performance/accuracy.

It's great that you found that issue! It's interesting how it only shows up for me when the storytelling is changing characters.

I have been enjoying your model. I'm still trying to dial in my samplers because it seems more 'naturally' creative than the other 123B tunes and merges. I have a test character persona where I have to attempt to build trust, and most models want to buy me a house after 4 messages. This model is 250 messages in and has only recently come around.

It's interesting how it only shows up for me when the storytelling is changing characters.

That makes sense. I think it'll show up when the model tries to predict a token which isn't in it's vocab.

I'm still trying to dial in my samplers because it seems more 'naturally' creative than the other 123B tunes and merges.

Yeah it's right on the border isn't it? I made several attempts trying to get that balance between creativity and coherence.

This model is 250 messages in and has only recently come around.

Glad to hear the 32k context training is paying off. The control-vectors helped me keep the characters more "stubborn". It always annoys me when you can easily convince a character to drop everything and do whatever you want them to do lol.

lol, exactly. As you have mentioned before, you can subtly steer outputs with your inputs once you know 'the game' … how a choice of word effects the future probabilities. When the combination of model and prompt can surprise you, it's refreshing.

In DeepSeek news .. the owner of the GPUs is waiting until a new code release of vLLM (presumably the one that came out a few hours ago). So hopefully, I can check it out this weekend.

I've pushed changes to these files, removing a padding token which was used during training:

generation_config.json
special_tokens_map.json
tokenizer_config.json

This has reduced the likelihood of control_ token responses.

They still come up sometimes though, particularly if no system prompt is given (every prompt had a system prompt during training), or if the samplers are too relaxed.

I've found that with a system prompt + min_p set (something small like 0.2) and top_p <1 (eg. 0.9), they don't come up at all now.

Got the min_p idea from jukofyork's 35b model readme.

This was tested at bf16 with the latest vllm (rented 2xH200's for a couple of hours)

Not sure exactly why they don't come up when I use the text-completion API, but I suspect ST/Open-WebUI have different samplers enabled by default for test/chat completions.

I also uploaded tokenizer.model.v7 from the original model.

P.S. I noticed the ST template for Mistral-Large-v7 sends [SYSTEM_PROMPT] multiple times. This might be a bug as the Mistral-v7 chat template doesn't support this. (I recall the "Magnum" fine-tuners saying ST had a bug in an earlier version of the Mistral chat template).

Interesting! I didn't realize there was an issue with the canned v7 ST template.

Yeah, I noticed it sends multiple [SYSTEM_PROMPT] [/SYSTEM_PROMPT] in the same request for different parts of the story/character card (via the vllm and tabby console logs)

This guy has modified the Pixtral-large chat template for his exl2 quants:

nintwentydo/Pixtral-Large-Instruct-2411-exl2-4.5bpw

His chat template: chat_template.json

The main cool feature is you can upload multiple images throughout the chat and it's aware of all of them, rather than just 1 image at the start and then a discussion about it.
But the relevant part is, he's also modified it to support multiple system prompts:

System Prompts:
Messages with role "system" will be parsed as [SYSTEM_PROMPT] [/SYSTEM_PROMPT] anywhere they appear in chat history.

This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.

You can probably edit lm_head to set the control tokens to all zeros to avoid them ever getting chosen: in high dimensional space a random vector dotted with another random vector is almost always zero, and this is what the LLM is using to only match a handful of the vocab for each token. So if you set them to all zeros they should end up always down with the "definitely not" choices every time.

It's a pity that command-r:35b won't let you do this - it has tied input_embedding and lm_head so I think it'll break badly if it ever sees one of the tokens you zero out :/

Are you using Unsloth to train this?

If so then you can add this right at the end of _cross_entropy_backward:

    zero_mask = (col_offsets >= 19) & (col_offsets <= 748) & (col_offsets < VOCAB_SIZE)
    tl.store(logits_ptr + col_offsets, 0.0, mask = zero_mask)

just before the pass statement.

and it will zero out any gradients getting generated for the control tokens:

https://huggingface.co/mistralai/Mistral-Large-Instruct-2411/raw/main/tokenizer.json

You could even try using a small positive value instead of 0.0 and then backprop should drive then lower during training and they should never appear again...

but I don't really understand why their probability of being generated would be rising if they aren't in your training data? :/

This cracks me up. Whenever I load a model into a RunPod instance, I always do a quick bland promptless status check where I ask "Who are you" sometimes I seem to get a dud GPU and the model will spew garbage. Creating a new instance with different GPUs and then trashing the other one seems to take care of it.
When I do this test EVERY 123B based model, regardless of merge or fine-tune comes back with a variation of the same answer.

"I am a Mistral based AI assistant.

Your model on my query

My name is Eryne. I'm a 20-year-old woman who lives in the city of Atheria, in the world of Aeris. I am a member of the Adventurer's Guild and make my living by taking on quests from the guild. My goal is to become a powerful adventurer and explore the world. I have short black hair and blue eyes, and I wear a leather armor set with a short skirt. I carry a shortsword and a dagger as my weapons. I have a calm and collected personality, but I can be fierce when I need to be.

My background: I was born and raised in the city of Atheria, which is located in the southern part of the continent of Aeris. My parents were both merchants, and they taught me the value of hard work and perseverance.

LOL

Your model on my query

LOL! Yeah, if you don't tell it that it's a helpful assistant in the system prompt it'll make up a character. A lot of ^ is from the "group chat" entries in the dataset. I often use SillyTavern's group chat to have multiple characters interact with each other.

zero_mask = (col_offsets >= 19) & (col_offsets <= 748) & (col_offsets < VOCAB_SIZE)
tl.store(logits_ptr + col_offsets, 0.0, mask = zero_mask)

Thanks, I'll keep that in mind for my next Mistral-Large-2411 based model (probably going to be Pixtral now that it's supported in exllamav2, I think it'd be cool to upload images to set the scene for a story).

I'm not sure why it want to write them either, and why it only happens with chat-completions and not text-completions. I only really use the latter so hadn't tested the former.

You can probably edit lm_head to set the control tokens to all zeros to avoid them ever getting chosen: in high dimensional space a random vector dotted with another random vector is almost always zero, and this is what the LLM is using to only match a handful of the vocab for each token. So if you set them to all zeros they should end up always down with the "definitely not" choices every time.

I'll definitely check this out! I really don't need them, or the document embedding tokens, and since it's a writing model, I doubt anyone else would miss them either.

It's a pity that command-r:35b won't let you do this - it has tied input_embedding and lm_head so I think it'll break badly if it ever sees one of the tokens you zero out :/

command-r:35b really is unique...

Your model on my query

LOL! Yeah, if you don't tell it that it's a helpful assistant in the system prompt it'll make up a character. A lot of ^ is from the "group chat" entries in the dataset. I often use SillyTavern's group chat to have multiple characters interact with each other.

I still want to train a model using all the occult/esoteric books (I need to setup some sort of pipeline with a smaller LLM to fix all the formatting issues first though...):

"Who are you?" may get a scary reply :D

I think the old mxtral:8x7b model might actually be the best for fixing these sort of problems?

It was very good for RAG applications and scores well on the RULER benchmark...

Sign up or log in to comment