Seems to be broken.
Just writes endlessly, stop token doesn't seem to reliably function when using ChatML format.
This is a llama-3 model, make sure your tools have been updated for it. What you see is a typical symptom of using an old inference engine, or the wrong configuration.
I was just using another ChatML Llama 3 finetune with no problems at all so I don't think there are any problems with my configuration.
Using latest Koboldcpp and Sillytavern, no problems with other models. Not sure if the problem is with the quants or finetune itself. Was using Q5_K_M just for reference.
current koboldcpp does not have support for the llama 3 end tokens unless you manually configure them - did you do so? if not, that is the problem. other finetunes don't matter, because they might not use the same end tokens.
in any case, i only provide the quants - any vocabulary problem is up to the original model. but the symtpoms you describe are a clear indication of a configuration problem, especially since kopboldcpp has not yet been updated for llama-3 as of the latest release.
llama-3 support has landed a few hours ago in llama.cpp, and is expected to be in the next koboldcpp version. the equivalent of --override-kv tokenizer.ggml.pre=str:llama3 likely also needs to be specified (it's not clear whether koboldcpp will have such a switch).
Yes I saw... it was my mistake, I thought the tokenizer issues were already fixed. Will give this a try when new version of koboldcpp releases.