Great Model

#1
by spike4379 - opened

Great model thanks for making it. It works great so far at RP and narration.
I hope in the future there's a 24k context length version

Thank you! It's a bit of a relief to see some positive feedback as I wasn't entirely sure how people would react. ;-)

No dude its awesome. I'll do some further testing tonight but so far its pretty smart and converses well. Full steam ahead!

Agreed, excellent idea and execution. Would love a version with even more context too!

I just wanted to post to say how delighted I am to see you're making Llama3 models,Gryphe! I know you're shy about sharing models, and seeing how they fare against others, but honestly friend, a lot of the so-called popular model creators...seem to be making a lot of basic mistakes IMO, and the results are often less impressive than the base models they're supposed to be improving upon (more horny perhaps, but also less intelligent, less coherent, or whatever.)

Or people are just throwing various random datasets into a mix and they all have various conflicting impacts on the quality of the final result... there's no process of carefully curating and selecting material and then assessing how it impacts things before adding more. To me, a lot of it seems like merges-of-merges-of-merges now with no rhyme or reason.

I remember when you were releasing your popular L3 Mytho* series models...the amount of care and attention that went into your efforts which you went into a lot of detail about. Whatever you were doing, it clearly was working, as each model was better than the last. I know that you say MythoMax isn't the best anymore, but the simple truth I've found - again and again - is that people will often experiment with newer models, enjoy them for a particular use case, but fall back to using MythoMax again as it just has a certain robustness, charm and personality that made it a great solid all-rounder model, capable of throwing itself into any role and playing it well. MythoMax remained a benchmark of sorts for a lot of us, to compare newer models against, long after its launch. Not "is this better than MythoMax" so much as "in what ways is this better or worse?" at least in my case.

Whilst MythoMax was clearly superior to many prior models others put out, after MythoMax set that new benchmark it became much harder to say if models were better than what proceeded them.

I really hope you'll apply that same level of skill, attention and finesse to making L3 models with the aim of making the next Mytho line LLM. I'm really excited to see if you can make a version of MythoMax that builds upon what you have learned since, be it about using the right source, format or mix of data or merging methods, to create a kind of 'MythoMax Plus L3'.

I can't speak for everyone, but I know a lot of people are tired now of seeing models from certain 'popular' mixers being celebrated before they've even tried them, because in my experience, they just don't seem to have a special sauce. A lot of people seem to chase mixes with/from certain names because they're well known, but in my experience, the best model makers take their time and release infrequently, but what they put out, is worth the wait. You're in that latter category, and I wouldn't worry too much about the former. Popular doesn't always mean better(although it was certainly true for the Mytho* range!)

Hello! So, while I haven't gotten a chance to thoroughly test quite yet, I'm rather curious what you use for RSS Summarization, if you don't mind sharing!
I'm looking forward to poking and prodding at this model more (though I'll be using GGUF quants, for better or worse), and I was wondering if there was any particular way I could test or engage with the pantheon system you have without direct system prompting? I'd like to see if I could get baseline(s), in case it interferes like it's a character card, (though I know the training goal is to strengthen those personalities). If there's any other best practices you could suggest, I'd love to know.

EDIT: Oh, nevermind. I put the system prompts in and got VERY strong and unique responses right away to not need those baselines.

I've tested a bit, and overall a pretty good model! I had a few repetition issues after a while, but the model seems to be holding on tight for quite a while! Managed to get 100+ messages chats easily without issues. It seems really similar to the latest TheSpice from cgato (maybe collab 👀?), I need to test further to see how it differs precisely, but both seem to follow the no quotes+asterisks format consistently, unlike most models, a big plus in my books!

I'd recommend to benchmark it on Chaiverse, I bet it could have a pretty high score.

Edit1: It has some character, I really like it. I'd prefer if responses were a bit smaller for RP, but that's okay (they start off slow but then can get super long even with a system prompt saying the opposite).

Thank you all for providing so much extensive feedback so far! I admit I didn't have a huge amount of confidence when I originally published this model so consider me very happily surprised. (And motivated to do even better, cause that's just how I roll.)

Regarding potential next steps; I'm hoping Meta will release a 8B model with more context support in the coming weeks onto which I might then train a 1.1 version featuring whatever improvements I will have cooked up in the meantime, which, knowing myself, will be plenty.

@Sovy : In regards to RSS summarization; I use a duct tape Python-coded client that just feeds the raw RSS data to the model, alongside some instructions. (I want a Markdown formatted list, tell me if anything looks dubious, etc, anything goes in these instructions) No official protocols or anything.

@Varkoyote : I did submit the model to Chaiverse, where it seems to land in the middle ground. Admittedly, I have no idea how Chai prompts Pantheon so it could be sub-optimal for all I know.

I will try it out shortly, and see how it is in my ongoing story! I will leave feedback afterwards!

I've been testing out this model more, it's been very, very nice. It seems like the way you trained it builds on the 'Llama3 prefers to be told IT IS the character' vs 'Roleplay as' people have been bringing up. I've checked out the various personas as system prompts, I've even tried to create new original personas using the same format (that went extremely well!). I've ran the persona sys prompts through other models as well, and while they tried to act like what-they-thought of the character, it was still rather stiff. Meanwhile, Pantheon seemed to actually embody the character, which is awesome because I've been trying to see if it was possible to strengthen that in llama3.
When it came to RP with character cards it also did an excellent job, using my custom system prompt via sillytavern and virt-io's simple sampler. All tests have been with ChatML set.

The only real issues I notice is that sometimes it really stays in character (not a bad thing) when you ask it a quick question, while other times it's more responsive to answering questions more generally. You'll still get the answer or instructions or whatever, but it's as-told by the character. It's pretty fun though, and I assume if I just swapped to Aiva that the problem would be minimal. Still, felt like it should be noted as neutral feedback (but personally positive, as I have other models for that).
The other issue I've had is that randomly I'll get backticks or chunks of ruby on rails (impressive lmao) or little emojis here and there. This is mostly an issue with testing on KoboldLite backend for whatever reason, using the built in Godlike setting (likely the culprit). I haven't yet seen this issue with sillytavern. I've still yet to try your recommended inference samplers.

Honestly my experience with this has been surprising and very good, because while it does have some interesting quirks it likes such as certain actions (characters leaning against wall/tree/cave), I've rarely seen any repetition compared to other models I've messed with so far (actions in context getting 'stuck' verbatim). It also doesn't try to end the reply as soon as possible like some other llama 3 models. I actually had to doublecheck end tokens, but it can reply shortly, it just prefers not to. I still feel like I need to poke at it more, but I'm looking forward to what might happen with this finetune line or even merges. I think since I've finally gotten around to messing with it, it'll be my daily driver.

you cooked fr

Thank you all for providing so much extensive feedback so far! I admit I didn't have a huge amount of confidence when I originally published this model so consider me very happily surprised. (And motivated to do even better, cause that's just how I roll.)

Regarding potential next steps; I'm hoping Meta will release a 8B model with more context support in the coming weeks onto which I might then train a 1.1 version featuring whatever improvements I will have cooked up in the meantime, which, knowing myself, will be plenty.

@Sovy : In regards to RSS summarization; I use a duct tape Python-coded client that just feeds the raw RSS data to the model, alongside some instructions. (I want a Markdown formatted list, tell me if anything looks dubious, etc, anything goes in these instructions) No official protocols or anything.

@Varkoyote : I did submit the model to Chaiverse, where it seems to land in the middle ground. Admittedly, I have no idea how Chai prompts Pantheon so it could be sub-optimal for all I know.

Chaiverse prompts weirdly, people were raising their elo point by 50 just with a change to the system prompts.
I've personally had my models elo scores increase 10+ points with a system prompt change.
1207 with the base prompts is impressive as it's the incorrect prompt for your model, it's closer to alpaca prompting
Also outliers like the spice score so high because they are trained on the prompt format chai uses, makes it score way higher.
I'm excited to try it out 😸

Thank you all for providing so much extensive feedback so far! I admit I didn't have a huge amount of confidence when I originally published this model so consider me very happily surprised. (And motivated to do even better, cause that's just how I roll.)

Regarding potential next steps; I'm hoping Meta will release a 8B model with more context support in the coming weeks onto which I might then train a 1.1 version featuring whatever improvements I will have cooked up in the meantime, which, knowing myself, will be plenty.

@Sovy : In regards to RSS summarization; I use a duct tape Python-coded client that just feeds the raw RSS data to the model, alongside some instructions. (I want a Markdown formatted list, tell me if anything looks dubious, etc, anything goes in these instructions) No official protocols or anything.

@Varkoyote : I did submit the model to Chaiverse, where it seems to land in the middle ground. Admittedly, I have no idea how Chai prompts Pantheon so it could be sub-optimal for all I know.

Maybe I can merge gradientai/Llama-3-8B-Instruct-Gradient-1048k with meta-llama/Meta-Llama-3-8B for you.
Maybe it can bring support for long context?
I'm preparing the 24B variant of llama3-8b.

Maybe I can merge gradientai/Llama-3-8B-Instruct-Gradient-1048k with meta-llama/Meta-Llama-3-8B for you.
Maybe it can bring support for long context?
I'm preparing the 24B variant of llama3-8b.

I've been experimenting with Llama.cpp's RoPE scaling feature and I can report that 32k context appears to be perfectly doable with the following parameters;

-c 32768
--rope-scaling linear
--rope-freq-base 2000000

I haven't liked any of the L3 models aside from 4x8B for RP/ERP and like was a stretch. I tried your model out today and damn it has no business being this good at 8B bro. I scaled context up to 32k and I haven't seen a large amount of issues around 17k context but I haven't pushed it all the way yet.

One thing I enjoyed was it refusing my debug requests to stop, and then proceeded to follow all steps of the debug in character. One regen it had a damn existential crisis. The second I called it Aiva, every regen was just coherent obedience.

A few things of note, it handles spatial awareness better than most models I've tried above 8B. This is only over about 300 messages but it's not just simply getting it right some of the time, it's all of the time with context at 8192. Once I scaled the context up it had a few times where it would get confused about the order of who is doing what and where but usually a regen gets me a good response and I move on until it happens again. The personality it brings to the characters feels more appropriate than other models, and does an awesome job at slow burning development. It won't bulldoze through the stories. It's a bit tame on the ERP side of things, using very gentle language, though I didn't do much testing but it's friday evening for me so I'll have more feedback on that later.

Overall I was just amazed, it's easily going to be my daily driver, pushing 2 8x7B's aside. Thanks for releasing the model. Here are my ST samplers for posterity's sake: https://files.catbox.moe/v3s5kg.json
I'm probably doing something wrong but I used your recommended samplers in ST and I was getting a lot of issues. I'm using bartowski_Pantheon-RP-1.0-8b-Llama-3-exl2_8_0.

Gryphe, this model you created, it is one of the best model i ever used. i thank you for this creation you did :D

An amazing model that can adequately understand bots with additional text window settings. Would you be able to make something else out of the 12-20b models based on 3.1 LLAMA?

Also, more of a wish, I would like the model to be able to refuse. Your old MythoMax rejected the user if the character didn't like something, in LLAMA 3 this is obviously worse. LLAMA3 itself is more positive towards the user, which makes all the characters on it look more soft and weak-willed, cards of even the scariest bastards sooner or later turn into calm neutered cats, it's a shame. Even the reinforcements in the instructions don't help. If anything, this is not a criticism of you, please don't think that way.

An amazing model that can adequately understand bots with additional text window settings. Would you be able to make something else out of the 12-20b models based on 3.1 LLAMA?

The next release will be trained on top of Mistral's new 12B model and is currently in progress. I hope to release it Soon™, but as always I only push a release if I'm happy with it!

can't wait to try, the current existing Nemo finetunes are a bit underwhelming, often less performant/more quirky than the base model...

@Varkoyote Yes, taming Nemo has been difficult but I suck at giving up so I hope to deliver something good! At this stage I'm testing various recipes to see which one works best. I'll get there, eventually.

@Diavator Since you edited your response later I didn't answer the newly added part. I agree it's been an issue in general to counter all the positivity bias.

I've regenerated all my Pantheon-related data to include more not-so-nice situations alongside a huge list of other improvements to try and counter exactly that. And no worries about the criticism, there's only so much I can do to counter a model's base training. (It's called FINEtuning for a reason, after all!) It's one of the reasons I'm focusing on Nemo to begin with, besides the noticeably smarter brain power.

@Diavator Since you edited your response later I didn't answer the newly added part. I agree it's been an issue in general to counter all the positivity bias.

I've regenerated all my Pantheon-related data to include more not-so-nice situations alongside a huge list of other improvements to try and counter exactly that. And no worries about the criticism, there's only so much I can do to counter a model's base training. (It's called FINEtuning for a reason, after all!) It's one of the reasons I'm focusing on Nemo to begin with, besides the noticeably smarter brain power.

Most of the new models on llama 3 from other authors, just have more aggression in behaviour and freedom in NSWF, but they are sorely lacking in personality. I enjoyed the NeverSleep and Undi95 derivatives based on your model MM. There was a lot of thought and independence in them. The characters in these models were really ‘alive’, their logic, actions were human, not just dry text like in other models. The character turned out believable, even GPT and Claude with their huge dataset can't create a believable character even with detailed instructions.
LLAMA3 was a huge disappointment for me, I don't understand the point of its existence at all, on its background even WizardLM looks like a masterpiece. I realise I'm judging as a simple user, but seeing the models that come out on it makes me sad and want to go back to llama 2 or Mistral.

Actually can't wait for a new and improved pantheon, it's still one of my fav models!

Sign up or log in to comment