THE THREAD OF DOOM
Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(
Okay, I was wondering if we crossed some sort of line.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
@ChuckMcSneed @BigHuggyD @gghfez
Ping.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.
I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.
Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0
ones to avoid a lot of the confusion.
It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b
and creative-writer-v0.2-35b
models will be going as soon as I get the v1.0
version uploaded, and possible Dusk-Miqu-70B
if they do set a hard-limit (I still think Dark-Miqu-70B
is worth keeping whatever though).
Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B
myself!
:( Damn there was some good info in that thread.
If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.
Unfortunately, I cleaned my browser tabs up about an hour ago.
And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.
I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.
@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.
I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol
I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...
Hi @jukofyork first impressions of your Creative Writer Preview have been very positive so far. Thank you and Happy new year!
I think the next version will be better, as pretty sure some tokenisation bug effected the last run:
I will be having a break over the next few days so gonna just try this training process on the 32b
and 35b
models:
- Train using the "random paragraphs" method (~140M tokens).
- Train using the same paragraphs, but put them all back in order from the books they were extracted from with
<EOS>
tokens separating the books (same ~140M tokens; 1073 books). - Train a 3rd time on the "in order" books, but use Focal Loss* with
gamma=1.1
like I did for the"bravo"
experimental models.
Stage 2 will use the model with the stage 1 LoRA merged, and then stage 3 will use the model with the stage 2 LoRA merged, and so on.
My hope is that the first stage will force the model to use more natural language as it can't reduce the loss by looking at recent paragraphs, then the second stage will hopefully fix the mixing of the 1st/3rd person POV (and any other general "weirdness") caused by stage 1's "myopia", and then finally stage 3 will try to ramp up the single-token Entropy.
Assuming all goes well, I'll upload each stage of each model as I go along (each run takes 24-25 hours, so will take around a week to do this).
Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:
https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/
This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/
I've actually found gpt-4-0125-preview
seems to be the least nerfed (maybe they forgot it? lol).
Nearly done with 35b
now:
I've worked out what went wrong too:
https://huggingface.co/spaces/Xenova/the-tokenizer-playground
If you tokenise multiple paragraphs you get two separate newline tokens (206
) between them:
This is a paragraph.
This is another paragraph.
[5, 4184, 1801, 1671, 42429, 21, 206, 206, 4184, 1801, 4907, 42429, 21]
but if you tokenise each on their own you get the double newline token ( 2126
):
This is a paragraph.
[5, 4184, 1801, 1671, 42429, 21, 2126]
and then when you concatenate these you get wildly out-of-distribution data!
I don't really know enough about tokenisers, but this was not what I expected and seems really odd behaviour to me?!
(It also explains why I mysteriously gained 1M+ tokens for my new run - I was super confused where they had come from! 🤣)
Anyway, I'm still gonna run the second stage on each of these as I think the switching to/from 1pp/3pp will still be a problem.
I've also got qlora-pipe
to output metrics about the hidden state going into lm_head
which should mean for stage 3 that I can push the Entropy as high as I possibly can before the model breaks down (ie: where it starts "cheating" the Focal Loss* loss by shrinking the hidden state).
Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:
https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/
This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/
I've actually found
gpt-4-0125-preview
seems to be the least nerfed (maybe they forgot it? lol).
I've tried doing some non-standard functionality coding and what surprised me are how bad ALL llms are at it. I need to guide them each little step or they'll fuck up. O1? Decides to randomly change unrelated parts of the code. Gemini? Just dumb. Sonnet? A bit better, but still makes beginner-level mistakes that need to be fixed by hand. At this point I'm feeling like it would have been faster if I coded it just by myself.
Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:
https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/
This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/
I've actually found
gpt-4-0125-preview
seems to be the least nerfed (maybe they forgot it? lol).I've tried doing some non-standard functionality coding and what surprised me are how bad ALL llms are at it. I need to guide them each little step or they'll fuck up. O1? Decides to randomly change unrelated parts of the code. Gemini? Just dumb. Sonnet? A bit better, but still makes beginner-level mistakes that need to be fixed by hand. At this point I'm feeling like it would have been faster if I coded it just by myself.
Yeah, I think even o1-preview
has been quietly nerfed :/
Talk of the different tokenization of single and multiple paragraphs reminds me of something. I don't know how useful this is, but I remember when I was messing around implementing the novelai api in a writing program, they were doing something unusual with their llama finetune - you have to strip the last token off the context when sending text in. Apparently llama has many ways to tokenize the end of the sentence/paragraph and it was causing issues, whereas stripping it let the model continue in a more creative way. Probably not a useful thought but I thought I'd write it down here since this seemed like the most appropriate place ;p. I've actually taken this to an extreme in the past, having entire paragraphs stripped from the end of every generation because I found continuations were better if I started a paragraph back rather than at the end of the previous gen. It's possible I was "solving" the same issue by accident, explaining the improved text quality across the story.
Or, maybe I'm just a fool sitting here fooling around with magical talk-boxes imagining things :).
Anyway, hey. Somehow control vectors slipped below my radar. Look forward to jumping in and messing with them. What's a typical workflow look like with these things? Are you constantly loading and unloading vectors to steer the story? I'm digging around looking for examples/more info but haven't found much. Wouldn't mind implementing this in my own writing system (I'm a novelist by trade and always looking for ways to get the AI writing in a more controllable way).
Odd thought:
Awhile back I was testing out trying to control positivity/negativity and bias in the model by doing external dice rolls (python) and feeding the roll to the LLM in context every turn, asking it to act accordingly based on its dice roll and giving it a prompt to continue the novel-style narrative. I wasn't roleplaying in the classic sense. The idea was that LLMs likely have enough information about roleplay/D&D style interactions that if I took an ongoing novel and gave it a dice roll, then said it needed to continue the story based upon the success (or failure), of that roll, it would allow me to steer a story a bit more directly and achieve negative and positive outcomes where I wanted them.
It worked, albeit a bit noisy. Low rolls led to negative storytelling, higher dice rolls led to positive things happening.
Now I'm imagining a situation where control vectors are made for rolls 1-20 by producing prompts and showing the varied outcomes (the continued novel-text based on the results of that roll).
Once produced, you apply vectors each generation based on how the roll goes in the background (so if they rolled an 18, you'd apply the 'roll18' vector). The text coming out is then being steered based on those outcomes. It should give you good 2-way vectors since the rolls largely pivot around the 10 (especially if you're prompting it to figure 10 as neutral and 9 and below as negative, 11 and up as positive). Would also make implementing a slider in a UI easy to push the story in positive or negative directions by sliding it up or down... and since the roll outcomes are creative/ambiguous, it should give the AI some space to be creative in how it interprets the scene.
Anyway, I'm not in my depth here yet - I'll have to mess around with control vectors and get a feel for them.