THE THREAD OF DOOM

#12
by jukofyork - opened

Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(

jukofyork pinned discussion

Okay, I was wondering if we crossed some sort of line.

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.

I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.

Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0 ones to avoid a lot of the confusion.


image.png

image.png

It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b and creative-writer-v0.2-35b models will be going as soon as I get the v1.0 version uploaded, and possible Dusk-Miqu-70B if they do set a hard-limit (I still think Dark-Miqu-70B is worth keeping whatever though).


Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B myself!

:( Damn there was some good info in that thread.

If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.

Unfortunately, I cleaned my browser tabs up about an hour ago.

And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.

I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.

@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.

I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol

I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...

There is a snapshot on the wayback machine:

http://web.archive.org/web/20241130014411/https://huggingface.co/jukofyork/creative-writing-control-vectors-BETA-v0.1/discussions/2

but it looks like the "click to expand" stuff stopped it getting backed up properly?

The mistralai/Mistral-Large-Instruct-2407 fine-tune is cooking and should be ready in around 9-10 days.

This is going to be good. Mistral-Large is very tolerant of projects like this.

@jukofyork

Control-Vector question: how much VRAM is needed to train vectors for Wizard2-8x22b? I vaguely recall in the lost thread you were using 3 x ?

Control-Vector question: how much VRAM is needed to train vectors for Wizard2-8x22b? I vaguely recall in the lost thread you were using 3 x ?

Around 5/8ths of 140GB. I could train everything up to 70B-72B using a single A6000, but the larger models needed 2x A6000.

Thanks. Ended up managing on a single 94GB H100NVL in the cloud. Looks like it just misses out on an 80gb < 1gb of vram.

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?

This seems quite an interesting metric (used in that paper):

Screenshot_20241207-094538.png

From: https://www.sltinfo.com/wp-content/uploads/2014/01/type-token-ratio.pdf

Also: Type-Token Ratios: What do they tell us?

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

Yeah, I've been thinking about this too and wonder if a really well curated dataset of "openings" (sentences, paragraphs, chapters, etc) of books/stories might help somewhat with this?

Just checked on the mistral-large fine-tune and it's nearly 1/2 way now and still looking good: at 60% of the way it will switch to a cosine schedule, so fingers crossed it stays this way:

Screenshot_20241207-115133.png

I was a little worried when I saw those big jumps in the max-norm, but it's probably just due to the weird / non-standard hyper-parameters I have to use to increase the Entropy (ie: it can't use any momentum-based optimiser or it overshoots badly, so have to use Adam with beta1 = 0; aka uncentered-RMSprop).

From previous experiments, the Entropy should start to drop slightly now and hopefully end up being approximately the same as the log-loss by the end of training...

Considering I've optimised the hyper-parameters on command-r:35b; this looks pretty hopeful the same will work for all models.

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

I think some of this is likely a failure of the associative memory again:

I've been thinking a lot about QwQ and I'm beginning to think the "power" of the model actually comes from being able to approximate higher-order interaction effects from the words it writes.

The associative memory in the transformer architecture (and the Hopfield networks that came before) only really looks at second-order interactions (directly).

Trying to extend the transformer architecture to cubic interactions (and beyond) is totally out of the question as second-order interaction already cost O(n^2).

You can actually approximate higher order interactions to some degree, eg:

SimpleBayesNet.svg.png

https://en.m.wikipedia.org/wiki/Bayesian_network

But it quickly blows up...

So what I think QwQ might be doing is trawling through all the "linked associations" which in turn let it look "further" away from the input context than repeated transformer blocks allow (which can likely only consider a very contrained set of links; likely following a very restrictive pattern too).


So how is this related to creative writing?

Well at the start, the model only really has what you have given it in the prompt to go off, so will likely only have this along with some kind of low-Entropy / pre-baked "template" story (that shows up again and again and again...).

One solution then would be to try to preload the KV-cache with some sort of jumbled up "superimposition" of story prompts, to try to kick-start it away from the boring "template", but I think this will likely be fraught with the model not following your instructions and other "weird shit" due to the randomised input possibility being nothing to do with what you actually want.

So what's the (an) alternative?

Try to start by "asking around" but be very careful to not give away what you actually want to do, eg:

  • What do you know about Joe Abercrombie?
  • What do you know about Rob J Hayes?
  • What do you know about Grimdark fantasy and how is it different to epic fantasy?
  • Let's think about some original settings and character names that might work in a setting like this.
  • Let's now summarise what we have thought about so far.
  • What are we missing here? Can you list some related stuff to consider that we haven't discussed yet?

and so on..

This is exactly what QwQ is doing, but then it finishes off by writing a heap of the worst qwen-slop imaginable! :D

We need to find a way to "pre-load" this higher-order, possibly useful, possibly useless, context into some of the better models.

This method actually has a name in psychology / educational theory, but I've forgotten what it is called now:

Basically the idea is to "prime" the student with something novel/interesting that gets these sort of associations working and creates "anticipation", before actually giving the task...

IIRC, it has "prime" in the name.

I have done something similar to that before back when GPT3.5 came out.
I wrote a bunch of phrases at the start, then said "Oh sorry, wrong window, what I meant to say was: "

This is exactly what QwQ is doing

I hadn't realized that, but that makes perfect sense.

be very careful to not give away what you actually want to do

Why is that?

I have done something similar to that before back when GPT3.5 came out.
I wrote a bunch of phrases at the start, then said "Oh sorry, wrong window, what I meant to say was: "

This is exactly what QwQ is doing

I hadn't realized that, but that makes perfect sense.

be very careful to not give away what you actually want to do

Why is that?

It's a bit like the "don't think of an elephant" thing: if I start off telling you that we're ultimately gonna be writing "a Grimdark story in the style of..." then all the distant associations you know about are unlikely to be used effectively as you've "framed" the problem for them.

From a human perspective, I think it also likely triggers the "reward centres" more due to a mix of "anticipation" and the "satisfaction" of problem solving.

I don't know anything about psychology (at all) so may be using the wrong terminology; it's just 20+ years ago I worked as a private maths teacher who had to deal with kids excluded from school and often those who had failed to get anywhere with other private teachers too! Needless to say; I read a lot about educational theory those years and even managed to get some to pass their exams that nobody would have thought possible... :/

https://en.m.wikipedia.org/wiki/Priming_(psychology)

I think it is actually just called "priming" but sadly wokeism seems to have corrupted the Wikipedia article:

Priming is thought to play a large part in the systems of stereotyping.


https://www.teachwithmrst.com/post/priming

this is another example of priming, which is an increased sensitivity to a particular schema due to a recent experience. In other words, priming is when an experience or exposure to a stimulus puts a particular schema at the forefront of our mind. When this in turn influences our judgments and decisions, it's called the priming effect.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

Have you tried it with base models? Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

Have you tried it with base models? Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.

Interestingly, this paper (which sadly got lost when I deleted the old thread :/) shows that base models start off well:

https://openreview.net/forum?id=ZpQ2SqQNXf

but then start to gain way too much entropy as the sequence length increases:

Screenshot_20241207-181943.png

It almost looks like if we could do "late fusion" on the two sets of outputs we would have something close to human generation?!

When my machines finally finish training, then I think I might be able to hack together something that tests this...

I think it will need some heuristics adding to let the instruct model decide when to stop, but otherwise it's just a case of blending the probability outputs before deciding which token to accept.

(I've already experimented lots with merging base/instruct models and/or making MoE models with the gating weights all set to zero, and both are "interesting" but sadly never stop and quickly go completely off the rails by taking to themselves, etc).

Interestingly, this paper (which sadly got lost when I deleted the old thread :/) shows that base models start off well:

You've still got it though right (you linked to it).

I've got a copy which I used to build a tool to replicate the graphs in the paper.

Have you tried it with base models?

Not really, even with few-shot prompting, couldn't get them to reliably produce synthetic data.

Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.

Okay that was a beast to get running. It doesn't seem to produce gpt-isms, but I notice it re-uses the same names a lot (not Elara but it's own names).

That's what I mean, I think all of these models; once they've been (pre)trained and become stateless weights, will either have their own flavor of slop, or produce noise. Kind of like how we have our own patterns of speech, etc.

P.S. I see they've given us more storage now on HF, and it looks like public repos are free

image.png

So I've been reading up on the "Softmax Bottleneck":

https://arxiv.org/abs/1711.03953

which likely effects all LLMs to some degree (due to having n_vocab >> hidden_dim), but likely effects small LLMs the most:

https://arxiv.org/abs/2404.07647

(possibly one of the reasonsCohere and Mistral-large with their 12k hidden_dim outperform the standard 8k hidden_dim of the 70B models for writing too?)

The "Mixture of Softmax" solution isn't very appealing as the lm_head tensors are already huge...

Another solution people have experimented with is passing the logits through a non-linear function:

https://arxiv.org/abs/1805.10829
https://arxiv.org/abs/1902.08077

Then it occurred to me that we already have an example of a range of models that do this already, why are also quite good at creative writing and appear to "punch above their weight" - gemma2 with their "logit soft capping":

https://arxiv.org/abs/2408.00118

Screenshot_20241211-124332.png

which originally came from this paper:

https://arxiv.org/abs/1611.09940 (Section 5.1, 'RL pretraining')

Interestingly, the "Sigsoftmax" paper above experimented with using the binary sigmoid function:

Screenshot_20241211-124447.png

and found it worked better than their function (which is a sort of "soft leaky RELU") for one if the tests, but concluded capping at 1 was likely problematic...

But the gemma2 models use +/- 30 for their cap:

  "final_logit_softcapping": 30.0,

which when passed through exp(), is well outside the range of floating point values anyway...

So I wonder if the benefit of gemma2's "final logit softcapping" is actually nothing to do with clipping/capping; and simply because it solves the "Softmax Bottleneck" problem to some degree due to the non-linearity it introduces?!

P.S. I see they've given us more storage now on HF, and it looks like public repos are free

image.png

Yeah, I saw that posted on Reddit too. I'm 1 day away from the mistral-large fine tune being ready:

Screenshot_20241211-130018.png

So at least I won't have to delete anything to upload it (I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though).

Pretty excited to see what it is like as 9 days has felt like a long time lol.

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

Enough slop in the new releases to keep the pigs happy...

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

Enough slop in the new releases to keep the pigs happy...

Yeah, and I think some of the newer models are starting to filter out copyrighted data so they aren't gonna work well even if the slop can be reduced :/

I think qwen-1.5:110b is worth trying, as even though it was in the v1.5 line it came out way after the others, and does seem to not have been "benchmaxxed" as badly as the v2.0 and v2.5 models.

The older v1.5 models also didn't have great long context ability:

Screenshot_20241211-144115.png

https://github.com/NVIDIA/RULER

but I have feeling qwen-1.5:110b was actually more like qwen-2:110b but just named as v1.5...

Before all the gemma2:9b clones took over, it scored fairly high on EQ-Bench:

http://eqbench.com/creative_writing.html

and did appear to do well in the sample "write in the style of" prompts they used to test it (meaning it's unlikely to have had the copyrighted data filtered out).

It also appears to be quite intelligent and actually scored higher than the commercial models when acting as a judge in this paper:

https://arxiv.org/abs/2406.08598v2

I think it will be interesting to see how it turns out anyway.

This paper makes me think merging might be back on the cards too:

https://arxiv.org/abs/2412.06769

and I noticed all the top places in the open-llm-leaderboard:

https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

appear to be using versions of qwen2:72b and qwen2. 5:72b with around 6 layers self-merged (the authors are very cagey about saying exactly what the method is though...).

I wonder if command-r-plus with the middle 16 (or 24) layers duplicated (taking it up to 80 or 88 layers respectively), might be a worthwhile experiment?

I'm pretty sure the "multiplicative-LoRA" method is ideally suited to fixing a lot of the old weirdness caused by merging, and these middle layers are clearly related to concepts as they were the most important for the control vectors...

The discussion in this thread:

https://huggingface.co/MaziyarPanahi/calme-2.4-rys-78b/discussions/10

Is what makes me believe the "secret sauce" is really just a self-merge...

I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...

If we can make command-r-plus just a little smarter, then it would be a big win IMO and only take the size up to around the same as mistral-large:123b and still less than wizard-lm-2:140b.

IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).

@TheDrummer tried making largestral smaller by cutting out "unimportant layers", but it didn't go too well imo. While the vanilla knew all 8 of the styles, the cut down version almost completely forgot one and got worse at writing poems:
image.png

IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).

image.png
Self-merges wrote better on my tests too.

I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...

If we can make command-r-plus just a little smarter, then it would be a big win IMO and only take the size up to around the same as mistral-large:123b and still less than wizard-lm-2:140b.

IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).

I for one would love smarter command r plus. Still one of my favorite writers but its continuity leaves something to be desired

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

Enough slop in the new releases to keep the pigs happy...

image.png
Even pigs aren't happy with the new one.

Even pigs aren't happy with the new one.

Because it's worse for non-creative tasks. It's general knowledge is worse than 2407 (same as command-r-plus-08) even though 2411 appears to have the same knowledge cutoff as 2407.

I'm not sure they're trying to remove copyright though, I suspect it's teething issues, the first time Mistral have tried adding a proper system prompt to their template.

I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...

Was this the one which had random spelling/grammatical errors? I wonder if that could be healed with a very light finetune. I've successfully taught a model I broke how to speak again with a quick r=16,a=32 tune on the mlp modules, using a dataset generated by the original model.

Is what makes me believe the "secret sauce" is really just a self-merge...

Could Vizdiff help you investigate this? https://huggingface.co/spaces/Steelskull/Vis_Diff

I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though

If you just want to tidy up, sure. But public models don't count towards the quota.

Took a snapshot of https://archive.is/M8Tr2 to avoid link-rot.

@jukofyork P.S.since llama.cpp server has on the fly lora swapping and scaling (like the control-vector-scaled) with the latest version, and Mistral-Large is huge to store locally, I don't suppose you could upload the LoRA adapter of your Mistral-Large as well like rAIfle did with rAIfle/SorcererLM-8x22b-epoch2-LoRA ?

Is what makes me believe the "secret sauce" is really just a self-merge...

Could Vizdiff help you investigate this? https://huggingface.co/spaces/Steelskull/Vis_Diff

Thanks, I'll have a look at this and see if I can spot what they did.

I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though

If you just want to tidy up, sure. But public models don't count towards the quota.

Yeah, I'm just trying to avoid a lot of the confusion and only have "good" models uploaded.

Took a snapshot of https://archive.is/M8Tr2 to avoid link-rot.

@jukofyork P.S.since llama.cpp server has on the fly lora swapping and scaling (like the control-vector-scaled) with the latest version, and Mistral-Large is huge to store locally, I don't suppose you could upload the LoRA adapter of your Mistral-Large as well like rAIfle did with rAIfle/SorcererLM-8x22b-epoch2-LoRA ?

The problem is that it's a Multiplicative-LoRA so the standard Additive-LoRA code won't work, and even a very high rank SVD still can't capture the full Multiplicative-LoRA :/

I could possibly save just the down_proj tensors using the modules_to_save option, but sadly it won't work with most stuff and I probably am best just uploading the full model.

This comment has been hidden

@TheDrummer Will do.

@ChuckMcSneed Could you check out Endurance v1 & v1.1 to see if finetuning healed it to an extent?

Great, who left the door open again?!?! ;D

d9f8c62743d7ac0ca6d1f2709b58bec0.jpg
FLUX thinks demons gotta be phasing through doors to close them.

The problem is that it's a Multiplicative-LoRA so the standard Additive-LoRA code won't work, and even a very high rank SVD still can't capture the full Multiplicative-LoRA :/

All good, this is a special case then. I've cleared up space by deleting the new Mistral-Large and command-r+, other models I don't need.

Looking forward to trying it out!

Bad news guys :(

It seems to have corrupted itself and tried to do an extra step (???) at the end:

GPU-SERVER-1: before GAS splitting, batch size: 10, total tokens: 81920
GPU-SERVER-1: [2024-12-12 14:38:52,276] [INFO] [logging.py:129:log_dist] [Rank 0] step=1159, skipped=0, lr=[0.0], mom=[0.0]
GPU-SERVER-1: [2024-12-12 14:38:52.456] [INFO] [qlora-pipe] step:  1159 /  1159 loss: 1.5680 iter time (s): 622.448 samples/sec: 0.048 eta: 46m41s 
GPU-SERVER-1: before GAS splitting, batch size: 10, total tokens: 81920
GPU-SERVER-1: [2024-12-12 14:49:11,957] [INFO] [logging.py:129:log_dist] [Rank 0] step=1160, skipped=0, lr=[1.1460462221279944e-09], mom=[0.0]
GPU-SERVER-1: [2024-12-12 14:49:12.019] [INFO] [qlora-pipe] step:  1160 /  1159 loss: 8.7767 iter time (s): 618.958 samples/sec: 0.048 eta: 36m18s 

and then crashed....

I tied quantizing this and can confirm it's completely broken (as the loss: 8.7767 indicates).

Even worse is I tried to go back to the step: 1100 snapshot and it turns out two of the ranks have been saving 2 copies (???) at the same time:

GPU-SERVER-1: [2024-12-12 04:26:28,592] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1100 is ready now!
GPU-SERVER-2: [2024-12-12 04:26:28,598] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_46-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:28,602] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_02-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:28,841] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_02-model_states.pt.
GPU-SERVER-1: [2024-12-12 04:26:28,854] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_03-model_states.pt...
GPU-SERVER-2: [2024-12-12 04:26:28,869] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_46-model_states.pt.
GPU-SERVER-2: [2024-12-12 04:26:28,881] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_47-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:29,083] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_03-model_states.pt.

so these all seems messed up too :(

I will have to power-cycle all the machines and/or try to investigate what caused this when I get back home, but not much point in redoing it or trying other models until then.

Looks possibly like GPU-SERVER-2 has a broken SSD :(

Meh.

Shit.

Before you reboot:

1160 / 1159 loss: 8.7767
step: 1100

Do you have a much earlier step like 500? If this sync issue is somehow related to the dead SSD it might have been okay earlier on so it's not all lost at least

broken SSD

Again, before you reboot, it's worth asking Claude/o1 if there's a way to get the data. Years ago I nuked an SSD and I forget what I did, but managed to get something back which was still loaded . Depends on the filesystem though (claude/o1 would know)

I don't suppose you had something like wandb logging your checkpoints?

Shit.

Before you reboot:

1160 / 1159 loss: 8.7767
step: 1100

Do you have a much earlier step like 500? If this sync issue is somehow related to the dead SSD it might have been okay earlier on so it's not all lost at least

broken SSD

Again, before you reboot, it's worth asking Claude/o1 if there's a way to get the data. Years ago I nuked an SSD and I forget what I did, but managed to get something back which was still loaded . Depends on the filesystem though (claude/o1 would know)

I don't suppose you had something like wandb logging your checkpoints?

I think the SDD errors were a red herring and there actually was something wrong with mixing pipeline parallel and batch parallel at the same time.

It seems both rank 0 and rank 1 had been saving over the top of each other the whole run and I never noticed :/

I'm just gonna run on the 30B-ish models which don't use pipeline parallel whilst away and see how they get on... If they are fucked too then something more serious must have gone wrong as I did manage to train endless command-r:35b fine tunes before.

I've also reverted a lot of the fiddling about I did and made a fresh pull of qlora-pipe incase...

If I can't mix pipeline parallel and batch parallel then it's not the end of the world, as I can just run the training 3x and combine all the LoRA using the mean or even SVD (but sadly 9 days --> 27 days).

This might even be the better option as the samples to tunable parameters for the large models is gonna be pretty bad anyway and this would help with overfitting.

Oof sorry 😞

So I've been hunting through the qlora-pipe code to see if I could see where the "extra step" came from (which I think actually ended up with a negative learning rate and hence performed gradient ascent and ruined the model at the end). I didn't manage to find the answer, but I have found a way better method to create the training data, eg:

  1. Extract all paragraphs that are between 200 and 2000 characters (which is ~40-400 words or ~50-500 tokens). This gets rid of all the "dross" like tables of contents, page numbers, etc and leave just nice clean paragraphs.
  2. So now we're left with ~1.1M paragraphs and for each of these, we trim any trailing whitespace and add two new lines (to be consistent with how most LLMs output paragraphs) and then append an <EOS> token.
  3. Randomly shuffle all the 1.1M paragraph + "\n\n" + <EOS> chunks and concatenate them to use as training data.

For example, for Cohere models:

Not unkindly, Mr. Nell told him, "There's two parts to the system. One part carries solid human waste--shit, if I'd not be offendin yer tender ears. The other part carries gray water--water flushed from toilets or run down the drains from sinks and washin-machines and showers; it's also the water that runs down the gutters into the city drains.

<|END_OF_TURN_TOKEN|>The aluminum sled on which Norah was transporting her testing gear resembled an oversized Flexible Flyer. The craft was prepacked with diagnostic gear and safety accessories she'd been using on the glacier over the past few days. All of her gear--including a battery pack, safety flares, and a powerful front-mounted spotlight--was bound under a secured, plastic tarp. Despite the heavy load, the sled glided effortlessly on long, straight runners. Even on the almost imperceptible incline, the sled moved downhill on its own accord, and Norah applied a gentle restraint, almost as if allowing the sled to lead the way. Sensing the distance growing between the group and the habisphere, Tolland looked over his shoulder. Only fifty yards away, the pale curvature of the dome had all but disappeared in the blustery blackness.

<|END_OF_TURN_TOKEN|>He packed a compartmentalized, hand-tooled Mark Cross briefcase with the blue bag, the Green Acres bag, and the tape recorder that he used for dictation. While he waited for the Keanuphobe to call, he would do some game planning and compose a chapter *of Fear Not for l Am with You.*

<|END_OF_TURN_TOKEN|>Well, the word was out. Cancer. Rhymes with *dancer* and You *just shit your pants, sir.* God knew the word had bobbed up in his own mind more than once since getting on the penny scale in front of the shoe store. It had bobbed up like some evil clown's dirty balloon and he had turned away from it. He had turned away from it the way you turned away from the bag ladies who sat rocking back and forth in their strange, sooty little nooks outside the Grand Central Station or the way you turned away from the capering Gypsy children who had come with the rest of the Gypsy band. The Gypsy children sang in voices that somehow managed to be both monotonous and strangely sweet at the same time. The Gypsy children walked on their hands with tambourines outstretched, held somehow by their bare dirty toes. The Gypsy children juggled. The Gypsy children put the local Frisbee jocks to shame by spinning two, sometimes three of the plastic disks at the same time - on fingers, on thumbs, sometimes on noses. They laughed while they did all those things, and they all seemed to have skin diseases or crossed eyes or harelips. When you suddenly found such a weird combination of agility and ugliness thrust in front of you, what else was there to do but turn away? Bag ladies, Gypsy children, and cancer. Even the skittery run of his thoughts frightened him.

<|END_OF_TURN_TOKEN|>

(sadly all the work of extracting, shuffling, formatting, etc is done using bash scripts as python was so slow it kept timing out the Deepspeed connection...)

Then we now load in the new dataset files and create batches using this modified version of yield_sequences_from_token_batch:

def yield_sequences_from_token_batch(tokenizer, token_batch, sequence_len):
    """Yields fixed-length sequences from batches of tokens, ensuring proper BOS/EOS token handling.
    
    Takes batches of tokens and yields sequences of fixed length, with each sequence:
    - Starting with BOS token if specified in tokeniser
    - Containing complete chunks terminated by EOS tokens (never splitting between EOS tokens)
    - Right-padded with extra EOS tokens if needed so all reach exactly sequence_len
    """
    sequence_tokens = [] if tokenizer.bos_token_id is None else [tokenizer.bos_token_id]
    for tokens in tqdm(token_batch):
        tokens = tokens.tolist()
        assert len(tokens) > 0, "empty token list"
        assert tokens[-1] == tokenizer.eos_token_id, "token lists must end with EOS"

        idx = 0
        # If present, skip the auto-generated BOS token
        if tokenizer.bos_token_id is not None and tokens[0] == tokenizer.bos_token_id:
            idx += 1

        while idx < len(tokens):          
            next_eos_idx = tokens.index(tokenizer.eos_token_id, idx)
            chunk = tokens[idx:next_eos_idx + 1]
            assert len(chunk) <= sequence_len, "chunk exceeds sequence length"
 
            if len(sequence_tokens) + len(chunk) > sequence_len:
                sequence_tokens.extend([tokenizer.eos_token_id] * (sequence_len - len(sequence_tokens)))
                yield sequence_tokens
                sequence_tokens = [] if tokenizer.bos_token_id is None else [tokenizer.bos_token_id]

            sequence_tokens.extend(chunk)
            idx += len(chunk)

    if len(sequence_tokens) >= sequence_len / 2:
        sequence_tokens.extend([tokenizer.eos_token_id] * (sequence_len - len(sequence_tokens)))
        yield sequence_tokens

Which then gets called like this:

    dataset = dataset.map(lambda x: tokenizer(x['text']), batched=True, batch_size=10, remove_columns=dataset.column_names, desc='tokenizing', num_proc=num_proc)
    dataset = dataset.map(lambda x: {'input_ids': list(yield_sequences_from_token_batch(tokenizer, x['input_ids'], sequence_len))}, batched=True, batch_size=None, remove_columns=dataset.column_names, desc='splitting')
    # Set labels for EOS tokens -100 to exclude them from training gradient calculations
    dataset = dataset.map(
        lambda x: {
            'attention_mask': torch.ones_like(x['input_ids']),
            'labels': torch.where(x['input_ids'] == tokenizer.eos_token_id, torch.full_like(x['input_ids'], -100), x['input_ids'])
        },
        desc='adding attention_mask and labels (with EOS labels set to -100)'
    )

to ensure the <EOS> tokens are attended to, but not used for gradient calculations (which would bias the response lengths of the fine-tuned model).

This also means I can right-pad all the batches up to the desired sequence length using <EOS> tokens.


Transformers only has a 1D attention_mask so I can't do proper sample packing without using this:

https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

BUT: I'm not convinced this is actually beneficial, as during pre-training the LLMs were trained on data that looks just like what I am giving them, eg:

<BOS> sample text 1 <EOS> sample text 2 <EOS>...

and the interference might actually be beneficial and force the fine-tune to concentrate better on each example with the surrounding "noise".


So now we have a dataset format that is sequence length agnostic (eg: large clever models won't get hugely lower/different losses) and no longer biases the reposnse length (due to masking the <EOS> labels for gradient calculations) to be shorter or longer.

We also have much higher entropy training data due to randomised paragraphs to be looked at in isolation (eg: things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...).

BUT: The most exciting possibility is to add some contextual text before each paragraph (or group of paragraphs if it turns out to be needed), such as: the author's name, book title, genre and so on, which can then be masked in the same way as the <EOS> tokens (in a similar way to instruction tuning "prompt-masking" method). So the model should then be able to learn the association between the contextual meta-data and the style of writing!!!

For the time being I am just going back to using stock cross-entropy loss (ie: no attempt to increase the entropy of the outputs), and just using the 1.1M randomised paragraphs as outlined above to hopefully get something much closer to the "multiplicative control-vectors" that I set out to create right at the start, but the possibilities this new dataset method opens up are huge IMO.

Another benefit of this is that it trains in about 1/2 the time as before, partly due to removing the 40% of the "dross" from the old books files converted to text, but also because I can now increase the batch size right up to the GPU memory limit and not worry that large/smart models with long context can just memorise everything easily; all models should now face the same prediction task, with a similar starting loss regardless of the batch size or their native context length.

I look forward to seeing the result!

So to make sure I understand, you're essentially doing the equivalent of this "train on completions" prompt-masking like unsloth support, but since there's no instruction prompt, you're only masking the :

https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=vITh0KVJ10qX

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

Extract all paragraphs that are between 200 and 2000 characters

I like this idea, that's actually a really simple way to get rid of the junk.

Randomly shuffle all the 1.1M paragraph + "\n\n" + chunks and concatenate them to use as training data.

So this would also teach the model to end every turn with \n\n

things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...

I've read your post a few times, but I'm not understanding why/how this part would work?

So to make sure I understand, you're essentially doing the equivalent of this "train on completions" prompt-masking like unsloth support, but since there's no instruction prompt, you're only masking the :

https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=vITh0KVJ10qX

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

Yeah, setting the label to -100 like this causes it to get set to the same value as is used for "causal masking" which means it gets ignored for the loss calculations, but still get used for the attention mechanism (the attention_mask can be used for padding tokens to both ignore for the gradient calculation and make the tokens effectively "invisible" to the attention mechanism, but that's not what we want here).

Extract all paragraphs that are between 200 and 2000 characters

I like this idea, that's actually a really simple way to get rid of the junk.

Yeah, I found you can go smaller but the 50-100 character paragraphs in isolation give so little leading context that they aren't likely to be very useful, and by choosing ~200 characters you 100% remove all the useless junk like page numbers, tables of content, etc.

The reason for setting an upper limit is that things like markdown quotations using > characters can create long run-on "paragraphs" that are really several paragraphs joined.

Randomly shuffle all the 1.1M paragraph + "\n\n" + chunks and concatenate them to use as training data.

So this would also teach the model to end every turn with \n\n

I'm hoping it will just learn to end every paragraph with \n\n as it's not actually getting any loss calculated for the following <EOS> token and it should just appear similar to training on larger texts that the model just happens to only see the first paragraph of.

things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...

I've read your post a few times, but I'm not understanding why/how this part would work?

Imagine I give you several chapters of a book to read. If you learn the protagonist is called "Tom" in chapter 1 then the point where you learn his name there could be a huge range of possible names (very high entropy), but as soon as you know his name is "Tom" then the range of valid names drops to just a single possibility (very low entropy).

If these several chapters can fit in a context of 16k or 32k tokens then each time you are about to generate the name "Tom" you aren't really going to get any gradient information from it as the model will be near 100% correct.

On the other hand if you mix these same chapters up with 1000 other books' chapters, and then force the model to look at just a single paragraph (or possibly handful of paragraphs) then the model will be left guessing much more and have to use the very sparse preceeding context to guess the valid range of names based on whatever clues it can glean from it (ie: locale, sex, other nouns, etc).

This is quite an interesting article on prompt masking / prompt weighting:

https://towardsdatascience.com/to-mask-or-not-to-mask-the-effect-of-prompt-tokens-on-instruction-tuning-016f85fd67f4

(just open in an incognito tab if it won't show - it's pretty rare I ever find anything useful one Medium, but this is one rare case)

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

I should know by tomorrow if it has any potential, as currently training on top of command-r:32b (new version) which is more prone to sloppy writing...

I just need to be careful of overfitting though, as 40% of my data has been pruned away and now only have around ~100M tokens, and even a rank-16 LoRA on command-r:32b is ~10M trainable parameters... I don't want to reject this method thinking it's broken, but later find it was because of overfitting! So back to using a more conservative rank, lora_dropout and weight_decay to hopefully mitigate the chance of this.

It is definitely learning something:

Screenshot_20241216-224828.png

but will likely be very conservative changes to the output if it isn't broken.

I've just noticed some interesting stuff about the Cohere tokeniser:

https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<PAD>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<UNK>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<CLS>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<SEP>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "4": {
      "content": "<MASK_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "5": {
      "content": "<BOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "6": {
      "content": "<EOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "7": {
      "content": "<EOP_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255000": {
      "content": "<|START_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255001": {
      "content": "<|END_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255002": {
      "content": "<|YES_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255003": {
      "content": "<|NO_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255004": {
      "content": "<|GOOD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255005": {
      "content": "<|BAD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255006": {
      "content": "<|USER_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255007": {
      "content": "<|CHATBOT_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255008": {
      "content": "<|SYSTEM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
  "bos_token": "<BOS_TOKEN>",
  "eos_token": "<|END_OF_TURN_TOKEN|>",

They used an actual <EOS_TOKEN> (and <EOP_TOKEN>) token during pre-training, but then it got switched to "eos_token": "<|END_OF_TURN_TOKEN|>" during fine-tuning.

Also the use of <CLS>, <SEP> and <MASK> during pre-training likely means it was trained (at least partly) using non-causal data (ie: like BERT where it gets to see the future tokens and has to fill in the masked/middle tokens):

https://huggingface.co/docs/transformers/en/main_classes/tokenizer

https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/

It looks like llama3 might have done something similar with its tokeniser:

<|end_of_text|>

Model will cease to generate more tokens. This token is generated only by the base models.

<|eom_id|>

End of message. A message represents a possible stopping point for execution where the model can inform the executor that a tool call needs to be made. This is used for multi-step interactions between the model and any available tools. This token is emitted by the model when the Environment: ipython instruction is used in the system prompt, or if the model calls for a built-in tool.

<|eot_id|>

End of turn. Represents when the model has determined that it has finished interacting with the user message that initiated its response. This is used in two scenarios:

at the end of a direct interaction between the model and the user
at the end of multiple interactions between the model and any available tools

This token signals to the executor that the model has finished generating a response.


This makes me wonder if we can still use these tokens for fine-tuning if we set the labels to -100?

I'm gonna test using each of these:

  • <SEP>
  • <EOP_TOKEN>
  • \n + <EOS_TOKEN>
  • \n + <|END_OF_TURN_TOKEN|>
  • \n + \n + <EOS_TOKEN>
  • \n + \n + <|END_OF_TURN_TOKEN|>

to deliminate the paragraphs (with the label set to -100), and see what it does to the losses for command-r:32b (I'm currently running \n + \n + <|END_OF_TURN_TOKEN|>).

I don't think using <EOS_TOKEN> or <|END_OF_TURN_TOKEN|> without any new lines prepended makes much sense, but from reading the paper (which I re-linked below after my post above vanished) the use of <EOP_TOKEN> and <SEP> are worth trying.

One of my posts just vanished above, but in it I linked these two:

https://arxiv.org/abs/2004.02251

https://stackoverflow.com/questions/71306070/do-you-need-to-put-eos-and-bos-tokens-in-autoencoder-transformers

and said it look like the Cohere models' order of token ID numbers makes it look like they might have first pre-trained bi-directionally, then pre-trained causally, then finally fine-tuned.

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

Glad to see you're giving this model a go for us 24gb and below users :-)

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

Glad to see you're giving this model a go for us 24gb and below users :-)

Well, if I can get this working properly then I think it should work with smaller models too:

I think the reason @tdrussell 's "Instruct-Storywriter" method didn't work well on small models is because they got a huge drop in loss compared to larger models, whereas this method of using a bunch of randomised paragraphs gets a similar loss for all models, and big models can't rely so much on already having the stories encoded in their weights.

I'm gonna test using each of these:

  • <SEP>
  • <EOP_TOKEN>
  • \n + <EOS_TOKEN>
  • \n + <|END_OF_TURN_TOKEN|>
  • \n + \n + <EOS_TOKEN>
  • \n + \n + <|END_OF_TURN_TOKEN|>

to deliminate the paragraphs (with the label set to -100), and see what it does to the losses for command-r:32b (I'm currently running \n + \n + <|END_OF_TURN_TOKEN|>).

After reading the paper I linked above about the use of <SEP> and <EOP_TOKEN>:

The most important observation is that, without EOP, the beginning of the generation is more relevant to the end of the input prompt, but the more it generates, the poor quality is. While the generator with EOP can generate multiple paragraphs related to the input with a reasonable ending but each paragraph is more independent than human writings.

(see Appendix B too)

Added to the fact that my paragraphs are all seen in isolation and randomised; I think actually the only ones I need to try now are:

  • <EOS_TOKEN>
  • \n + <EOS_TOKEN>
  • \n + \n + <EOS_TOKEN>

and:

  • <|END_OF_TURN_TOKEN|>
  • \n + <|END_OF_TURN_TOKEN|>
  • \n + \n + <|END_OF_TURN_TOKEN|>

It only take around 20 hours per run so can easily test all of these, but it will be harder to compare the evaluation losses between the different new line variants as the models can probably "cheat" and learn the pattern from earlier examples...

and this bit from the paper:

This observation indicates that GPT2 tends not to generate the EOS following the NL even after fine-tuning, but it can learn better EOS with the help of a new EOP token.

make me think that adding the new lines right before the <EOS> token might be a bad idea (but not 100% sure if I'm setting the <EOS> label to -100).

So next I will try <|END_OF_TURN_TOKEN|> and <EOS_TOKEN> (with label set to -100) as these should be easier to compare.

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

Glad to see you're giving this model a go for us 24gb and below users :-)

Well, if I can get this working properly then I think it should work with smaller models too:

I think the reason @tdrussell 's "Instruct-Storywriter" method didn't work well on small models is because they got a huge drop in loss compared to larger models, whereas this method of using a bunch of randomised paragraphs gets a similar loss for all models, and big models can't rely so much on already having the stories encoded in their weights.

Mate, that's awesome! Can't wait to see it.

All this is getting way too complicated and it's unclear exactly what the effect of all these different ways of breaking paragraphs are going to have on an instruction-tuned model...

So... I'm just gonna generate my data as before:

Paragraph 1

<EOS>Paragraph 2

<EOS>Paragraph 3

.
.
.
<EOS>Paragraph N-1

<EOS>Paragraph N

<EOS>

Then tokenise this with the <EOS> tokens ensuring each paragraph with the 2 trailing newlines gets tokenised as a whole.

Then use this to just output huge sequences of random paragraphs to train on:

<BOS>Paragraph 1

Paragraph 2

Paragraph 3

.
.
.
Paragraph N-1

Paragraph N

<EOS>
<EOS>
<EOS>

and completely mask out the <EOS> tokens in the same way as <PAD> would be.

It will likely confuse the model somewhat, but may actually be less confusing that attempting to use all these breaking tokens for an instruction-tuned model and the distribution of newlines in real stories should be retained.

(If it does cause the model to not be able to output any special tokens, then I can deal with that by using a second dataset that is passed through the chat template but then mask out everything except the special tokens. Even if the second dataset is full of horrible slop-ridden stories; it will still be able to hopefully fix the frequencies of special tokens if needed....)

It's a bit of a dodgy hack, but I've found a way to avoid screwing up the frequencies of the special tokens:

  "added_tokens_decoder": {
    "0": {
      "content": "<PAD>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<UNK>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<CLS>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<SEP>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "4": {
      "content": "<MASK_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "5": {
      "content": "<BOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "6": {
      "content": "<EOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "7": {
      "content": "<EOP_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255000": {
      "content": "<|START_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255001": {
      "content": "<|END_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255002": {
      "content": "<|YES_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255003": {
      "content": "<|NO_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255004": {
      "content": "<|GOOD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255005": {
      "content": "<|BAD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255006": {
      "content": "<|USER_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255007": {
      "content": "<|CHATBOT_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255008": {
      "content": "<|SYSTEM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255009": {
      "content": "<|USER_0_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255010": {
      "content": "<|USER_1_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255011": {
      "content": "<|USER_2_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255012": {
      "content": "<|USER_3_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255013": {
      "content": "<|USER_4_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255014": {
      "content": "<|USER_5_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255015": {
      "content": "<|USER_6_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255016": {
      "content": "<|USER_7_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255017": {
      "content": "<|USER_8_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255018": {
      "content": "<|USER_9_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255019": {
      "content": "<|EXTRA_0_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255020": {
      "content": "<|EXTRA_1_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255021": {
      "content": "<|EXTRA_2_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255022": {
      "content": "<|EXTRA_3_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255023": {
      "content": "<|EXTRA_4_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255024": {
      "content": "<|EXTRA_5_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255025": {
      "content": "<|EXTRA_6_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255026": {
      "content": "<|EXTRA_7_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255027": {
      "content": "<|EXTRA_8_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255028": {
      "content": "<|NEW_FILE|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255029": {
      "content": "<|BEGINNING_OF_PREFIX_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255030": {
      "content": "<|BEGINNING_OF_MIDDLE_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255031": {
      "content": "<|BEGINNING_OF_SUFFIX_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255032": {
      "content": "<|END_OF_MIDDLE_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255033": {
      "content": "<|EXTRA_9_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    }
  },

by hacking the Triton kernel:



@triton
	.heuristics({
    "DO_LOGIT_SCALING": lambda args: args["DO_LOGIT_SCALING"],
})


@triton
	.jit
def _cross_entropy_backward(
    logits_ptr, logits_row_stride,
    dloss_ptr,   dloss_row_stride,
    logsumexp_ptr,
    labels_ptr,
    VOCAB_SIZE : tl.constexpr,
    BLOCK_SIZE : tl.constexpr,
    DO_LOGIT_SCALING : tl.constexpr,
    LOGIT_SCALE : tl.constexpr,
):
    """
        CE_i = -y log(P) = y * (log[sum(exp(x))] - x)
        dC/dx = d/dx (y * log[sum(exp(x))] - x * y)

        From https://en.wikipedia.org/wiki/LogSumExp
        d/dx logsumexp = exp(x) / sum(exp(x)) = softmax(x)

        dC/dx = y * exp(x) / sum(exp(x)) - d/dx (x * y)
        dC/dx = y * exp[ log[exp(x) / sum(exp(x))] ] using x = exp(log(x)) trick
        dC/dx = y * exp[x - logsumexp] - d/dx (x * y)

        If y == 0: dC/dx = 0
        If y == 1 and x == label: dC/dlabel = exp[x - logsumexp] - 1
        If y == 1 and x != label: dC/dx     = exp[x - logsumexp]
    """
    row_idx   = tl.program_id(0)
    block_idx = tl.program_id(1)

    logits_ptr += row_idx * logits_row_stride.to(tl.int64)
    dloss_ptr  += row_idx *  dloss_row_stride
    col_offsets = block_idx*BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = col_offsets < VOCAB_SIZE
    label_idx = tl.load(labels_ptr + row_idx).to(tl.int32)

    if label_idx != -100:
        dloss = tl.load(dloss_ptr)
    else:
        dloss = 0.0

    x = tl.load(logits_ptr + col_offsets, mask = mask, other = -float("inf")).to(tl.float32)
    if DO_LOGIT_SCALING:
        # d/dx [s * x] = s
        x = LOGIT_SCALE * x
    pass
    logsumexp = tl.load(logsumexp_ptr + row_idx)
    y = tl.exp(x - logsumexp)
    y = tl.where(
        col_offsets == label_idx,
        y - 1.0, # exp(x - logsumexp) - 1
        y,       # exp(x - logsumexp)
    )

    #######################################################
    # Zero out the gradients for the Cohere special tokens.
    y = tl.where(
        (col_offsets <= 7) | (col_offsets >= 255000),
        0.0,
        y,
    )
    #######################################################

    # If y == 0: dC/dx = 0 ==> we already masked it to be = 0, so dloss = 0.
    if DO_LOGIT_SCALING:
        # d/dx [s * x] = s
        y = LOGIT_SCALE * y
    pass
    tl.store(logits_ptr + col_offsets, dloss * y, mask = mask)
pass

so that gradient information isn't backed up for these tokens.

This should fix the problems regarding the frequencies of these going slowly to zero due to having none of them in your training data!

Now I just need to see what happens when we train on these massive files of:

<BOS>paragaph1

paragraph2

paragraph3

I've found out using:

https://huggingface.co/spaces/Xenova/the-tokenizer-playground

that the above tokenises to:

[5, 35, 138854, 37, 2385, 1786, 16599, 24, 206, 206, 95337, 25, 206, 206, 95337, 26]

with 206 being the newlines.

I'm hoping by keeping these newlines we DO actually bias the frequency of these to be closer to actual authors' writing style, but if this fails I also can zero their gradient if needs be.

Fingers crossed this works!

Sorry for the lack of updates, but I have still been progressing slowly with this:

  • I'm still getting the weird "extra step" at the end of every training run now, but unless I use cosine annealed schedule it doesn't seem to make any difference.
  • I've found a much better way to initialise the LoRAs which let me run projected gradient descent on lora_A so it stays on the surface of a unit sphere, and then use weight-decay only on lora_B.

I'll post more details and hopefully the v1.0 of command-r:32b before the new year.

I haven't tested it yet, but the new initialization / optimisation may let me bump the Entropy up even further than I could before, but for now I'm just using stock Cross-Entropy loss and no attempt to increase Entropy until I get the hyper-parameters dialed in properly...

I'm still running on the 1.1M random paragraphs dataset and using the "hack" I posted above to avoid the special tokens getting nerfed:

https://github.com/tdrussell/qlora-pipe/discussions/41

I'll be buggered if I can make this work in pytorch without using 10GB extra VRAM (for no apparent reason - even using "chunking"???), but the Triton kernel modification works...

If anybody has any suggestions I'd be very grateful, as currently this dodgy hack will mean the code needs to be edited for every different model :/

Merry Christmas!

2ab07090374e9f9a78cbdf0e304dc8c8.jpg

Merry Christmas @jukofyork @ChuckMcSneed @gghfez and lurkers!

Merry Christmas!

https://huggingface.co/spaces/Xenova/the-tokenizer-playground

This looks useful. I've got a tokenizer issue to investigate myself. I've been using the standard eg:

from transformers import AutoTokenizer
writer_tokenizer = AutoTokenizer.from_pretrained("gghfez/Writer-Large-2411-v2.1")
print(writer_tokenizer.encode("""<BOS>paragaph1

paragraph2

paragraph3"""))

So it looks like for command-r, 206 is 1 linefeed and 2126 is 2 linefeeds.

If anybody has any suggestions I'd be very grateful, as currently this dodgy hack will mean the code needs to be edited for every different model :/

Sorry, what you're doing is beyond my level right now.

Merry Christmas!

Not related to creative writing, but the new QWQ:72B model is insanely impressive:

  1. I gave it an obscure picture of train line map I took at a museum a few months ago: horrible photo, glare reflecting off the perspex in front of it, etc. Then asked it to estimate the date and it absolutely nailed it by looking at the place names, the dates the lines were created and cut, the style of the fonts, and so on!
  2. I gave it a picture of my brother and his wife sitting in front of a waterfall in New Zealand and it looked at the foliage, lighting, water colour and so on to narrow it down and actually got the exact place!
  3. I gave it a picture of my confusing 3-phase electric meter and asked for the reading, and it managed to ignore all the distractions and read the exact value!

I think GeoGuessr will have to start working on their anti-cheat as it's likely better than 99% of the population!!!

Merry Christmas all! Have a great day!

Just starting to upload the v1.0 creative writer models, but noticed you can only have 100GB private storage now... Due to having such poor upload bandwidth I usually make them private until they are finished, but not sure what will happen now?

Just starting to upload the v1.0 creative writer models, but noticed you can only have 100GB private storage now... Due to having such poor upload bandwidth I usually make them private until they are finished, but not sure what will happen now?

I don't 'think' those are enforced limits yet? I guess we will find out.

Can confirm, it's not enforced yet (thank God). Earlier today I pushed (private):

  • A Llama-3.3-Instruct-70b finetune @ fp16
  • A, Llama-3.2-90b-vision with that ^ 70b merged into it @ fp16
  • Several LoRA checkpoints, a couple of tokenizers and a some mistral-large hidden_state files.
    Oh and I gguf-my-repo'd a Qwen2.5-32b finetune privately.

All worked fine.

Tried having Flux give me pictures about disk storage space police... it did not understand the assignment... fixated on 'space police' :D

Tried having Flux give me pictures about disk storage space police... it did not understand the assignment... fixated on 'space police' :D

image.png
Cinematic shot from retro 80s cop movie. The room is full of hard drives. Like a lot of hard drives. The hard drives are scattered everywhere. Piles of hard drives can be seen in the background. Two police officers with pistols are busting through the door. Both of the police officers are wearing uniforms with text "HuggingFace" and a big hugging face emoji.

I might try to upload all the cohere-based models before opening the repos then, as the 32b is likely to be not that great compared to the 35b and the 104b based off the old command-r-plus model (I may even try to create an 8b using aya-expanse-8b [or even aya-23-8B]). So:

  • creative-writer-v1.0-32b
  • creative-writer-v1.0-35b
  • creative-writer-v1.0-104b
  • (and possibly) creative-writer-v1.0-8b

I've also found that increasing the Entropy is best done via a second epoch using the same training data (or otherwise the momentum-based optimisers like Adam massively overshoot and/or try to make the norm of the hidden state smaller to "cheat"). I'm going to call these models "creative-writer-plus" and use around the same value as I used for the bravo experimental models, as this seems to give a good balance between increasing the Entropy vs making the model not follow instructions quite as well. So:

  • creative-writer-plus-v1.0-32b
  • creative-writer-plus-v1.0-35b
  • creative-writer-plus-v1.0-104b
  • (and possibly) creative-writer-plus-v1.0-8b

I'm actually really happy with the hyper-parameters and extra code used to train these now, and it will likely take me 1-2 days to write the README.MD file.

So on top of all I wrote about the use of "Multiplicative-LoRAs" (which are explained in the README.MD of the experimental models), here is a rough draft of what I am now doing:

After each step of the optimiser, I then use this custom code:

def apply_lora_norm_regularization(model, config, current_lr):
    assert config['lora_alpha'] == config['lora_rank'], "Used `nn.init.orthogonal_` so must have: alpha = r"
    weight_decay = config['optimizer'].get('lora_weight_decay', 0.0)
    lora_B_scaler = 1.0 - (current_lr / config['optimizer']['lr']) * weight_decay if weight_decay > 0 else 1.0
    for name, param in model.named_parameters():
        if 'lora_A' in name:
            # Project each row of lora_A back onto the surface of the unit ball
            with torch.no_grad():
                param.div_(param.norm(p=2, dim=1, keepdim=True))  # TODO: Check this works for bfloat16
        elif 'lora_B' in name and lora_B_scaler < 1.0:
            # Shrink each column of lora_B back towards the origin
            with torch.no_grad():
                param.mul_(lora_B_scaler)  # TODO: Check this works for bfloat16

To:

  1. Perform projected gradient descent (sorry no Wiki page for it?) on lora_A to enforce the unit length (but not the semi-orthogonality; although this is possible and might be worth trying in the future - see section 2.1 of Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks). It might also be worth projecting the gradients onto the tangent space as detailed in the top reply to this post, but I can't easily try this without altering lots of scary Deepspeed stuff in qlora-pipe (plus I'm not 100% convinced it is actually the correct thing to do anyway...).
  2. Perform fully decoupled weight decay on lora_B ONLY.

This essentially makes each pair of vectors lora_A[i] x lora_B[i] in the outer-product act like the "conditional control vectors" I envisioned when I set out to do this about 6 months ago...

On top of this, I now have the following changes to the dataset creation process:

  1. I start with ~185M tokens extracted from ~1000 books.
  2. I then extract all the paragraphs, which are then heuristically filtered to leave approximately 1.1M definite/clean paragraphs (~135M tokens in total).
  3. The 1.1M paragraphs are then randomly concatenated into blocks of 8192 tokens, separated by double newlines, and right-padded with a small amount of EOS tokens where needed, eg:
<BOS>paragraph1

paragraph2

paragraph3
.
.
.
paragraph n-1

paragraph n

<EOS><EOS><EOS><EOS><EOS>

The EOS tokens used for padding all have their labels set to -100 and their attention_mask flags set to 0 (to avoid training on them).

Also, to avoid the slow decay of the special token probabilities; all special tokens have their gradient set to zero in the backward function of the Triton kernel (which is approximately equivalent to assuming that the output probability value of the special tokens is always exactly what the target for these is, and that we correct the 1-hot targets to account for these so the target vector sums to unity).

Finally we use "Focal Loss Star" loss function with gamma = 1.01 to maintain the entropy during training, eg:

image.png


The "plus" variants of the models will just use the same config file and dataset but use the "Focal Loss Star" loss function with gamma = 1.1 to adjust the merged LoRA model from the previous stage.

The only way forward now would be to start to scale up the dataset size, as even just using a rank-16 LoRA for the down_proj matrices only; the amount of tokens per tunable parameter is pretty low (eg: around 15 tokens/parameter for the 32b and 35b models, and less than 5 tokens/parameter for the bigger models). There is probably a (very) significant amount of noise getting added to the 8192+ dimension vectors compared to the control-vector training where I used several orders of magnitude more tokens per tunable parameter compared to this...

I think this could be accomplished by leaving the config file completely unchanged (ie: weight_decay, learning_rate, etc) and just increasing the batch size as we get more samples in the future. Increasing the rank probably doesn't make much sense either, and I'm only using rank-16 as this seems to be the smallest value I could get working without too much risk of getting stuck in saddle points... The possible combinations of all the rank-16 vectors in all the n-1 layers is really high anyway; it's the low tokens per tunable parameter that is likely going to be the biggest problem.

I have a shit-ton of books that I can create the datasets from, but even with 6x A6000s it's not really feasible to scale up much more and it would need a rented GPU cluster to train for the larger 70B+ parameter models.

One final thing: don't get your hopes up too much for models like qwen and llama-3 - this process can (subjectively) make already good creative-writing model (slightly) better and/or increase their Entropy back towards real natural language, but it can't make a horrible creative-writing model into a good creative-writing model... :)

It's also unlikely to massively improve the dreaded "slop attractor phrases" as it's only considering single-token Entropy and never rolling out to the generation-space where some of these live (it will probably help somewhat with this though).

Tried having Flux give me pictures about disk storage space police... it did not understand the assignment... fixated on 'space police' :D

Cinematic shot from retro 80s cop movie. The room is full of hard drives. Like a lot of hard drives. The hard drives are scattered everywhere. Piles of hard drives can be seen in the background. Two police officers with pistols are busting through the door. Both of the police officers are wearing uniforms with text "HuggingFace" and a big hugging face emoji.

Love it!

One final thing: don't get your hopes up too much for models like qwen and llama-3 - this process can (subjectively) make already good creative-writing model (slightly) better and/or increase their Entropy back towards real natural language, but it can't make a horrible creative-writing model into a good creative-writing model... :)

It's also unlikely to massively improve the dreaded "slop attractor phrases" as it's only considering single-token Entropy and never rolling out to the generation-space where some of these live (it will probably help somewhat with this though).

Making comparatively good models better works for me. Need to put some of these full precision models on ice just in case. Though, I guess there isn't just one copy of anything anymore. :D

Cinematic shot from retro 80s cop movie. The room is full of hard drives. Like a lot of hard drives. The hard drives are scattered everywhere. Piles of hard drives can be seen in the background. Two police officers with pistols are busting through the door. Both of the police officers are wearing uniforms with text "HuggingFace" and a big hugging face emoji

That's awesome, you're good at prompting diffusion models.

never rolling out to the generation-space where some of these live

What is the "generation space"?

I'm actually really happy with the hyper-parameters and extra code used to train these now, and it will likely take me 1-2 days to write the README.MD file.

That's going to be a goldmine of information!

creative-writer-v1.0-32b

I'm hoping this one is good given we get GQA and it doesn't write random Russian/Chinese letters when quantized :)

never rolling out to the generation-space where some of these live

What is the "generation space"?

If the phrase "shivers down her spine" isn't in the training corpus, or even in the training corpus but way less than the model will use if left to write its own text, then it will never get the feedback needed to stop it outputting this... The only way it can get some feedback is via seeing lots of other things that are not "shivers down her spine" and hoping that these drown it out.

This is opposed to the case where you allow it to roll out further in it's own "generation space" and use something like DPO to give feedback on its own generations.

The creative-writer-v1.0-35b is really interesting! It perhaps has picked up one some weird formatting from the paragraphs training data (eg: starting paragraphs indented by a space or using triple newlines), but it only does this is you just start off with a blank slate, and not always. If you start it off with a bit of the title or edit the opening phrase it seems better and I think it will be fine if you stop and correct the first couple of paragraphs if needed...

BUT: it seems to be very creative and much more interesting that the creative-writer-v1.0-32b fine-tune; using more internal monologues, etc.


I'm really excited now to see what creative-writer-v1.0-104b turns out like as the 35b model always was a bit flaky with the formatting before (as @gghfez found with his BSG story prompt).

Now you're just teasing...

Now you're just teasing...

It's uploading now (I already have the 32b uploaded but set private currently).


I've also just figured out what probably screwed the mistral-large training run! The default for Deepspeed when you have 3 machines with 2 pipeline stages is to distribute like this:

Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=0, data=1): 1, ProcessCoord(pipe=0, data=2): 2, ProcessCoord(pipe=1, data=0): 3, ProcessCoord(pipe=1, data=1): 4, ProcessCoord(pipe=1, data=2): 5}

Which probably makes sense for GPU clusters doing full fine-tuning:

class PipeDataParallelTopology(ProcessTopology):
    """ A topology specialization for hybrid data and pipeline parallelism.

        Uses data parallelism on the last dimension to encourage gradient
        reductions to use high-bandwidth intra-node links and lower-volume
        pipeline communications to use low-bandwidth inter-node links.
    """

    def __init__(self, num_pp, num_dp):
        super().__init__(axes=['pipe', 'data'], dims=[num_pp, num_dp])

but when you are training LoRAs it is a disaster... Each machine has to use the network to pass on a 12288 sized vector of floats for each token, but then every 10 minutes or so passes around ~10 x 12288 sized vector of floats to do the reduce step for the gradients (using the NVLink or PCI-bus).

This is completely backwards for what we want, and I've fixed it now:

class CustomPipeDataParallelTopology(ProcessTopology):
    """A topology specialization for hybrid data and pipeline parallelism with swapped axes."""

    def __init__(self, num_pp, num_dp):
        # Swap the axes and dims to change the rank mapping
        super().__init__(axes=['data', 'pipe'], dims=[num_dp, num_pp])

Which gives:

Using topology: {ProcessCoord(data=0, pipe=0): 0, ProcessCoord(data=0, pipe=1): 1, ProcessCoord(data=1, pipe=0): 2, ProcessCoord(data=1, pipe=1): 3, ProcessCoord(data=2, pipe=0): 4, ProcessCoord(data=2, pipe=1): 5}

So now I have 3 copies of the model (the data axis above) spread over the 3 machines, and each machine has the model split between their 2 GPUs (pipe axis above).

Using a 10gbit connection the data use probably didn't matter that much (but it still seems dumb to pass several TB of data through the network and a few MB through the NVLink bridge...), but I think this may have caused some problem with saving the checkpoints due to having the the LoRA spread weirdly through multiple machines like it was...

Looks to be working and possible quite a bit faster too:

GPU-SERVER-1: [2024-12-28 17:58:05.312] [INFO] [qlora-pipe] step:     1 /   562 loss: 2.8461 iter time (s): 533.796 samples/sec: 0.056 eta: 83h10m

(IIRC, the broken mistral-large run was 200 hours for 185M tokens through 123B parameters, and this is 83 hours for 135M tokens through 104B parameters)

So should have the results in around 4.5 days from now!

Now you're just teasing...

It's uploading now (I already have the 32b uploaded but set private currently).


I've also just figured out what probably screwed the mistral-large training run! The default for Deepspeed when you have 3 machines with 2 pipeline stages is to distribute like this:

Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=0, data=1): 1, ProcessCoord(pipe=0, data=2): 2, ProcessCoord(pipe=1, data=0): 3, ProcessCoord(pipe=1, data=1): 4, ProcessCoord(pipe=1, data=2): 5}

Which probably makes sense for GPU clusters doing full fine-tuning:

class PipeDataParallelTopology(ProcessTopology):
    """ A topology specialization for hybrid data and pipeline parallelism.

        Uses data parallelism on the last dimension to encourage gradient
        reductions to use high-bandwidth intra-node links and lower-volume
        pipeline communications to use low-bandwidth inter-node links.
    """

    def __init__(self, num_pp, num_dp):
        super().__init__(axes=['pipe', 'data'], dims=[num_pp, num_dp])

but when you are training LoRAs it is a disaster... Each machine has to use the network to pass on a 12288 sized vector of floats for each token, but then every 10 minutes or so passes around ~10 x 12288 sized vector of floats to do the reduce step for the gradients (using the NVLink or PCI-bus).

This is completely backwards for what we want, and I've fixed it now:

class CustomPipeDataParallelTopology(ProcessTopology):
    """A topology specialization for hybrid data and pipeline parallelism with swapped axes."""

    def __init__(self, num_pp, num_dp):
        # Swap the axes and dims to change the rank mapping
        super().__init__(axes=['data', 'pipe'], dims=[num_dp, num_pp])

Which gives:

Using topology: {ProcessCoord(data=0, pipe=0): 0, ProcessCoord(data=0, pipe=1): 1, ProcessCoord(data=1, pipe=0): 2, ProcessCoord(data=1, pipe=1): 3, ProcessCoord(data=2, pipe=0): 4, ProcessCoord(data=2, pipe=1): 5}

So now I have 3 copies of the model (the data axis above) spread over the 3 machines, and each machine has the model split between their 2 GPUs (pipe axis above).

Using a 10gbit connection the data use probably didn't matter that much (but it still seems dumb to pass several TB of data through the network and a few MB through the NVLink bridge...), but I think this may have caused some problem with saving the checkpoints due to having the the LoRA spread weirdly through multiple machines like it was...

Any idea when you'll be releasing the 32b to the public? I'm curious to know, if you dataset has any Sci-Fi in it?

Yeah, this looks to be around 50% faster! The old topology was probably even worse for LoRAs than it seemed as each of the 3 copies were probably all contending for the network at the same time to try to pass all their outputs between all the different possible combinations of stages :/

I think the next mistral-large training run will only take around 5 to 5.5 days to complete (instead of 9-10 days).

Any idea when you'll be releasing the 32b to the public?

I'll open them both up with a blank readme page tomorrow - but hold off on any judgments until we see what happens for the big models, as from past testing the smaller models are much more susceptible to being damaged by this process!

I'm curious to know, does you dataset have any Sci-Fi in it?

Yeah, there are probably around 10-15% Sci-F books in the training data, but I wouldn't expect it to be all that important:

  • This isn't really a "fine-tune" that compares to what other people upload (ie: that can specifically "learn" things from the dataset used to train it - it's just a jumbled up bunch of paragraphs!).
  • Think of it more as a "recalibration" to try to get the writing style back to what a more "normal" pre-LLM authors' style would be.

Hopefully it will be clearer what I mean when we get the first big model finished :)

Any idea when you'll be releasing the 32b to the public?

I'll open them both up with a blank readme page tomorrow - but hold off on any judgments until we see what happens for the big models, as from past testing the smaller models are much more susceptible to being damaged by this process!

I'm curious to know, does you dataset have any Sci-Fi in it?

Yeah, there are probably around 10-15% Sci-F books in the training data, but I wouldn't expect it to be all that important:

  • This isn't really a "fine-tune" that compares to what other people upload (ie: that can specifically "learn" things from the dataset used to train it - it's just a jumbled up bunch of paragraphs!).
  • Think of it more as a "recalibration" to try to get the writing style back to what a more "normal" pre-LLM authors' style would be.

Hopefully it will be clearer what I mean when we get the first big model finished :)

Sounds good! It'll be fun for me to have a mess about with it! Thanks man.

I think the bigger the drop in loss, the more the risk of damage:

image.png

So we should (hopefully) see the magenta line drop much less.

When I tried to run this on the really small models; I got huge drops and the models were pretty much lobotomised as a result :/

This paper explains this phenomenon a bit too:

https://arxiv.org/abs/2405.09673


I'm already very heavily regularising the model though:

image.png

far (far, far!) more than 99.9% of fine-tunes here and this can be reduced even more if needed (at the cost of the fine-tuning process having less and less effect...).

It's unlocked now:

https://huggingface.co/jukofyork/creative-writer-32b-preview


Looks like they are enforcing the limits now:

403 Forbidden: Private repository storage limit reached, please upgrade your plan to increase your private storage limit.

So had no choice but to unlock it early to upload the 35b model :/

Nice!

I just got temporary access to DeepSeek v3 FP8 and I was trying to find the prompts that were being used for our test runs? Since the original thread is toast I don't see them.

So had no choice but to unlock it early to upload the 35b model :/

They must be rolling that out in stages??

So I'm guessing you don't want quants uploaded?

I guess I'll be setting public a lot of broken model experiments soon (eg. a WizardLM2-8x22b hack which only responds in Spanish lol)

I just got temporary access to DeepSeek v3 FP8 and I was trying to find the prompts that were being used for our test runs? Since the original thread is toast I don't see them.

They were in the old version's thread of doom. I've got some of them in my history from testing:

https://pastebin.com/pdrqEB1M
Password: NoSlop4U

There's also this:

https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/discussions/1

They were in the old version's thread of doom. I've got some of them in my history from testing:

Thanks! Were the outputs Temp 0 , Temp 1 or something else?

I've only got that info for one of them:
Temperature=0, no system prompt and looks like using the command-r chat template

So I'm guessing you don't want quants uploaded?

Feel free to do whatever with them - it's because I'm uploading each file 1 at a time that I would rather not open the repo until it's done or else people might miss some of the files :/

It's good (testing unquantized) side-by-side with the official cohere model. 10 prompts so far.

Cohere model has given me Elara in 6 of them. Yours hasn't given me any of the slop I usually notice.
Even when it wants to write some of the slop-like phrases such as "her hand hovering over the hilt of her dagger" (slop) it writes it differently, like starting a new sentence and writing "She clutched her dagger".

It sometimes gives me these humorous disclaimers too (I get that this probably wasn't intentional but it's funny):

"""
Author's Note: All prose, names, and terminology are entirely fictional and any coincidental similarities to real-world counterparts are unintended. Absolutely no elves or bards were harmed in the writing of this mock-up novel chapter, although the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
"""

the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.

LOL!

It does lose narrative cohesion sometimes though, for example, in 3 stories it changed from first -> third person perspective. And it seems to have a "darkness" kind of bias towards it, almost like when I apply a control-vector.

As it is now, this is already already a very model despite that ^. Normally you'd need a difficult to run model like Mistral-Large or the non-GQA command-r/r+ to get output like this. Huge improvement over the standard 32b version. I'm going to mlx-quant it so I can run it on my mac while I'm away in January.

Thank you for this!

The 35b is uploaded and open now:

https://huggingface.co/jukofyork/creative-writer-35b-preview

This one seems to have slightly weird/broken formatting if you just try to start stories from a "blank slate", but it seems very creative when it does this (possibly due to using different tokenisation via a leading space?).

And it seems to have a "darkness" kind of bias towards it, almost like when I apply a control-vector.

Yeah, I wasn't sure if I was imagining this but I thought it seemed pretty dark too? I think the 35b might even be more dark from my brief testing!?

changed from first -> third person perspective

This could be bias in the dataset of books being used.

Author's Note: All prose, names, and terminology are entirely fictional and any coincidental similarities to real-world counterparts are unintended. Absolutely no elves or bards were harmed in the writing of this mock-up novel chapter, although the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.

LOL, no idea at all where this could have come from? A rank-16 LoRA couldn't even have encoded that, so it must have been some latent weirdness it's brought to the surface? :O

Thank you for this!

No problem and hopefully the 104b and mistral-large:123b will work even better!

There is probably a lot more I can try with the dataset formatting, but really need to see how it works on the larger models before making any more changes to the method.

I'm pretty sure the method of making the gradients zero for the special tokens works well too, and is likely the reason that other guy found that you can train LoRAs on the base model to be applied to the instruct model perhaps?

"""
Author's Note: All prose, names, and terminology are entirely fictional and any coincidental similarities to real-world counterparts are unintended. Absolutely no elves or bards were harmed in the writing of this mock-up novel chapter, although the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
"""

the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.

LOL that's awesome...

I should reiterate that I wouldn't be too disappointed if these first versions are a little broken - I have full control over the regularisation now and can regularise over the full continuum of possible models if needed, and there are several ways multiple training runs can be combined to smooth out excessive noise if needed too.

and is likely the reason that other guy found that you can train LoRAs on the base model to be applied to the instruct model perhaps?

I asked him why on discord a while back, he said he didn't know and was just experimenting :D

Yeah, I wasn't sure if I was imagining this but I thought it seemed pretty dark too?

Definitely is. It wrote a pretty brutal story when I prompted it to write about 2 giraffes fighting over the last acacia tree during a drought lol.

If not for the cohere license regarding training on outputs, I'd use this in place of what I did for the dataset for my Mistral-Large-2411 writing model (which was to use control-vectors on Apache2 licensed models to generate the synthetic aspects of the dataset)

I wonder what happened to command-r

image.png

This is full precision, with min_p set to 0.2. Same thing happens with the cohere model (full weights, AWQ, exl2 and GGUF) and on openrouter.

I don't remember it having this problem when they released it. I wonder if an update to transformers caused a regression at some point.

lol at these disclaimers :D

These are a weird one - I can't really explain what could have caused two different models to start doing that?!

This is full precision, with min_p set to 0.2. Same thing happens with the cohere model (full weights, AWQ, exl2 and GGUF) and on openrouter.

Yeah, I don't remember it doing that either...

I did notice that the tied input_embedding and lm_head tensor looks like it might have been scaled (along with all the other models' tensors) to use most of the range of float16. This means that if you scale the logits just a tiny bit; some of the losses go to exactly zero which they don't if you leave it alone?

changed from first -> third person perspective

I've been thinking about this some more and worry this might end up being a problem for all models due to the "random paragraphs" dataset mixing the different perspectives.

There are a couple of possible solutions:

  • Use blocks of 2+ consecutive paragraphs taken either randomly, or in order, from the same source book.
  • Try to use <BOS>paragraph1\n\n<EOS><BOS>paragraph2\n\n<EOS><BOS>paragraph2\n\n<EOS>... type formatting (I'm reluctant to do this though as the double newline before <EOS> is likely very out-of-distribution).
  • Try to use another model (LLM or otherwise) to classify the entire books' perceptive and then split the training data so that the same perspectives get put in the same file of random paragraphs.

I will wait until we see what comes from training on the much larger command-r-plus:104b and mistral-large-2:123b first as it may not be such a problem for these.

Yeah, I don't remember it doing that either...

I did notice that the tied input_embedding and lm_head tensor looks like it might have been scaled (along with all the other models' tensors) to use most of the range of float16. This means that if you scale the logits just a tiny bit; some of the losses go to exactly zero which they don't if you leave it alone?

I've actually just realised I can now train these small models without quantizing them!

I tried this before and found that running 6 x 4bit models was around 2x faster than running 3 x 16bit models split between the 2 GPUs on each machine... BUT: that was because instead of having the two halves of the model split on the same machine, and passing their hidden state using the NVLink bridge; I was actually passing it all through the network!

With the fix I mentioned above running the models in their native float16 (and using float32 for the optimiser states as before) is now exactly the same training time to run as when quantised to 4bits...

I'm gonna rerun the 32b and 35b models using this to see if that fixes some of the weirdness... I can already see that the float16 model has a slightly lower starting loss and starting Entropy, and a slightly higher top-1 accuracy, so this could help quite a lot.

Yeah, there is no weird blip on the log-loss histogram now, so I think float16 will likely fix some of the strange stuff it was doing...

I can probably even train command-r-plus like this: by using 3 pipeline stages with 1 of the stages having to use the network (but it might need a InfiniBand connection if 10gbit is too slow).

Nope, it will need >48GB per card for this and I think 6 stages will have too much overhead :/

Turns out I had some sort of corruption in the cached tokenised training data. Now I've deleted my ~/.cache/huggingface/*folder contents, I'm getting a much lower loss before I even start:

Screenshot_20241231-024234.png

and that was probably why the 35b acted so weird (it is the blue line above).

I didn't even know that folder existed and only found it by chance trying to debug something unrelated :/

fyi - not complaining about the disclaimers, it's fun. I haven't used a command-r-35b in general for a while since the garbage characters bug (got mistral-large since then anyway).

Nope, it will need >48GB per card for this and I think 6 stages will have too much overhead :/

48GB A40 instances are $39c / hr on runpod.io if that helps. Could be useful for your low upload bandwidth issue as well.
EDIT: just realised, you said >48GB (and I recall you already have 48GB GPUs)

the float16 model has a slightly lower starting loss and starting Entropy

Interesting. I would have expected the opposite. So this is different from testing token probabilities during inference then?
When I've tested this, I found that the stronger the quantization, the flatter the distribution of tokens.
Abliterated models did this as well, even at bf16, they behaved more like Q4_K versions of the "non-abliterated" originals.

p.s. I noticed something interesting regarding control-vectors (generally): the occupation of random side characters is often "author" or "writer" lol

Hi @jukofyork first impressions of your Creative Writer Preview have been very positive so far. Thank you and Happy new year!

I think the next version will be better, as pretty sure some tokenisation bug effected the last run:

image.png

I will be having a break over the next few days so gonna just try this training process on the 32b and 35b models:

  1. Train using the "random paragraphs" method (~140M tokens).
  2. Train using the same paragraphs, but put them all back in order from the books they were extracted from with <EOS> tokens separating the books (same ~140M tokens; 1073 books).
  3. Train a 3rd time on the "in order" books, but use Focal Loss* with gamma=1.1 like I did for the "bravo" experimental models.

Stage 2 will use the model with the stage 1 LoRA merged, and then stage 3 will use the model with the stage 2 LoRA merged, and so on.

My hope is that the first stage will force the model to use more natural language as it can't reduce the loss by looking at recent paragraphs, then the second stage will hopefully fix the mixing of the 1st/3rd person POV (and any other general "weirdness") caused by stage 1's "myopia", and then finally stage 3 will try to ramp up the single-token Entropy.

Assuming all goes well, I'll upload each stage of each model as I go along (each run takes 24-25 hours, so will take around a week to do this).

Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:

https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/

This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/

I've actually found gpt-4-0125-preview seems to be the least nerfed (maybe they forgot it? lol).

Nearly done with 35b now:

Screenshot_20250102-000851.png

I've worked out what went wrong too:

https://huggingface.co/spaces/Xenova/the-tokenizer-playground

If you tokenise multiple paragraphs you get two separate newline tokens (206) between them:

This is a paragraph.

This is another paragraph.
[5, 4184, 1801, 1671, 42429, 21, 206, 206, 4184, 1801, 4907, 42429, 21]

but if you tokenise each on their own you get the double newline token ( 2126):

This is a paragraph.
[5, 4184, 1801, 1671, 42429, 21, 2126]

and then when you concatenate these you get wildly out-of-distribution data!

I don't really know enough about tokenisers, but this was not what I expected and seems really odd behaviour to me?!

(It also explains why I mysteriously gained 1M+ tokens for my new run - I was super confused where they had come from! 🤣)


Anyway, I'm still gonna run the second stage on each of these as I think the switching to/from 1pp/3pp will still be a problem.

I've also got qlora-pipe to output metrics about the hidden state going into lm_head which should mean for stage 3 that I can push the Entropy as high as I possibly can before the model breaks down (ie: where it starts "cheating" the Focal Loss* loss by shrinking the hidden state).

Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:

https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/

This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/

I've actually found gpt-4-0125-preview seems to be the least nerfed (maybe they forgot it? lol).

I've tried doing some non-standard functionality coding and what surprised me are how bad ALL llms are at it. I need to guide them each little step or they'll fuck up. O1? Decides to randomly change unrelated parts of the code. Gemini? Just dumb. Sonnet? A bit better, but still makes beginner-level mistakes that need to be fixed by hand. At this point I'm feeling like it would have been faster if I coded it just by myself.

Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:

https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/

This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/

I've actually found gpt-4-0125-preview seems to be the least nerfed (maybe they forgot it? lol).

I've tried doing some non-standard functionality coding and what surprised me are how bad ALL llms are at it. I need to guide them each little step or they'll fuck up. O1? Decides to randomly change unrelated parts of the code. Gemini? Just dumb. Sonnet? A bit better, but still makes beginner-level mistakes that need to be fixed by hand. At this point I'm feeling like it would have been faster if I coded it just by myself.

Yeah, I think even o1-preview has been quietly nerfed :/

Talk of the different tokenization of single and multiple paragraphs reminds me of something. I don't know how useful this is, but I remember when I was messing around implementing the novelai api in a writing program, they were doing something unusual with their llama finetune - you have to strip the last token off the context when sending text in. Apparently llama has many ways to tokenize the end of the sentence/paragraph and it was causing issues, whereas stripping it let the model continue in a more creative way. Probably not a useful thought but I thought I'd write it down here since this seemed like the most appropriate place ;p. I've actually taken this to an extreme in the past, having entire paragraphs stripped from the end of every generation because I found continuations were better if I started a paragraph back rather than at the end of the previous gen. It's possible I was "solving" the same issue by accident, explaining the improved text quality across the story.

Or, maybe I'm just a fool sitting here fooling around with magical talk-boxes imagining things :).

Anyway, hey. Somehow control vectors slipped below my radar. Look forward to jumping in and messing with them. What's a typical workflow look like with these things? Are you constantly loading and unloading vectors to steer the story? I'm digging around looking for examples/more info but haven't found much. Wouldn't mind implementing this in my own writing system (I'm a novelist by trade and always looking for ways to get the AI writing in a more controllable way).

Odd thought:

Awhile back I was testing out trying to control positivity/negativity and bias in the model by doing external dice rolls (python) and feeding the roll to the LLM in context every turn, asking it to act accordingly based on its dice roll and giving it a prompt to continue the novel-style narrative. I wasn't roleplaying in the classic sense. The idea was that LLMs likely have enough information about roleplay/D&D style interactions that if I took an ongoing novel and gave it a dice roll, then said it needed to continue the story based upon the success (or failure), of that roll, it would allow me to steer a story a bit more directly and achieve negative and positive outcomes where I wanted them.

It worked, albeit a bit noisy. Low rolls led to negative storytelling, higher dice rolls led to positive things happening.

Now I'm imagining a situation where control vectors are made for rolls 1-20 by producing prompts and showing the varied outcomes (the continued novel-text based on the results of that roll).

Once produced, you apply vectors each generation based on how the roll goes in the background (so if they rolled an 18, you'd apply the 'roll18' vector). The text coming out is then being steered based on those outcomes. It should give you good 2-way vectors since the rolls largely pivot around the 10 (especially if you're prompting it to figure 10 as neutral and 9 and below as negative, 11 and up as positive). Would also make implementing a slider in a UI easy to push the story in positive or negative directions by sliding it up or down... and since the roll outcomes are creative/ambiguous, it should give the AI some space to be creative in how it interprets the scene.

Anyway, I'm not in my depth here yet - I'll have to mess around with control vectors and get a feel for them.

Talk of the different tokenization of single and multiple paragraphs reminds me of something. I don't know how useful this is, but I remember when I was messing around implementing the novelai api in a writing program, they were doing something unusual with their llama finetune - you have to strip the last token off the context when sending text in. Apparently llama has many ways to tokenize the end of the sentence/paragraph and it was causing issues, whereas stripping it let the model continue in a more creative way. Probably not a useful thought but I thought I'd write it down here since this seemed like the most appropriate place ;p. I've actually taken this to an extreme in the past, having entire paragraphs stripped from the end of every generation because I found continuations were better if I started a paragraph back rather than at the end of the previous gen. It's possible I was "solving" the same issue by accident, explaining the improved text quality across the story.

Thanks - that's really interesting! It definitely seemed to make the 32b more creative but also seemed to completely break the 35b who started every paragraph with a word tokenised with the "space before" variant.

I think your thing about stripping the last paragraph to continue from is interesting too - it probably get the model away from some "finishing response soon" internal representation?

Anyway, hey. Somehow control vectors slipped below my radar. Look forward to jumping in and messing with them. What's a typical workflow look like with these things? Are you constantly loading and unloading vectors to steer the story? I'm digging around looking for examples/more info but haven't found much. Wouldn't mind implementing this in my own writing system (I'm a novelist by trade and always looking for ways to get the AI writing in a more controllable way).

I tend to just set them up and leave on. If you turn them off after generating part of the story then the model will tend to quickly revert to its default" style.

It may be worth switching them up for different POV characters or as I have found (sometimes hilariously) you can end up writing for example a "Grimdark" story where everyone is a bunch of stone-cold sociopaths, who slowly get worse and worse over the chapters! :D

Odd thought:

Awhile back I was testing out trying to control positivity/negativity and bias in the model by doing external dice rolls (python) and feeding the roll to the LLM in context every turn, asking it to act accordingly based on its dice roll and giving it a prompt to continue the novel-style narrative. I wasn't roleplaying in the classic sense. The idea was that LLMs likely have enough information about roleplay/D&D style interactions that if I took an ongoing novel and gave it a dice roll, then said it needed to continue the story based upon the success (or failure), of that roll, it would allow me to steer a story a bit more directly and achieve negative and positive outcomes where I wanted them.

It worked, albeit a bit noisy. Low rolls led to negative storytelling, higher dice rolls led to positive things happening.

Now I'm imagining a situation where control vectors are made for rolls 1-20 by producing prompts and showing the varied outcomes (the continued novel-text based on the results of that roll).

Once produced, you apply vectors each generation based on how the roll goes in the background (so if they rolled an 18, you'd apply the 'roll18' vector). The text coming out is then being steered based on those outcomes. It should give you good 2-way vectors since the rolls largely pivot around the 10 (especially if you're prompting it to figure 10 as neutral and 9 and below as negative, 11 and up as positive). Would also make implementing a slider in a UI easy to push the story in positive or negative directions by sliding it up or down... and since the roll outcomes are creative/ambiguous, it should give the AI some space to be creative in how it interprets the scene.

Anyway, I'm not in my depth here yet - I'll have to mess around with control vectors and get a feel for them.

One of the first things I tried was to implement the 2-axis alignment system from AD&D, but quickly found that you can't really have mixed concepts in a control vector - it has to be two clearly defined sides of a single axis (and the law-chaos axis was too mixed to work properly).

So for example your idea would really need to be trained on "18" vs "not 18" and loads this each time. If you tried to mix "greater than 10" and "10 or less" to train from you'd just end up getting lots of noise added due to the mathematics of the way control vectors are created sadly.

To create control vectors you have to really find a way to clearly demonstrate two sides of a clearly defined (and extremely obvious to the model) axis to start with, train on this and then decide afterwards the scale factor you want to use to elicit the effect your want.

  1. Train using the "random paragraphs" method (~140M tokens).
  2. Train using the same paragraphs, but put them all back in order from the books they were extracted from with <EOS> tokens separating the books (same ~140M tokens; 1073 books).
  3. Train a 3rd time on the "in order" books, but use Focal Loss* with gamma=1.1 like I did for the "bravo" experimental models.

Stage 2 will use the model with the stage 1 LoRA merged, and then stage 3 will use the model with the stage 2 LoRA merged, and so on.

My hope is that the first stage will force the model to use more natural language as it can't reduce the loss by looking at recent paragraphs, then the second stage will hopefully fix the mixing of the 1st/3rd person POV (and any other general "weirdness") caused by stage 1's "myopia", and then finally stage 3 will try to ramp up the single-token Entropy.

I've found you can't run stage 2 using the same dataset or it just drops all it learned from stage 1 in the first few steps (altering down_proj only must be too near to being convex).

I've also found that running a second stage (still using the "random paragraphs" data from stage 1) using Focal Loss* with gamma=1.1 actually works really well though. It seems starting from the minimum found in the previous stage let's the training really focus in on just increasing the Entropy and is way less likely to overshoot because of the momentum in Adam:

399959978-71fc0e18-9cf3-47e1-b9a0-74e81fa4f37b.png

I've also added some metrics in this PR to help see what's going on better:

https://github.com/tdrussell/qlora-pipe/pull/48

I have actually thought of a completely new PEFT method over the holidays to be applied to the attention matrices specifically.

It's not "LoRA" (as in "low rank") but for 64 heads it uses exactly the same number of tunable parameters as for a rank-64 LoRA applied to the q_proj (and less for the k_proj / v_proj if using GQA).

It's a bit involved to explain and will need some custom code writing to use a block-diagonal matrix in a PEFT wrapper, but I think there is actually a fundamental flaw in using LoRAs with multi-headed attention that this should fix (to do with the cross-talk / lack of enforced sparsity in lora_B which is getting added to all the attention heads when actually there is no reason to believe there is any actual linkage between the heads!).

The most important thing with this idea is that it might actually be possible to regularise each of the tiny 128x128 matrices (back towards the identity matrix), that each act independently on a separate attention head, so as to use knowledge of relative sample sizes of the high vs low frequencies generated using RoPE to have way less chance of screwing up the long-contex ability of models when trained on shorter sequences.

Has anybody else noticed Claude Sonnet has had a lobotomy recently?

About 2 weeks ago I noticed it started:

  1. Insisting on breaking up code into different messages. (I got around this by saying “Please write it all in one message, I promise it will go through) lol. Without the promise, it would still break it up.

  2. Saying “Wow, that’s a really clever idea” and other compliments.

  3. Making mistakes in its code, forgetting what we’re trying to do, and repeating it’s mistakes.

  4. Acting curious and asking me questions all the time.

This is all via OpenRouter / API.

If o1 got nerfed at the same time, well I find that to be too much of a coincidence. Maybe an OpenRouter issue?

What's a typical workflow look like with these things?

I usually use Exui for writing, sometimes tabbyAPI + Mikupad if I want to click the token probabilities and choose a different token / change the direction of the story.

Are you constantly loading and unloading vectors to steer the story?

The character ones (honesty, etc) I toggle frequently. The rest I leave on.

When I was using llama.cpp, I wrote a wrapper UI which looked similar to the “command line generator” in the control-vectors GitHub repo.

I guess I do frequently adjust them.

Has anybody else noticed Claude Sonnet has had a lobotomy recently?

About 2 weeks ago I noticed it started:

  1. Insisting on breaking up code into different messages. (I got around this by saying “Please write it all in one message, I promise it will go through) lol. Without the promise, it would still break it up.

  2. Saying “Wow, that’s a really clever idea” and other compliments.

  3. Making mistakes in its code, forgetting what we’re trying to do, and repeating it’s mistakes.

  4. Acting curious and asking me questions all the time.

This is all via OpenRouter / API.

If o1 got nerfed at the same time, well I find that to be too much of a coincidence. Maybe an OpenRouter issue?

I've been using claude-sonnet-3.5 on OpenRouter via the API and have tried all 4 variants (ie: old, new, self-moderated and "beta") and all are working like complete shit :/

I've actually been using o1-preview using the openai API and it definitely seems to have got quite a lot dumber, and seems to make a lot more stupid mistakes than it used to make :(

First version with the "double-newline" tokenisation bug fix is uploaded:

https://huggingface.co/jukofyork/creative-writer-32b-preview-01-2025

I'm current uploading creative-writer-plus-32b-preview-01-2025 which has its Entropy quite significantly boosted.

I have creative-writer-plus-35b-preview-01-2025 training now, so will upload the two 35b models over the next couple of days too.

This looks really interesting:

https://github.com/zenoverflow/omnichain

There are quite a few interesting threads on Reddit about it, but this has the most details on how it might be interesting for writing:

https://old.reddit.com/r/LocalLLaMA/comments/1ej8aua/simple_dumb_trick_to_enhance_roleplay_and_other/

I was actually just looking for something to quickly prototype lots of regex + loopback LLM manipulations to try to get some sort of workflow to tidy up books in text format, but I think it actually might have quite a lot of potential for mixing things up for creative writing too - especially as it can act as an OpenAI API endpoint itself...

Sign up or log in to comment