"RPMax's Unconventional Approach"
RPMax's Unconventional Approach
RPMax, on the other hand, is only trained for one single epoch, uses a very low gradient accumulation, and a higher than normal learning rate. The loss curve during training is actually unstable and jumps up and down a lot, but if you smooth it out, it is actually still steadily decreasing over time.
This is almost certainly a bad thing (called "rattling"):
Except it doesn't completely diverge and sits in a basin, stopping at a random point in this basin when you end the training...
It'd actually a form of regularisation:
https://en.m.wikipedia.org/wiki/Regularization_(mathematics)
but a very poor and hard to control one, and instead of regularising back towards to base model (as would be the case with using weight-decay [L2-regurlarisation in the wiki page]), you are actually destroying the base models parameters by "bouncing them around" so they fall into a nearby suitable shaped basin.
Or to put it another way: regularisation would "pull" the fine-tuned weights back towards to base-weights, whereas using a higher than optimal learning-rate to cause rattling changes the inductive bias (ie: "the landscape") of the model.
Hope I don't come across as critical as we don't have enough creative-writing models as it is! But this definitely isn't a good method to use :)
I don't have a reddit account so can't reply there, but a likely explanation of the dreaded "El---" female names is this:
https://en.m.wikipedia.org/wiki/Elaine_(given_name)
It has several variations (all meaning "light") and these have been used in some of the most well known fantasy series (eg: The Wheel of Time has 2 main characters names that start with "El---").
"El" is also a reasonable common bi-gram in English:
and as a result very likely to be a single token and hence why every model loves this name! :(
I often see male names that start with "Ca---" like "Caleb", and even though it's a reasonable common bi-gram; I don't know the origin of why it appears to be chosen so often? :/
Obviously I am not an ML engineer, I actually have a background in electrical engineering. I get that conventionally it isn't a good thing, and that using a higher gradient accumulation and a slow and steady learning rate is better.
It's just that in my mind that only applies if the dataset is all conforming to a certain style or knowledge that you want to finetune the model on. For creative writing and RP datasets, if you have a good dataset with a lot of variations, I think that the loss for each step in the training run SHOULD be high since the model should have never seen that particular example before. If you use a higher gradient accumulation, I think that the model loses a lot of things that it could have learned from each individual example.
In my mind it is like telling a person to read 10 books and then telling them to summarize what they learned in one paragraph, versus if you use a higher than normal learning rate and lower gradient accumulation you are telling the person to read 3 books and then summarize in one paragraph.
I don't know how true this theory is, but I have actually tested various gradient accumulation from 16 all the way to 128. The training doesn't work with 16 as the loss curve is never actually descending over time and just jumps around, and the minimum gradient accumulation that the training works at is actually the 32 that I use.
I also compared 32 against higher gradient accumulation, and the eval loss showed that for a eval of creative writing and RP examples from a subset of the same dataset a gradient accumulation of 32 always ends up with the lowest eval loss at the end of the training run. With higher gradient accumulation resulting in higher loss at the end. So it really does prove to me that lower gradient accumulation and a higher learning rate does work in teaching the model better.
"regularisation would "pull" the fine-tuned weights back towards to base-weights, whereas using a higher than optimal learning-rate to cause rattling changes the inductive bias (ie: "the landscape") of the model.
That sounds to me like a good thing though, I am actually trying to make the resulting model less like the base model to make it actually output in a different writing style.
I don't know the whole theory, but I am just trying things out and seeing what sticks. If it works, then it works.
I don't have a reddit account so can't reply there, but a likely explanation of the dreaded "El---" female names is this:
https://en.m.wikipedia.org/wiki/Elaine_(given_name)
It has several variations (all meaning "light") and these have been used in some of the most well known fantasy series (eg: The Wheel of Time has 2 main characters names that start with "El---").
"El" is also a reasonable common bi-gram in English:
and as a result very likely to be a single token and hence why every model loves this name! :(
I often see male names that start with "Ca---" like "Caleb", and even though it's a reasonable common bi-gram; I don't know the origin of why it appears to be chosen so often? :/
That is a possible reason, but I think that the biggest reason is definitely contamination of chatGPT generated training data from when the open source LLM were originally trained using many chatGPT examples as they are far behind chatGPT in performance back then.
"regularisation would "pull" the fine-tuned weights back towards to base-weights, whereas using a higher than optimal learning-rate to cause rattling changes the inductive bias (ie: "the landscape") of the model.
That sounds to me like a good thing though, I am actually trying to make the resulting model less like the base model to make it actually output in a different writing style.
It definitely can be a good thing to change the inductive bias, but almost certainly not using a microscopic sample compared to the original training data - each of those "jumps" you are seeing are undoubtedly undoing the training of many millions of times more samples that your dataset itself.
I don't want to be negative as improving creative-writing is my main interest in LLMs too, but I also do have a background in ML from many years before the current "deep learning" craze, and it's almost certain this setup will not be repeatable, incredibly sensitive to parameters and/or initialisation, and generally rely on pure luck if it produces anything good! :)
It definitely can be a good thing to change the inductive bias, but almost certainly not using a microscopic sample compared to the original training data - each of those "jumps" you are seeing are undoubtedly undoing the training of many millions of times more samples that your dataset itself.
To me that is certainly my aim lol I want to undo a lot of the training of the model that is based on garbage creative writing datasets. That seemed to work in this case without breaking the model. I do also have some instruct and knowledge datasets that I usually mix in if I find that the creative/RP datasets are making the models too dumb, but that doesn't seem to happen on the latest generation models. Only on older Mistral and Llama 2.
I don't want to be negative as improving creative-writing is my main interest in LLMs too, but I also do have a background in ML from many years before the current "deep learning" craze
It's okay I like discussions like this, it's better if my methods are being scrutinized especially by an actual ML engineer.
it's almost certain this setup will not be repeatable, incredibly sensitive to parameters and/or initialisation,
Can you explain a bit on this part? Do you mean parameters and initialization during inference? Because I did find that the RPMax models is definitely much more sensitive to how you start the conversation compared to other models. If that is the "problem" I think that is completely fine? It just makes the model much more variable in the output depending on how you want it.
With regards to how repeatable this low grad/high learning rate training is, it worked on all the models I trained where the eval loss is always lowest compared to slow and steady high grad/low learning rate.
Because of the variety of examples in the datasets used for both training and eval, I attributed this lower loss to the model being more flexible and able to output more variety.
it's almost certain this setup will not be repeatable, incredibly sensitive to parameters and/or initialisation,
Can you explain a bit on this part? Do you mean parameters and initialization during inference? Because I did find that the RPMax models is definitely much more sensitive to how you start the conversation compared to other models. If that is the "problem" I think that is completely fine? It just makes the model much more variable in the output depending on how you want it.
With regards to how repeatable this low grad/high learning rate training is, it worked on all the models I trained where the eval loss is always lowest compared to slow and steady high grad/low learning rate.
It's not really even specific to optimisation or ML, but the "rattling" behaviour is ralated to something called a "chaotic map" in dynamical systems:
- The "chaos" here simply means "incredibly sensitive to initial parameters".
- The "map" here means "the space of parameters that do or do not lead to chaotic behaviour".
A well set up numerical optimisation problem should lie in the parameter space in the "map" where it has predictable behaviour given a wide range of initial starting conditions and be insensitive to slight variations in hyper-parameters.
A badly set up numerical optimisation problem lies in the "chaotic" part of the "map" and very small changes in the initial conditions or slight variations of the hyper-parameters will finish up in completely different areas of the search space.
NOTE: It's not exactly a "chaotic map" though, as these have a more strict definition, but this is the general idea (there are second order optimisation procedures related to Newton's Method which are actual chaotic maps though).
There's also lots of optimisation algorithms that actually start out in "the bad" part of the map (to escape local minima or create a desired inductive bias), but they then generally change the parameters during the run to "the good" part if the map to finish off.
The theory explanation is definitely one step higher than my understanding, but I get it somewhat so thanks for explaining.
A well set up numerical optimisation problem should lie in the parameter space in the "map" where it has predictable behaviour given a wide range of initial starting conditions and be insensitive to slight variations in hyper-parameters.
So what I am understanding is that this rattling behavior is so unpredictable that it can cause the model to not actually train properly depending on the starting condition and hyper-parameters? Leading for the resulting trained model to not be repeatable.
If this is so unpredictable then the model would all turn out super garbage and I'd have to train it a few times to get even one to work. But that wasn't the case, all the different model versions I made worked out all in the first try with these settings and dataset.
Sure maybe you are right if I re-ran the training it will keep making different resulting models due to this, but I am also trying to understand why this chaotic behavior is bad though? I am not trying to optimize the model on a certain specific task, but instead trying to make it output more random and interesting text without the model going stupid. Which seems to work.
I didn't actually want the model to output exactly as the dataset examples, just to sort of take inspiration and learn from it so it can even output something completely different than the dataset if prompted.
why this chaotic behavior is bad though?
Because how different the model would be had you run 0.95 of an epoch or 1.01 of an epoch: "The loss curve during training is actually unstable and jumps up and down a lot", but also because some of those jumps are actually undoing the original training completely at random; subtlety damaging the model.
Also, assuming you used an optimisation rule with some kind of momentum term in (which is pretty much a given nowadays), then all these jumps will have even worse effects as these rules are all designed with the idea that "the new gradient is likely to be close to the old gradient", which certainly isn't the case if you're bouncing all over!
I didn't actually want the model to output exactly as the dataset examples, just to sort of take inspiration and learn from it so it can even output something completely different than the dataset if prompted.
This is what regularisation is for - you never want the model to overfit to your dataset examples, and it offers a principled way to "balance" the learning new stuff whilst keeping the old stuff intact as much as possible (stopping after 1 epoch is actually a form of regularisation too!).
Basically all I'm saying is if you ever see the "the loss curve during training is actually unstable and jumps up and down a lot" it's almost universally something bad and needs to be fixed by reducing the learning rate. If you are worried that this might make it overfit then you need to increase the regularisation factor to "pull" back towards the base model.
Basically all I'm saying is if you ever see the "the loss curve during training is actually unstable and jumps up and down a lot" it's almost universally something bad and needs to be fixed by reducing the learning rate. If you are worried that this might make it overfit then you need to increase the regularisation factor to "pull" back towards the base model.
I see okay thanks for all the explanation. I am just basing these parameters I am using based on extensive comparison using different gradient accumulation and learning rate settings though, it seems that at 32 gradient accumulation and 0.000001 learning rate on LORA+ seems to result in the best performing models for this dataset.
I am not worried about overfitting, in fact I am more worried about the model not learning enough. So in a way I set these settings because I want the model to overfit to each individual example more. I want it to actually be "damaged" and altered by each individual examples more than in a more normal learning rate and grad setting.
I will definitely explore different hyper-parameters more to see if I can figure out a setting that the loss is stable but the final loss is still the lowest possible.
Np :)
You might also find this interesting too:
https://huggingface.co/openbmb/Eurus-70b-nca/discussions/3
I've no idea if it actually works, but if what they said was correct then it appears that taking a context-extended model and then setting the RoPE back for fine-tuning might actually help preserve the long-context ability of the model.
You can find the "fixed" version I made here:
https://huggingface.co/jukofyork/Eurus-70b-nca-fixed
It was very interesting as non-offical coding fine-tuned models almost always lost their long-context ability, but this didn't and I tested it extensively to between 16-32k context to be sure!
I've no idea if it actually works, but if what they said was correct then it appears that taking a context-extended model and then setting the RoPE back for fine-tuning might actually help preserve the long-context ability of the model.
Ooh this is new to me thanks for sharing. Will see what I can do about that.
I've no idea if it actually works, but if what they said was correct then it appears that taking a context-extended model and then setting the RoPE back for fine-tuning might actually help preserve the long-context ability of the model.
Ooh this is new to me thanks for sharing. Will see what I can do about that.
Yeah, it was new to me too - and very surprising!
I can't be 100% sure if they really did set it back to 4k, but they were adamant they did and it appeared they just straight up copy and pasted the codellama-70b-intruct
json files to use (which did oddly have a smaller 4k context!).
It would be interesting to see if this method actually works!
Yeah, it was new to me too - and very surprising!
I can't be 100% sure if they really did set it back to 4k, but they were adamant they did and it appeared they just straight up copy and pasted the
codellama-70b-intruct
json files to use (which did oddly have a smaller 4k context!).It would be interesting to see if this method actually works!
Alright so to make sure I got it right, setting to a lower context/rope setting just for training and then changing it back when running inference after actually works better than training with the extended context/rope setting is what you are saying happened there right?
I know in the Llama 2 days when we figured out we can extend context by rope scaling, the model did become dumber if you extended the context via rope scaling.
Back then I was using oobabooga for inference, and every time I set the context higher to 2x the default context size with llinear rope scaling setting "rope_scale_base" to 10000 which was what was recommended, the model would definitely become dumber even if I only send it short context messages.
So I think that there is definitely some merit to this. I feel like it is very possible the rope scaling is causing training to be less effective than if you train without rope scaling. Maybe the model starts learning to only give attention to a small portion of the massive extended context when you train with low context? I don't know why I never thought about this or anyone even tried this too.
I will try and experiment with this new info because this is very interesting.
Considering that the Llama 3 config file is:
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.40.0.dev0",
"use_cache": true,
"vocab_size": 128256
}
And Llama 3.1 config file is:
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.42.3",
"use_cache": true,
"vocab_size": 128256
}
Basically I just need to set these on the Llama 3.1 config file right?
"max_position_embeddings": 8192,
"rope_scaling": null,
I am not too sure what to do with Mistral Nemo since it doesn't have a config file version with the default context. Just set max_position_embeddings to 10240? Do you have an idea?
{
"architectures": [
"MistralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 1024000,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.0.dev0",
"use_cache": true,
"vocab_size": 131072
}
Basically all I'm saying is if you ever see the "the loss curve during training is actually unstable and jumps up and down a lot" it's almost universally something bad and needs to be fixed by reducing the learning rate. If you are worried that this might make it overfit then you need to increase the regularisation factor to "pull" back towards the base model.
By the way regarding this topic, I can show you that on training with my Formax dataset, which is a more normal instruct dataset based on Dolphin dataset, the training loss isn't jumpy all all like when I trained my RPMax models.
Doesn't this show that the grad accumulation and learning rate settings and the resulting effective learning rate isn't actually problematic? It's just that the RPMax dataset is much more varied and unfamiliar to the model that the loss is more unstable.
The reason I am so against doing more than 1 epoch and prefer using this method is also because every time I go over 1 epoch there is a sudden huge drop in loss on the 2nd epoch, and the resulting model always turns out dumber and worse.
Nemo-Formax v2.0 training loss:
Nemo-RPMax v1.1 training loss:
It's just that the RPMax dataset is much more varied and unfamiliar to the model that the loss is more unstable.
This is what I'd thought as well, when I noticed that if I fully de-slop a dataset with a regex of 10+ synonyms and wipe out repetitive conversations, my train/loss graph has the "rattling" effect.
But if I leave the slop in and run the same training script to 1.0epoch, it looks more like your Nemo-Formax graph above.
Running a second Epoch didn't help much with the deslop'd dataset, I figured the smallish model was just incapable of learning it.
It's just that the RPMax dataset is much more varied and unfamiliar to the model that the loss is more unstable.
This is what I'd thought as well, when I noticed that if I fully de-slop a dataset with a regex of 10+ synonyms and wipe out repetitive conversations, my train/loss graph has the "rattling" effect.
But if I leave the slop in and run the same training script to 1.0epoch, it looks more like your Nemo-Formax graph above.Running a second Epoch didn't help much with the deslop'd dataset, I figured the smallish model was just incapable of learning it.
Yup that is what I observed as well. If I did not dedupe the dataset of similar characters and situations, the training loss will become much more stable so I attest that to the model already seeing similar examples and therefore having lower loss on the subsequent similar examples it is presented with. This rattling only happens when my dataset has been deduped and carefully made sure to have variety.
What happens when you run more than 1 epoch on a varied dataset is just that there is a sudden drop in loss and then the rattling effect is much lower on the second epoch. Same ish as to when you use a more repetitive dataset like an instruct dataset.
Glad you chimed in and confirmed that you are seeing similar things as I am.
Yeah, it was new to me too - and very surprising!
I can't be 100% sure if they really did set it back to 4k, but they were adamant they did and it appeared they just straight up copy and pasted the
codellama-70b-intruct
json files to use (which did oddly have a smaller 4k context!).It would be interesting to see if this method actually works!
Alright so to make sure I got it right, setting to a lower context/rope setting just for training and then changing it back when running inference after actually works better than training with the extended context/rope setting is what you are saying happened there right?
I know in the Llama 2 days when we figured out we can extend context by rope scaling, the model did become dumber if you extended the context via rope scaling.
Back then I was using oobabooga for inference, and every time I set the context higher to 2x the default context size with llinear rope scaling setting "rope_scale_base" to 10000 which was what was recommended, the model would definitely become dumber even if I only send it short context messages.
So I think that there is definitely some merit to this. I feel like it is very possible the rope scaling is causing training to be less effective than if you train without rope scaling. Maybe the model starts learning to only give attention to a small portion of the massive extended context when you train with low context? I don't know why I never thought about this or anyone even tried this too.
I will try and experiment with this new info because this is very interesting.
Considering that the Llama 3 config file is:
{ "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.40.0.dev0", "use_cache": true, "vocab_size": 128256 }
And Llama 3.1 config file is:
{ "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ], "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.42.3", "use_cache": true, "vocab_size": 128256 }
Basically I just need to set these on the Llama 3.1 config file right?
"max_position_embeddings": 8192, "rope_scaling": null,
I am not too sure what to do with Mistral Nemo since it doesn't have a config file version with the default context. Just set max_position_embeddings to 10240? Do you have an idea?
{ "architectures": [ "MistralForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "head_dim": 128, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 1024000, "model_type": "mistral", "num_attention_heads": 32, "num_hidden_layers": 40, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.0.dev0", "use_cache": true, "vocab_size": 131072 }
No, it's just about the rope_theta
value.
Actually it looks like llama-3
just straight up used:
"rope_theta": 500000.0
During the whole training run:
Other foundation model trainers tend to start with a much lower rope_theta
for the first part of their training and then extend it later. It used to be very obvious by just looking at the base model vs the finetuned model, but now they seem to be often doing this during the pre-training too.
You can still sometimes deduce it from some models, eg:
https://huggingface.co/Qwen/Qwen2-72B/blob/main/config.json
https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/blob/main/config.json
It's almost certainly not 10000
for the mistralai models due to the findings of this paper:
https://arxiv.org/abs/2310.05209
Even if you can't actually find it then this might help figure it out:
https://github.com/hsiehjackson/RULER
Or failing all that, possibly just try rearranging the formula from the paper above to "unscale" to whatever you can fit in your GPU and at least then the final position embedding of your training data would align with the final position embedding of the model.
I should say I have (absolutely) no idea if any if this will work and it's all just based on the interesting phenomenon with the eurus
model so it's probably worth a quick try if you can work out a way to test if there is any merit to it:
- For coding models I tested by creating a "repository level" file that was just under
max_position_embeddings
tokens when tokenised and then prepending a question about the code like "can you explain this code?" - coding models with ruined context would be clueless and just output nonsense, gibberish or often a single new line. - For stories you could probably do something similar by asking it to summarise the story or similar.
Doesn't this show that the grad accumulation and learning rate settings and the resulting effective learning rate isn't actually problematic? It's just that the RPMax dataset is much more varied and unfamiliar to the model that the loss is more unstable.
No I think it just shows that you are making steps that are too large and over shooting:
Then when you get far enough from the flat basin of the original model and the gradient is again much steeper; you're no longer over shooting and going back down towards the minimum.
There isn't really much point in me trying to emphasise this isn't a good thing, but it is a well known problem with all gradient descent methods (and particularly for second order methods where errors in the Hessian approximation can send things way off) and easily solved by just reducing your learning rate to have it not happen...
It's just that the RPMax dataset is much more varied and unfamiliar to the model that the loss is more unstable.
This is what I'd thought as well, when I noticed that if I fully de-slop a dataset with a regex of 10+ synonyms and wipe out repetitive conversations, my train/loss graph has the "rattling" effect.
But if I leave the slop in and run the same training script to 1.0epoch, it looks more like your Nemo-Formax graph above.Running a second Epoch didn't help much with the deslop'd dataset, I figured the smallish model was just incapable of learning it.
Yeah, this make sense if you think about it:
The base model has been trained so as to enter a flat basin and when you finetune on more data that is very similar to the base model's original training data the gradients will be small and point in directions that move around this basin.
When you use a dataset that is very distributionally different to the original training data than the gradient steps will be much larger and likely point in directions that take you further away from the basin.
If you use the same learning rate for both the cases above then obviously the step = lr * gradient
is also going to be larger too.
BUT: This absolutely doesn't mean you should be leaving the learning rate the same so as to buzz all over the basin; in the process wasting a good proportion of your computation and also needlessly destroying the original weights... You simply reduce your learning rate!
Honestly there isn't really even a question of whether this would be a good idea or not, and there have been many 100s of millions of hours of human though put into optimisation since WW2 in many different fields and "rattling" is Mathematical Optimisation 101 Lecture 1 material - seriously! :)
No, it's just about the
rope_theta
value.Actually it looks like
llama-3
just straight up used:"rope_theta": 500000.0
During the whole training run:
Other foundation model trainers tend to start with a much lower
rope_theta
for the first part of their training and then extend it later. It used to be very obvious by just looking at the base model vs the finetuned model, but now they seem to be often doing this during the pre-training too.
Ooh I see, I should have read into the papers a bit more lol. If Llama 3.1 already used 500000.0 for the whole training, then would it be right to assume there should be no benefit in lowering it for training? I don't see how it should help if the whole thing was done in rope_theta 500000.0 as the model learnt while "seeing" through that rope_theta all the way from the beginning no?
You can still sometimes deduce it from some models, eg:
https://huggingface.co/Qwen/Qwen2-72B/blob/main/config.json
https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/blob/main/config.json
It's almost certainly not
10000
for the mistralai models due to the findings of this paper:https://arxiv.org/abs/2310.05209
Even if you can't actually find it then this might help figure it out:
Interesting that according to the RULER bench Mistral Nemo is only usable up to 16K context. It also makes sense why Llama 3.1 8B seems to go bonkers above 32K when you see the RULER bench puts it at 32K as well.
Or failing all that, possibly just try rearranging the formula from the paper above to "unscale" to whatever you can fit in your GPU and at least then the final position embedding of your training data would align with the final position embedding of the model.
I'll say I am a bit over my head on this, can you point me to how I would calculate the rope_theta for my training data sequence length relative to a model's given rope_theta? Again this is assuming Llama 3.1 even has any benefit to doing this or that we can know what Nemo's base rope_theta is.
I should say I have (absolutely) no idea if any if this will work and it's all just based on the interesting phenomenon with the
eurus
model so it's probably worth a quick try if you can work out a way to test if there is any merit to it:
- For coding models I tested by creating a "repository level" file that was just under
max_position_embeddings
tokens when tokenised and then prepending a question about the code like "can you explain this code?" - coding models with ruined context would be clueless and just output nonsense, gibberish or often a single new line.- For stories you could probably do something similar by asking it to summarise the story or similar.
I do actually have a pretty reliable test for doing this completely by accident. I made a "pseudo-sentience" python program that let's an LLM think for itself over and over after given a task, and it actually works out pretty well testing a model's actual usable context length. Where when you get past the limit, the LLM will keep repeating the same actions over and over again.
Honestly there isn't really even a question of whether this would be a good idea or not, and there have been many 100s of millions of hours of human though put into optimisation since WW2 in many different fields and "rattling" is Mathematical Optimisation 101 Lecture 1 material - seriously! :)
Okok point taken about this topic and I totally get the mathematical reason as to why it's bad.
But I still don't see how a lower learning rate and/or higher gradient accumulation for less rattling would still result in a better model if the resulting final eval loss becomes higher when I do that though, how would you explain this phenomenon?
"regularisation would "pull" the fine-tuned weights back towards to base-weights, whereas using a higher than optimal learning-rate to cause rattling changes the inductive bias (ie: "the landscape") of the model.
That sounds to me like a good thing though, I am actually trying to make the resulting model less like the base model to make it actually output in a different writing style.
It definitely can be a good thing to change the inductive bias, but almost certainly not using a microscopic sample compared to the original training data - each of those "jumps" you are seeing are undoubtedly undoing the training of many millions of times more samples that your dataset itself.
I probably shouldn't have mentioned this, but there is definitely interesting research on the effects of "annealed schedules", "cosine schedules", using vanilla SGD vs newer learning rules, and so on, and their effect on the generalisation ability of the final models, etc.
BUT: This is absolutely irrelevant when you're using training data many many orders of magnitude smaller than the original datasets for a single epoch...
The only sound way to alter the inductive bias in this scenario is via regularisation, and the sad fact is that 99% of all model trainers here have the mindset of "If I'm spending $200 on finetuning credits I want the most 'bang for my buck'" and then completely ignore regularisation and just hope to move the model weights as much as possible without completly destroying the model.
One of the most famous quotes in military history is:
Amateurs study tactics; professionals study logistics.
- Omar N. Bradley
and in this context the optimisation is "the easy part" - it's incredibly well studied and almost mindless plug-and-play with very minimal knowledge of the theory needed to get working... The not screwing up what you started with that is hard part! :)
We'd have 90%+ less broken models here on huggingface if people cared more (or at all) about regularisation, and the mindset of "If I'm spending $200 on finetuning credits I want the most 'bang for my buck'" is actually doing exactly the opposite of what could be achieved :(
Sorry for going on about this, but it really is one of my pet peeves... :/
and in this context the optimisation is "the easy part" - it's incredibly well studied and almost mindless plug-and-play with very minimal knowledge of the theory needed to get working... The not screwing up what you started with that is hard part! :)
We'd have 90%+ less broken models here on huggingface if people cared more (or at all) about regularisation, and the mindset of "If I'm spending $200 on finetuning credits I want the most 'bang for my buck'" is actually doing exactly the opposite of what could be achieved :(
Sorry for going on about this, but it really is one of my pet peeves... :/
No it's fine I actually want to learn more about this and improve. So what do you suggest I do then regarding my training? For reference again I usually just use LORA+ training with a ratio of 16.
In my testing, the model both eval and train loss drops faster as I use the higher learning rate and lower grad accumulation. The final eval loss is also lower this way then if I set a lower learning rate and higher grad accumulation.
As I don't like how the loss drops significantly on the second epoch and the model becoming more repetitive because of this, is the solution to really lower the learning rate and increase gradient accumulation until the second epoch doesn't show these symptoms? But then the training time would be way longer than my usual 1 epoch setting I used.
I don't worry about training credits because I do this all on my own hardware since the original llama. Since I am not an ML engineer, I have just been trying things out and finding what works best. Which what my latest finetunes are the culmination of.
But I still don't see how a lower learning rate and/or higher gradient accumulation for less rattling would still result in a better model if the resulting final eval loss becomes higher when I do that though, how would you explain this phenomenon?
You just run the training a bit longer...
and regularise to avoid overfitting if needed.
Your "eval loss" is just an empirical estimate of the true loss due to it being a sample. The fact that one "eval loss" is lower than another "eval loss" doesn't necessarily mean one set of weights is actually better than another.
If the lower eval loss state came about by bouncing the original weights all over the place, then is close to certain it's actually worse than one with a similar or slightly higher eval loss that didn't do this. If you replaced the original model's weights with random values then managed somehow to get an equivalent eval loss via training on your training set, would you consider this to be a good model? :)
and in this context the optimisation is "the easy part" - it's incredibly well studied and almost mindless plug-and-play with very minimal knowledge of the theory needed to get working... The not screwing up what you started with that is hard part! :)
We'd have 90%+ less broken models here on huggingface if people cared more (or at all) about regularisation, and the mindset of "If I'm spending $200 on finetuning credits I want the most 'bang for my buck'" is actually doing exactly the opposite of what could be achieved :(
Sorry for going on about this, but it really is one of my pet peeves... :/
No it's fine I actually want to learn more about this and improve. So what do you suggest I do then regarding my training? For reference again I usually just use LORA+ training with a ratio of 16.
In my testing, the model both eval and train loss drops faster as I use the higher learning rate and lower grad accumulation. The final eval loss is also lower this way then if I set a lower learning rate and higher grad accumulation.
As I don't like how the loss drops significantly on the second epoch and the model becoming more repetitive because of this, is the solution to really lower the learning rate and increase gradient accumulation until the second epoch doesn't show these symptoms? But then the training time would be way longer than my usual 1 epoch setting I used.
I don't worry about training credits because I do this all on my own hardware since the original llama. Since I am not an ML engineer, I have just been trying things out and finding what works best. Which what my latest finetunes are the culmination of.
All I can suggest is first stop the rattling by reducing the learning rate, then look into:
weight_decay
(https://en.m.wikipedia.org/wiki/Regularization_(mathematics) - aka "ridge regression", "Lr-regularisation").- Reduce the rank (https://en.m.wikipedia.org/wiki/Regularization_by_spectral_filtering - aka spectral regularisation).
- Early stopping (https://en.m.wikipedia.org/wiki/Early_stopping).
These are the only regularisation methods available here and each will have a subtly different effect on the inductive bias.
You will spend more computational resources to tune these at the start, but what you learn can then likely be applied to future finetunes and the ultimate quality of what you create will be way better.
A good way to get into the regularisation before optimisation mindset is to think to yourself "what if I had infite computational ability?" and go from there.
Sorry have to go out but by "what if I had infite computational ability?" I really meant "what if I could fully optimise this problem right to the minimum?"...
You just run the training a bit longer...
and regularise to avoid overfitting if needed.
Since simply increasing dataset size isn’t an option, would this mean just running multiple epochs?
Your "eval loss" is just an empirical estimate of the true loss due to it being a sample. The fact that one "eval loss" is lower than another "eval loss" doesn't necessarily mean one set of weights is actually better than another.
If the lower eval loss state came about by bouncing the original weights all over the place, then is close to certain it's actually worse than one with a similar or slightly higher eval loss that didn't do this. If you replaced the original model's weights with random values then managed somehow to get an equivalent eval loss via training on your training set, would you consider this to be a good model? :)
Ok your comparison to a random value model does make sense.
However, isn’t a random value model basically going to be the same as a good working model if the resulting eval loss is the same for a significantly large enough dataset? Isn’t it the same thing?
All I can suggest is first stop the rattling by reducing the learning rate, then look into:
As for reducing learning rate this means I’d need either a larger dataset or more epochs to achieve the same amount of learning for the model right? Otherwise the model is literally always worse when I tested a lower learning rate with everything else equal.
weight_decay
(https://en.m.wikipedia.org/wiki/Regularization_(mathematics) - aka "ridge regression", "Lr-regularisation").- Reduce the rank (https://en.m.wikipedia.org/wiki/Regularization_by_spectral_filtering - aka spectral regularisation).
- Early stopping (https://en.m.wikipedia.org/wiki/Early_stopping).
These are the only regularisation methods available here and each will have a subtly different effect on the inductive bias.
You will spend more computational resources to tune these at the start, but what you learn can then likely be applied to future finetunes and the ultimate quality of what you create will be way better.
A good way to get into the regularisation before optimisation mindset is to think to yourself "what if I had infite computational ability?" and go from there.
I guess the only way to find out is for me to test this out then. I will find compute time to compare my usual methods vs a more stable loss curve method of lower learning rates.
At the moment all I know for sure is:
lowering learning rate/higher gradient accumulation results in worse models confirmed by the train/eval loss and by testing the model out myself
using a lower learning rate and using 2 epochs vs 1 just results in the model overfitting resulting in being repetitive which could possibly be fixed by regularization.
However, isn’t a random value model basically going to be the same as a good working model if the resulting eval loss is the same for a significantly large enough dataset? Isn’t it the same thing?
No, because "significantly large enough dataset" would have to be in the order of the original training set of several trillion tokens.
Your best bet is to read up on this:
- The Elements of Statistical Learning is free and looks at this from the Statistical Learning perceptive (
3.4: Shrinkage Methods
). - Pattern Recognition and Machine Learning also seems to be free now, and looks at this from a Bayesian perspective (
5.1: Regularization in Neural Networks
). - The same idea is called "Penalty Methods" in the Numerical Optimisation literature, but rarely used now and only really described in very old books.
- The same idea is called "Ill-posed Problems" in Numerical Mathematics and where most of the ideas originally came from.
Your other questions relate to "Model Selection" and the two books above also look at this from a Statistical Learning and a Bayesian (Machine Learning) Perspective respectively.
These are just the first books that come into my head that I can find a link to the free PDF of - the same topics are covered Ad nauseam in other books, and there are also 100s of YouTube series on this:
https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
https://www.youtube.com/playlist?list=PL05umP7R6ij2XCvrRzLokX6EoHWaGA2cC
https://www.youtube.com/playlist?list=PLoROMvodv4rOzrYsAxzQyHb8n_RWNuS1e
Information Theory, Inference, and Learning Algorithms is also free, and is yet another way to look at the same idea from an Information Theory perspective (28.1 Occam’s razor
).
Honestly, if people on huggingface spent a few days reading about these ideas, we would likely have 95% less broken models getting uploaded here! :)
However, isn’t a random value model basically going to be the same as a good working model if the resulting eval loss is the same for a significantly large enough dataset? Isn’t it the same thing?
No, because "significantly large enough dataset" would have to be in the order of the original training set of several trillion tokens.
Yes this is just a what if question. If you want the model to be as good as the original dataset can make it then for sure.
Which is relevant here because I am not trying to make the best general models. This is literally for writing RP creatively. So the way I saw it is if the eval loss on the RP dataset is great then it is great for RP. I don’t really care if it became worse in general tasks.
My point was that all these settings that cause rattling that is supposedly very bad only caused the resulting model to be better in my case. RPMax isn't a broken model.
Your best bet is to read up on this:
- The Elements of Statistical Learning is free and looks at this from the Statistical Learning perceptive (
3.4: Shrinkage Methods
).- Pattern Recognition and Machine Learning also seems to be free now, and looks at this from a Bayesian perspective (
5.1: Regularization in Neural Networks
).- The same idea is called "Penalty Methods" in the Numerical Optimisation literature, but rarely used now and only really described in very old books.
- The same idea is called "Ill-posed Problems" in Numerical Mathematics and where most of the ideas originally came from.
Your other questions relate to "Model Selection" and the two books above also look at this from a Statistical Learning and a Bayesian (Machine Learning) Perspective respectively.
These are just the first books that come into my head that I can find a link to the free PDF of - the same topics are covered Ad nauseam in other books, and there are also 100s of YouTube series on this:
https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
https://www.youtube.com/playlist?list=PL05umP7R6ij2XCvrRzLokX6EoHWaGA2cC
https://www.youtube.com/playlist?list=PLoROMvodv4rOzrYsAxzQyHb8n_RWNuS1e
Thanks for the recommendations, will definitely read up on them when I have time!
Hey, just saw your latest Reddit post (where you posted both graphs) and realised the graph you were showing before was the training graph! I have to apologise as I read "very high learning rate" and took one look at the graph and assumed it was the evaluation graph - doh! :/
So yeah, that is a perfectly valid and common thing to do: there is a continuum between stochastic-GD (batch = 1 ) and batch-GD (batch = n), and using a mini-batch closer to the SGD side can definitely help guide the model into better/flatter locations of the search space! :)
I think it was likely "higher than normal learning rate. The loss curve during training is actually unstable and jumps up and down" made me think this - when people mention "loss" they tend to mean evaluation loss and somehow I just never noticed the graph you posted clearly said "train/loss" - sorry again! :)