Arch speculations
This smells a lot like someone did an ABF (theta-scaling) finetune on Llama2 70B, and it somehow works great (and yet somehow they did not have the bandwidth to upload fp16 weights). Granted, no one tried that before. LongLORA did linear rope and Nous did Yarn, and Meta only tried it on Codellama 30B.
Were you able to spit out any info about the layers or the architecture? Does it look like a llama inside or some kind of MoE/Mixtral layers?
Will post the info when I get a chance if you haven't.
It's a Llama 2 70b in my opinion. Same size, same conversion, same speed, same feel. But with some Mistral on the top of it.
For more tech stuff, there's more qualified than me!
And indeed, I read an excerpt about ABF a few days ago.
https://arxiv.org/html/2401.07004v1
On this Github, they link to some HF models with scaled Theta : https://github.com/GAIR-NLP/Entropy-ABF
Models are here : https://huggingface.co/Arist12
Worth a quant and a test, I guess.
Would you be willing to do any of the same tests (egs., Q3_K_S Hellaswag) on LongLORA (either my modification or the original). I'm impressed by Miku, and am curious where linear rope stands with respect to the others.
Picking the right base model & method for 32K scaling is maybe more important than I thought. If theta-scaling (ABF) is really good, then I would drop everything and direct my efforts to making an open 70B ABF base model ASAP. The only reason I haven't is the expense and uncertainity if it is really worth it.
Reading the codellama 70B paper, looks like they were trained for long context. Even though the config.json
says the base theta is 10K, I suspect it is intended to be used with 1M. Which one did you use for your Codellama 70B benchmarks?
Here's what I have (and I corrected a typo thanks to you) :
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,6.4634,512,512,2024-01-30 01:40:00,RBF10000,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,655
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,9.7866,512,512,2024-01-30 01:40:00,RBF1000000,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,81
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,8.5822,512,512,2024-01-30 01:40:00,RBF500000,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,81
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,7.1098,512,512,2024-01-30 01:40:00,RBF100000,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,81
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,6.8224,512,512,2024-01-30 01:40:00,RBF50000,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,81
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,6.5705,512,512,2024-01-30 01:40:00,RBF10000,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,81
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,5.6064,4096,4096,2024-01-30 01:40:00,,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,
- CodeLlama-70b-Instruct-hf-Q2_K.gguf,-,wikitext,153.5606,6144,6144,2024-01-30 01:40:00,,70b,CodeLlama,32768,,,GGUF,Meta,Lonestriker,
When not precised, the base theta from the model quant (10k) is used. Otherwise, I used a manual base rope frequency, and my results indicate a Theta 10,000.. Honestly, I hope Meta didn't publish the right weights, because this is grotesque..
The last number when present is the number of chunks in the PPL test, otherwise it's the LlamaCPP standard attribution.
I tested LongLora on 13b a while ago, and I was impressed :
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,5.2329,512,512,2024-01-14 12:10:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.616,2048,2048,2024-01-14 12:15:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.4841,4096,4096,2024-01-14 12:20:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.4587,6144,6144,2024-01-14 12:25:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.4878,8192,8192,2024-01-14 12:30:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.3038,12288,12288,2024-01-14 12:40:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.3974,16384,16384,2024-01-14 12:50:00,PEC8,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.3256,12488,12288,2024-01-14 13:00,PEC8-RBF18168.7,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
Llama-2-13b-longlora-16k-ft.q6_k.gguf,-,wikitext,4.4041,14848,14848,2024-01-14 13:10:00,PEC8-RBF22277,13b,Llama_2,4096,,,GGUF,Yukang,Undi95,
I'm a litle bit CPU busy atm (i7-6700k and my own quants to do..), but if you provide me a Q3_K_S (let's keep it simple indeed) GGUF to test I'll squeeze it tomorrow, I'm not training anything so I have some GPU time to spare to test what you need me to test. I also want Aurelian to be a success for all and a reward for your work and expenses!
I uploaded GGUF Q3_K_S of longlora base here if you want to try. Curious if linear rope is indeed much worse than the other methods.
I looked into the architecture of Miqu, and it seems to be like you said, Llama2 fine-tune/variant. Tokenizer looks the same as Llama. I didn't look too closely or make weight comparisons because all that is annoying to do outside of HF format. Do you know any llama.cpp tool that lets you look at the model layers and structure more easily? If it is a Mis(x)tral variant like some people are claiming, it's not obvious to my eyes.
Otherwise, I used a manual base rope frequency, and my results indicate a Theta 10,000.. Honestly, I hope Meta didn't publish the right weights, because this is grotesque..
Did you try theta = 1M also? That's what they used for Codellama before, wonder why they'd do something different for 70b.
I did try Theta 1,000,000 on CodeLlama 70b instruct, that's my RBF 1,000,000. The optimal perplexity is at RBF 10,000..
For Miqu, Alpindale made a FP16 dequant of the Q5_K_M quant of Miqu with the help of Ycros : alpindale/miqu-1-70b-fp16 : alpindale/miqu-1-70b-fp16
I don't know more.
Then, a Mistral dataset was most probably used to train Miqu, but is it from MistralAI or recomposed from a Q/A series, I don't know. Some folks say the output is very close to Mistral Medium though.
Your Q3_K_S is on download, I will test it shortly.