Difference between this and the non-"test" version?
Hi, is this just an updated quant of internlm2-chat-20b-llama or is it different in some way?
ah right i should probably post about this, there were changes to the config of the original model after I had done conversions and quants, so this was a test to see if those changes were needed
In brief testing by @MarinaraSpaghetti they found that this test version ran better, if you happen to try it out and find the same let me know and i'll go and redo all the old ones the proper way (though maybe not the SFT ones)
Ah that's what I was hoping. Will give it a go.
Hi, yes, I played with this version yesterday and found it much, MUCH better at following the instructions than the previous iteration. It even fares better than the other versions of this model between base and standard editions (did a quick test of „pause the roleplay and describe your character” and only on this version the command worked). But it’s also worth noting that I swapped <|im_start|> with [UNUSED_TOKEN_146] and <|im_end|> with [UNUSED_TOKEN_145] in the prompt as per one Redditor’s suggestion which seems to even improve the model further. Once again, thank you Bartowski!
I tested this model and also the one with -old in name now. With new config.json and tokenizer_config.json this model do not load with ExLlamav2_HF.
I tested that two models and outputs seemed quite simillar when I loaded this model withought setting alpha_value to 3. Futhermore I coppied config.json and tokenizer_config.json from this model to the one with -old in name and.... the old one works a lot better for me than this one and also requires to be loaded with ExLlamav2 after configs change!
So my conclusion would be that -old model is OK, but needs new configs and that configs somehow do not play well with ExLlamav2_HF as with this configs I do not get any responses from LM with this loader.
@altomek yes i noticed that for some reason the _HF loader seems to be broken with internelm models, I reported the issue here:
https://github.com/oobabooga/text-generation-webui/issues/5375
have you tried with the non HF loader to compare the output?
First I tested old (with HF) against this one and this one seamed like Marianna described a bit more verbose. Then I set alpha_value to 1 for this one and tested again. Outputs seamed simmilar for me and only difference was that old one was loaded with HF while this one with EXL2. But then I got idea to replace config.json and tokenizer_config.json in old model and try again. And this looks for me that the old one is performing very well with that new configs (and loaded after this change with old EXL2 loader). I think it is possible that only difference is that configs and both models are OK. They differ a bit beacause of quantization process. So choose yours best one yourself :)
BTW. I tested with 16K context only and 6.5bpw quant. Would love to see 8.0 quant of this model. From my tests on other models (mainly SOLAR tunes) 6.0 and 8.0 make a difference. Difference offten is very small but you can see more naturally worded responses, better vocabulary, ect...
yeah ill add an 8.0 so you can check, but 6.5 is like 99.5% of the way to 8.0
the old one wasn't quantized with the right rope scaling so i'm concerned about dropoff when you really push the context, but i'm purely guessing. I would hope that at worse this one is the same when configs are lined up, but if it's actually worse that would be very odd
6.5 is like 99.5% of the way to 8.0 - I know! You realy have to check this by yourself. I can see the difference. It all depends how you use LLM. When you are about to have RP this may not make a big differnce, but when you try to write some scenes or ask LLM to describe something, you will find that 8,0bpw models perform better also when you load full weight model with 8bit in transformers you will also find it a bit better then quantized 8.0bpw quant. That are mere % acoording to math but you will see differnce in quality nevertheless.
I added to my previous comment that I only tested with 16K context, so yeah this may be diffrent storry when you set context higher. Need to test this later.
Hadn't found any meaningful difference myself but go ahead and investigate and report your findings! 8.0 is uploaded:
https://huggingface.co/bartowski/internlm2-chat-20b-llama-exl2/tree/8_0
Will check this one. Thank you!
I'm going to make one more as well, I realize that this one uses dynamic rope scaling which I think exllamav2 calls ROPE_ALPHA, where ROPE_SCALE is for linear, if it makes a difference it'll be at lower contexts, but i'm not positive if it'll make a difference
I tested 8_0 a bit and I like it! I tested this one and 6_5 against some texts and asked for summaries and some analisys. It works great and maybe it will replace Nous-Capybara for this tasks. First impresions are solid. 8_0 recals important facts from texts more often then 6_5. However this LLM is strange a bit (both versions). Although initially, the output consists of several excellent iterations, eventually, it begins yielding considerably shortened replies. Quality varies greatly. Thus, it lacks consistency in outputs quality.
It will fit on 24G VRAM with ~8K context. With 32GB VRAM I was able to push context up to ~35K with no alpha_value or compress_pos_emb set! Whan starting new chat sometimes there are that UNUSED_TOKEN_145 messages but in chat with long history it didn't happend yet. EDIT, both do still print UNUSED_TOKEN_145 from time to time :)
@altomek
you can add "[UNUSED_TOKEN_145]" as a custom stopping String, did wonders for me. :) Also, I recommend replacing the <|im_start|> with [UNUSED_TOKEN_146] and <|im_end|> with [UNUSED_TOKEN_145] in the prompt.
@bartowski
thank you so much for the 8.0 quant! Could you please let me know if I should run it with 3 for alpha_value setting, or is it no longer necessary? Can't wait to test it! Thank you once again!
Thanks for the hint @MarinaraSpaghetti !
I am running without alpha_value using up to 35K contexts for 8_0 quant and up to 45K contexts for 6_5 quants and I haven't encountered any problems so far. However, if you plan on utilizing up to 200K contexts, incorporating alpha_value might become necessary for optimal performance?
Yeah I was incorrect with the rope_alpha value, ExLlamaV2 just doesn't support dynamic rope scaling, so I cancelled that test
I think that's fine, you still get proper scaling by setting your the alpha_value to 3.0, you just don't get the benefit from when you don't need it
Dynamic alpha attempts to scale the alpha value on the fly when your model needs the extra context, so that at low context you maintain the quality and higher you maintain the attention
So if you want to extend past 32k you should set the alpha value to 3.0
Hadn't found any meaningful difference myself but go ahead and investigate and report your findings! 8.0 is uploaded:
https://huggingface.co/bartowski/internlm2-chat-20b-llama-exl2/tree/8_0
I think you can update the model card to reflect the 8.0bps quant. I almost missed the treasure.