Jim Lai

grimjim

AI & ML interests

Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct use, but aim for educational and/or merge purposes.

Recent Activity

new activity about 12 hours ago

FreedomIntelligence/HuatuoGPT-o1-8B:Please submit this model to the Open LLM Leaderboard

posted an update about 12 hours ago

I've arrived at an interesting result on the current Open LLM leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/ After I narrowed down the filter of models to be between 8-9B parameters, my recent merge achieved the highest MATH eval result of any Llama 3.x 8B model currently on the board, hitting 33.99%, placing 973/2795. https://huggingface.co/grimjim/HuatuoSkywork-o1-Llama-3.1-8B Unfortunately, I need more information to evaluate the parent models used in the merge. The Skywork/Skywork-o1-Open-Llama-3.1-8B model scored 0% on the MATH eval, which I suspect was due to output formatting that was baked too hard into the model, and placed 2168/2795; the merge achieved a significant uplift in every benchmark across the board. Unfortunately, FreedomIntelligence/HuatuoGPT-o1-8B was not currently benched as of this post, so I am unable to assess relative benchmarks. Nevertheless, it is intriguing that an ostensibly medical o1 model appears to have resulted in a sizable MATH boost.

updated a model about 13 hours ago

grimjim/Llama3.1-SuperNovaLite-HuatuoSkywork-o1-8B

View all activity

Organizations

grimjim's activity

posted an update about 12 hours ago

Post

457

I've arrived at an interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard
After I narrowed down the filter of models to be between 8-9B parameters, my recent merge achieved the highest MATH eval result of any Llama 3.x 8B model currently on the board, hitting 33.99%, placing 973/2795.
grimjim/HuatuoSkywork-o1-Llama-3.1-8B

Unfortunately, I need more information to evaluate the parent models used in the merge.
The Skywork/Skywork-o1-Open-Llama-3.1-8B model scored 0% on the MATH eval, which I suspect was due to output formatting that was baked too hard into the model, and placed 2168/2795; the merge achieved a significant uplift in every benchmark across the board.
Unfortunately, FreedomIntelligence/HuatuoGPT-o1-8B was not currently benched as of this post, so I am unable to assess relative benchmarks. Nevertheless, it is intriguing that an ostensibly medical o1 model appears to have resulted in a sizable MATH boost.

replied to their post about 13 hours ago

Example of a fixed model.
https://huggingface.co/grimjim/lemon07r_Gemma-2-Ataraxy-v4c-9B_fixed

posted an update 5 days ago

Post

2543

I'm (finally) releasing a Python script that trims excess weights in Gemma2 full-weight models that bloated by ~1B parameters due to an early mergekit bug.
https://github.com/jim-plus/Gemma2-mergekit-remediation

I'd noticed something was off when merges of Gemma2 9B models ended up having ~10B parameters. The current mergekit package is fine, but there are still bloated models on HF that could stand to be fixed.

The script assumes that it will be run from the same directory as the model weights, and will trim the unnecessary lm_head.weight tensor and corresponding index entry.

2 replies

posted an update 21 days ago

Post

1397

A reminder that literal base models are valid choices for base model in task arithmetic mergers. Each Instruct or fine-tuned model then becomes a vector against the base model. Example merge formula used can be found via this model page.
grimjim/Magnolia-v3-12B

posted an update about 1 month ago

Post

1032

Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM.
https://github.com/turboderp/exllamav2
meta-llama/Llama-3.2-1B-Instruct
grimjim/llama-3-Nephilim-v3-8B

The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case.

It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning.

It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model.
arcee-ai/Llama-3.1-SuperNova-Lite
meta-llama/Llama-3.1-405B-Instruct

posted an update 3 months ago

Post

2146

To demonstrate that it was possible, I performed a "trapezoid" gradient merge of a Llama 3 8B model onto Llama 3.1 8B Instruct, favoring the L3.1 model at the ends in order to preserve coherence and limiting the influence of the L3 model to at most 0.1 weight. Tested to 16k context length.
grimjim/Llama-Nephilim-Metamorphosis-v2-8B

replied to their post 3 months ago

Those look like prefills. Unless you want to train for prefill-specific outputs, it makes sense to remove them.

replied to their post 4 months ago

For DPO, I'd stick with what HF recommends, which in their example does not have prompt repetition.
https://huggingface.co/docs/trl/main/en/dpo_trainer

Offhand, for multi-turn data, I'd go with what the LLM "sees" in practice, so prior turns are probably part of the prompt, and "chosen" and "rejected" guide what text generation occurs.

posted an update 4 months ago

Post

1967

I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".

7 replies

posted an update 4 months ago

Post

3241

I found this paper to be thought-provoking: "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" by Bansal, Hosseini, Agarwal, Tran, and Kazemi.
https://arxiv.org/abs/2408.16737
The direct implication is that smaller models could be used to create cost-effective synthetic datasets. And on that note, in the Gemma terms of use, Google explicitly claims no rights on outputs generated from those models, which means one is free to synthgen from the Gemma line. Meta's Llama 3 licence forbids synthetic generation of outputs if used to improve other models. Relevant Mistral, Qwen, and Yi models under the Apache 2.0 license are unrestricted for this purpose.

2 replies

posted an update 4 months ago

Post

1642

This merge, this time grounded in Gemma2 9B Instruct fine-tunes, is another demonstration that models without any fine-tuning to support roleplay can still perform the function, maintaining coherence and attention to context. It should be evident that no overt fine-tuning is required for roleplay in text generation; pretraining should provide models with a requisite basic understanding of the world, so all that should be needed is some corrective fine-tuning to address observed defects in portraying the world along with datasets to promote a suitably entertaining writing style. Good Instruct tuning should promote reasoning, coherence, and attention to context.
grimjim/Kitsunebi-v1-Gemma2-8k-9B
grimjim/Kitsunebi-v1-Gemma2-8k-9B-GGUF

I opted not to incorporate the UCLA SPPO fine-tune for Gemma2 9B after observing context confusion occur with some frequency during complex scenarios.

Thanks to Axcxept co., ltd. for fine-tuning HODACHI/EZO-Common-9B-gemma-2-it, and to Princeton NLP Group for fine-tuning princeton-nlp/gemma-2-9b-it-SimPO.
AXCXEPT/EZO-Common-9B-gemma-2-it
princeton-nlp/gemma-2-9b-it-SimPO

posted an update 5 months ago

Post

2814

I've observed that the layers targeted in various abliteration notebooks (e.g., https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing ) appear to be arbitrary, reflecting probable brute-force exploration. This doesn't need to be the case.

Taking a cue from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" ( https://arxiv.org/abs/2403.17887 ) and PruneMe (https://github.com/arcee-ai/PruneMe), it seems reasonable to target deeper layers identified as more redundant given measured similarity across layers, as the result should be less damaging to models, reducing the need for subsequent fine-tuning. Intuitively, one should expect the resulting intervention layers to be deep but not final. The only uncertainty is if the redundancy successfully encodes refusals, something which is almost certainly model-dependent. This approach only requires the redundancy to be computed once per model, and the result used as a starting point for which layer range to restrict intervention to.

posted an update 5 months ago

Post

4185

I've come across theoretical justification for my prior experimentation with extremely low-weight mergers: they amount to flattening a model so its "massive activation" features remain as significant contributors. Extremely low-weight merge weights also effectively sparsify a contributing model with regard to the base model, but in a way which still preserves relationships within the flattened latent space. In the paper "Massive Activations in Large Language Models", the authors observed "very few activations exhibit significantly larger values than others (e.g., 100,000 times larger)", which in turn implies a lower bound in effective application of extremely low weight merging.
https://arxiv.org/abs/2402.17762

1 reply

replied to their post 6 months ago

Not the same model, but a related model.

I created what amounted to an abliteration LoRA by contrasting original L3 Instruct against failspy's abliterated L3 Instruct, then applied and merged the L3-derived LoRA on top of original L3.1 Instruct to obtain the final model.

The result appears to outperform mlabonne's reapplication of the abliteration technique directly to L3.1 Instruct.

replied to their post 6 months ago

An example use of mergekit to apply the lora is documented at the bottom of the model card as well as in mergekit_config.yaml. Although task_arithmetic was used, a passthrough merge will work as well.

An example of the mergekit LoRA extraction command is at the bottom of the model card for the LoRA:
https://huggingface.co/grimjim/Llama-3-Instruct-abliteration-LoRA-8B

replied to their post 6 months ago

Select GGUF quants.
https://huggingface.co/grimjim/Llama-3.1-8B-Instruct-abliterated_via_adapter-GGUF

posted an update 6 months ago

Post

2299

In principle, it's possible to "abliterate" refusals in any Llama 3.1 8B models via application of a LoRA, using only mergekit.

Proof of concept below:
grimjim/Llama-3.1-8B-Instruct-abliterated_via_adapter

5 replies

posted an update 6 months ago

Post

2757

Intelligence is all you need for roleplay.
Roleplay is overlooked as a special case of chain-of-thought, where context must be attended to and inferred state of the world and embodied minds must be persisted and evolved along credible narrative lines. LLMs are also being tasked to function as gamemasters. It's a challenging task which points to potential future benchmarks. The fact that the largest commercial LLMs are adept in generating text for roleplay intuitively implies that model intelligence is sufficient so long as it can generalize properly and pay attention to context without becoming confused.
This recent merge of mine composed using 3 academic fine-tunes, none of which were intended for roleplay, has survived the gauntlet of a Reddit post and appears to be a particularly strong 8B model when it comes to roleplay coherence.
grimjim/llama-3-Nephilim-v3-8B (bf16 weights)
grimjim/llama-3-Nephilim-v3-8B-GGUF (select quants)

1 reply

replied to their post 6 months ago

Something odd is happening when merging with the OVA model. It will reduce refusals at medium (0.5-0.6) weight, but at full (1.0) weight against Instruct 8B, the result is incoherent. The LoRA should work, though!

It should also be possible to modify abliteration scripts to instead directly produce a LoRA as output.

posted an update 6 months ago

Post

2271

Below we experiment with negative merger weighting (-1.0!) using task arithmetic. Merge formula on the model card and in the repo itself.

This model is steered to behave opposite to what MopeyMule demonstrated.

Based on the implications of the merge technique, we also propose Orthogonalized Vector Adaptation (OVA). We also extract a LoRA of the counter-refusal abliteration steering vector.

The resulting merger is not a perfect model, but it's a behaviorally interesting model. The model name was inspired by a Philip K. Dick story.
grimjim/Llama-3-Perky-Pat-Instruct-8B

Refusal vector weights ready for use:
grimjim/Llama-3-Instruct-abliteration-OVA-8B
grimjim/Llama-3-Instruct-abliteration-LoRA-8B

3 replies