File size: 3,685 Bytes
df36cfa 1bfec0e df36cfa 1bfec0e df36cfa 1bfec0e df36cfa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
base_model:
- elinas/Llama-3-15B-Instruct-zeroed
library_name: transformers
tags:
- mergekit
- merge
- finetune
datasets:
- Chat-Error/Pure-dove-sharegpt
license: llama3
---
# Llama-3-15B-Instruct-zeroed-ft-v2
This is a QLoRA **finetune** of a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
The model is based on a "zeroed" passthrough merge of [Llama-3-15B-Instruct-zeroed](https://huggingface.co/elinas/Llama-3-15B-Instruct-zeroed)
This was primarily an experiment to see how a passthrough merge will respond to further finetuning of all LoRA modules.
The model was finetuned on **8192 context length** and it can possibly be extended using RoPE up to 32k.
**v3 of the model will contain significantly more data, primarily human focused, aimed to excel at writing as well as maintaining logic, coherency, and continuity.**
**[GGUF Quants provided by @gelukuMLG](https://huggingface.co/gelukuMLG/Llama-3-15B-Instruct-ft-v2-GGUF)**
## Datasets
* [Chat-Error/Pure-dove-sharegpt](https://huggingface.co/datasets/Chat-Error/Pure-dove-sharegpt)
A small, high quality, curated dataset was used as a PoC / validation on stabilizing the model after the original passthrough merge.
## Finetuning details
This is a QLoRA model and all of the LoRA modules were targeted this time to ensure sufficient training before moving on to larger datasets.
the first version of this model only targeted **o_proj** and **up_proj**
```yaml
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
lora_modules_to_save:
- embed_tokens
- lm_head
```
The model is coherent even with training the "zeroed" layers plus the additional layers, as this was the recommendation from [Charles Goddard](https://huggingface.co/chargoddard) (mergekit developer) - thank you for sharing the method of merging as well as Toasty
Pigeon for bringing it to my attention!
```yaml
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 3
- total_eval_batch_size: 3
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1
```
Optimizer `paged_adamw_8bit` and Deepspeed ZeRO 3 was used at a LR of `1e-5` using the cosine scheduler for 1 epoch on 3x3090s taking 4 hours total.
**Unsloth** was used for speed and memory savings.
Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.
W&B Run Summary
```
wandb: eval/loss 0.90895
wandb: eval/runtime 463.4688
wandb: eval/samples_per_second 0.833
wandb: eval/steps_per_second 0.278
wandb: total_flos 8270790524928.0
wandb: train/epoch 1.0
wandb: train/global_step 1157
wandb: train/grad_norm 7.3847
wandb: train/learning_rate 0.0
wandb: train/loss 0.8702
wandb: train_loss 0.87814
wandb: train_runtime 16425.2713
wandb: train_samples_per_second 0.211
wandb: train_steps_per_second 0.07
```
### Framework versions
- PEFT 0.10.0
- Transformers 4.40.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
## Model Evaluation
TBD
If you have any questions or comments on the model, feel free to open a discussion in the community tab.
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl) |