Updated Readme.txt
Browse files
README.md
CHANGED
@@ -52,20 +52,21 @@ By design, this model has a strong vorny bias. It's not intended for use by anyo
|
|
52 |
|
53 |
The model was fine-tuned using a [rank-stabilized](https://arxiv.org/abs/2312.03732) [QLoRA adapter](https://arxiv.org/abs/2305.14314). Training was performed using [Unsloth AI](https://github.com/unslothai/unsloth) library on `Ubuntu 22.04.4 LTS` with `CUDA 12.1` and `Pytorch 2.3.0`.
|
54 |
|
55 |
-
The total training time on NVIDIA GeForce RTX 4060 Ti is about
|
56 |
|
57 |
After training, the adapter weights were merged into the dequantized model as described in [ChrisHayduk's GitHub gist](https://gist.github.com/ChrisHayduk/1a53463331f52dca205e55982baf9930).
|
58 |
|
59 |
The quantized version of the model was prepared using [llama.cpp](https://github.com/ggerganov/llama.cpp).
|
60 |
|
61 |
-
###
|
62 |
|
63 |
- Rank: 64
|
64 |
- Alpha: 16
|
65 |
- Dropout rate: 0.1
|
66 |
-
- Target weights: `["q_proj", "k_proj", "o_proj", "gate_proj", "up_proj"]`,
|
67 |
- `use_rslora=True`
|
68 |
|
|
|
69 |
|
70 |
### Domain adaptation
|
71 |
|
@@ -91,6 +92,7 @@ The raw-text stories in dataset were edited as follows:
|
|
91 |
- Batch size: 1
|
92 |
- Gradient accumulation steps: 1
|
93 |
|
|
|
94 |
|
95 |
#### Plots
|
96 |
|
|
|
52 |
|
53 |
The model was fine-tuned using a [rank-stabilized](https://arxiv.org/abs/2312.03732) [QLoRA adapter](https://arxiv.org/abs/2305.14314). Training was performed using [Unsloth AI](https://github.com/unslothai/unsloth) library on `Ubuntu 22.04.4 LTS` with `CUDA 12.1` and `Pytorch 2.3.0`.
|
54 |
|
55 |
+
The total training time on NVIDIA GeForce RTX 4060 Ti is about 26 hours.
|
56 |
|
57 |
After training, the adapter weights were merged into the dequantized model as described in [ChrisHayduk's GitHub gist](https://gist.github.com/ChrisHayduk/1a53463331f52dca205e55982baf9930).
|
58 |
|
59 |
The quantized version of the model was prepared using [llama.cpp](https://github.com/ggerganov/llama.cpp).
|
60 |
|
61 |
+
### QLoRa adapter configuration
|
62 |
|
63 |
- Rank: 64
|
64 |
- Alpha: 16
|
65 |
- Dropout rate: 0.1
|
66 |
+
- Target weights: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`,
|
67 |
- `use_rslora=True`
|
68 |
|
69 |
+
Targeting all projections for QLoRA adapter resulted in the smallest loss compared to other combinations, even compared to larger rank adapters.
|
70 |
|
71 |
### Domain adaptation
|
72 |
|
|
|
92 |
- Batch size: 1
|
93 |
- Gradient accumulation steps: 1
|
94 |
|
95 |
+
The training takes ~24 hours on NVIDIA GeForce RTX 4060 Ti.
|
96 |
|
97 |
#### Plots
|
98 |
|