Update README.md
Browse files
README.md
CHANGED
@@ -47,9 +47,11 @@ Introducing **SauerkrautLM-1.5b** – our Sauerkraut version of the powerful [Qw
|
|
47 |
- **Contact:** [VAGO solutions](https://vago-solutions.ai)
|
48 |
|
49 |
## Training Procedure
|
|
|
50 |
This model is a demo intended to showcase the potential of resource-efficient training of large language models using Spectrum CPT. Here's a brief on the procedure:
|
51 |
|
52 |
**Continuous Pre-training (CPT) on German Data**:
|
|
|
53 |
Utilizing Spectrum by Eric Hartford, Lucas Atkins, Fernando Fernandes Neto, and David Golchinfar, the model targeted 25% of its layers during training. This approach allowed significant resource savings:
|
54 |
Spectrum with 25% layer targeting consumed 309.78GB at a batch size of 2048.
|
55 |
Full Fine-tuning targeting 100% of layers used 633.55GB at the same batch size.
|
@@ -62,8 +64,10 @@ In the German Rag evaluation, it is on par with 8 billion parameter models and,
|
|
62 |
Despite the large volume of German CPT data, the model competes well against the Qwen2-1.5B-Instruct model and performs significantly better in German.
|
63 |
|
64 |
**Post-CPT Training**:
|
|
|
65 |
The model underwent 3 epochs of Supervised Fine-Tuning (SFT) with 700K samples.
|
66 |
**Further Steps**:
|
|
|
67 |
The model was aligned with Direct Preference Optimization (DPO) using 70K samples.
|
68 |
|
69 |
## Objective and Results
|
|
|
47 |
- **Contact:** [VAGO solutions](https://vago-solutions.ai)
|
48 |
|
49 |
## Training Procedure
|
50 |
+
|
51 |
This model is a demo intended to showcase the potential of resource-efficient training of large language models using Spectrum CPT. Here's a brief on the procedure:
|
52 |
|
53 |
**Continuous Pre-training (CPT) on German Data**:
|
54 |
+
|
55 |
Utilizing Spectrum by Eric Hartford, Lucas Atkins, Fernando Fernandes Neto, and David Golchinfar, the model targeted 25% of its layers during training. This approach allowed significant resource savings:
|
56 |
Spectrum with 25% layer targeting consumed 309.78GB at a batch size of 2048.
|
57 |
Full Fine-tuning targeting 100% of layers used 633.55GB at the same batch size.
|
|
|
64 |
Despite the large volume of German CPT data, the model competes well against the Qwen2-1.5B-Instruct model and performs significantly better in German.
|
65 |
|
66 |
**Post-CPT Training**:
|
67 |
+
|
68 |
The model underwent 3 epochs of Supervised Fine-Tuning (SFT) with 700K samples.
|
69 |
**Further Steps**:
|
70 |
+
|
71 |
The model was aligned with Direct Preference Optimization (DPO) using 70K samples.
|
72 |
|
73 |
## Objective and Results
|