DavidGF commited on
Commit
902942e
·
verified ·
1 Parent(s): e96460e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -47,9 +47,11 @@ Introducing **SauerkrautLM-1.5b** – our Sauerkraut version of the powerful [Qw
47
  - **Contact:** [VAGO solutions](https://vago-solutions.ai)
48
 
49
  ## Training Procedure
 
50
  This model is a demo intended to showcase the potential of resource-efficient training of large language models using Spectrum CPT. Here's a brief on the procedure:
51
 
52
  **Continuous Pre-training (CPT) on German Data**:
 
53
  Utilizing Spectrum by Eric Hartford, Lucas Atkins, Fernando Fernandes Neto, and David Golchinfar, the model targeted 25% of its layers during training. This approach allowed significant resource savings:
54
  Spectrum with 25% layer targeting consumed 309.78GB at a batch size of 2048.
55
  Full Fine-tuning targeting 100% of layers used 633.55GB at the same batch size.
@@ -62,8 +64,10 @@ In the German Rag evaluation, it is on par with 8 billion parameter models and,
62
  Despite the large volume of German CPT data, the model competes well against the Qwen2-1.5B-Instruct model and performs significantly better in German.
63
 
64
  **Post-CPT Training**:
 
65
  The model underwent 3 epochs of Supervised Fine-Tuning (SFT) with 700K samples.
66
  **Further Steps**:
 
67
  The model was aligned with Direct Preference Optimization (DPO) using 70K samples.
68
 
69
  ## Objective and Results
 
47
  - **Contact:** [VAGO solutions](https://vago-solutions.ai)
48
 
49
  ## Training Procedure
50
+
51
  This model is a demo intended to showcase the potential of resource-efficient training of large language models using Spectrum CPT. Here's a brief on the procedure:
52
 
53
  **Continuous Pre-training (CPT) on German Data**:
54
+
55
  Utilizing Spectrum by Eric Hartford, Lucas Atkins, Fernando Fernandes Neto, and David Golchinfar, the model targeted 25% of its layers during training. This approach allowed significant resource savings:
56
  Spectrum with 25% layer targeting consumed 309.78GB at a batch size of 2048.
57
  Full Fine-tuning targeting 100% of layers used 633.55GB at the same batch size.
 
64
  Despite the large volume of German CPT data, the model competes well against the Qwen2-1.5B-Instruct model and performs significantly better in German.
65
 
66
  **Post-CPT Training**:
67
+
68
  The model underwent 3 epochs of Supervised Fine-Tuning (SFT) with 700K samples.
69
  **Further Steps**:
70
+
71
  The model was aligned with Direct Preference Optimization (DPO) using 70K samples.
72
 
73
  ## Objective and Results