Update README.md
Browse files
README.md
CHANGED
@@ -129,21 +129,16 @@ This example demonstrates how to load the model and tokenizer, prepare input, ge
|
|
129 |
|
130 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
131 |
|
|
|
|
|
|
|
|
|
132 |
For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
|
133 |
1. Add all multi-turn examples
|
134 |
2. Add the entire code_alpaca dataset subset
|
135 |
3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
|
136 |
4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
|
137 |
|
138 |
-
## Dataset Sizes Before Composition
|
139 |
-
|
140 |
-
### English
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
### German
|
145 |
-
|
146 |
-
|
147 |
|
148 |
### Training Procedure
|
149 |
|
@@ -199,7 +194,27 @@ The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Tr
|
|
199 |
| Distributed-optimizers | yes |
|
200 |
| Model Initialization | |
|
201 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
202 |
|
|
|
203 |
|
204 |
**BibTeX:**
|
205 |
|
|
|
129 |
|
130 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
131 |
|
132 |
+
Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources.
|
133 |
+
|
134 |
+
The pretraining data has a cutoff of September 2023.
|
135 |
+
|
136 |
For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
|
137 |
1. Add all multi-turn examples
|
138 |
2. Add the entire code_alpaca dataset subset
|
139 |
3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
|
140 |
4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
|
141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
142 |
|
143 |
### Training Procedure
|
144 |
|
|
|
194 |
| Distributed-optimizers | yes |
|
195 |
| Model Initialization | |
|
196 |
|
197 |
+
### Compute Infrastructure
|
198 |
+
|
199 |
+
We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.
|
200 |
+
|
201 |
+
#### Hardware
|
202 |
+
|
203 |
+
The configuration of JUWELS Booster compute nodes is the following:
|
204 |
+
|
205 |
+
CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
|
206 |
+
|
207 |
+
Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
|
208 |
+
|
209 |
+
GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
|
210 |
+
|
211 |
+
Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
|
212 |
+
|
213 |
+
Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.
|
214 |
+
|
215 |
+
#### Software
|
216 |
|
217 |
+
https://github.com/OpenGPTX/Megatron-LM
|
218 |
|
219 |
**BibTeX:**
|
220 |
|