Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
mfromm commited on
Commit
2bcc49d
·
verified ·
1 Parent(s): 93d19b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -9
README.md CHANGED
@@ -129,21 +129,16 @@ This example demonstrates how to load the model and tokenizer, prepare input, ge
129
 
130
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
131
 
 
 
 
 
132
  For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
133
  1. Add all multi-turn examples
134
  2. Add the entire code_alpaca dataset subset
135
  3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
136
  4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
137
 
138
- ## Dataset Sizes Before Composition
139
-
140
- ### English
141
-
142
-
143
-
144
- ### German
145
-
146
-
147
 
148
  ### Training Procedure
149
 
@@ -199,7 +194,27 @@ The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Tr
199
  | Distributed-optimizers | yes |
200
  | Model Initialization | |
201
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
 
203
 
204
  **BibTeX:**
205
 
 
129
 
130
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
131
 
132
+ Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources.
133
+
134
+ The pretraining data has a cutoff of September 2023.
135
+
136
  For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
137
  1. Add all multi-turn examples
138
  2. Add the entire code_alpaca dataset subset
139
  3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
140
  4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
141
 
 
 
 
 
 
 
 
 
 
142
 
143
  ### Training Procedure
144
 
 
194
  | Distributed-optimizers | yes |
195
  | Model Initialization | |
196
 
197
+ ### Compute Infrastructure
198
+
199
+ We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.
200
+
201
+ #### Hardware
202
+
203
+ The configuration of JUWELS Booster compute nodes is the following:
204
+
205
+ CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
206
+
207
+ Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
208
+
209
+ GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
210
+
211
+ Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
212
+
213
+ Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.
214
+
215
+ #### Software
216
 
217
+ https://github.com/OpenGPTX/Megatron-LM
218
 
219
  **BibTeX:**
220