openGPT-X
/

Teuken-7B-instruct-research-v0.4

@@ -129,21 +129,16 @@ This example demonstrates how to load the model and tokenizer, prepare input, ge
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
   1. Add all multi-turn examples
   2. Add the entire code_alpaca dataset subset
   3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
   4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
-## Dataset Sizes Before Composition
-### English
-### German
 ### Training Procedure
@@ -199,7 +194,27 @@ The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Tr
 | Distributed-optimizers     | yes      |
 | Model Initialization       |          |
 **BibTeX:**

 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources.
+The pretraining data has a cutoff of September 2023.
 For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
   1. Add all multi-turn examples
   2. Add the entire code_alpaca dataset subset
   3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
   4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
 ### Training Procedure
 | Distributed-optimizers     | yes      |
 | Model Initialization       |          |
+### Compute Infrastructure
+We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.
+#### Hardware
+The configuration of JUWELS Booster compute nodes is the following:
+    CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
+    Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
+    GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
+    Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
+    Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.
+#### Software
+https://github.com/OpenGPTX/Megatron-LM
 **BibTeX:**