File size: 12,414 Bytes
595fe5f f958f36 595fe5f 6b6ebab f958f36 595fe5f 6b6ebab 595fe5f 6b6ebab 595fe5f f958f36 595fe5f d4e0be4 595fe5f f958f36 595fe5f 78ae529 595fe5f 78ae529 595fe5f 4acfc1e 595fe5f 2bcc49d 4acfc1e 2bcc49d 595fe5f 8204548 595fe5f 2c85088 595fe5f 2bcc49d 595fe5f 2c85088 595fe5f ea9a9e2 51b68b0 b7a8c9b 51b68b0 595fe5f 9d35189 595fe5f 9e78cf8 595fe5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 |
---
language:
- de
- bg
- cs
- da
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sl
- sv
- sk
metrics:
- accuracy
- bleu
pipeline_tag: text-generation
library_name: transformers
base_model:
- openGPT-X/Teuken-7B-base-v0.4
---
# Model Card for Teuken-7B-instruct-v0.4
Teuken-7B-base-v0.4 is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4.
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
- **Funded by:** German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
- **Model type:** Transformer based decoder-only model
- **Language(s) (NLP):** bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
- **Shared by:** OpenGPT-X
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-chat-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
## Disclaimer Toxic Content:
This Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been heavily filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model is not intended for use in math and coding tasks.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 that is not completely free from biases and hallucinations.
## How to Get Started with the Model
## Usage
The model requires transformers, sentencepiece, and the torch library.
After installation, here's an example of how to use the model:
The prompt template for the fine-tuned model is defined as follows:
```python
user="Hi!"
lang_code = "DE"
system_messages={
"EN": "A chat between a human and an artificial intelligence assistant."
" The assistant gives helpful and polite answers to the human's questions.",
"DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
" Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
}
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:<s>"
```
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "openGPT-X/Teuken-7B-instruct-v0.4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
model_name,
use_fast=False,
trust_remote_code=True,
)
messages = [{"role": "User", "content": "Wer bist du?"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
prompt_ids.to(model.device),
max_length=512,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7,
num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0])
print(prediction_text)
```
This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
## Training Details
### Pre-Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Teuken-7B-base-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources.
The pretraining data has a cutoff of September 2023.
More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
### Instruction-Tuning Data
### English
| Dataset file | Sample Count |
| ----------------------------------------------------- | ------------ |
| en/bactrianx_EN_fastchat.jsonl | 66985 |
| en/code_alpaca_fastchat.jsonl | 19990 |
| en/evol_instruct_143k_fastchat.jsonl | 142968 |
| en/evol_instruct_70k_fastchat.jsonl | 69968 |
| en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651 |
| en/open_orca_fastchat_aa.jsonl | 599968 |
| en/open_orca_fastchat_ab.jsonl | 599968 |
| en/open_orca_fastchat_ac.jsonl | 599968 |
| en/open_orca_fastchat_ad.jsonl | 599968 |
| en/open_orca_fastchat_ag.jsonl | 599968 |
| en/open_orca_fastchat_ah.jsonl | 33891 |
| en/sharegpt_v3_unfiltered_fastchat.jsonl | 93880 |
| en/ultrachat_200k_fastchat.jsonl | 11525 |
| **total** | **3457698** |
### German
| Dataset file | Sample Count |
| ----------------------------------------------------------- | ------------ |
| de/bactrianx_DE_fastchat.jsonl | 67017 |
| de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl | 49969 |
| de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022 |
| de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl | 6101 |
| de/german_poems_fastchat.jsonl | 400 |
| de/german_songs_fastchat.jsonl | 1000 |
| de/ultrachat_de_1k_fastchat.jsonl | 959 |
| **total** | **184468** |
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Instruction fined tuned version of Teuken-7B-base-v0.4.
#### Training Hyperparameters
- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, , bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the [European LLM Leaderboard](https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).
## Technical Specifications
### Model Architecture and Objective
| Hyper-Parameter | Value |
|----------------------------|----------|
| Training Objective | CLM |
| Activation Function | SwiGLU |
| Seq Length | 4096 |
| Position Embeddings | Rotary |
| Num Layers | 32 |
| Hidden Size | 4096 |
| FFN Hidden Size | 13440 |
| Num Attention Heads | 32 |
| Head Dim | 128 |
| Group Query Attention | yes |
| Num Query Groups | 2 |
| Normalization | RMSNorm |
| Learning rate | 3e-4 |
| Min learning rate | 3e-5 |
| Disable bias in linear | yes |
| Hidden dropout | 0.0 |
| Attention dropout | 0.0 |
| Optimizer | AdamW |
| Beta1 | 0.9 |
| Beta2 | 0.95 |
| Sequence-parallelism
| Data-type | bf16 |
| Recompute-activations | yes |
| Distributed-optimizers | yes |
| Model Initialization | |
### Compute Infrastructure
We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.
#### Hardware
The configuration of JUWELS Booster compute nodes is the following:
CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.
#### Software
[Megatron-LM](https://github.com/OpenGPTX/Megatron-LM)
**BibTeX:**
If you find our model useful in your research, please consider citing our [preprint](https://arxiv.org/abs/2410.03730):
```
@misc{ali2024teuken7bbaseteuken7binstructeuropean,
title={Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs},
author={Mehdi Ali and Michael Fromm and Klaudia Thellmann and Jan Ebert and Alexander Arno Weber and Richard Rutmann and Charvi Jain and Max Lübbering and Daniel Steinigen and Johannes Leveling and Katrin Klug and Jasper Schulze Buschhoff and Lena Jurkschat and Hammam Abdelwahab and Benny Jörg Stein and Karl-Heinz Sylla and Pavel Denisov and Nicolo' Brandizzi and Qasid Saleem and Anirban Bhowmick and Lennard Helmer and Chelsea John and Pedro Ortiz Suarez and Malte Ostendorff and Alex Jude and Lalith Manjunath and Samuel Weinbach and Carolin Penke and Oleg Filatov and Shima Asaadi and Fabio Barth and Rafet Sifa and Fabian Küch and Andreas Herten and René Jäkel and Georg Rehm and Stefan Kesselheim and Joachim Köhler and Nicolas Flores-Herr},
year={2024},
eprint={2410.03730},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.03730},
}
```
# Team
## Data Team
Anirban Bhowmick (IAIS), Nicolo Brandizzi (IAIS), Lennard Helmer (IAIS), Benny Jörg Stein (IAIS), Karl-Heinz Sylla (IAIS), Pavel Denisov (IAIS), Qasid Saleem (IAIS), Johannes Leveling (IAIS), Hammam Abdelwahab (IAIS), Luzian Hahn (IIS), Farzad Naderi (IIS), Md Saiful Islam (IIS), Alexander Schwirjow (IIS), Pedro Ortiz Suarez (ex. DFKI), Malte Ostendorff (ex. DFKI)
## Model-Training Team
### Core contributors
Mehdi Ali (IAIS), Michael Fromm (IAIS), Jan Ebert (FZJ), Chelsea John (FZJ), Lena Jurkschat (TUD), Alexander Weber (IAIS)
### Contributors:
Richard Rutmann (IAIS), Daniel Steinigen (IAIS), Lalith Manjunath (TUD), Carolin Penke (FZJ)
## Evaluation Team
### Core contributors
Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)
### Contributors:
Shima Assadi (IIS), Fabio Barth (DFKI)
## Management
Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)
<div class="hf-card">
<h2>Contact Information</h2>
<p>You can reach out to the following model card contact:</p>
<ul>
<li>
<a href="https://huggingface.co/openGPT-X" target="_blank">OpenGPT-X</a>
- <a href="[email protected]">[email protected]</a>
</li>
</ul>
</div> |