Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
File size: 16,645 Bytes
595fe5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e652889
595fe5f
39302de
595fe5f
6b6ebab
e605687
6836a9a
595fe5f
 
 
 
 
 
6b6ebab
595fe5f
 
 
6b6ebab
595fe5f
 
 
 
c2b4289
857bf21
595fe5f
d4e0be4
 
c2b4289
d4e0be4
 
595fe5f
 
 
 
 
 
 
 
 
 
6836a9a
595fe5f
 
 
 
 
 
 
c870e86
595fe5f
 
 
 
 
 
 
 
 
 
11ff4bc
595fe5f
 
c870e86
595fe5f
 
 
 
78ae529
595fe5f
78ae529
960c6d2
78ae529
 
 
 
 
 
 
 
 
 
 
 
9a67d89
78ae529
 
 
 
 
 
 
 
 
 
548ab9a
78ae529
595fe5f
 
 
 
 
 
4acfc1e
595fe5f
 
 
dad31d2
2bcc49d
ed411e1
2bcc49d
595fe5f
8204548
 
39302de
 
8204548
 
39302de
 
 
 
 
 
 
 
 
 
 
 
8204548
39302de
 
 
 
 
 
 
 
 
8204548
 
 
39302de
 
 
8204548
39302de
 
 
 
 
 
 
 
 
595fe5f
 
 
 
831b8b0
595fe5f
c87ef23
595fe5f
 
 
 
 
 
 
 
 
f8ed8af
 
 
9545102
 
 
 
 
 
 
34add1c
595fe5f
6176880
 
 
f8ed8af
595fe5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2bcc49d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
595fe5f
2c85088
595fe5f
 
 
ea9a9e2
51b68b0
 
b7a8c9b
 
 
 
 
 
 
 
 
51b68b0
595fe5f
9d35189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
595fe5f
7c13168
 
 
 
 
 
595fe5f
 
 
 
 
9e78cf8
595fe5f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
language:
- de
- bg
- cs
- da
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sl
- sv
- sk
metrics:
- accuracy
- bleu
pipeline_tag: text-generation
library_name: transformers
base_model:
- openGPT-X/Teuken-7B-base-v0.4
license: other
---
# Model Card for Teuken-7B-instruct-research-v0.4


[Teuken-7B-instruct-research-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-instruct-research-v0.4) is an instruction-tuned 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project [OpenGPT-X](https://opengpt-x.de). 
The base model Teuken-7B-base-v0.4 is available on request 📧 <a href="[email protected]">[email protected]</a>.


### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
- **Funded by:** German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
- **Model type:** Transformer based decoder-only model
- **Language(s) (NLP):** bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
- **Shared by:** OpenGPT-X

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
[Teuken-7B-instruct-research-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-instruct-research-v0.4) focuses on covering all 24 EU languages and therefore renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
Since the underlying base model is trained on all 24 EU languages, Teuken-7B-instruct-research-v0.4 is also intended for research use in these 24 languages.

## Disclaimer Toxic Content:
 
This Large Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been heavily filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

The model is not intended for use in math and coding tasks.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[Teuken-7B-instruct-research-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-instruct-research-v0.4) is an instruction-tuned version of Teuken-7B-base-v0.4 (base model is available on request 📧 <a href="[email protected]">[email protected]</a>) that is not completely free from biases and hallucinations.

## How to Get Started with the Model

## Usage
The model requires transformers, sentencepiece, and the torch library.
After installation, here's an example of how to use the model:

As this model is a fine-tuned model, it must be used with the provided prompt template. Using the model without the prompt template is not intended and is not recommended. The prompt template is defined as follows:
```python
user="Hi!"
lang_code = "DE"
system_messages={
            "EN": "A chat between a human and an artificial intelligence assistant."
            " The assistant gives helpful and polite answers to the human's questions.",
            "DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
            " Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
        }
 
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:"
```

The prompt template is also directly integrated in the Tokenizer and can be used as follows:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "openGPT-X/Teuken-7B-instruct-research-v0.4"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False,
    trust_remote_code=True,
)

messages = [{"role": "User", "content": "Hallo"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
    prompt_ids.to(model.device),
    max_length=512,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0].tolist())
print(prediction_text)
```

This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.

## Training Details

### Pre-Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[Teuken-7B-instruct-research-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-instruct-research-v0.4) was pre-trained on 4 trillion tokens of data from publicly available sources. 
The pretraining data has a cutoff of September 2023.
More information is available in our preprint ["Data Processing for the OpenGPT-X Model Family"](http://arxiv.org/abs/2410.08800).


### Instruction-Tuning Data

For the dataset composition, we used a selection of English and German datasets from which we sampled our final dataset with equal distribution between German and English, as shown in the following tables. 

### English

* We only included a subsample of the OpenOrca dataset.
* For the LMSYS-Chat dataset, we selected only the high-quality criteria in [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://arxiv.org/abs/2309.11998), i.e., if the model answer stems from any of "GPT-3.5-turbo", "GPT-4",  "Claude-1", "Claude-instant-1" or "Claude-2" and is English.
* To select instruction-tuning examples based on their quality, We calculated the reward scores of all English examples utilizing [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) (Apache-2.0 license)

For English data, we did the following steps for sample selection:
  1. Add all multi-turn examples
  2. Add entire `code_alpaca` dataset subset
  3. Add entire `lmsys_chat_1m_high_quality_train_en` dataset subset
  4. For the remaining dataset subsets (`open_orca`, `evol_instruct_143k`, `evol_instruct_70k`, `sharegpt_v3`, `ultrachat_200k`, `bactrianx_EN`), we add the samples with the highest reward scores so that each dataset subset contributes an equal amount of high-quality examples


| Dataset                                               | Sample Count |
| ----------------------------------------------------- | ------------ |
| anon8231489123/ShareGPT_Vicuna_unfiltered             | 37.6K        |
| MBZUAI/Bactrian-X                                     | 26.9K        |
| Open-Orca/OpenOrca                                    | 26.9K        |
| WizardLM/WizardLM_evol_instruct_70k                   | 26.9K        |
| WizardLM/WizardLM_evol_instruct_V2_196k               | 26.8K        |
| sahil2801/CodeAlpaca-20k                              | 12.1K        |
| lmsys/lmsys-chat-1m                                   | 11.2K        |
| HuggingFaceH4/ultrachat_200k                          | 7.0K         |
| **total**                                             | **175,5K**   |

### German

For German data we include the complete data sets from the given table:

| Dataset                                                     | Sample Count |
| ----------------------------------------------------------- | ------------ |
| MBZUAI/Bactrian-X DE                                        | 63.7K        |
| FreedomIntelligence/evol-instruct-deutsch                   | 55.9K        |
| FreedomIntelligence/alpaca-gpt4-deutsch                     | 47.5K        |
| FreedomIntelligence/sharegpt-deutsch                        | 5.8K         |
| LeoLM/German_Songs                                          | 943          |
| LeoLM/German_Poems                                          | 378          |
| bjoernp/ultrachat_de                                        | 909          |
| **total**                                                   | **175,13K**  |


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Instruction fined tuned version of Teuken-7B-base-v0.4.

More information regarding the pre-training are available in our model preprint ["Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs"](https://arxiv.org/abs/2410.03730).

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, , bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Results on multilingual benchmarks for 21 European languages with instruction-tuned models
| Model                          | Avg.   | EU21-ARC | EU21-HeSw | EU21-TQA | EU21-MMLU |
|--------------------------------|--------|----------|-----------|----------|-----------|
| Meta-Llama-3.1-8B-Instruct     | **.563** | .563   | .579      | .532     | **.576**  |
| Mistral-7B-Instruct-v0.3       | .527   | .530     | .538      | **.548** | .491   |
| Salamandra-7B-Instruct         | .543   | **.595** | **.637**  | .482     | .459      |
| Aya-23-8B                      | .485   | .475     | .535      | .476     | .455      |
| Occiglot-7B-eu5-Instruct       | .475   | .484     | .519      | .471     | .428      |
| Pharia-1-LLM-7B-C-A            | .417   | .396     | .438      | .469     | .366      |
| Bloomz-7B1                     | .358   | .316     | .354      | .461     | .302      |
| **Teuken-7B-instruct-research-v0.4**            | .543   | .581     | .624      | .543     | .425      |

More information regarding the quality of our translated benchmarks are available in our Evaluation preprint ["Towards Multilingual LLM Evaluation for European Languages"](https://arxiv.org/abs/2410.08928).
More evaluation results regarding Teuken-7B-instruct-research-v0.4 are available in our model preprint  ["Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs"](https://arxiv.org/abs/2410.03730).

The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can also be seen in the [European LLM Leaderboard](https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).

## Technical Specifications

### Model Architecture and Objective

| Hyper-Parameter            | Value    |
|----------------------------|----------|
| Training Objective         | CLM      |
| Activation Function        | SwiGLU   |
| Seq Length                 | 4096     |
| Position Embeddings        | Rotary   |
| Num Layers                 | 32       |
| Hidden Size                | 4096     |
| FFN Hidden Size            | 13440    |
| Num Attention Heads        | 32       |
| Head Dim                   | 128      |
| Group Query Attention      | yes      |
| Num Query Groups           | 2        |
| Normalization              | RMSNorm  |
| Learning rate              | 3e-4     |
| Min learning rate          | 3e-5     |
| Disable bias in linear     | yes      |
| Hidden dropout             | 0.0      |
| Attention dropout          | 0.0      |
| Optimizer                  | AdamW    |
| Beta1                      | 0.9      |
| Beta2                      | 0.95     |
| Data-type                  | bf16     |
| Recompute-activations      | yes      |
| Distributed-optimizers     | yes      |

### Compute Infrastructure

We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology. 

#### Hardware

The configuration of JUWELS Booster compute nodes is the following:

    CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration

    Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)

    GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other

    Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA

    Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.

#### Software

[Megatron-LM](https://github.com/OpenGPTX/Megatron-LM)

**BibTeX:**

If you find our model useful in your research, please consider citing our [preprint](https://arxiv.org/abs/2410.03730):
```

@misc{ali2024teuken7bbaseteuken7binstructeuropean,
      title={Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs}, 
      author={Mehdi Ali and Michael Fromm and Klaudia Thellmann and Jan Ebert and Alexander Arno Weber and Richard Rutmann and Charvi Jain and Max Lübbering and Daniel Steinigen and Johannes Leveling and Katrin Klug and Jasper Schulze Buschhoff and Lena Jurkschat and Hammam Abdelwahab and Benny Jörg Stein and Karl-Heinz Sylla and Pavel Denisov and Nicolo' Brandizzi and Qasid Saleem and Anirban Bhowmick and Lennard Helmer and Chelsea John and Pedro Ortiz Suarez and Malte Ostendorff and Alex Jude and Lalith Manjunath and Samuel Weinbach and Carolin Penke and Oleg Filatov and Shima Asaadi and Fabio Barth and Rafet Sifa and Fabian Küch and Andreas Herten and René Jäkel and Georg Rehm and Stefan Kesselheim and Joachim Köhler and Nicolas Flores-Herr},
      year={2024},
      eprint={2410.03730},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03730}, 
}
```

# Team
## Data Team
Anirban Bhowmick (IAIS), Nicolo Brandizzi (IAIS), Lennard Helmer (IAIS), Benny Jörg Stein (IAIS), Karl-Heinz Sylla (IAIS), Pavel Denisov (IAIS), Qasid Saleem (IAIS), Johannes Leveling (IAIS), Hammam Abdelwahab (IAIS), Luzian Hahn (IIS), Farzad Naderi (IIS), Md Saiful Islam (IIS), Alexander Schwirjow (IIS), Pedro Ortiz Suarez (ex. DFKI), Malte Ostendorff (ex. DFKI)
## Model-Training Team
### Core contributors
Mehdi Ali (IAIS), Michael Fromm (IAIS), Jan Ebert (FZJ), Chelsea John (FZJ), Lena Jurkschat (TUD), Alexander Weber (IAIS)
### Contributors:
Richard Rutmann (IAIS), Daniel Steinigen (IAIS), Lalith Manjunath (TUD), Carolin Penke (FZJ)
## Evaluation Team
### Core contributors
Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)
### Contributors:
Shima Assadi (IIS), Fabio Barth (DFKI)
## Management
Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)

We believe that collaboration is key to overcome the aforementioned limitations and thereby strengthening the European GenAI landscape. Because of this, the team invites researchers, developers, and AI enthusiasts to join and engage through various platforms. A Discord server has been created for community collaboration, offering a space for discussions on technical details, ideas, and direct interaction with developers. Additionally, resources like research publications and a European LLM Leaderboard provide insights into Teuken-7B’s performance and technical aspects. The OpenGPT-X team encourages ongoing engagement and collaboration as the project evolves.
Key links:
Discord: OpenGPT-X [Discord server](https://discord.com/invite/RvdHpGMvB3)
Research Papers: OpenGPT-X News [Research Papers](https://opengpt-x.de/en/news-en/)
LLM Leaderboard: European LLM Leaderboard [LLM Leaderboard](https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard)

<div class="hf-card">
    <h2>Contact Information</h2>
    <p>You can reach out to the following model card contact:</p>
    <ul>
        <li>
            <a href="https://huggingface.co/openGPT-X" target="_blank">OpenGPT-X</a> 
            - <a href="[email protected]">[email protected]</a>
        </li>
    </ul>
</div>