language:
- de
- bg
- cs
- da
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sl
- sv
- sk
metrics:
- accuracy
- bleu
pipeline_tag: text-generation
library_name: transformers
base_model:
- openGPT-X/Teuken-7B-base-v0.4
license: apache-2.0
Model Card for Teuken-7B-instruct-v0.4
Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4.
Model Description
- Developed by: Fraunhofer IAIS
- Funded by: German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
- Model type: Transformer based decoder-only model
- Language(s) (NLP): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
- Shared by: Fraunhofer IAIS
Uses
Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-chat-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
Out-of-Scope Use
The model is not intended for use in math and coding tasks.
Bias, Risks, and Limitations
Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 that is not completely free from biases and hallucinations.
How to Get Started with the Model
Usage
The model requires transformers, sentencepiece, and the torch library. After installation, here's an example of how to use the model:
The prompt template for the fine-tuned model is defined as follows:
user="Hi!"
lang_code = "DE"
system_messages={
"EN": "A chat between a human and an artificial intelligence assistant."
" The assistant gives helpful and polite answers to the human's questions.",
"DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
" Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
}
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:<s>"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "openGPT-X/Teuken-7B-instruct-v0.4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
model_name,
use_fast=False,
trust_remote_code=True,
)
messages = [{"role": "User", "content": "Wer bist du?"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
prompt_ids.to(model.device),
max_length=512,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7,
num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0])
print(prediction_text)
This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
Training Details
Training Data
For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
- Add all multi-turn examples
- Add the entire code_alpaca dataset subset
- Add entire lmsys_chat_1m_high_quality_train_en dataset subset
- For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
Dataset Sizes Before Composition
English
German
Training Procedure
Instruction fined tuned version of Teuken-7B-base-v0.4.
Training Hyperparameters
- Training regime: bf16 mixed precision
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the European LLM Leaderboard (https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).
Technical Specifications
Model Architecture and Objective
Hyper-Parameter | Value |
---|---|
Training Objective | CLM |
Activation Function | SwiGLU |
Seq Length | 4096 |
Position Embeddings | Rotary |
Num Layers | 32 |
Hidden Size | 4096 |
FFN Hidden Size | 13440 |
Num Attention Heads | 32 |
Head Dim | 128 |
Group Query Attention | yes |
Num Query Groups | 2 |
Normalization | RMSNorm |
Learning rate | 3e-4 |
Min learning rate | 3e-5 |
Disable bias in linear | yes |
Hidden dropout | 0.0 |
Attention dropout | 0.0 |
Optimizer | AdamW |
Beta1 | 0.9 |
Beta2 | 0.95 |
Sequence-parallelism | |
Data-type | bf16 |
Recompute-activations | yes |
Distributed-optimizers | yes |
Model Initialization |
BibTeX:
If you find our model useful in your research, please consider citing our preprint:
@misc{ali2024teuken7bbaseteuken7binstructeuropean,
title={Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs},
author={Mehdi Ali and Michael Fromm and Klaudia Thellmann and Jan Ebert and Alexander Arno Weber and Richard Rutmann and Charvi Jain and Max Lübbering and Daniel Steinigen and Johannes Leveling and Katrin Klug and Jasper Schulze Buschhoff and Lena Jurkschat and Hammam Abdelwahab and Benny Jörg Stein and Karl-Heinz Sylla and Pavel Denisov and Nicolo' Brandizzi and Qasid Saleem and Anirban Bhowmick and Lennard Helmer and Chelsea John and Pedro Ortiz Suarez and Malte Ostendorff and Alex Jude and Lalith Manjunath and Samuel Weinbach and Carolin Penke and Oleg Filatov and Shima Asaadi and Fabio Barth and Rafet Sifa and Fabian Küch and Andreas Herten and René Jäkel and Georg Rehm and Stefan Kesselheim and Joachim Köhler and Nicolas Flores-Herr},
year={2024},
eprint={2410.03730},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.03730},
}
Team
Data Team
Anirban Bhowmick (IAIS), Nicolo Brandizzi (IAIS), Lennard Helmer (IAIS), Benny Jörg Stein (IAIS), Karl-Heinz Sylla (IAIS), Pavel Denisov (IAIS), Qasid Saleem (IAIS), Johannes Leveling (IAIS), Hammam Abdelwahab (IAIS), Luzian Hahn (IIS), Farzad Naderi (IIS), Md Saiful Islam (IIS), Alexander Schwirjow (IIS), Pedro Ortiz Suarez (ex. DFKI), Malte Ostendorff (ex. DFKI)
Model-Training Team
Core contributors
Mehdi Ali (IAIS), Michael Fromm (IAIS), Jan Ebert (FZJ), Chelsea John (FZJ), Lena Jurkschat (TUD), Alexander Weber (IAIS)
Contributors:
Richard Rutmann (IAIS), Daniel Steinigen (IAIS), Lalith Manjunath (TUD), Carolin Penke (FZJ)
Evaluation Team
Core contributors
Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)
Contributors:
Shima Assadi (IIS), Fabio Barth (DFKI)
Management
Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)
Model Card Contact
Contact Information
You can reach out to the following model card contact: