Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
mfromm's picture
Update README.md
78ae529 verified
|
raw
history blame
7.06 kB
metadata
language:
  - de
  - bg
  - cs
  - da
  - el
  - en
  - es
  - et
  - fi
  - fr
  - ga
  - hr
  - hu
  - it
  - lt
  - lv
  - mt
  - nl
  - pl
  - pt
  - ro
  - sl
  - sv
  - sk
metrics:
  - accuracy
  - bleu
pipeline_tag: text-generation
library_name: transformers
base_model:
  - openGPT-X/Teuken-7B-base-v0.4
license: apache-2.0

Model Card for Teuken-7B-instruct-v0.4

Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4.

Model Description

  • Developed by: Fraunhofer IAIS
  • Funded by: German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
  • Model type: Transformer based decoder-only model
  • Language(s) (NLP): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
  • Shared by: Fraunhofer IAIS

Uses

Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-chat-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.

Out-of-Scope Use

The model is not intended for use in math and coding tasks.

Bias, Risks, and Limitations

Teuken-7B-instruct-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 that is not completely free from biases and hallucinations.

How to Get Started with the Model

Usage

The model requires transformers, sentencepiece, and the torch library. After installation, here's an example of how to use the model:

The prompt template for the fine-tuned model is defined as follows:

user="Hi!"
lang_code = "DE"
system_messages={
            "EN": "A chat between a human and an artificial intelligence assistant."
            " The assistant gives helpful and polite answers to the human's questions.",
            "DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
            " Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
        }
 
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:<s>"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "openGPT-X/Teuken-7B-instruct-v0.4"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False,
    trust_remote_code=True,
)

messages = [{"role": "User", "content": "Wer bist du?"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
    prompt_ids.to(model.device),
    max_length=512,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0])
print(prediction_text)

This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.

Training Details

Training Data

For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:

  1. Add all multi-turn examples
  2. Add the entire code_alpaca dataset subset
  3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
  4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples

Dataset Sizes Before Composition

English

German

Training Procedure

Instruction fined tuned version of Teuken-7B-base-v0.4.

Training Hyperparameters

  • Training regime: bf16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the European LLM Leaderboard (https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).

Technical Specifications

Model Architecture and Objective

Hyper-Parameter Value
Training Objective CLM
Activation Function SwiGLU
Seq Length 4096
Position Embeddings Rotary
Num Layers 32
Hidden Size 4096
FFN Hidden Size 13440
Num Attention Heads 32
Head Dim 128
Group Query Attention yes
Num Query Groups 2
Normalization RMSNorm
Learning rate 3e-4
Min learning rate 3e-5
Disable bias in linear yes
Hidden dropout 0.0
Attention dropout 0.0
Optimizer AdamW
Beta1 0.9
Beta2 0.95
Sequence-parallelism
Data-type bf16
Recompute-activations yes
Distributed-optimizers yes
Model Initialization

BibTeX:

TODO

APA:

TODO

Model Card Contact

Contact Information

You can reach out to the following model card contact: