bueble-lm-2b-sft / README.md
johannhartmann's picture
Upload README.md
01e69fa verified
|
raw
history blame
1.91 kB
metadata
language:
  - de
tags:
  - german
  - causal-lm
  - text-generation
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0

BübleLM SFT WIP

BübleLM Logo

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2-2B, adapted using trans-tokenization with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.

Model Details

  • Architecture: Based on Gemma-2B decoder-only architecture
  • Parameters: 2 billion
  • Tokenizer: Custom German SentencePiece tokenizer (20k vocabulary)
    • Fertility rate: 1.78 tokens per word
    • Optimized for German morphological structures
    • Trained on the same corpus as the model
  • Context Length: 8192 tokens
  • Training Hardware: Single node with 4x NVidia A100-SXM4-80GB GPUs

Training Data

Trained on 3.5B tokens from Occiglot-FineWeb project, including:

  • Contemporary web content (OSCAR 2015-2023)
  • Legislative documents (EurLex, ParlamInt)
  • News data (Tagesschau)
  • Wiki sources

Data sampling weights:

  • Wikipedia: 4x
  • News/Parliamentary: 2x
  • Other sources: 1x

Finetuning

Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.

Performance

TBD after dpo training.

Usage

Source

@article{delobelle2024buble,
    title={BübleLM: A small German LM},
    author={Delobelle, Pieter and Akbik, Alan and others},
    year={2024}
}