metadata
language:
- de
tags:
- german
- causal-lm
- text-generation
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
BübleLM SFT WIP
BübleLM
A small German LM
BübleLM is a German language model based on Gemma-2-2B, adapted using trans-tokenization with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
Model Details
- Architecture: Based on Gemma-2B decoder-only architecture
- Parameters: 2 billion
- Tokenizer: Custom German SentencePiece tokenizer (20k vocabulary)
- Fertility rate: 1.78 tokens per word
- Optimized for German morphological structures
- Trained on the same corpus as the model
- Context Length: 8192 tokens
- Training Hardware: Single node with 4x NVidia A100-SXM4-80GB GPUs
Training Data
Trained on 3.5B tokens from Occiglot-FineWeb project, including:
- Contemporary web content (OSCAR 2015-2023)
- Legislative documents (EurLex, ParlamInt)
- News data (Tagesschau)
- Wiki sources
Data sampling weights:
- Wikipedia: 4x
- News/Parliamentary: 2x
- Other sources: 1x
Finetuning
Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.
Performance
TBD after dpo training.
Usage
Source
@article{delobelle2024buble,
title={BübleLM: A small German LM},
author={Delobelle, Pieter and Akbik, Alan and others},
year={2024}
}