|
--- |
|
language: |
|
- de |
|
tags: |
|
- german |
|
- causal-lm |
|
- text-generation |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
license: apache-2.0 |
|
--- |
|
|
|
# BübleLM SFT WIP |
|
|
|
|
|
<div align="center" style="margin-bottom: 2rem; margin-top: 2rem"> |
|
<img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/> |
|
<h1 style="margin-top: 1rem;">BübleLM</h1> |
|
<p><em>A small German LM</em></p> |
|
</div> |
|
|
|
BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities. |
|
|
|
This is an experimental version that received some finetuning using several german datasets. |
|
DPO version will follow soon. |
|
|
|
## Model Details |
|
|
|
- **Architecture**: Based on Gemma-2B decoder-only architecture |
|
- **Parameters**: 2 billion |
|
- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary) |
|
- Fertility rate: 1.78 tokens per word |
|
- Optimized for German morphological structures |
|
- Trained on the same corpus as the model |
|
- **Context Length**: 8192 tokens |
|
- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs |
|
|
|
## Training Data |
|
|
|
Trained on 3.5B tokens from Occiglot-FineWeb project, including: |
|
- Contemporary web content (OSCAR 2015-2023) |
|
- Legislative documents (EurLex, ParlamInt) |
|
- News data (Tagesschau) |
|
- Wiki sources |
|
|
|
Data sampling weights: |
|
- Wikipedia: 4x |
|
- News/Parliamentary: 2x |
|
- Other sources: 1x |
|
|
|
|
|
## Finetuning |
|
Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia. |
|
|
|
## Performance |
|
TBD after dpo training. |
|
|
|
## Usage |
|
|
|
## Source |
|
|
|
```bibtex |
|
@article{delobelle2024buble, |
|
title={BübleLM: A small German LM}, |
|
author={Delobelle, Pieter and Akbik, Alan and others}, |
|
year={2024} |
|
} |
|
``` |
|
|