johannhartmann
/

bueble-lm-2b-sft

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

johannhartmann commited on 20 days ago

Commit

01e69fa

·

verified ·

1 Parent(s): 0c1bc02

Upload README.md

Files changed (1) hide show

README.md +65 -0

README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+---
+language:
+- de
+tags:
+- german
+- causal-lm
+- text-generation
+library_name: transformers
+pipeline_tag: text-generation
+license: apache-2.0
+---
+# BübleLM SFT WIP
+<div align="center" style="margin-bottom: 2rem; margin-top: 2rem">
+    <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/>
+    <h1 style="margin-top: 1rem;">BübleLM</h1>
+    <p><em>A small German LM</em></p>
+</div>
+BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
+## Model Details
+- **Architecture**: Based on Gemma-2B decoder-only architecture
+- **Parameters**: 2 billion
+- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
+  - Fertility rate: 1.78 tokens per word
+  - Optimized for German morphological structures
+  - Trained on the same corpus as the model
+- **Context Length**: 8192 tokens
+- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
+## Training Data
+Trained on 3.5B tokens from Occiglot-FineWeb project, including:
+- Contemporary web content (OSCAR 2015-2023)
+- Legislative documents (EurLex, ParlamInt)
+- News data (Tagesschau)
+- Wiki sources
+Data sampling weights:
+- Wikipedia: 4x
+- News/Parliamentary: 2x
+- Other sources: 1x
+## Finetuning
+Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.
+## Performance
+TBD after dpo training.
+## Usage
+## Source
+```bibtex
+@article{delobelle2024buble,
+    title={BübleLM: A small German LM},
+    author={Delobelle, Pieter and Akbik, Alan and others},
+    year={2024}
+}
+```