File size: 2,033 Bytes
01e69fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
534eb29
 
 
01e69fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
language:
- de
tags:
- german
- causal-lm
- text-generation
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
---

# BübleLM SFT WIP


<div align="center" style="margin-bottom: 2rem; margin-top: 2rem">
    <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/>
    <h1 style="margin-top: 1rem;">BübleLM</h1>
    <p><em>A small German LM</em></p>
</div>

BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.

This is an experimental version that received some finetuning using several german datasets. 
DPO version will follow soon.

## Model Details

- **Architecture**: Based on Gemma-2B decoder-only architecture
- **Parameters**: 2 billion
- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
  - Fertility rate: 1.78 tokens per word
  - Optimized for German morphological structures
  - Trained on the same corpus as the model
- **Context Length**: 8192 tokens
- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs

## Training Data

Trained on 3.5B tokens from Occiglot-FineWeb project, including:
- Contemporary web content (OSCAR 2015-2023)
- Legislative documents (EurLex, ParlamInt)
- News data (Tagesschau)
- Wiki sources

Data sampling weights:
- Wikipedia: 4x
- News/Parliamentary: 2x
- Other sources: 1x


## Finetuning
Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.

## Performance
TBD after dpo training.

## Usage

## Source 

```bibtex
@article{delobelle2024buble,
    title={BübleLM: A small German LM},
    author={Delobelle, Pieter and Akbik, Alan and others},
    year={2024}
}
```