johannhartmann commited on
Commit
01e69fa
·
verified ·
1 Parent(s): 0c1bc02

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ tags:
5
+ - german
6
+ - causal-lm
7
+ - text-generation
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # BübleLM SFT WIP
14
+
15
+
16
+ <div align="center" style="margin-bottom: 2rem; margin-top: 2rem">
17
+ <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/>
18
+ <h1 style="margin-top: 1rem;">BübleLM</h1>
19
+ <p><em>A small German LM</em></p>
20
+ </div>
21
+
22
+ BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
23
+
24
+ ## Model Details
25
+
26
+ - **Architecture**: Based on Gemma-2B decoder-only architecture
27
+ - **Parameters**: 2 billion
28
+ - **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
29
+ - Fertility rate: 1.78 tokens per word
30
+ - Optimized for German morphological structures
31
+ - Trained on the same corpus as the model
32
+ - **Context Length**: 8192 tokens
33
+ - **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
34
+
35
+ ## Training Data
36
+
37
+ Trained on 3.5B tokens from Occiglot-FineWeb project, including:
38
+ - Contemporary web content (OSCAR 2015-2023)
39
+ - Legislative documents (EurLex, ParlamInt)
40
+ - News data (Tagesschau)
41
+ - Wiki sources
42
+
43
+ Data sampling weights:
44
+ - Wikipedia: 4x
45
+ - News/Parliamentary: 2x
46
+ - Other sources: 1x
47
+
48
+
49
+ ## Finetuning
50
+ Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.
51
+
52
+ ## Performance
53
+ TBD after dpo training.
54
+
55
+ ## Usage
56
+
57
+ ## Source
58
+
59
+ ```bibtex
60
+ @article{delobelle2024buble,
61
+ title={BübleLM: A small German LM},
62
+ author={Delobelle, Pieter and Akbik, Alan and others},
63
+ year={2024}
64
+ }
65
+ ```