|
--- |
|
license: cc-by-sa-3.0 |
|
language: |
|
- de |
|
--- |
|
|
|
# xLSTM Model trained on German Wikipedia |
|
|
|
Research & development of an xLSTM model trained on German Wikipedia. |
|
|
|
The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks). |
|
|
|
For pretraining this xLSTM model, we this [fork](https://github.com/HallerPatrick/helibrunna) (from [Patrick Haller](https://huggingface.co/PatrickHaller)) of the awesome [Helibrunna](https://github.com/AI-Guru/helibrunna) library from [Tristan](https://huggingface.co/TristanBehrens). |
|
|
|
Initially, we integrated xLSTM model training into Flair - for more information about this, please refer to the archived [flair-old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch of this repository. |
|
|
|
# Changelog |
|
|
|
- 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna). |
|
- 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch. |
|
|
|
# Training |
|
|
|
The current model was trained with commit `f66cc55` from the [`main` branch](https://github.com/HallerPatrick/helibrunna) of the forked Helibrunna repo. |
|
|
|
The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed. |
|
|
|
The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used. |
|
|
|
The following training configuration is used: |
|
|
|
```yaml |
|
description: "Train a wikipedia xLSTM" |
|
|
|
training: |
|
model_name: "german_wikipedia" |
|
batch_size: 10 |
|
lr: 6e-4 |
|
lr_warmup_steps: 4584 |
|
lr_decay_until_steps: "auto" |
|
lr_decay_factor: 0.001 |
|
weight_decay: 0.1 |
|
amp_precision: bfloat16 |
|
weight_precision: float32 |
|
enable_mixed_precision: true |
|
num_epochs: 1 |
|
output_dir: "./output" |
|
save_every_step: 2000 |
|
log_every_step: 10 |
|
generate_every_step: 5000 |
|
wandb_project: "xlstm" |
|
|
|
model: |
|
num_blocks: 24 |
|
embedding_dim: 768 |
|
mlstm_block: |
|
mlstm: |
|
num_heads: 4 |
|
slstm_block: {} |
|
slstm_at: [] |
|
context_length: 512 |
|
|
|
dataset: |
|
output_path: "./output/german-wikipedia-dataset" |
|
hugging_face_id: ["stefan-it/dewiki-20230701"] |
|
split: "train" # Also subsetting is possible: "train[:100000]" |
|
shuffle: False |
|
seed: 42 |
|
|
|
tokenizer: |
|
type: "pretrained" |
|
pretrained_class: "LlamaTokenizer" |
|
pretrained_id: "meta-llama/Llama-2-7b-hf" |
|
``` |
|
|
|
# Usage |
|
|
|
It is possible to use the model to generate some text: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name_or_path = "stefan-it/xlstm-german-wikipedia" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name_or_path) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
|
|
input_ids = tokenizer.encode("Heute ist schönes Wetter in", return_tensors="pt") |
|
output = model.generate(input_ids, max_length=100, temperature=0.7, do_sample=True) |
|
generated_text = tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
|
print(generated_text) |
|
``` |
|
|
|
# Caveats |
|
|
|
Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters. |
|
Also downstream experiments are coming very soon. |
|
|
|
Unfortunately, there are nan's occuring in the training (after 7h 33m 14s of training on a single RTX 4090): |
|
|
|
![Training Loss](training-loss.png) |
|
|
|
This is very likely due to missing grad norm - which will be added soon with `Accelerator.clip_grad_norm_`. |
|
|
|
The uploaded model checkpoint is from 80k steps. |
|
|