---
license: apache-2.0
datasets:
- BUT-FIT/BUT-LCC
- BUT-FIT/adult_content_classifier_dataset
language:
- cs
---
# Introduction
CSTinyLlama-1.2B is a Czech language model continously pretrained on 168b training tokens from English [TinyLLama-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method.
Training was done on [Karolina](https://www.it4i.cz/en) cluster.
# BUT LM Model Roster
- [BUT-FIT/CSTinyLlama-1.2B](https://huggingface.co/BUT-FIT/CSTinyLlama-1.2B)
- [BUT-FIT/Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k)
- [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b)
# Loss
Below we
- (i) demonstrate the convergence speed of released model (`TINYLLAMA1.2B_cztokenizer64k_align1.7k_tllama1.1B_C2048_lr1e-04_150k`, at 160k step).
- (ii) justify the contributions of our vocabulary swap method by comparing the swapped model with model trained from scratch (using same hyperparameters) `scratch_cztokenizer64k_tllama1.1B_C2048_lr1e-04_150k`. We swap 1.7K tokens in this run, similarly as for our other models (see [Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k))
## Train Cross-Entropy
## Test Perplexity
## Distance in Steps For the Same Loss from Fine-Tuning vs Training from Scratch
The distance |x1-x2| with same function value f1(x1)=f2(x2) grows with more steps. On convergence, it starts to rapidly increase (perhaps exponentially).
## Training parameters
Not mentioned parameters are the same as for [TinyLLama-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T).
| **Name** | **Value** | **Note** |
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
| dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
| tokenizer_size | 64k | |
| max_seq_len | 2048 | |
| batch_size | 512 | |
| learning_rate | 1.0e-4 | |
| optimizer | LionW | |
| optimizer_betas | 0.9/0.95 | |
| optimizer_weight_decay | 0 | |
| gradient_clipping_max_norm | 1.0 | |
| attn_impl | flash2 | |
| fsdp | SHARD_GRAD_OP | (optimized for A100 40GB GPUs) |
| precision | bf16 | |
| scheduler | cosine | |
| scheduler_warmup | 100 steps | |
| scheduler_steps | 200,000 | |
| scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
## Usage
```python
import torch
import transformers
from transformers import pipeline
name = 'BUT-FIT/CSTinyLlama-1.2B'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
trust_remote_code=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
with torch.autocast('cuda', dtype=torch.bfloat16):
print(
pipe('Nejznámějším českým spisovatelem ',
max_new_tokens=100,
top_p=0.95,
repetition_penalty=1.0,
do_sample=True,
use_cache=True))
```
# Training Data
We release most (95.79%) of our training data corpus as [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
## Getting in Touch
For further questions, email to `martin.fajcik@vut.cz`.
# Disclaimer
This is a probabilistic model, it can output stochastic information. Authors are not responsible for the model outputs. Use at your own risk.
# Acknowledgement
This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
"Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).
# Citation
```bibtex
@article{benczechmark,
author = {Martin Fajčík, Martin Dočekal, Jan Doležal, Karel Beneš, Michal Hradiš},
title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
journal = {arXiv preprint arXiv:insert-arxiv-number-here},
year = {2024},
month = {March},
eprint = {insert-arxiv-number-here},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
}