metadata
license: apache-2.0
datasets:
- BUT-FIT/BUT-LCC
language:
- cs
Introduction
CSTinyLlama-1.2B is a Czech language model continously pretrained on 168b training tokens from English TinyLLama-2.5T model. Model was pretrained on ~67b token Large Czech Collection using Czech tokenizer, obtained using our vocabulary swap method (see below). Training was done on Karolina cluster.