--- library_name: transformers license: cc-by-nc-4.0 datasets: - tahrirchi/dilmash tags: - nllb - karakalpak language: - en - ru - uz - kaa base_model: facebook/nllb-200-distilled-600M pipeline_tag: translation --- # Dilmash: Karakalpak Machine Translation Models This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak". ## Model variations We provide three variants of our Karakalpak translation model: | Model | Tokenizer Length | Parameter Count | Unique Features | |-------|------------|-------------------|-----------------| | [**`dilmash-raw`**](https://huggingface.co/tahrirchi/dilmash-raw) | **256,204** | **615M** | **Original NLLB tokenizer** | | [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer | | [`dilmash-TIL`](https://huggingface.co/tahrirchi/dilmash-TIL) | 269,399 | 629M | Additional TIL corpus | **Common attributes:** - **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) - **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash) - **Languages:** Karakalpak, Uzbek, Russian, English ## Intended uses & limitations These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English. ### How to use You can use these models with the Transformers library. Here's a quick example: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_ckpt = "tahrirchi/dilmash-raw" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt) # Example translation input_text = "Here is dilmash translation model." tokenizer.src_lang = "eng_Latn" tokenizer.tgt_lang = "kaa_Latn" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(translated_text) # Dilmash awdarması modeli. ``` ## Training data The models were trained on a parallel corpus of 300,000 sentence pairs, including: - Uzbek-Karakalpak (100,000 pairs) - Russian-Karakalpak (100,000 pairs) - English-Karakalpak (100,000 pairs) The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash). ## Training procedure For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269). ## Citation If you use these models in your research, please cite our paper: ```bibtex @misc{mamasaidov2024openlanguagedatainitiative, title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak}, author={Mukhammadsaid Mamasaidov and Abror Shopulatov}, year={2024}, eprint={2409.04269}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.04269}, } ``` ## Gratitude We are thankful to these awesome organizations and people for helping to make it happen: - [David Dalé](https://daviddale.ru): for advise throughout the process - Perizad Najimova: for expertise and assistance with the Karakalpak language - [Nurlan Pirjanov](https://www.linkedin.com/in/nurlan-pirjanov/): for expertise and assistance with the Karakalpak language - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process - Ajiniyaz Nurniyazov: for advise throughout the process We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation. ## Contacts We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak. For further development and issues about the dataset, please use m.mamasaidov@tahrirchi.uz or a.shopolatov@tahrirchi.uz to contact.