|
--- |
|
license: mit |
|
language: |
|
- ky |
|
tags: |
|
- tokenization |
|
- BPE |
|
- kyrgyz |
|
- tokenizer |
|
--- |
|
A tokenizer tailored for the Kyrgyz language, utilizing SentencePiece with Byte Pair Encoding (BPE) to offer efficient and precise tokenization. It features a 150,000-subword vocabulary, ensuring optimal performance for various Kyrgyz NLP tasks. This tokenizer was developed in collaboration with UlutSoft LLC to reflect authentic Kyrgyz language usage. |
|
|
|
|
|
Language: Kyrgyz |
|
Vocabulary Size: 150,000 subwords |
|
Method: SentencePiece (BPE) |
|
|
|
Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots. |
|
Usage Example (Python with transformers): |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path") |
|
text = "Кыргыз тили – бай жана кооз тил." |
|
tokens = tokenizer(text) |
|
print(tokens) |
|
``` |
|
Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results. |
|
|
|
License and Attribution |
|
This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources. |
|
|
|
Feedback and Contributions |
|
We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource. |