|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
A Byte-Pair Encoding (BPE) tokenizer works by grouping frequently occurring pairs of characters instead of splitting text into words or individual characters. By training on a Sindhi Twitter dataset, this tokenizer captures common letter combinations in Sindhi, preserving the language's phonetic structure. |
|
|
|
The key advantage of BPE is its ability to handle unknown words. Similar to how we can pronounce a new word by recognizing character pairings, the BPE tokenizer breaks down unfamiliar words into smaller, familiar sub-units, making it robust for unseen terms while maintaining consistency with the language's sound patterns. |
|
|
|
- **Developed by:** Fahad Maqsood Qazi |
|
- **Model type:** BPE Tokenizer |
|
- **Language(s) (NLP):** Sindhi (Perso-Arabic Script) |
|
- **License:** [More Information Needed] |
|
|
|
## Usage |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
``` |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("fahadqazi/Sindhi-BPE-Tokenizer") |
|
|
|
encoded = tokenizer.encode("ڪهڙا حال آهن") |
|
decoded = tokenizer.decode(encoded.ids) |
|
|
|
print("Encoded tokens: ", encoded.tokens) # output: ['Úª', 'Ùĩ', 'ÚĻا', 'ĠØŃاÙĦ', 'ĠØ¢ÙĩÙĨ'] |
|
print("Decoded text: ", decoded) # output: ڪهڙا حال آهن |
|
|
|
``` |
|
|
|
|