Model Card for Model ID

Model Details

Model Description

A Byte-Pair Encoding (BPE) tokenizer works by grouping frequently occurring pairs of characters instead of splitting text into words or individual characters. By training on a Sindhi Twitter dataset, this tokenizer captures common letter combinations in Sindhi, preserving the language's phonetic structure.

The key advantage of BPE is its ability to handle unknown words. Similar to how we can pronounce a new word by recognizing character pairings, the BPE tokenizer breaks down unfamiliar words into smaller, familiar sub-units, making it robust for unseen terms while maintaining consistency with the language's sound patterns.

  • Developed by: Fahad Maqsood Qazi
  • Model type: BPE Tokenizer
  • Language(s) (NLP): Sindhi (Perso-Arabic Script)
  • License: [More Information Needed]

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("fahadqazi/Sindhi-BPE-Tokenizer")

encoded = tokenizer.encode("ڪهڙا حال آهن")
decoded = tokenizer.decode(encoded.ids)

print("Encoded tokens: ", encoded.tokens) # output: ['Úª', 'Ùĩ', 'ÚĻا', 'ĠØŃاÙĦ', 'ĠØ¢ÙĩÙĨ']
print("Decoded text: ", decoded) # output: ڪهڙا حال آهن
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .