fahadqazi
/

Sindhi-BPE-Tokenizer

Inference Endpoints

Model card Files Files and versions Community

fahadqazi commited on Nov 14, 2024

Commit

8143634

·

verified ·

1 Parent(s): 4535c67

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -15,14 +15,16 @@ tags: []
 <!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 - **Developed by:** Fahad Maqsood Qazi
 - **Model type:** BPE Tokenizer
 - **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
 - **License:** [More Information Needed]
-## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

 <!-- Provide a longer summary of what this model is. -->
+A Byte-Pair Encoding (BPE) tokenizer works by grouping frequently occurring pairs of characters instead of splitting text into words or individual characters. By training on a Sindhi Twitter dataset, this tokenizer captures common letter combinations in Sindhi, preserving the language's phonetic structure.
+The key advantage of BPE is its ability to handle unknown words. Similar to how we can pronounce a new word by recognizing character pairings, the BPE tokenizer breaks down unfamiliar words into smaller, familiar sub-units, making it robust for unseen terms while maintaining consistency with the language's sound patterns.
 - **Developed by:** Fahad Maqsood Qazi
 - **Model type:** BPE Tokenizer
 - **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
 - **License:** [More Information Needed]
+## Usage
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->