Update README.md
Browse files
README.md
CHANGED
@@ -15,14 +15,16 @@ tags: []
|
|
15 |
|
16 |
<!-- Provide a longer summary of what this model is. -->
|
17 |
|
18 |
-
|
|
|
|
|
19 |
|
20 |
- **Developed by:** Fahad Maqsood Qazi
|
21 |
- **Model type:** BPE Tokenizer
|
22 |
- **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
|
23 |
- **License:** [More Information Needed]
|
24 |
|
25 |
-
##
|
26 |
|
27 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
28 |
|
|
|
15 |
|
16 |
<!-- Provide a longer summary of what this model is. -->
|
17 |
|
18 |
+
A Byte-Pair Encoding (BPE) tokenizer works by grouping frequently occurring pairs of characters instead of splitting text into words or individual characters. By training on a Sindhi Twitter dataset, this tokenizer captures common letter combinations in Sindhi, preserving the language's phonetic structure.
|
19 |
+
|
20 |
+
The key advantage of BPE is its ability to handle unknown words. Similar to how we can pronounce a new word by recognizing character pairings, the BPE tokenizer breaks down unfamiliar words into smaller, familiar sub-units, making it robust for unseen terms while maintaining consistency with the language's sound patterns.
|
21 |
|
22 |
- **Developed by:** Fahad Maqsood Qazi
|
23 |
- **Model type:** BPE Tokenizer
|
24 |
- **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
|
25 |
- **License:** [More Information Needed]
|
26 |
|
27 |
+
## Usage
|
28 |
|
29 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
30 |
|