fahadqazi commited on
Commit
8143634
·
verified ·
1 Parent(s): 4535c67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -15,14 +15,16 @@ tags: []
15
 
16
  <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
19
 
20
  - **Developed by:** Fahad Maqsood Qazi
21
  - **Model type:** BPE Tokenizer
22
  - **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
23
  - **License:** [More Information Needed]
24
 
25
- ## Uses
26
 
27
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
28
 
 
15
 
16
  <!-- Provide a longer summary of what this model is. -->
17
 
18
+ A Byte-Pair Encoding (BPE) tokenizer works by grouping frequently occurring pairs of characters instead of splitting text into words or individual characters. By training on a Sindhi Twitter dataset, this tokenizer captures common letter combinations in Sindhi, preserving the language's phonetic structure.
19
+
20
+ The key advantage of BPE is its ability to handle unknown words. Similar to how we can pronounce a new word by recognizing character pairings, the BPE tokenizer breaks down unfamiliar words into smaller, familiar sub-units, making it robust for unseen terms while maintaining consistency with the language's sound patterns.
21
 
22
  - **Developed by:** Fahad Maqsood Qazi
23
  - **Model type:** BPE Tokenizer
24
  - **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
25
  - **License:** [More Information Needed]
26
 
27
+ ## Usage
28
 
29
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
30