File size: 1,600 Bytes
dfed8da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8143634
 
 
dfed8da
66652f2
 
 
dfed8da
 
8143634
dfed8da
 
 
f3eac7c
1515872
 
 
66652f2
f3eac7c
 
66652f2
d9a951f
 
dfed8da
f3eac7c
dfed8da
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
library_name: transformers
tags: []
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

A Byte-Pair Encoding (BPE) tokenizer works by grouping frequently occurring pairs of characters instead of splitting text into words or individual characters. By training on a Sindhi Twitter dataset, this tokenizer captures common letter combinations in Sindhi, preserving the language's phonetic structure.

The key advantage of BPE is its ability to handle unknown words. Similar to how we can pronounce a new word by recognizing character pairings, the BPE tokenizer breaks down unfamiliar words into smaller, familiar sub-units, making it robust for unseen terms while maintaining consistency with the language's sound patterns.

- **Developed by:** Fahad Maqsood Qazi
- **Model type:** BPE Tokenizer
- **Language(s) (NLP):** Sindhi (Perso-Arabic Script)
- **License:** [More Information Needed]

## Usage

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

```
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("fahadqazi/Sindhi-BPE-Tokenizer")

encoded = tokenizer.encode("ڪهڙا حال آهن")
decoded = tokenizer.decode(encoded.ids)

print("Encoded tokens: ", encoded.tokens) # output: ['Úª', 'Ùĩ', 'ÚĻا', 'ĠØŃاÙĦ', 'ĠØ¢ÙĩÙĨ']
print("Decoded text: ", decoded) # output: ڪهڙا حال آهن

```