Saiteja commited on
Commit
a35bc8f
·
verified ·
1 Parent(s): e7a51a2

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +42 -13
  2. app.py +4 -0
  3. requirements.txt +3 -0
  4. test.py +92 -0
README.md CHANGED
@@ -1,13 +1,42 @@
1
- ---
2
- title: Telugu Bpe
3
- emoji: 🌍
4
- colorFrom: red
5
- colorTo: indigo
6
- sdk: gradio
7
- sdk_version: 5.9.1
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Telugu Tokenizer Demo
2
+
3
+ This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.
4
+
5
+ ## Features
6
+
7
+ - **Vocabulary Size**: 50,000+ tokens
8
+ - **Compression Ratio**: >3.0
9
+ - **Special Token Handling**: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
10
+ - **Telugu-specific**: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)
11
+
12
+ ## Usage
13
+
14
+ 1. Enter Telugu text in the input box
15
+ 2. Click "Submit"
16
+ 3. View the tokenization results:
17
+ - Tokens
18
+ - Token IDs
19
+ - Number of tokens
20
+ - Text length
21
+ - Compression ratio
22
+
23
+ ## Examples
24
+
25
+ The demo includes several example texts showcasing different aspects of Telugu text:
26
+ - Basic greetings
27
+ - Simple sentences
28
+ - Complex sentences
29
+ - Long paragraphs
30
+
31
+ ## Tokenizer Source
32
+
33
+ The tokenizer is available at: [https://huggingface.co/Saiteja/telugu-bpe](https://huggingface.co/Saiteja/telugu-bpe)
34
+
35
+ ## Technical Details
36
+
37
+ - Built using the 🤗 Tokenizers library
38
+ - Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
39
+ - Handles Telugu Unicode characters effectively
40
+ - Maintains high compression ratio while preserving token interpretability
41
+
42
+
app.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ from test import iface
2
+
3
+ # For HuggingFace Spaces
4
+ iface.launch()
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio>=3.50.2
2
+ tokenizers>=0.15.0
3
+ huggingface_hub>=0.19.4
test.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from tokenizers import Tokenizer
3
+ import json
4
+ from huggingface_hub import hf_hub_download
5
+ import os
6
+
7
+ # Download tokenizer files from HF Hub
8
+ def get_tokenizer():
9
+ try:
10
+ # Download tokenizer.json
11
+ tokenizer_path = hf_hub_download(
12
+ repo_id="Saiteja/telugu-bpe",
13
+ filename="tokenizer.json",
14
+ repo_type="model"
15
+ )
16
+ # Download examples.json
17
+ examples_path = hf_hub_download(
18
+ repo_id="Saiteja/telugu-bpe",
19
+ filename="examples.json",
20
+ repo_type="model"
21
+ )
22
+ return tokenizer_path, examples_path
23
+ except Exception as e:
24
+ print(f"Error downloading files: {e}")
25
+ return None, None
26
+
27
+ # Get tokenizer and examples
28
+ tokenizer_path, examples_path = get_tokenizer()
29
+
30
+ # Load the tokenizer
31
+ tokenizer = Tokenizer.from_file(tokenizer_path)
32
+
33
+ # Load examples
34
+ with open(examples_path, "r", encoding="utf-8") as f:
35
+ examples_data = json.load(f)
36
+
37
+ # Extract example texts
38
+ # example_texts = [
39
+ # "నమస్కారం", # Hello
40
+ # "తెలుగు భాష చాలా అందమైనది", # Telugu language is very beautiful
41
+ # "భారతదేశం నా దేశం", # India is my country
42
+ # "తెలుగు సాహిత్యం చాలా సమృద్ధిగా ఉంది", # Telugu literature is very rich
43
+ # "నేను తెలుగు భాషను ప్రేమిస్తున్నాను" # I love Telugu language
44
+ # ]
45
+ example_texts = [example["text"] for example in examples_data]
46
+
47
+ def tokenize_text(text):
48
+ """Tokenize the input text and return tokens, ids and compression ratio."""
49
+ if not text.strip():
50
+ return "Please enter some text."
51
+
52
+ try:
53
+ encoding = tokenizer.encode(text)
54
+ compression_ratio = len(text) / len(encoding.ids)
55
+
56
+ result = f"""Tokens: {encoding.tokens}
57
+ Token IDs: {encoding.ids}
58
+ Number of tokens: {len(encoding.ids)}
59
+ Text length: {len(text)}
60
+ Compression ratio: {compression_ratio:.2f}"""
61
+
62
+ return result
63
+ except Exception as e:
64
+ return f"Error: {str(e)}"
65
+
66
+ # Create the Gradio interface
67
+ iface = gr.Interface(
68
+ fn=tokenize_text,
69
+ inputs=gr.Textbox(
70
+ lines=5,
71
+ placeholder="Enter Telugu text here...",
72
+ label="Input Text"
73
+ ),
74
+ outputs=gr.Textbox(
75
+ label="Tokenization Results",
76
+ lines=10
77
+ ),
78
+ title="Telugu Tokenizer Demo",
79
+ description="""This demo uses a custom Telugu tokenizer trained on a large corpus of Telugu text.
80
+ The tokenizer has a vocabulary size of 50,000+ tokens and achieves a compression ratio of >3.0.
81
+ Try entering some Telugu text to see how it's tokenized!
82
+
83
+ Tokenizer: https://huggingface.co/Saiteja/telugu-bpe""",
84
+ examples=example_texts,
85
+ theme=gr.themes.Soft()
86
+ )
87
+
88
+ # Launch the app
89
+ if __name__ == "__main__":
90
+ iface.launch()
91
+
92
+