File size: 2,906 Bytes
a0bc510
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
language: hi
tags:
- hindi
- tokenizer
- bpe
- subword
- text-processing
pipeline_tag: text2text-generation
inference: true
license: mit
spaces:
- aayushraina/bpe-hindi
---

# Hindi Byte Pair Encoding (BPE) Tokenizer

A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence.

## Online Demo

Try the tokenizer in your browser: [Hindi BPE Tokenizer Demo](https://huggingface.co/spaces/aayushraina/bpe-hindi)

## Project Overview

This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features:
- Efficient trie-based tokenization
- Visualization of training progress
- Compression ratio optimization
- Support for large Hindi text datasets
- Hugging Face compatibility

## Project Structure 
hindi-bpe/
β”œβ”€β”€ data/ # Dataset directory
β”‚ β”œβ”€β”€ train/ # Training data
β”‚ └── valid/ # Validation data
β”œβ”€β”€ tokenizer/ # Saved tokenizer files
β”‚ β”œβ”€β”€ encoder.json # Encoder state
β”‚ └── vocab_stats.json # Vocabulary statistics
β”œβ”€β”€ output/ # Visualization outputs
β”œβ”€β”€ byte_pair_encoder.py # Core BPE implementation
β”œβ”€β”€ hindi_bpe.py # Hindi-specific wrapper
β”œβ”€β”€ test_hindi_bpe.py # Test suite
└── requirements.txt # Dependencies

## Training stats
    - Iteration 4500:
    - Vocabulary size: 4,477
    - Data size: 448,754
    - Compression ratio: 3.66
    - Max token length: 64

## File Descriptions

1. **byte_pair_encoder.py**
   - Core BPE implementation
   - Trie-based tokenization
   - Training statistics tracking
   - Visualization utilities

2. **hindi_bpe.py**
   - Hindi-specific tokenizer wrapper
   - Text preprocessing
   - Model saving/loading
   - Compression ratio calculation

3. **app.py**
   - Interactive web interface
   - Real-time tokenization
   - Training visualization
   - Model parameter tuning

4. **test_hindi_bpe.py**
   - Test suite for tokenizer
   - Performance benchmarks
   - Example usage

## Installation
    - bash
    - Clone repository
    - git clone https://github.com/yourusername/hindi-bpe.git
    - cd hindi-bpe
    - pip install -r requirements.txt

## Download and prepare dataset
    - python download_dataset.py
  
### Web Interface
    - streamlit run app.py

### Test-
    - python test_hindi_bpe.py
    - The test suite includes:
    - Training pipeline verification
    - Compression ratio validation
    - Token count requirements
    - Encoding/decoding accuracy

## Performance Metrics

    The tokenizer aims to achieve:
    - Vocabulary size < 5000 tokens
    - Compression ratio β‰₯ 3.2
    - Fast encoding/decoding
    - Memory-efficient operation

## Contributing

1. Fork the repository
2. Create feature branch
3. Commit changes
4. Push to branch
5. Create Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.