File size: 2,906 Bytes
a0bc510 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
language: hi
tags:
- hindi
- tokenizer
- bpe
- subword
- text-processing
pipeline_tag: text2text-generation
inference: true
license: mit
spaces:
- aayushraina/bpe-hindi
---
# Hindi Byte Pair Encoding (BPE) Tokenizer
A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence.
## Online Demo
Try the tokenizer in your browser: [Hindi BPE Tokenizer Demo](https://huggingface.co/spaces/aayushraina/bpe-hindi)
## Project Overview
This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features:
- Efficient trie-based tokenization
- Visualization of training progress
- Compression ratio optimization
- Support for large Hindi text datasets
- Hugging Face compatibility
## Project Structure
hindi-bpe/
βββ data/ # Dataset directory
β βββ train/ # Training data
β βββ valid/ # Validation data
βββ tokenizer/ # Saved tokenizer files
β βββ encoder.json # Encoder state
β βββ vocab_stats.json # Vocabulary statistics
βββ output/ # Visualization outputs
βββ byte_pair_encoder.py # Core BPE implementation
βββ hindi_bpe.py # Hindi-specific wrapper
βββ test_hindi_bpe.py # Test suite
βββ requirements.txt # Dependencies
## Training stats
- Iteration 4500:
- Vocabulary size: 4,477
- Data size: 448,754
- Compression ratio: 3.66
- Max token length: 64
## File Descriptions
1. **byte_pair_encoder.py**
- Core BPE implementation
- Trie-based tokenization
- Training statistics tracking
- Visualization utilities
2. **hindi_bpe.py**
- Hindi-specific tokenizer wrapper
- Text preprocessing
- Model saving/loading
- Compression ratio calculation
3. **app.py**
- Interactive web interface
- Real-time tokenization
- Training visualization
- Model parameter tuning
4. **test_hindi_bpe.py**
- Test suite for tokenizer
- Performance benchmarks
- Example usage
## Installation
- bash
- Clone repository
- git clone https://github.com/yourusername/hindi-bpe.git
- cd hindi-bpe
- pip install -r requirements.txt
## Download and prepare dataset
- python download_dataset.py
### Web Interface
- streamlit run app.py
### Test-
- python test_hindi_bpe.py
- The test suite includes:
- Training pipeline verification
- Compression ratio validation
- Token count requirements
- Encoding/decoding accuracy
## Performance Metrics
The tokenizer aims to achieve:
- Vocabulary size < 5000 tokens
- Compression ratio β₯ 3.2
- Fast encoding/decoding
- Memory-efficient operation
## Contributing
1. Fork the repository
2. Create feature branch
3. Commit changes
4. Push to branch
5. Create Pull Request
## License
This project is licensed under the MIT License - see the LICENSE file for details.
|