|
--- |
|
language: sw |
|
license: apache-2.0 |
|
tags: |
|
- icefall |
|
- phoneme-recognition |
|
- automatic-speech-recognition |
|
datasets: |
|
- bookbot/ALFFA_swahili |
|
- bookbot/fleurs_sw |
|
- bookbot/common_voice_16_1_sw |
|
--- |
|
|
|
# Pruned Stateless Zipformer RNN-T Streaming Robust SW |
|
|
|
Pruned Stateless Zipformer RNN-T Streaming Robust SW is an automatic speech recognition model trained on the following datasets: |
|
|
|
- [ALFFA Swahili](https://huggingface.co/datasets/bookbot/ALFFA_swahili) |
|
- [FLEURS Swahili](https://huggingface.co/datasets/bookbot/fleurs_sw) |
|
- [Common Voice 16.1 Swahili](https://huggingface.co/datasets/bookbot/common_voice_16_1_sw) |
|
|
|
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut). |
|
|
|
This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a NVIDIA H100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tensorboard) logged via Tensorboard. |
|
|
|
## Evaluation Results |
|
|
|
### Simulated Streaming |
|
|
|
```sh |
|
for m in greedy_search fast_beam_search modified_beam_search; do |
|
./zipformer/decode.py \ |
|
--epoch 40 \ |
|
--avg 7 \ |
|
--causal 1 \ |
|
--chunk-size 32 \ |
|
--left-context-frames 128 \ |
|
--exp-dir zipformer/exp-causal \ |
|
--use-transducer True --use-ctc True \ |
|
--decoding-method $m |
|
done |
|
``` |
|
|
|
```sh |
|
./zipformer/ctc_decode.py \ |
|
--epoch 40 \ |
|
--avg 7 \ |
|
--causal 1 \ |
|
--chunk-size 32 \ |
|
--left-context-frames 128 \ |
|
--exp-dir zipformer/exp-causal \ |
|
--decoding-method ctc-decoding \ |
|
--use-transducer True --use-ctc True |
|
``` |
|
|
|
The model achieves the following phoneme error rates on the different test sets: |
|
|
|
| Decoding | Common Voice 16.1 | FLEURS | |
|
| -------------------- | :---------------: | :----: | |
|
| Greedy Search | 7.71 | 6.58 | |
|
| Modified Beam Search | 7.53 | 6.4 | |
|
| Fast Beam Search | 7.73 | 6.61 | |
|
| CTC Greedy Search | 7.78 | 6.72 | |
|
|
|
### Chunk-wise Streaming |
|
|
|
```sh |
|
for m in greedy_search fast_beam_search modified_beam_search; do |
|
./zipformer/streaming_decode.py \ |
|
--epoch 40 \ |
|
--avg 7 \ |
|
--causal 1 \ |
|
--chunk-size 32 \ |
|
--left-context-frames 128 \ |
|
--exp-dir zipformer/exp-causal \ |
|
--use-transducer True --use-ctc True \ |
|
--decoding-method $m \ |
|
--num-decode-streams 1000 |
|
done |
|
``` |
|
|
|
The model achieves the following phoneme error rates on the different test sets: |
|
|
|
| Decoding | Common Voice 16.1 | FLEURS | |
|
| -------------------- | :---------------: | :----: | |
|
| Greedy Search | 7.75 | 6.59 | |
|
| Modified Beam Search | 7.57 | 6.37 | |
|
| Fast Beam Search | 7.72 | 6.44 | |
|
|
|
## Usage |
|
|
|
### Download Pre-trained Model |
|
|
|
```sh |
|
cd egs/bookbot_sw/ASR |
|
mkdir tmp |
|
cd tmp |
|
git lfs install |
|
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-sw/ |
|
``` |
|
|
|
### Inference |
|
|
|
To decode with greedy search, run: |
|
|
|
```sh |
|
./zipformer/jit_pretrained_streaming.py \ |
|
--nn-model-filename ./tmp/zipformer-streaming-robust-sw/exp-causal/jit_script_chunk_32_left_128.pt \ |
|
--tokens ./tmp/zipformer-streaming-robust-sw/data/lang_phone/tokens.txt \ |
|
./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav |
|
``` |
|
|
|
<details> |
|
<summary>Decoding Output</summary> |
|
|
|
``` |
|
2024-03-07 11:07:41,231 INFO [jit_pretrained_streaming.py:184] device: cuda:0 |
|
2024-03-07 11:07:41,865 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer |
|
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav |
|
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:205] torch.Size([125568]) |
|
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:207] Decoding started |
|
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:212] chunk_length: 64 |
|
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:213] T: 77 |
|
2024-03-07 11:07:41,876 INFO [jit_pretrained_streaming.py:229] 0/130368 |
|
2024-03-07 11:07:41,877 INFO [jit_pretrained_streaming.py:229] 4000/130368 |
|
2024-03-07 11:07:41,878 INFO [jit_pretrained_streaming.py:229] 8000/130368 |
|
2024-03-07 11:07:41,879 INFO [jit_pretrained_streaming.py:229] 12000/130368 |
|
2024-03-07 11:07:42,103 INFO [jit_pretrained_streaming.py:229] 16000/130368 |
|
2024-03-07 11:07:42,104 INFO [jit_pretrained_streaming.py:229] 20000/130368 |
|
2024-03-07 11:07:42,126 INFO [jit_pretrained_streaming.py:229] 24000/130368 |
|
2024-03-07 11:07:42,127 INFO [jit_pretrained_streaming.py:229] 28000/130368 |
|
2024-03-07 11:07:42,128 INFO [jit_pretrained_streaming.py:229] 32000/130368 |
|
2024-03-07 11:07:42,151 INFO [jit_pretrained_streaming.py:229] 36000/130368 |
|
2024-03-07 11:07:42,152 INFO [jit_pretrained_streaming.py:229] 40000/130368 |
|
2024-03-07 11:07:42,175 INFO [jit_pretrained_streaming.py:229] 44000/130368 |
|
2024-03-07 11:07:42,176 INFO [jit_pretrained_streaming.py:229] 48000/130368 |
|
2024-03-07 11:07:42,177 INFO [jit_pretrained_streaming.py:229] 52000/130368 |
|
2024-03-07 11:07:42,200 INFO [jit_pretrained_streaming.py:229] 56000/130368 |
|
2024-03-07 11:07:42,201 INFO [jit_pretrained_streaming.py:229] 60000/130368 |
|
2024-03-07 11:07:42,224 INFO [jit_pretrained_streaming.py:229] 64000/130368 |
|
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 68000/130368 |
|
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 72000/130368 |
|
2024-03-07 11:07:42,250 INFO [jit_pretrained_streaming.py:229] 76000/130368 |
|
2024-03-07 11:07:42,251 INFO [jit_pretrained_streaming.py:229] 80000/130368 |
|
2024-03-07 11:07:42,252 INFO [jit_pretrained_streaming.py:229] 84000/130368 |
|
2024-03-07 11:07:42,275 INFO [jit_pretrained_streaming.py:229] 88000/130368 |
|
2024-03-07 11:07:42,276 INFO [jit_pretrained_streaming.py:229] 92000/130368 |
|
2024-03-07 11:07:42,299 INFO [jit_pretrained_streaming.py:229] 96000/130368 |
|
2024-03-07 11:07:42,300 INFO [jit_pretrained_streaming.py:229] 100000/130368 |
|
2024-03-07 11:07:42,301 INFO [jit_pretrained_streaming.py:229] 104000/130368 |
|
2024-03-07 11:07:42,325 INFO [jit_pretrained_streaming.py:229] 108000/130368 |
|
2024-03-07 11:07:42,326 INFO [jit_pretrained_streaming.py:229] 112000/130368 |
|
2024-03-07 11:07:42,349 INFO [jit_pretrained_streaming.py:229] 116000/130368 |
|
2024-03-07 11:07:42,350 INFO [jit_pretrained_streaming.py:229] 120000/130368 |
|
2024-03-07 11:07:42,351 INFO [jit_pretrained_streaming.py:229] 124000/130368 |
|
2024-03-07 11:07:42,373 INFO [jit_pretrained_streaming.py:229] 128000/130368 |
|
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav |
|
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:260] ʃiɑ|ɑᵐɓɑɔ|wɑnɑiʃi|hɑsɑ|kɑtikɑ|ɛnɛɔ|lɑ|mɑʃɑɾiki|kɑtikɑ|ufɑlmɛ|huɔ|wɛnjɛ|utɑʄiɾi|wɑ|mɑfutɑ |
|
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:262] Decoding Done |
|
``` |
|
|
|
</details> |
|
|
|
## Training procedure |
|
|
|
### Install icefall |
|
|
|
```sh |
|
git clone https://github.com/bookbot-hive/icefall |
|
cd icefall |
|
export PYTHONPATH=`pwd`:$PYTHONPATH |
|
``` |
|
|
|
### Prepare Data |
|
|
|
```sh |
|
cd egs/bookbot_sw/ASR |
|
./prepare.sh |
|
``` |
|
|
|
### Train |
|
|
|
```sh |
|
export CUDA_VISIBLE_DEVICES="0" |
|
./zipformer/train.py \ |
|
--num-epochs 40 \ |
|
--use-fp16 1 \ |
|
--exp-dir zipformer/exp-causal \ |
|
--causal 1 \ |
|
--max-duration 800 \ |
|
--use-transducer True --use-ctc True |
|
``` |
|
|
|
## Frameworks |
|
|
|
- [k2](https://github.com/k2-fsa/k2) |
|
- [icefall](https://github.com/bookbot-hive/icefall) |
|
- [lhotse](https://github.com/bookbot-hive/lhotse) |
|
|