w11wo's picture
Added MOdel
d4ce303
---
language: sw
license: apache-2.0
tags:
- icefall
- phoneme-recognition
- automatic-speech-recognition
datasets:
- bookbot/ALFFA_swahili
- bookbot/fleurs_sw
- bookbot/common_voice_16_1_sw
---
# Pruned Stateless Zipformer RNN-T Streaming Robust SW
Pruned Stateless Zipformer RNN-T Streaming Robust SW is an automatic speech recognition model trained on the following datasets:
- [ALFFA Swahili](https://huggingface.co/datasets/bookbot/ALFFA_swahili)
- [FLEURS Swahili](https://huggingface.co/datasets/bookbot/fleurs_sw)
- [Common Voice 16.1 Swahili](https://huggingface.co/datasets/bookbot/common_voice_16_1_sw)
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).
This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a NVIDIA H100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tensorboard) logged via Tensorboard.
## Evaluation Results
### Simulated Streaming
```sh
for m in greedy_search fast_beam_search modified_beam_search; do
./zipformer/decode.py \
--epoch 40 \
--avg 7 \
--causal 1 \
--chunk-size 32 \
--left-context-frames 128 \
--exp-dir zipformer/exp-causal \
--use-transducer True --use-ctc True \
--decoding-method $m
done
```
```sh
./zipformer/ctc_decode.py \
--epoch 40 \
--avg 7 \
--causal 1 \
--chunk-size 32 \
--left-context-frames 128 \
--exp-dir zipformer/exp-causal \
--decoding-method ctc-decoding \
--use-transducer True --use-ctc True
```
The model achieves the following phoneme error rates on the different test sets:
| Decoding | Common Voice 16.1 | FLEURS |
| -------------------- | :---------------: | :----: |
| Greedy Search | 7.71 | 6.58 |
| Modified Beam Search | 7.53 | 6.4 |
| Fast Beam Search | 7.73 | 6.61 |
| CTC Greedy Search | 7.78 | 6.72 |
### Chunk-wise Streaming
```sh
for m in greedy_search fast_beam_search modified_beam_search; do
./zipformer/streaming_decode.py \
--epoch 40 \
--avg 7 \
--causal 1 \
--chunk-size 32 \
--left-context-frames 128 \
--exp-dir zipformer/exp-causal \
--use-transducer True --use-ctc True \
--decoding-method $m \
--num-decode-streams 1000
done
```
The model achieves the following phoneme error rates on the different test sets:
| Decoding | Common Voice 16.1 | FLEURS |
| -------------------- | :---------------: | :----: |
| Greedy Search | 7.75 | 6.59 |
| Modified Beam Search | 7.57 | 6.37 |
| Fast Beam Search | 7.72 | 6.44 |
## Usage
### Download Pre-trained Model
```sh
cd egs/bookbot_sw/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-sw/
```
### Inference
To decode with greedy search, run:
```sh
./zipformer/jit_pretrained_streaming.py \
--nn-model-filename ./tmp/zipformer-streaming-robust-sw/exp-causal/jit_script_chunk_32_left_128.pt \
--tokens ./tmp/zipformer-streaming-robust-sw/data/lang_phone/tokens.txt \
./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
```
<details>
<summary>Decoding Output</summary>
```
2024-03-07 11:07:41,231 INFO [jit_pretrained_streaming.py:184] device: cuda:0
2024-03-07 11:07:41,865 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:205] torch.Size([125568])
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:207] Decoding started
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:212] chunk_length: 64
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:213] T: 77
2024-03-07 11:07:41,876 INFO [jit_pretrained_streaming.py:229] 0/130368
2024-03-07 11:07:41,877 INFO [jit_pretrained_streaming.py:229] 4000/130368
2024-03-07 11:07:41,878 INFO [jit_pretrained_streaming.py:229] 8000/130368
2024-03-07 11:07:41,879 INFO [jit_pretrained_streaming.py:229] 12000/130368
2024-03-07 11:07:42,103 INFO [jit_pretrained_streaming.py:229] 16000/130368
2024-03-07 11:07:42,104 INFO [jit_pretrained_streaming.py:229] 20000/130368
2024-03-07 11:07:42,126 INFO [jit_pretrained_streaming.py:229] 24000/130368
2024-03-07 11:07:42,127 INFO [jit_pretrained_streaming.py:229] 28000/130368
2024-03-07 11:07:42,128 INFO [jit_pretrained_streaming.py:229] 32000/130368
2024-03-07 11:07:42,151 INFO [jit_pretrained_streaming.py:229] 36000/130368
2024-03-07 11:07:42,152 INFO [jit_pretrained_streaming.py:229] 40000/130368
2024-03-07 11:07:42,175 INFO [jit_pretrained_streaming.py:229] 44000/130368
2024-03-07 11:07:42,176 INFO [jit_pretrained_streaming.py:229] 48000/130368
2024-03-07 11:07:42,177 INFO [jit_pretrained_streaming.py:229] 52000/130368
2024-03-07 11:07:42,200 INFO [jit_pretrained_streaming.py:229] 56000/130368
2024-03-07 11:07:42,201 INFO [jit_pretrained_streaming.py:229] 60000/130368
2024-03-07 11:07:42,224 INFO [jit_pretrained_streaming.py:229] 64000/130368
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 68000/130368
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 72000/130368
2024-03-07 11:07:42,250 INFO [jit_pretrained_streaming.py:229] 76000/130368
2024-03-07 11:07:42,251 INFO [jit_pretrained_streaming.py:229] 80000/130368
2024-03-07 11:07:42,252 INFO [jit_pretrained_streaming.py:229] 84000/130368
2024-03-07 11:07:42,275 INFO [jit_pretrained_streaming.py:229] 88000/130368
2024-03-07 11:07:42,276 INFO [jit_pretrained_streaming.py:229] 92000/130368
2024-03-07 11:07:42,299 INFO [jit_pretrained_streaming.py:229] 96000/130368
2024-03-07 11:07:42,300 INFO [jit_pretrained_streaming.py:229] 100000/130368
2024-03-07 11:07:42,301 INFO [jit_pretrained_streaming.py:229] 104000/130368
2024-03-07 11:07:42,325 INFO [jit_pretrained_streaming.py:229] 108000/130368
2024-03-07 11:07:42,326 INFO [jit_pretrained_streaming.py:229] 112000/130368
2024-03-07 11:07:42,349 INFO [jit_pretrained_streaming.py:229] 116000/130368
2024-03-07 11:07:42,350 INFO [jit_pretrained_streaming.py:229] 120000/130368
2024-03-07 11:07:42,351 INFO [jit_pretrained_streaming.py:229] 124000/130368
2024-03-07 11:07:42,373 INFO [jit_pretrained_streaming.py:229] 128000/130368
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:260] ʃiɑ|ɑᵐɓɑɔ|wɑnɑiʃi|hɑsɑ|kɑtikɑ|ɛnɛɔ|lɑ|mɑʃɑɾiki|kɑtikɑ|ufɑlmɛ|huɔ|wɛnjɛ|utɑʄiɾi|wɑ|mɑfutɑ
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:262] Decoding Done
```
</details>
## Training procedure
### Install icefall
```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```
### Prepare Data
```sh
cd egs/bookbot_sw/ASR
./prepare.sh
```
### Train
```sh
export CUDA_VISIBLE_DEVICES="0"
./zipformer/train.py \
--num-epochs 40 \
--use-fp16 1 \
--exp-dir zipformer/exp-causal \
--causal 1 \
--max-duration 800 \
--use-transducer True --use-ctc True
```
## Frameworks
- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)