File size: 7,924 Bytes

f2b1dfb
d4ce303
f2b1dfb
d4ce303
 
 
 
 
 
 
 
f2b1dfb
d4ce303

---
language: sw
license: apache-2.0
tags:
  - icefall
  - phoneme-recognition
  - automatic-speech-recognition
datasets:
  - bookbot/ALFFA_swahili
  - bookbot/fleurs_sw
  - bookbot/common_voice_16_1_sw
---

# Pruned Stateless Zipformer RNN-T Streaming Robust SW

Pruned Stateless Zipformer RNN-T Streaming Robust SW is an automatic speech recognition model trained on the following datasets:

- [ALFFA Swahili](https://huggingface.co/datasets/bookbot/ALFFA_swahili)
- [FLEURS Swahili](https://huggingface.co/datasets/bookbot/fleurs_sw)
- [Common Voice 16.1 Swahili](https://huggingface.co/datasets/bookbot/common_voice_16_1_sw)

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).

This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a NVIDIA H100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tensorboard) logged via Tensorboard.

## Evaluation Results

### Simulated Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/decode.py \
    --epoch 40 \
    --avg 7 \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --exp-dir zipformer/exp-causal \
    --use-transducer True --use-ctc True \
    --decoding-method $m
done
```

```sh
./zipformer/ctc_decode.py \
    --epoch 40 \
    --avg 7 \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --exp-dir zipformer/exp-causal \
    --decoding-method ctc-decoding \
    --use-transducer True --use-ctc True
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | Common Voice 16.1 | FLEURS |
| -------------------- | :---------------: | :----: |
| Greedy Search        |       7.71        |  6.58  |
| Modified Beam Search |       7.53        |  6.4   |
| Fast Beam Search     |       7.73        |  6.61  |
| CTC Greedy Search    |       7.78        |  6.72  |

### Chunk-wise Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/streaming_decode.py \
    --epoch 40 \
    --avg 7 \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --exp-dir zipformer/exp-causal \
    --use-transducer True --use-ctc True \
    --decoding-method $m \
    --num-decode-streams 1000
done
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | Common Voice 16.1 | FLEURS |
| -------------------- | :---------------: | :----: |
| Greedy Search        |       7.75        |  6.59  |
| Modified Beam Search |       7.57        |  6.37  |
| Fast Beam Search     |       7.72        |  6.44  |

## Usage

### Download Pre-trained Model

```sh
cd egs/bookbot_sw/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-sw/
```

### Inference

To decode with greedy search, run:

```sh
./zipformer/jit_pretrained_streaming.py \
  --nn-model-filename ./tmp/zipformer-streaming-robust-sw/exp-causal/jit_script_chunk_32_left_128.pt \
  --tokens ./tmp/zipformer-streaming-robust-sw/data/lang_phone/tokens.txt \
  ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
```

<details>
<summary>Decoding Output</summary>

```
2024-03-07 11:07:41,231 INFO [jit_pretrained_streaming.py:184] device: cuda:0
2024-03-07 11:07:41,865 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:205] torch.Size([125568])
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:207] Decoding started
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:212] chunk_length: 64
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:213] T: 77
2024-03-07 11:07:41,876 INFO [jit_pretrained_streaming.py:229] 0/130368
2024-03-07 11:07:41,877 INFO [jit_pretrained_streaming.py:229] 4000/130368
2024-03-07 11:07:41,878 INFO [jit_pretrained_streaming.py:229] 8000/130368
2024-03-07 11:07:41,879 INFO [jit_pretrained_streaming.py:229] 12000/130368
2024-03-07 11:07:42,103 INFO [jit_pretrained_streaming.py:229] 16000/130368
2024-03-07 11:07:42,104 INFO [jit_pretrained_streaming.py:229] 20000/130368
2024-03-07 11:07:42,126 INFO [jit_pretrained_streaming.py:229] 24000/130368
2024-03-07 11:07:42,127 INFO [jit_pretrained_streaming.py:229] 28000/130368
2024-03-07 11:07:42,128 INFO [jit_pretrained_streaming.py:229] 32000/130368
2024-03-07 11:07:42,151 INFO [jit_pretrained_streaming.py:229] 36000/130368
2024-03-07 11:07:42,152 INFO [jit_pretrained_streaming.py:229] 40000/130368
2024-03-07 11:07:42,175 INFO [jit_pretrained_streaming.py:229] 44000/130368
2024-03-07 11:07:42,176 INFO [jit_pretrained_streaming.py:229] 48000/130368
2024-03-07 11:07:42,177 INFO [jit_pretrained_streaming.py:229] 52000/130368
2024-03-07 11:07:42,200 INFO [jit_pretrained_streaming.py:229] 56000/130368
2024-03-07 11:07:42,201 INFO [jit_pretrained_streaming.py:229] 60000/130368
2024-03-07 11:07:42,224 INFO [jit_pretrained_streaming.py:229] 64000/130368
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 68000/130368
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 72000/130368
2024-03-07 11:07:42,250 INFO [jit_pretrained_streaming.py:229] 76000/130368
2024-03-07 11:07:42,251 INFO [jit_pretrained_streaming.py:229] 80000/130368
2024-03-07 11:07:42,252 INFO [jit_pretrained_streaming.py:229] 84000/130368
2024-03-07 11:07:42,275 INFO [jit_pretrained_streaming.py:229] 88000/130368
2024-03-07 11:07:42,276 INFO [jit_pretrained_streaming.py:229] 92000/130368
2024-03-07 11:07:42,299 INFO [jit_pretrained_streaming.py:229] 96000/130368
2024-03-07 11:07:42,300 INFO [jit_pretrained_streaming.py:229] 100000/130368
2024-03-07 11:07:42,301 INFO [jit_pretrained_streaming.py:229] 104000/130368
2024-03-07 11:07:42,325 INFO [jit_pretrained_streaming.py:229] 108000/130368
2024-03-07 11:07:42,326 INFO [jit_pretrained_streaming.py:229] 112000/130368
2024-03-07 11:07:42,349 INFO [jit_pretrained_streaming.py:229] 116000/130368
2024-03-07 11:07:42,350 INFO [jit_pretrained_streaming.py:229] 120000/130368
2024-03-07 11:07:42,351 INFO [jit_pretrained_streaming.py:229] 124000/130368
2024-03-07 11:07:42,373 INFO [jit_pretrained_streaming.py:229] 128000/130368
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:260] ʃiɑ|ɑᵐɓɑɔ|wɑnɑiʃi|hɑsɑ|kɑtikɑ|ɛnɛɔ|lɑ|mɑʃɑɾiki|kɑtikɑ|ufɑlmɛ|huɔ|wɛnjɛ|utɑʄiɾi|wɑ|mɑfutɑ
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:262] Decoding Done
```

</details>

## Training procedure

### Install icefall

```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```

### Prepare Data

```sh
cd egs/bookbot_sw/ASR
./prepare.sh
```

### Train

```sh
export CUDA_VISIBLE_DEVICES="0"
./zipformer/train.py \
  --num-epochs 40 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-causal \
  --causal 1 \
  --max-duration 800 \
  --use-transducer True --use-ctc True
```

## Frameworks

- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)