File size: 7,924 Bytes
f2b1dfb
d4ce303
f2b1dfb
d4ce303
 
 
 
 
 
 
 
f2b1dfb
d4ce303
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
language: sw
license: apache-2.0
tags:
  - icefall
  - phoneme-recognition
  - automatic-speech-recognition
datasets:
  - bookbot/ALFFA_swahili
  - bookbot/fleurs_sw
  - bookbot/common_voice_16_1_sw
---

# Pruned Stateless Zipformer RNN-T Streaming Robust SW

Pruned Stateless Zipformer RNN-T Streaming Robust SW is an automatic speech recognition model trained on the following datasets:

- [ALFFA Swahili](https://huggingface.co/datasets/bookbot/ALFFA_swahili)
- [FLEURS Swahili](https://huggingface.co/datasets/bookbot/fleurs_sw)
- [Common Voice 16.1 Swahili](https://huggingface.co/datasets/bookbot/common_voice_16_1_sw)

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).

This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a NVIDIA H100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tensorboard) logged via Tensorboard.

## Evaluation Results

### Simulated Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/decode.py \
    --epoch 40 \
    --avg 7 \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --exp-dir zipformer/exp-causal \
    --use-transducer True --use-ctc True \
    --decoding-method $m
done
```

```sh
./zipformer/ctc_decode.py \
    --epoch 40 \
    --avg 7 \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --exp-dir zipformer/exp-causal \
    --decoding-method ctc-decoding \
    --use-transducer True --use-ctc True
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | Common Voice 16.1 | FLEURS |
| -------------------- | :---------------: | :----: |
| Greedy Search        |       7.71        |  6.58  |
| Modified Beam Search |       7.53        |  6.4   |
| Fast Beam Search     |       7.73        |  6.61  |
| CTC Greedy Search    |       7.78        |  6.72  |

### Chunk-wise Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/streaming_decode.py \
    --epoch 40 \
    --avg 7 \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --exp-dir zipformer/exp-causal \
    --use-transducer True --use-ctc True \
    --decoding-method $m \
    --num-decode-streams 1000
done
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | Common Voice 16.1 | FLEURS |
| -------------------- | :---------------: | :----: |
| Greedy Search        |       7.75        |  6.59  |
| Modified Beam Search |       7.57        |  6.37  |
| Fast Beam Search     |       7.72        |  6.44  |

## Usage

### Download Pre-trained Model

```sh
cd egs/bookbot_sw/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-sw/
```

### Inference

To decode with greedy search, run:

```sh
./zipformer/jit_pretrained_streaming.py \
  --nn-model-filename ./tmp/zipformer-streaming-robust-sw/exp-causal/jit_script_chunk_32_left_128.pt \
  --tokens ./tmp/zipformer-streaming-robust-sw/data/lang_phone/tokens.txt \
  ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
```

<details>
<summary>Decoding Output</summary>

```
2024-03-07 11:07:41,231 INFO [jit_pretrained_streaming.py:184] device: cuda:0
2024-03-07 11:07:41,865 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:205] torch.Size([125568])
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:207] Decoding started
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:212] chunk_length: 64
2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:213] T: 77
2024-03-07 11:07:41,876 INFO [jit_pretrained_streaming.py:229] 0/130368
2024-03-07 11:07:41,877 INFO [jit_pretrained_streaming.py:229] 4000/130368
2024-03-07 11:07:41,878 INFO [jit_pretrained_streaming.py:229] 8000/130368
2024-03-07 11:07:41,879 INFO [jit_pretrained_streaming.py:229] 12000/130368
2024-03-07 11:07:42,103 INFO [jit_pretrained_streaming.py:229] 16000/130368
2024-03-07 11:07:42,104 INFO [jit_pretrained_streaming.py:229] 20000/130368
2024-03-07 11:07:42,126 INFO [jit_pretrained_streaming.py:229] 24000/130368
2024-03-07 11:07:42,127 INFO [jit_pretrained_streaming.py:229] 28000/130368
2024-03-07 11:07:42,128 INFO [jit_pretrained_streaming.py:229] 32000/130368
2024-03-07 11:07:42,151 INFO [jit_pretrained_streaming.py:229] 36000/130368
2024-03-07 11:07:42,152 INFO [jit_pretrained_streaming.py:229] 40000/130368
2024-03-07 11:07:42,175 INFO [jit_pretrained_streaming.py:229] 44000/130368
2024-03-07 11:07:42,176 INFO [jit_pretrained_streaming.py:229] 48000/130368
2024-03-07 11:07:42,177 INFO [jit_pretrained_streaming.py:229] 52000/130368
2024-03-07 11:07:42,200 INFO [jit_pretrained_streaming.py:229] 56000/130368
2024-03-07 11:07:42,201 INFO [jit_pretrained_streaming.py:229] 60000/130368
2024-03-07 11:07:42,224 INFO [jit_pretrained_streaming.py:229] 64000/130368
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 68000/130368
2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 72000/130368
2024-03-07 11:07:42,250 INFO [jit_pretrained_streaming.py:229] 76000/130368
2024-03-07 11:07:42,251 INFO [jit_pretrained_streaming.py:229] 80000/130368
2024-03-07 11:07:42,252 INFO [jit_pretrained_streaming.py:229] 84000/130368
2024-03-07 11:07:42,275 INFO [jit_pretrained_streaming.py:229] 88000/130368
2024-03-07 11:07:42,276 INFO [jit_pretrained_streaming.py:229] 92000/130368
2024-03-07 11:07:42,299 INFO [jit_pretrained_streaming.py:229] 96000/130368
2024-03-07 11:07:42,300 INFO [jit_pretrained_streaming.py:229] 100000/130368
2024-03-07 11:07:42,301 INFO [jit_pretrained_streaming.py:229] 104000/130368
2024-03-07 11:07:42,325 INFO [jit_pretrained_streaming.py:229] 108000/130368
2024-03-07 11:07:42,326 INFO [jit_pretrained_streaming.py:229] 112000/130368
2024-03-07 11:07:42,349 INFO [jit_pretrained_streaming.py:229] 116000/130368
2024-03-07 11:07:42,350 INFO [jit_pretrained_streaming.py:229] 120000/130368
2024-03-07 11:07:42,351 INFO [jit_pretrained_streaming.py:229] 124000/130368
2024-03-07 11:07:42,373 INFO [jit_pretrained_streaming.py:229] 128000/130368
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:260] ʃiɑ|ɑᵐɓɑɔ|wɑnɑiʃi|hɑsɑ|kɑtikɑ|ɛnɛɔ|lɑ|mɑʃɑɾiki|kɑtikɑ|ufɑlmɛ|huɔ|wɛnjɛ|utɑʄiɾi|wɑ|mɑfutɑ
2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:262] Decoding Done
```

</details>

## Training procedure

### Install icefall

```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```

### Prepare Data

```sh
cd egs/bookbot_sw/ASR
./prepare.sh
```

### Train

```sh
export CUDA_VISIBLE_DEVICES="0"
./zipformer/train.py \
  --num-epochs 40 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-causal \
  --causal 1 \
  --max-duration 800 \
  --use-transducer True --use-ctc True
```

## Frameworks

- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)