zipformer-streaming-robust-sw / README.md

Added MOdel

d4ce303 10 months ago

7.92 kB

	---
	language: sw
	license: apache-2.0
	tags:
	- icefall
	- phoneme-recognition
	- automatic-speech-recognition
	datasets:
	- bookbot/ALFFA_swahili
	- bookbot/fleurs_sw
	- bookbot/common_voice_16_1_sw
	---

	# Pruned Stateless Zipformer RNN-T Streaming Robust SW

	Pruned Stateless Zipformer RNN-T Streaming Robust SW is an automatic speech recognition model trained on the following datasets:

	- [ALFFA Swahili](https://huggingface.co/datasets/bookbot/ALFFA_swahili)
	- [FLEURS Swahili](https://huggingface.co/datasets/bookbot/fleurs_sw)
	- [Common Voice 16.1 Swahili](https://huggingface.co/datasets/bookbot/common_voice_16_1_sw)

	Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).

	This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a NVIDIA H100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tensorboard) logged via Tensorboard.

	## Evaluation Results

	### Simulated Streaming

	```sh
	for m in greedy_search fast_beam_search modified_beam_search; do
	./zipformer/decode.py \
	--epoch 40 \
	--avg 7 \
	--causal 1 \
	--chunk-size 32 \
	--left-context-frames 128 \
	--exp-dir zipformer/exp-causal \
	--use-transducer True --use-ctc True \
	--decoding-method $m
	done
	```

	```sh
	./zipformer/ctc_decode.py \
	--epoch 40 \
	--avg 7 \
	--causal 1 \
	--chunk-size 32 \
	--left-context-frames 128 \
	--exp-dir zipformer/exp-causal \
	--decoding-method ctc-decoding \
	--use-transducer True --use-ctc True
	```

	The model achieves the following phoneme error rates on the different test sets:

	\| Decoding \| Common Voice 16.1 \| FLEURS \|
	\| -------------------- \| :---------------: \| :----: \|
	\| Greedy Search \| 7.71 \| 6.58 \|
	\| Modified Beam Search \| 7.53 \| 6.4 \|
	\| Fast Beam Search \| 7.73 \| 6.61 \|
	\| CTC Greedy Search \| 7.78 \| 6.72 \|

	### Chunk-wise Streaming

	```sh
	for m in greedy_search fast_beam_search modified_beam_search; do
	./zipformer/streaming_decode.py \
	--epoch 40 \
	--avg 7 \
	--causal 1 \
	--chunk-size 32 \
	--left-context-frames 128 \
	--exp-dir zipformer/exp-causal \
	--use-transducer True --use-ctc True \
	--decoding-method $m \
	--num-decode-streams 1000
	done
	```

	The model achieves the following phoneme error rates on the different test sets:

	\| Decoding \| Common Voice 16.1 \| FLEURS \|
	\| -------------------- \| :---------------: \| :----: \|
	\| Greedy Search \| 7.75 \| 6.59 \|
	\| Modified Beam Search \| 7.57 \| 6.37 \|
	\| Fast Beam Search \| 7.72 \| 6.44 \|

	## Usage

	### Download Pre-trained Model

	```sh
	cd egs/bookbot_sw/ASR
	mkdir tmp
	cd tmp
	git lfs install
	git clone https://huggingface.co/bookbot/zipformer-streaming-robust-sw/
	```

	### Inference

	To decode with greedy search, run:

	```sh
	./zipformer/jit_pretrained_streaming.py \
	--nn-model-filename ./tmp/zipformer-streaming-robust-sw/exp-causal/jit_script_chunk_32_left_128.pt \
	--tokens ./tmp/zipformer-streaming-robust-sw/data/lang_phone/tokens.txt \
	./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
	```

	<details>
	<summary>Decoding Output</summary>

	```
	2024-03-07 11:07:41,231 INFO [jit_pretrained_streaming.py:184] device: cuda:0
	2024-03-07 11:07:41,865 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer
	2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
	2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:205] torch.Size([125568])
	2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:207] Decoding started
	2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:212] chunk_length: 64
	2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:213] T: 77
	2024-03-07 11:07:41,876 INFO [jit_pretrained_streaming.py:229] 0/130368
	2024-03-07 11:07:41,877 INFO [jit_pretrained_streaming.py:229] 4000/130368
	2024-03-07 11:07:41,878 INFO [jit_pretrained_streaming.py:229] 8000/130368
	2024-03-07 11:07:41,879 INFO [jit_pretrained_streaming.py:229] 12000/130368
	2024-03-07 11:07:42,103 INFO [jit_pretrained_streaming.py:229] 16000/130368
	2024-03-07 11:07:42,104 INFO [jit_pretrained_streaming.py:229] 20000/130368
	2024-03-07 11:07:42,126 INFO [jit_pretrained_streaming.py:229] 24000/130368
	2024-03-07 11:07:42,127 INFO [jit_pretrained_streaming.py:229] 28000/130368
	2024-03-07 11:07:42,128 INFO [jit_pretrained_streaming.py:229] 32000/130368
	2024-03-07 11:07:42,151 INFO [jit_pretrained_streaming.py:229] 36000/130368
	2024-03-07 11:07:42,152 INFO [jit_pretrained_streaming.py:229] 40000/130368
	2024-03-07 11:07:42,175 INFO [jit_pretrained_streaming.py:229] 44000/130368
	2024-03-07 11:07:42,176 INFO [jit_pretrained_streaming.py:229] 48000/130368
	2024-03-07 11:07:42,177 INFO [jit_pretrained_streaming.py:229] 52000/130368
	2024-03-07 11:07:42,200 INFO [jit_pretrained_streaming.py:229] 56000/130368
	2024-03-07 11:07:42,201 INFO [jit_pretrained_streaming.py:229] 60000/130368
	2024-03-07 11:07:42,224 INFO [jit_pretrained_streaming.py:229] 64000/130368
	2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 68000/130368
	2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 72000/130368
	2024-03-07 11:07:42,250 INFO [jit_pretrained_streaming.py:229] 76000/130368
	2024-03-07 11:07:42,251 INFO [jit_pretrained_streaming.py:229] 80000/130368
	2024-03-07 11:07:42,252 INFO [jit_pretrained_streaming.py:229] 84000/130368
	2024-03-07 11:07:42,275 INFO [jit_pretrained_streaming.py:229] 88000/130368
	2024-03-07 11:07:42,276 INFO [jit_pretrained_streaming.py:229] 92000/130368
	2024-03-07 11:07:42,299 INFO [jit_pretrained_streaming.py:229] 96000/130368
	2024-03-07 11:07:42,300 INFO [jit_pretrained_streaming.py:229] 100000/130368
	2024-03-07 11:07:42,301 INFO [jit_pretrained_streaming.py:229] 104000/130368
	2024-03-07 11:07:42,325 INFO [jit_pretrained_streaming.py:229] 108000/130368
	2024-03-07 11:07:42,326 INFO [jit_pretrained_streaming.py:229] 112000/130368
	2024-03-07 11:07:42,349 INFO [jit_pretrained_streaming.py:229] 116000/130368
	2024-03-07 11:07:42,350 INFO [jit_pretrained_streaming.py:229] 120000/130368
	2024-03-07 11:07:42,351 INFO [jit_pretrained_streaming.py:229] 124000/130368
	2024-03-07 11:07:42,373 INFO [jit_pretrained_streaming.py:229] 128000/130368
	2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav
	2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:260] ʃiɑ\|ɑᵐɓɑɔ\|wɑnɑiʃi\|hɑsɑ\|kɑtikɑ\|ɛnɛɔ\|lɑ\|mɑʃɑɾiki\|kɑtikɑ\|ufɑlmɛ\|huɔ\|wɛnjɛ\|utɑʄiɾi\|wɑ\|mɑfutɑ
	2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:262] Decoding Done
	```

	</details>

	## Training procedure

	### Install icefall

	```sh
	git clone https://github.com/bookbot-hive/icefall
	cd icefall
	export PYTHONPATH=`pwd`:$PYTHONPATH
	```

	### Prepare Data

	```sh
	cd egs/bookbot_sw/ASR
	./prepare.sh
	```

	### Train

	```sh
	export CUDA_VISIBLE_DEVICES="0"
	./zipformer/train.py \
	--num-epochs 40 \
	--use-fp16 1 \
	--exp-dir zipformer/exp-causal \
	--causal 1 \
	--max-duration 800 \
	--use-transducer True --use-ctc True
	```

	## Frameworks

	- [k2](https://github.com/k2-fsa/k2)
	- [icefall](https://github.com/bookbot-hive/icefall)
	- [lhotse](https://github.com/bookbot-hive/lhotse)