|
--- |
|
tags: |
|
- speech-recognition |
|
- ASR |
|
- k2 |
|
- sherpa |
|
- PyTorch |
|
license: cc-by-4.0 |
|
library_name: icefall |
|
datasets: |
|
- librispeech |
|
inference: false |
|
--- |
|
|
|
|
|
|
|
-1. Create your own virtualenv |
|
|
|
# Install CUDA and cuDNN |
|
|
|
0. Run the following command: |
|
```nvidia-smi | head -n 4``` |
|
|
|
Install CUDA <= Cuda Version mentioned. |
|
|
|
1. Install CUDA (I am installing CUDA 12.1) |
|
``` |
|
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run |
|
``` |
|
``` |
|
chmod +x cuda_12.1.0_530.30.02_linux.run |
|
``` |
|
(change the 'installpath') |
|
``` |
|
./cuda_12.1.0_530.30.02_linux.run \ |
|
--silent \ |
|
--toolkit \ |
|
--installpath=/speech/hasan/software/cuda-12.1.0 \ |
|
--no-opengl-libs \ |
|
--no-drm \ |
|
--no-man-page |
|
``` |
|
|
|
## Install cuDNN for CUDA 12.1 |
|
``` |
|
wget https://huggingface.co/csukuangfj/cudnn/resolve/main/cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz |
|
``` |
|
``` |
|
tar xvf cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz --strip-components=1 -C /speech/hasan/software/cuda-12.1.0 |
|
``` |
|
|
|
Create a file `activate-cuda-12.1.sh`, copy the following code and then run `source activate-cuda-12.1.sh` |
|
``` |
|
export CUDA_HOME=/speech/hasan/software/cuda-12.1.0 |
|
export PATH=$CUDA_HOME/bin:$PATH |
|
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH |
|
export LD_LIBRARY_PATH=$CUDA_HOME/lib:$LD_LIBRARY_PATH |
|
export LD_LIBRARY_PATH=$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH |
|
export CUDAToolkit_ROOT_DIR=$CUDA_HOME |
|
export CUDAToolkit_ROOT=$CUDA_HOME |
|
|
|
export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME |
|
export CUDA_TOOLKIT_ROOT=$CUDA_HOME |
|
export CUDA_BIN_PATH=$CUDA_HOME |
|
export CUDA_PATH=$CUDA_HOME |
|
export CUDA_INC_PATH=$CUDA_HOME/targets/x86_64-linux |
|
export CFLAGS=-I$CUDA_HOME/targets/x86_64-linux/include:$CFLAGS |
|
export CUDAToolkit_TARGET_DIR=$CUDA_HOME/targets/x86_64-linux |
|
``` |
|
|
|
Check your installation by running: |
|
``` |
|
which nvcc |
|
``` |
|
Desired output: |
|
``` |
|
/speech/hasan/software/cuda-12.1.0/bin/nvcc |
|
``` |
|
``` |
|
nvcc --version |
|
``` |
|
Desired output: |
|
``` |
|
nvcc: NVIDIA (R) Cuda compiler driver |
|
Copyright (c) 2005-2023 NVIDIA Corporation |
|
Built on Tue_Feb__7_19:32:13_PST_2023 |
|
Cuda compilation tools, release 12.1, V12.1.66 |
|
Build cuda_12.1.r12.1/compiler.32415258_0 |
|
``` |
|
|
|
[Reference](https://k2-fsa.github.io/k2/installation/cuda-cudnn.html) |
|
|
|
# Install Torch and TorchAudio |
|
|
|
torch==2.2.1 and torchaudio==2.2.1 are compatible, [reference](https://pytorch.org/get-started/previous-versions/#linux-and-windows-1), so I'll install that |
|
|
|
``` |
|
pip install torch==2.2.1+cu121 torchaudio==2.2.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html |
|
``` |
|
|
|
Verify Installation |
|
``` |
|
python3 -c "import torch; print(torch.__version__)" |
|
python3 -c "import torchaudio; print(torchaudio.__version__)" |
|
``` |
|
Desired output: |
|
``` |
|
2.3.0+cu121 |
|
``` |
|
|
|
## Install k2 |
|
``` |
|
pip install k2==1.24.4.dev20240425+cuda12.1.torch2.2.1 -f https://k2-fsa.github.io/k2/cuda.html |
|
``` |
|
|
|
Verify Installation |
|
``` |
|
python3 -m k2.version |
|
``` |
|
|
|
## Install lhotse |
|
``` |
|
pip install git+https://github.com/lhotse-speech/lhotse |
|
``` |
|
Verify Installation: |
|
``` |
|
python3 -c "import lhotse; print(lhotse.__version__)" |
|
``` |
|
Desired output: |
|
``` |
|
1.24.0.dev+git.4d57d53.clean |
|
``` |
|
|
|
## Install icefall |
|
``` |
|
git clone https://github.com/k2-fsa/icefall |
|
cd icefall/ |
|
pip install -r ./requirements.txt |
|
``` |
|
Export the path where you cloned icefall |
|
``` |
|
export PYTHONPATH=/speech/hasan/icefall_install/icefall:$PYTHONPATH |
|
cd egs/yesno/ASR/ |
|
``` |
|
Test your Installation |
|
``` |
|
./prepare.sh |
|
``` |
|
export CUDA_VISIBLE_DEVICES="" |
|
./tdnn/train.py |
|
``` |
|
``` |
|
./tdnn/decode.py |
|
``` |
|
|
|
## Congrats! |
|
[Reference](https://icefall.readthedocs.io/en/latest/installation/index.html) |
|
|
|
## install kaldi feat |
|
pip install kaldifeat==1.25.4.dev20240425+cpu.torch2.3.0 -f https://csukuangfj.github.io/kaldifeat/cpu.html |
|
## install sherpa |
|
pip install k2_sherpa==1.3.dev20240227+cpu.torch2.2.1 -f https://k2-fsa.github.io/sherpa/cpu.html |
|
|
|
## training |
|
python3 egs/<dataset_name>/ASR/zipformer/train.py \ |
|
--world-size <number_of_gpus> \ |
|
--num-epochs <number_of_epochs> \ |
|
--start-epoch <starting_epoch> \ |
|
--exp-dir <experiment_directory> \ |
|
--max-duration <max_duration_per_batch> \ |
|
--num-workers <number_of_data_workers> \ |
|
--on-the-fly-feats <True_or_False> \ |
|
--manifest-dir <manifest_directory> \ |
|
--num-buckets <number_of_buckets> \ |
|
--bpe-model <path_to_bpe_model> \ |
|
--train-cuts <path_to_training_cuts> \ |
|
--valid-cuts <path_to_validation_cuts> \ |
|
--causal <1_or_0> \ |
|
--master-port <port_number> |
|
|
|
Parameter Reference: |
|
|
|
--world-size: Number of GPUs or processes to use for distributed training. |
|
--num-epochs: Total number of epochs to run the training. |
|
--start-epoch: Epoch to start training from (helpful when resuming). |
|
--exp-dir: Path to the directory where experiment logs and model checkpoints will be saved. |
|
--max-duration: Maximum duration of audio samples per batch (in seconds or milliseconds, depending on the setup). |
|
--num-workers: Number of workers for loading data. |
|
--on-the-fly-feats: Whether to compute features on-the-fly during training (True or False). |
|
--manifest-dir: Directory containing the manifest files (JSON) for training and validation data. |
|
--num-buckets: Number of buckets used for bucketing data by sequence length. |
|
--bpe-model: Path to the Byte-Pair Encoding model for text tokenization. |
|
--train-cuts: Path to the JSONL file containing the training cuts. |
|
--valid-cuts: Path to the JSONL file containing the validation cuts. |
|
--causal: Set to 1 for causal training (useful for certain model architectures like Zipformer). |
|
--master-port: Port number for distributed training communication |
|
# sample decode file |
|
Streaming ASR Decoding with Zipformer |
|
|
|
This script facilitates the streaming decoding of ASR models using Zipformer in the Icefall framework. It supports greedy search decoding along with the configuration for chunked streaming. |
|
./streaming_decode.py --epoch <EPOCH_NUMBER> \ |
|
--avg <AVERAGE_NUMBER> \ |
|
--exp-dir <EXPERIMENT_DIR> \ |
|
--decoding-method <DECODING_METHOD> \ |
|
--manifest-dir <MANIFEST_DIR> \ |
|
--cut-set-name <CUT_SET_NAME> \ |
|
--bpe-model <BPE_MODEL_PATH> \ |
|
--causal <CAUSAL_FLAG> \ |
|
--chunk-size <CHUNK_SIZE> \ |
|
--left-context-frames <LEFT_CONTEXT_FRAMES> \ |
|
--on-the-fly-feats <ON_THE_FLY_FEATS_FLAG> \ |
|
--use-averaged-model <AVERAGED_MODEL_FLAG> \ |
|
--num-workers <NUM_WORKERS> \ |
|
--max-duration <MAX_DURATION> \ |
|
--num-decode-streams <NUM_DECODE_STREAMS> \ |
|
--context-size <CONTEXT_SIZE> |
|
Parameters |
|
|
|
--epoch: Specifies which training epoch to use for decoding. A higher epoch number means the model has undergone more training. |
|
|
|
--avg: Number of checkpoints to average. For example, --avg 4 means the last 4 checkpoints will be averaged for decoding. |
|
|
|
--exp-dir: Directory where the model's experimental data, such as checkpoints and logs, are stored. |
|
|
|
--decoding-method: Decoding strategy to be used. Common methods include greedy_search, beam_search, etc. |
|
|
|
--manifest-dir: Directory containing manifest files for the datasets to be decoded. |
|
|
|
--cut-set-name: Specifies which cut set to use for decoding, typically indicating the subset of data like test_1, test_2, etc. |
|
|
|
--bpe-model: Path to the BPE model to be used for tokenization during decoding. |
|
|
|
--causal: Indicates whether causal convolution should be used. Set 1 for causal and 0 for non-causal. |
|
|
|
--chunk-size: The size of each chunk to be processed during streaming. |
|
|
|
--left-context-frames: Number of frames from the left context to be included during chunked decoding. |
|
|
|
--on-the-fly-feats: If set to True, feature extraction is performed on-the-fly, without precomputing the features. |
|
|
|
--use-averaged-model: If True, the model will use averaged parameters from multiple epochs or checkpoints. |
|
|
|
--num-workers: Number of workers to be used for data loading during decoding. |
|
|
|
--max-duration: The maximum duration (in seconds) of audio files to decode in one batch. |
|
|
|
--num-decode-streams: Number of parallel decoding streams to process. |
|
|
|
--context-size: The size of the right context to be used during chunk-based streaming decoding. |
|
|
|
Sherpa Online WebSocket Server |
|
|
|
This script sets up a WebSocket server for real-time ASR decoding using the Sherpa framework. It supports GPU-based decoding, different decoding methods, and tokenized models. |
|
sherpa-online-websocket-server --use-gpu=<USE_GPU_FLAG> \ |
|
--tokens=<TOKENS_FILE_PATH> \ |
|
--port=<PORT_NUMBER> \ |
|
--doc-root=<DOCUMENT_ROOT> \ |
|
--nn-model=<MODEL_PATH> \ |
|
--decoding-method=<DECODING_METHOD> |
|
Parameters |
|
|
|
--use-gpu: Set this flag to True for GPU-based decoding, or False for CPU-based decoding. |
|
|
|
--tokens: Path to the file containing the token list (e.g., BPE tokens) required for decoding. |
|
|
|
--port: Port number for the WebSocket server. Ensure this port is open and not blocked by firewalls. |
|
|
|
--doc-root: The root directory for the server's documentation or web resources. This is the directory that serves files when accessed via a browser. |
|
|
|
--nn-model: Path to the neural network model to be used for decoding. The model is usually a jit_script file trained for speech recognition. |
|
|
|
--decoding-method: The decoding strategy to use. Common methods include greedy_search, beam_search, etc. Choose based on your model and application needs. |
|
Example |
|
sherpa-online-websocket-server --use-gpu=True \ |
|
--tokens=/path/to/tokens.txt \ |
|
--port=8003 \ |
|
--doc-root=/path/to/web/document/root \ |
|
--nn-model=/path/to/jit_script_model.pt \ |
|
--decoding-method=greedy_search |
|
Notes |
|
|
|
GPU support: If using GPU, ensure that CUDA is properly set up on the system. |
|
Token file: The token file should correspond to the language and tokenization scheme used when training the neural network model. |
|
Neural Network Model: The model provided should be compatible with the decoding method specified (e.g., chunk-based decoding for streaming models). |
|
|
|
|
|
|