nickdee96 commited on
Commit
ca96342
·
1 Parent(s): 346ce7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -4
README.md CHANGED
@@ -3,7 +3,6 @@ license: apache-2.0
3
  datasets:
4
  - mozilla-foundation/common_voice_11_0
5
  language:
6
- - en
7
  - sw
8
  metrics:
9
  - wer
@@ -16,20 +15,39 @@ pipeline_tag: automatic-speech-recognition
16
  ## Model details
17
  The Swahili ASR is an end-to-end automatic speech recognition system that was finetuned on the Common Voice Corpus 11.0 Swahili dataset. This repository provides the necessary tools to perform ASR using this model, allowing for high-quality speech-to-text conversions in Swahili.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  | EVAL_LOSS | EVAL_WER | EVAL_RUNTIME | EVAL_SAMPLES_PER_SECOND | EVAL_STEPS_PER_SECOND | EPOCH |
21
  |-------------------|--------------------|--------------|-------------------------|-----------------------|-------|
22
  | 0.345414400100708 | 0.2602372795622284 | 578.4006 | 17.701 | 2.213 | 4.17 |
 
23
  ## Intended Use
24
  This model is intended for any application requiring Swahili speech-to-text conversion, including but not limited to transcription services, voice assistants, and accessibility technology. It can be particularly beneficial in any context where demographic metadata (age, sex, accent) is significant, as these features have been taken into account during training.
 
25
  ## Dataset
26
  The model was trained on the Common Voice Corpus 11.0 Swahili dataset, which consists of unique MP3 files and corresponding text files, totaling 16,413 validated hours. Additionally, much of the dataset includes valuable demographic metadata, such as age, sex, and accent, contributing to a more accurate and contextually-aware ASR model.
27
  [Dataset link](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
 
28
  ## Training Procedure
29
  ### Pipeline Description
30
- The ASR system has two interconnected stages: the Tokenizer (unigram) and the Acoustic model (wav2vec2.0 + CTC).
31
  1. **Tokenizer (unigram):** It transforms words into subword units, using a vocabulary extracted from the training and test datasets. The resulting `Wav2Vec2CTCTokenizer` is then pushed to the Hugging Face model hub.
32
- 2. **Acoustic model (wav2vec2.0 + CTC):** Utilizes a pretrained wav2vec 2.0 model (`facebook/wav2vec2-base`), which is fine-tuned on the dataset. The processed audio data is passed through the CTC (Connectionist Temporal Classification) decoder, which converts the acoustic representations into a sequence of tokens/characters. The trained model is then also pushed to the Hugging Face model hub.
33
  ### Technical Specifications
34
  The ASR system uses the Wav2Vec2ForCTC model architecture from the Hugging Face's Transformers library. This model, with a built-in Connectionist Temporal Classification (CTC) layer, provides an optimal solution for speech recognition tasks. The model includes a pretrained wav2vec 2.0 model and a linear layer for CTC, which are trained together in an end-to-end manner. The ASR system's performance is measured using the Word Error Rate (WER) during the training process.
35
  ### Compute Infrastructure
@@ -72,4 +90,4 @@ The following hyperparameters were used during training:
72
  - Datasets 2.13.1
73
  - Tokenizers 0.13.3
74
  ## About THiNK
75
- THiNK is a technology initiative driven by a community of innovators and businesses. It brings together a collaborative platform that provides services to assist businesses in all sectors, particularly in their digital transformation journey.
 
3
  datasets:
4
  - mozilla-foundation/common_voice_11_0
5
  language:
 
6
  - sw
7
  metrics:
8
  - wer
 
15
  ## Model details
16
  The Swahili ASR is an end-to-end automatic speech recognition system that was finetuned on the Common Voice Corpus 11.0 Swahili dataset. This repository provides the necessary tools to perform ASR using this model, allowing for high-quality speech-to-text conversions in Swahili.
17
 
18
+ ## Example Usage
19
+
20
+ Here's an example of how you can use this model for speech-to-text conversion:
21
+
22
+ ```python
23
+ from datasets import load_dataset
24
+ from transformers import pipeline
25
+
26
+ # replace following lines to load an audio file of your choice
27
+ commonvoice_sw = load_dataset("mozilla-foundation/common_voice_11_0", "sw", split="test")
28
+ audio_file = commonvoice_sw[0]["audio"]
29
+
30
+ asr = pipeline("automatic-speech-recognition", model="thinkKenya/wav2vec2-large-xls-r-300m-sw", feature_extractor="thinkKenya/wav2vec2-large-xls-r-300m-sw")
31
+
32
+ translation = asr(audio_file)
33
+ ```
34
 
35
  | EVAL_LOSS | EVAL_WER | EVAL_RUNTIME | EVAL_SAMPLES_PER_SECOND | EVAL_STEPS_PER_SECOND | EPOCH |
36
  |-------------------|--------------------|--------------|-------------------------|-----------------------|-------|
37
  | 0.345414400100708 | 0.2602372795622284 | 578.4006 | 17.701 | 2.213 | 4.17 |
38
+
39
  ## Intended Use
40
  This model is intended for any application requiring Swahili speech-to-text conversion, including but not limited to transcription services, voice assistants, and accessibility technology. It can be particularly beneficial in any context where demographic metadata (age, sex, accent) is significant, as these features have been taken into account during training.
41
+
42
  ## Dataset
43
  The model was trained on the Common Voice Corpus 11.0 Swahili dataset, which consists of unique MP3 files and corresponding text files, totaling 16,413 validated hours. Additionally, much of the dataset includes valuable demographic metadata, such as age, sex, and accent, contributing to a more accurate and contextually-aware ASR model.
44
  [Dataset link](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
45
+
46
  ## Training Procedure
47
  ### Pipeline Description
48
+ The ASR system has two interconnected stages: the Tokenizer (unigram) and the Acoustic model (wav2vec2.0 + CTC).
49
  1. **Tokenizer (unigram):** It transforms words into subword units, using a vocabulary extracted from the training and test datasets. The resulting `Wav2Vec2CTCTokenizer` is then pushed to the Hugging Face model hub.
50
+ 2. **Acoustic model (wav2vec2.0 + CTC):** Utilizes a pretrained wav2vec 2.0 model (`facebook/wav2vec2-base`), which is fine-tuned on the dataset. The processed audio data is passed through the CTC (Connectionist Temporal Classification) decoder, which converts the acoustic representations into a sequence of tokens/characters. The trained model is then also pushed to the Hugging Face model hub.
51
  ### Technical Specifications
52
  The ASR system uses the Wav2Vec2ForCTC model architecture from the Hugging Face's Transformers library. This model, with a built-in Connectionist Temporal Classification (CTC) layer, provides an optimal solution for speech recognition tasks. The model includes a pretrained wav2vec 2.0 model and a linear layer for CTC, which are trained together in an end-to-end manner. The ASR system's performance is measured using the Word Error Rate (WER) during the training process.
53
  ### Compute Infrastructure
 
90
  - Datasets 2.13.1
91
  - Tokenizers 0.13.3
92
  ## About THiNK
93
+ THiNK is a technology initiative driven by a community of innovators and businesses. It brings together a collaborative platform that provides services to assist businesses in all sectors, particularly in their digital transformation journey.