--- license: apache-2.0 base_model: openai/whisper-medium tags: - generated_from_trainer metrics: - bleu model-index: - name: whisper-medium-wolof-2-english results: [] datasets: - bilalfaye/english-wolof-french-dataset language: - wo - en pipeline_tag: automatic-speech-recognition --- # whisper-medium-wolof-2-english This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset). The model is designed to translate Wolof audio into English text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap. It achieves the following results on the evaluation set: - **Loss:** 1.7756 - **BLEU:** 25.3308 ## Model Description The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate Wolof speech to English. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency. ## Intended Uses & Limitations **Intended uses:** - Automatic transcription and translation of Wolof audio into English text. - Assisting researchers and language learners working with Wolof audio content. **Limitations:** - May struggle with heavy accents or noisy environments. - Performance may vary depending on speaker pronunciation and recording quality. ## Training and Evaluation Data The model was fine-tuned on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset), which consists of Wolof audio paired with English translations. ## Training Procedure ### Training Hyperparameters The following hyperparameters were used during training: - **Learning Rate:** 1e-05 - **Train Batch Size:** 32 - **Eval Batch Size:** 16 - **Seed:** 42 - **Optimizer:** Adam (betas=(0.9,0.999), epsilon=1e-08) - **LR Scheduler Type:** Linear - **Warmup Steps:** 500 - **Training Steps:** 20000 - **Mixed Precision Training:** Native AMP ### Training Results | Training Loss | Epoch | Step | Validation Loss | BLEU | |:-------------:|:------:|:-----:|:---------------:|:-------:| | 1.1851 | 0.8941 | 2000 | 1.1864 | 18.7395 | | 0.8701 | 1.7881 | 4000 | 1.1268 | 22.3615 | | 0.566 | 2.6822 | 6000 | 1.1656 | 24.4993 | | 0.3238 | 3.5762 | 8000 | 1.2711 | 25.1466 | | 0.1725 | 4.4703 | 10000 | 1.3854 | 24.7036 | | 0.0821 | 5.3643 | 12000 | 1.4924 | 25.2531 | | 0.0424 | 6.2584 | 14000 | 1.5961 | 24.4800 | | 0.018 | 7.1524 | 16000 | 1.6757 | 24.8197 | | 0.0101 | 8.0465 | 18000 | 1.7439 | 25.1500 | | 0.0089 | 8.9405 | 20000 | 1.7756 | 25.3308 | ### Framework Versions - **Transformers:** 4.41.2 - **PyTorch:** 2.4.0+cu121 - **Datasets:** 3.2.0 - **Tokenizers:** 0.19.1 ## Inference ### Using Python Code ```python ! pip install transformers datasets torch import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor from datasets import load_dataset # Load model and processor device = "cuda:0" if torch.cuda.is_available() else "cpu" model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-wolof-2-english").to(device) processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-wolof-2-english") # Load dataset streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True) iterator = iter(streaming_dataset) sample = next(iterator) sample = next(iterator) sample = next(iterator) # Preprocess audio input_features = processor(sample["wo_audio"]["audio"]["array"], sampling_rate=sample["wo_audio"]["audio"]["sampling_rate"], return_tensors="pt").input_features.to(device) # Generate transcription predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) print("Correct sentence:", sample["wo"]) print("Transcription:", transcription[0]) ``` ### Using Gradio Interface ```python ! pip install gradio from transformers import pipeline import gradio as gr import numpy as np # Load model pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-wolof-2-english", device=device) # Function for transcription def transcribe(audio): if audio is None: return "No audio provided. Please try again." if isinstance(audio, str): waveform, sample_rate = torchaudio.load(audio) elif isinstance(audio, tuple): # Case microphone (Gradio donne un tuple (fichier, sample_rate)) waveform, sample_rate = torchaudio.load(audio[0]) else: return "Invalid audio input format." if waveform.shape[0] > 1: mono_audio = waveform.mean(dim=0, keepdim=True) else: mono_audio = waveform target_sample_rate = 16000 if sample_rate != target_sample_rate: resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate) mono_audio = resampler(mono_audio) sample_rate = target_sample_rate mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32) result = pipe({"array": mono_audio, "sampling_rate": sample_rate}) return result['text'] # Create Gradio interfaces interface = gr.Interface( fn=transcribe, inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"), outputs="text", title="Whisper Medium Wolof Translation", description="Record audio in Wolof and translate it to English using a fine-tuned Whisper medium model.", #live=True, ) app = gr.TabbedInterface( [interface], ["Use Uploaded File or Microphone"] ) app.launch(debug=True, share=True) ``` **Author** - Bilal FAYE