---
license: apache-2.0
base_model: openai/whisper-medium
tags:
- generated_from_trainer
metrics:
- bleu
model-index:
- name: whisper-medium-wolof-2-english
  results: []
datasets:
- bilalfaye/english-wolof-french-dataset
language:
- wo
- en
pipeline_tag: automatic-speech-recognition
---

# whisper-medium-wolof-2-english

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset). The model is designed to translate Wolof audio into English text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap.

It achieves the following results on the evaluation set:
- **Loss:** 1.7756
- **BLEU:** 25.3308

## Model Description

The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate Wolof speech to English. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.

## Intended Uses & Limitations

**Intended uses:**
- Automatic transcription and translation of Wolof audio into English text.
- Assisting researchers and language learners working with Wolof audio content.

**Limitations:**
- May struggle with heavy accents or noisy environments.
- Performance may vary depending on speaker pronunciation and recording quality.

## Training and Evaluation Data

The model was fine-tuned on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset), which consists of Wolof audio paired with English translations.

## Training Procedure

### Training Hyperparameters

The following hyperparameters were used during training:
- **Learning Rate:** 1e-05
- **Train Batch Size:** 32
- **Eval Batch Size:** 16
- **Seed:** 42
- **Optimizer:** Adam (betas=(0.9,0.999), epsilon=1e-08)
- **LR Scheduler Type:** Linear
- **Warmup Steps:** 500
- **Training Steps:** 20000
- **Mixed Precision Training:** Native AMP

### Training Results

| Training Loss | Epoch  | Step  | Validation Loss | BLEU    |
|:-------------:|:------:|:-----:|:---------------:|:-------:|
| 1.1851        | 0.8941 | 2000  | 1.1864          | 18.7395 |
| 0.8701        | 1.7881 | 4000  | 1.1268          | 22.3615 |
| 0.566         | 2.6822 | 6000  | 1.1656          | 24.4993 |
| 0.3238        | 3.5762 | 8000  | 1.2711          | 25.1466 |
| 0.1725        | 4.4703 | 10000 | 1.3854          | 24.7036 |
| 0.0821        | 5.3643 | 12000 | 1.4924          | 25.2531 |
| 0.0424        | 6.2584 | 14000 | 1.5961          | 24.4800 |
| 0.018         | 7.1524 | 16000 | 1.6757          | 24.8197 |
| 0.0101        | 8.0465 | 18000 | 1.7439          | 25.1500 |
| 0.0089        | 8.9405 | 20000 | 1.7756          | 25.3308 |

### Framework Versions

- **Transformers:** 4.41.2
- **PyTorch:** 2.4.0+cu121
- **Datasets:** 3.2.0
- **Tokenizers:** 0.19.1

## Inference

### Using Python Code

```python
! pip install transformers datasets torch

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset

# Load model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-wolof-2-english").to(device)
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-wolof-2-english")

# Load dataset
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
iterator = iter(streaming_dataset)
sample = next(iterator)
sample = next(iterator)
sample = next(iterator)


# Preprocess audio
input_features = processor(sample["wo_audio"]["audio"]["array"],
                           sampling_rate=sample["wo_audio"]["audio"]["sampling_rate"],
                           return_tensors="pt").input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Correct sentence:", sample["wo"])
print("Transcription:", transcription[0])
```

### Using Gradio Interface

```python
! pip install gradio

from transformers import pipeline
import gradio as gr
import numpy as np


# Load model pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-wolof-2-english", device=device)

# Function for transcription
def transcribe(audio):
    if audio is None:
        return "No audio provided. Please try again."

    if isinstance(audio, str):  
        waveform, sample_rate = torchaudio.load(audio)
    elif isinstance(audio, tuple):  # Case microphone (Gradio donne un tuple (fichier, sample_rate))
        waveform, sample_rate = torchaudio.load(audio[0])  
    else:
        return "Invalid audio input format."
    
    if waveform.shape[0] > 1:
        mono_audio = waveform.mean(dim=0, keepdim=True)
    else:
        mono_audio = waveform

    target_sample_rate = 16000
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        mono_audio = resampler(mono_audio)
        sample_rate = target_sample_rate

    mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)

    result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
    return result['text']


# Create Gradio interfaces
interface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),  
    outputs="text",
    title="Whisper Medium Wolof Translation",
    description="Record audio in Wolof and translate it to English using a fine-tuned Whisper medium model.",
    #live=True,
)


app = gr.TabbedInterface(
    [interface],
    ["Use Uploaded File or Microphone"]  
)

app.launch(debug=True, share=True)
```

**Author**
  - Bilal FAYE