Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)
Model Overview
Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
Author: Ayoub Laachir
License: apache-2.0
Repository: Ayoub-Laachir/MaghrebVoice
Dataset: Ayoub-Laachir/Darija_Dataset
Description
This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.
Technologies Used
- Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
- PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
- Google Colab: Cloud environment for training the model
- Hugging Face: Hosting the dataset and final model
Dataset Preparation
The dataset preparation involved several steps:
- Cleaning: Correcting bad transcriptions and standardizing word spellings.
- Audio Processing: Converting all samples to a 16 kHz sample rate.
- Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
- Format Conversion: Transforming the dataset into the parquet file format.
- Uploading: The prepared dataset was uploaded to the Hugging Face Hub.
Training Process
The model was fine-tuned using the following parameters:
- Per device train batch size: 8
- Gradient accumulation steps: 1
- Learning rate: 1e-4 (0.0001)
- Warmup steps: 200
- Number of train epochs: 2
- Logging and evaluation: every 50 steps
- Weight decay: 0.01
Training progress showed a steady decrease in both training and validation loss over 8000 steps.
Testing and Evaluation
The model was evaluated using:
- Word Error Rate (WER): 3.1467%
- Character Error Rate (CER): 2.3893%
These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.
The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.
Audio Transcription Script
This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija. It includes steps for installing necessary libraries, loading the model, and processing audio files.
Required Libraries
Before running the script, ensure you have the following libraries installed. You can install them using:
!pip install --upgrade pip
!pip install --upgrade transformers accelerate librosa soundfile pydub
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import librosa
import soundfile as sf
from pydub import AudioSegment
# Set the device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Configuration for the model
config = {
"model_id": "Ayoub-Laachir/MaghrebVoice", # Model ID from Hugging Face
"language": "arabic", # Language for transcription
"task": "transcribe", # Task type
"chunk_length_s": 30, # Length of each audio chunk in seconds
"stride_length_s": 5, # Overlap between chunks in seconds
}
# Load the model and processor
def load_model_and_processor():
try:
model = AutoModelForSpeechSeq2Seq.from_pretrained(
config["model_id"],
torch_dtype=torch_dtype, # Use appropriate data type
low_cpu_mem_usage=True, # Use low CPU memory
use_safetensors=True, # Load model with safetensors
attn_implementation="sdpa", # Specify attention implementation
)
model.to(device) # Move model to the specified device
processor = AutoProcessor.from_pretrained(config["model_id"])
print("Model and processor loaded successfully.")
return model, processor
except Exception as e:
print(f"Error loading model and processor: {e}")
return None, None
# Load the model and processor
model, processor = load_model_and_processor()
if model is None or processor is None:
print("Failed to load model or processor")
exit(1)
# Configure the generation parameters for the pipeline
generate_kwargs = {
"language": config["language"], # Language for the pipeline
"task": config["task"], # Task for the pipeline
}
# Initialize the automatic speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
generate_kwargs=generate_kwargs,
chunk_length_s=config["chunk_length_s"], # Length of each audio chunk
stride_length_s=config["stride_length_s"], # Overlap between chunks
)
# Convert audio to 16kHz sampling rate
def convert_audio_to_16khz(input_path, output_path):
audio, sr = librosa.load(input_path, sr=None) # Load the audio file
audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000) # Resample to 16kHz
sf.write(output_path, audio_16k, 16000) # Save the converted audio
# Format time in HH:MM:SS.milliseconds
def format_time(seconds):
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
seconds = seconds % 60
return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"
# Transcribe audio file
def transcribe_audio(audio_path):
try:
result = pipe(audio_path, return_timestamps=True) # Transcribe audio and get timestamps
return result["chunks"] # Return transcription chunks
except Exception as e:
print(f"Error transcribing audio: {e}")
return None
# Main function to execute the transcription process
def main():
# Specify input and output audio paths (update paths as needed)
input_audio_path = "/path/to/your/input/audio.mp3" # Replace with your input audio path
output_audio_path = "/path/to/your/output/audio_16khz.wav" # Replace with your output audio path
# Convert audio to 16kHz
convert_audio_to_16khz(input_audio_path, output_audio_path)
# Transcribe the converted audio
transcription_chunks = transcribe_audio(output_audio_path)
if transcription_chunks:
print("WEBVTT\n") # Print header for WEBVTT format
for chunk in transcription_chunks:
start_time = format_time(chunk["timestamp"][0]) # Format start time
end_time = format_time(chunk["timestamp"][1]) # Format end time
text = chunk["text"] # Get the transcribed text
print(f"{start_time} --> {end_time}") # Print time range
print(f"{text}\n") # Print transcribed text
else:
print("Transcription failed.")
if __name__ == "__main__":
main()
Challenges and Future Improvements
Challenges Encountered
- Diverse spellings of words in Moroccan Darija
- Cleaning and standardizing the dataset
Future Improvements
- Expand the dataset to include more Darija accents and expressions
- Further fine-tune the model for specific Moroccan regional dialects
- Explore integration into practical applications like voice assistants and transcription services
Conclusion
This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.
- Downloads last month
- 43
Model tree for Ayoub-Laachir/MaghrebVoice
Base model
openai/whisper-large-v3