--- license: mit datasets: - facebook/voxpopuli tags: - ASR - Automatic Speech Recognition - Whisper - Medusa - Speech - Speculative Decoding language: - en - es - de - fr --- # Whisper Medusa Whisper is an advanced encoder-decoder model for speech transcription and translation, processing audio through encoding and decoding stages. Given its large size and slow inference speed, various optimization strategies like Faster-Whisper and Speculative Decoding have been proposed to enhance performance. Our Medusa model builds on Whisper by predicting multiple tokens per iteration, which significantly improves speed with small degradation in WER. We train and evaluate our model on various datasets, demonstrating speed improvements. --------- ## Training Details `aiola/whisper-medusa-multilingual` was trained on the Voxpopuli dataset to perform audio translation. The Medusa heads were optimized for English, Spanish, German, and French so for optimal performance and speed improvements, please use these languages only. --------- ## Usage To use `aiola/whisper-medusa-multilingual` install [`whisper-medusa`](https://github.com/aiola-lab/whisper-medusa) repo following the README instructions. Inference can be done using the following code: ```python import torch import torchaudio from whisper_medusa import WhisperMedusaModel from transformers import WhisperProcessor model_name = "aiola/whisper-medusa-multilingual" model = WhisperMedusaModel.from_pretrained(model_name) processor = WhisperProcessor.from_pretrained(model_name) path_to_audio = "path/to/audio.wav" SAMPLING_RATE = 16000 language = "en" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") input_speech, sr = torchaudio.load(path_to_audio) if sr != SAMPLING_RATE: input_speech = torchaudio.transforms.Resample(sr, SAMPLING_RATE)(input_speech) input_features = processor(input_speech.squeeze(), return_tensors="pt", sampling_rate=SAMPLING_RATE).input_features input_features = input_features.to(device) model = model.to(device) model_output = model.generate( input_features, language=language, ) predict_ids = model_output[0] pred = processor.decode(predict_ids, skip_special_tokens=True) print(pred) ```