metadata
language: fr
license: apache-2.0
library_name: transformers
tags:
- automatic-speech-recognition
- hf-asr-leaderboard
- robust-speech-event
- CTC
- Wav2vec2
datasets:
- common_voice
- mozilla-foundation/common_voice_11_0
- facebook/multilingual_librispeech
- facebook/voxpopuli
- gigant/african_accented_french
metrics:
- wer
base_model: LeBenchmark/wav2vec2-FR-7K-large
model-index:
- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 11.0
type: mozilla-foundation/common_voice_11_0
args: fr
metrics:
- type: wer
value: 11.44
name: Test WER
- type: wer
value: 9.66
name: Test WER (+LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Multilingual LibriSpeech (MLS)
type: facebook/multilingual_librispeech
args: french
metrics:
- type: wer
value: 5.93
name: Test WER
- type: wer
value: 5.13
name: Test WER (+LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: VoxPopuli
type: facebook/voxpopuli
args: fr
metrics:
- type: wer
value: 9.33
name: Test WER
- type: wer
value: 8.51
name: Test WER (+LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: African Accented French
type: gigant/african_accented_french
args: fr
metrics:
- type: wer
value: 16.22
name: Test WER
- type: wer
value: 15.39
name: Test WER (+LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Robust Speech Event - Dev Data
type: speech-recognition-community-v2/dev_data
args: fr
metrics:
- type: wer
value: 16.56
name: Test WER
- type: wer
value: 12.96
name: Test WER (+LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Fleurs
type: google/fleurs
args: fr_fr
metrics:
- type: wer
value: 10.1
name: Test WER
- type: wer
value: 8.84
name: Test WER (+LM)
Fine-tuned wav2vec2-FR-7K-large model for ASR in French
This model is a fine-tuned version of LeBenchmark/wav2vec2-FR-7K-large, trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, Multilingual TEDx, MediaSpeech, and African Accented French. When using the model make sure that your speech input is also sampled at 16Khz.
Usage
- To use on a local audio file with the language model
import torch
import torchaudio
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate
wav_path = "example.wav" # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0) # mono
# resample
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
with torch.inference_mode():
logits = model(input_dict.input_values.to(device)).logits
predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
- To use on a local audio file without the language model
import torch
import torchaudio
from transformers import AutoModelForCTC, Wav2Vec2Processor
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate
wav_path = "example.wav" # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0) # mono
# resample
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
with torch.inference_mode():
logits = model(input_dict.input_values.to(device)).logits
# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]
Evaluation
- To evaluate on
mozilla-foundation/common_voice_11_0
python eval.py \
--model_id "bhuang/asr-wav2vec2-french" \
--dataset "mozilla-foundation/common_voice_11_0" \
--config "fr" \
--split "test" \
--log_outputs \
--outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
- To evaluate on
speech-recognition-community-v2/dev_data
python eval.py \
--model_id "bhuang/asr-wav2vec2-french" \
--dataset "speech-recognition-community-v2/dev_data" \
--config "fr" \
--split "validation" \
--chunk_length_s 30.0 \
--stride_length_s 5.0 \
--log_outputs \
--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"