license: apache-2.0
tags:
- automatic-speech-recognition
- fi
- finnish
library_name: transformers
language: fi
base_model:
- GetmanY1/wav2vec2-xlarge-fi-150k
model-index:
- name: wav2vec2-xlarge-fi-150k-finetuned
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Lahjoita puhetta (Donate Speech)
type: lahjoita-puhetta
args: fi
metrics:
- name: Dev WER
type: wer
value: 14.98
- name: Dev CER
type: cer
value: 4.13
- name: Test WER
type: wer
value: 16.37
- name: Test CER
type: cer
value: 5.03
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Finnish Parliament
type: FinParl
args: fi
metrics:
- name: Dev16 WER
type: wer
value: 10.91
- name: Dev16 CER
type: cer
value: 4.85
- name: Test16 WER
type: wer
value: 7.81
- name: Test16 CER
type: cer
value: 3.48
- name: Test20 WER
type: wer
value: 6.43
- name: Test20 CER
type: cer
value: 2.09
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
args: fi
metrics:
- name: Dev WER
type: wer
value: 6.65
- name: Dev CER
type: cer
value: 1.15
- name: Test WER
type: wer
value: 5.42
- name: Test CER
type: cer
value: 0.96
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: FLEURS
type: google/fleurs
args: fi_fi
metrics:
- name: Dev WER
type: wer
value: 8.67
- name: Dev CER
type: cer
value: 5.18
- name: Test WER
type: wer
value: 9.96
- name: Test CER
type: cer
value: 5.74
Finnish Wav2vec2-XLarge ASR
GetmanY1/wav2vec2-xlarge-fi-150k fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio:
- 1500 hours of Lahjoita puhetta (Donate Speech) (colloquial Finnish)
- 3100 hours of the Finnish Parliament dataset
When using the model make sure that your speech input is also sampled at 16Khz.
Model description
The Finnish Wav2Vec2 X-Large has the same architecture and uses the same training objective as the multilingual one described in paper.
GetmanY1/wav2vec2-xlarge-fi-150k is a large-scale, 1-billion parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including KAVI radio and television archive materials, Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli.
You can read more about the pre-trained model from this paper. The training scripts are available on GitHub.
Intended uses
You can use this model for Finnish ASR (speech-to-text).
How to use
To transcribe audio files the model can be used as a standalone acoustic model as follows:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# load model and processor
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-xlarge-fi-150k-finetuned")
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-xlarge-fi-150k-finetuned")
# load dummy dataset and read soundfiles
ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test')
# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
# retrieve logits
logits = model(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Team Members
- Yaroslav Getman, Hugging Face profile, LinkedIn profile
- Tamas Grosz, Hugging Face profile, LinkedIn profile
Feel free to contact us for more details 🤗