|
--- |
|
license: apache-2.0 |
|
tags: |
|
- automatic-speech-recognition |
|
- fi |
|
- finnish |
|
library_name: transformers |
|
language: fi |
|
base_model: |
|
- GetmanY1/wav2vec2-xlarge-fi-150k |
|
model-index: |
|
- name: wav2vec2-xlarge-fi-150k-finetuned |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Lahjoita puhetta (Donate Speech) |
|
type: lahjoita-puhetta |
|
args: fi |
|
metrics: |
|
- name: Dev WER |
|
type: wer |
|
value: 14.98 |
|
- name: Dev CER |
|
type: cer |
|
value: 4.13 |
|
- name: Test WER |
|
type: wer |
|
value: 16.37 |
|
- name: Test CER |
|
type: cer |
|
value: 5.03 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Finnish Parliament |
|
type: FinParl |
|
args: fi |
|
metrics: |
|
- name: Dev16 WER |
|
type: wer |
|
value: 10.91 |
|
- name: Dev16 CER |
|
type: cer |
|
value: 4.85 |
|
- name: Test16 WER |
|
type: wer |
|
value: 7.81 |
|
- name: Test16 CER |
|
type: cer |
|
value: 3.48 |
|
- name: Test20 WER |
|
type: wer |
|
value: 6.43 |
|
- name: Test20 CER |
|
type: cer |
|
value: 2.09 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 16.1 |
|
type: mozilla-foundation/common_voice_16_1 |
|
args: fi |
|
metrics: |
|
- name: Dev WER |
|
type: wer |
|
value: 6.65 |
|
- name: Dev CER |
|
type: cer |
|
value: 1.15 |
|
- name: Test WER |
|
type: wer |
|
value: 5.42 |
|
- name: Test CER |
|
type: cer |
|
value: 0.96 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: FLEURS |
|
type: google/fleurs |
|
args: fi_fi |
|
metrics: |
|
- name: Dev WER |
|
type: wer |
|
value: 8.67 |
|
- name: Dev CER |
|
type: cer |
|
value: 5.18 |
|
- name: Test WER |
|
type: wer |
|
value: 9.96 |
|
- name: Test CER |
|
type: cer |
|
value: 5.74 |
|
--- |
|
|
|
# Finnish Wav2vec2-XLarge ASR |
|
|
|
[GetmanY1/wav2vec2-xlarge-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-xlarge-fi-150k) fine-tuned on 4600 hours of Finnish speech on 16kHz sampled speech audio: |
|
* 1500 hours of [Lahjoita puhetta (Donate Speech)](https://link.springer.com/article/10.1007/s10579-022-09606-3) (colloquial Finnish) |
|
* 3100 hours of the [Finnish Parliament dataset](https://link.springer.com/article/10.1007/s10579-023-09650-7) |
|
|
|
When using the model make sure that your speech input is also sampled at 16Khz. |
|
|
|
## Model description |
|
|
|
The Finnish Wav2Vec2 X-Large has the same architecture and uses the same training objective as the multilingual one described in [paper](https://www.isca-archive.org/interspeech_2022/babu22_interspeech.pdf). |
|
|
|
[GetmanY1/wav2vec2-xlarge-fi-150k](https://huggingface.co/GetmanY1/wav2vec2-xlarge-fi-150k) is a large-scale, 1-billion parameter monolingual model pre-trained on 158k hours of unlabeled Finnish speech, including [KAVI radio and television archive materials](https://kavi.fi/en/radio-ja-televisioarkistointia-vuodesta-2008/), Lahjoita puhetta (Donate Speech), Finnish Parliament, Finnish VoxPopuli. |
|
|
|
You can read more about the pre-trained model from [this paper](TODO). The training scripts are available on [GitHub](https://github.com/aalto-speech/large-scale-monolingual-speech-foundation-models). |
|
|
|
## Intended uses |
|
|
|
You can use this model for Finnish ASR (speech-to-text). |
|
|
|
### How to use |
|
|
|
To transcribe audio files the model can be used as a standalone acoustic model as follows: |
|
|
|
``` |
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
|
from datasets import load_dataset |
|
import torch |
|
|
|
# load model and processor |
|
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-xlarge-fi-150k-finetuned") |
|
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-xlarge-fi-150k-finetuned") |
|
|
|
# load dummy dataset and read soundfiles |
|
ds = load_dataset("mozilla-foundation/common_voice_16_1", "fi", split='test') |
|
|
|
# tokenize |
|
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 |
|
|
|
# retrieve logits |
|
logits = model(input_values).logits |
|
|
|
# take argmax and decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
``` |
|
|
|
## Team Members |
|
|
|
- Yaroslav Getman, [Hugging Face profile](https://huggingface.co/GetmanY1), [LinkedIn profile](https://www.linkedin.com/in/yaroslav-getman/) |
|
- Tamas Grosz, [Hugging Face profile](https://huggingface.co/Grosy), [LinkedIn profile](https://www.linkedin.com/in/tam%C3%A1s-gr%C3%B3sz-950a049a/) |
|
|
|
Feel free to contact us for more details 🤗 |