You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

wav2vec2-jv-base-openslr

This model is a fine-tuned version of facebook/wav2vec2-base on the OpenSLR41 datasets. It achieves the following results on the evaluation set:

  • Loss: 0.2843
  • Wer: 0.1502

Model description

The model is a fine-tuned version of wav2vec2, specifically adapted using the OpenSLR 41 dataset, which is focused on the Javanese language domain. This adaptation enables the model to effectively recognize and process spoken Javanese, leveraging the robust capabilities of the wav2vec2 architecture combined with domain-specific training data.

Intended uses & limitations

This model is intended for transcribing spoken Javanese language from audio recordings. It achieves a Word Error Rate (WER) of 15%, indicating that while the model performs reasonably well, it still produces significant transcription errors. Users should be aware that the accuracy may vary, particularly in cases with challenging audio conditions or less common dialects. Additionally, this model requires input audio at a sample rate of 16kHz, which may limit its applicability for recordings at different sample rates or lower quality audio files.

Training and evaluation data

The model use OpenSLR41 datasets, and split into 2 section (training and testing), then the model is trained using 1xA100 GPU with a training duration of 4-5 hours.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 65
  • mixed_precision_training: Native AMP

Log Data | Training results

Training Loss Epoch Step Validation Loss Wer
0.5361 2.8329 2000 0.4626 0.4238
0.332 5.6657 4000 0.3857 0.3749
0.242 8.4986 6000 0.3456 0.3060
0.1893 11.3314 8000 0.3250 0.2846
0.1566 14.1643 10000 0.3260 0.2640
0.1433 16.9972 12000 0.2891 0.2516
0.124 19.8300 14000 0.3172 0.2433
0.1103 22.6629 16000 0.3099 0.2453
0.1015 25.4958 18000 0.3087 0.2295
0.088 28.3286 20000 0.3250 0.2054
0.0831 31.1615 22000 0.3127 0.2143
0.0748 33.9943 24000 0.2973 0.1923
0.0696 36.8272 26000 0.3103 0.2026
0.0622 39.6601 28000 0.3292 0.2068
0.0564 42.4929 30000 0.2965 0.1916
0.0507 45.3258 32000 0.3061 0.1819
0.0475 48.1586 34000 0.2784 0.1881
0.0448 50.9915 36000 0.2872 0.1764
0.0413 53.8244 38000 0.2854 0.1716
0.0357 56.6572 40000 0.2862 0.1723
0.0328 59.4901 42000 0.2887 0.1654
0.0324 62.3229 44000 0.2843 0.1502

How to run (Gradio Web)

import torch
import torchaudio
import gradio as gr
import numpy as np
from transformers import pipeline, AutoProcessor, AutoModelForCTC

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
MODEL_NAME = "<fill this to your model>"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForCTC.from_pretrained(MODEL_NAME)

# Move model to GPU
model.to(device)

# Create the pipeline with the model and processor
transcriber = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device)

def transcribe(audio):
    sr, y = audio
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    return transcriber({"sampling_rate": sr, "raw": y})["text"]

demo = gr.Interface(
    transcribe,
    gr.Audio(sources=["upload"]),
    "text",
)

demo.launch(share=True)

How to run

import torch
import torchaudio
import gradio as gr
import numpy as np
from transformers import pipeline, AutoProcessor, AutoModelForCTC

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
MODEL_NAME = "<fill this to actual model>"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForCTC.from_pretrained(MODEL_NAME)

# Move model to GPU
model.to(device)

# Load audio file
AUDIO_PATH = "<replace 'path_to_audio_file.wav' with the actual path to your audio file>"
audio_input, sample_rate = torchaudio.load(AUDIO_PATH)

# Ensure the audio is mono (1 channel)
if audio_input.shape[0] > 1:
    audio_input = torch.mean(audio_input, dim=0, keepdim=True)

# Resample audio if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    audio_input = resampler(audio_input)

# Process the audio input
input_values = processor(audio_input.squeeze(), sampling_rate=16000, return_tensors="pt").input_values

# Move input values to GPU
input_values = input_values.to(device)

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

# Decode the logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print("Transcription:", transcription)

Framework versions

  • Transformers 4.44.0
  • Pytorch 2.2.1+cu118
  • Datasets 2.20.0
  • Tokenizers 0.19.1
Downloads last month
0
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for johaness14/wav2vec2-jv-base-openslr

Finetuned
(694)
this model

Dataset used to train johaness14/wav2vec2-jv-base-openslr

Collection including johaness14/wav2vec2-jv-base-openslr