You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

whisper-swedish-telephonic

Model Overview

whisper-swedish-telephonic is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.

Key Features:

  • Language: Swedish (primary), with limited support for minor English segments.
  • Audio Types: Telephonic conversations, customer support recordings, and general low-bandwidth audio.
  • Sample Rate: 8kHz (resampled to 16kHz internally).
  • Special Tokens: Supports conversational markers, disfluencies, and speaker-specific tags.
  • Performance: Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.

Dataset

The model was fine-tuned using the Swedish Telephonic Dataset, consisting of:

  • Duration: ~97 hours of annotated audio.
  • Domains: Call center recordings, customer service conversations.
  • Annotations:
    • Speaker IDs and timestamps.
    • Conversational tags: (()), ~, <overlap>.
    • Language switching: <lang:English>...</lang:English>.

Preprocessing:

  • Audio: Resampled to 16kHz.
  • Segmentations: Aligned with timestamps.
  • Special Tokens: Includes non-speech sounds like [cough], [laugh].

Model Performance

Word Error Rate (WER) Evaluation

The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.

Metric Fine-Tuned Model Base Whisper-Small
WER 0.170 0.888

Key Observations:

  • Fine-Tuned Model:
    • Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
    • Handles speaker-specific annotations and conversational markers effectively.
  • Base Model:
    • Struggles with Swedish syntax and domain-specific vocabulary.
    • Outputs nonsensical transcriptions for longer or complex sentences.

Example Transcriptions

Segment Ground Truth Fine-Tuned Model Base Model WER (Fine-Tuned) WER (Base)
1 så nu så nu so, no 0.000 1.000
2 nu record du båda va nu record du båda va nu rekordar du båda 0.000 0.400
3 ja jag kommer inte ihåg ja jag kommer inte ihåg i am coming to you 0.000 1.000
5 sen när då, sen alltid... inga gäster sen när då, sen alltid... inga gäster sen då, sen alltid... ingen gest 0.000 0.250
14 till frankrike till frankrike thank you 0.000 1.000

Note: Full segment-wise evaluation logs are available in the repository.


Audio Example

This audio file demonstrates the model's transcription abilities:

  • File: trimmed_resampled_audio.wav
  • Content: Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?
  • Audio Type: Telephonic conversation.
  • Sample Rate: 16kHz (resampled).
  • Purpose: Showcasing the model's capabilities in transcribing Swedish telephonic speech.

Intended Use

This model is designed for:

  • Customer Support Automation: Transcription and analysis of call center recordings.
  • Telephony Analytics: Sentiment analysis, compliance monitoring, and business intelligence.
  • Swedish Language Research: Study of conversational patterns and colloquial expressions.

Limitations:

  • Language Support: Primarily Swedish; limited support for English.
  • Audio Quality: Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
  • Preprocessing Requirement: Requires resampling non-8kHz audio to 16kHz.

Try the Model

You can test the model using the Hugging Face Playground or the dedicated endpoint:


How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import soundfile as sf

# Load model and processor
model_name = "WMRNORDIC/whisper-swedish-telephonic"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Load and preprocess audio
audio, sample_rate = sf.read("path_to_audio.wav")
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")

# Transcribe
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Transcription:", transcription)
Downloads last month
15
Safetensors
Model size
242M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for WMRNORDIC/whisper-swedish-telephonic

Finetuned
(2200)
this model

Dataset used to train WMRNORDIC/whisper-swedish-telephonic

Space using WMRNORDIC/whisper-swedish-telephonic 1

Evaluation results