metadata
license: apache-2.0
datasets:
- WpythonW/real-fake-voices-dataset2
- mozilla-foundation/common_voice_17_0
language:
- en
metrics:
- accuracy
- f1
- recall
- precision
base_model:
- MIT/ast-finetuned-audioset-10-10-0.4593
pipeline_tag: audio-classification
library_name: transformers
tags:
- audio
- audio-classification
- fake-audio-detection
- ast
widget:
- text: Upload an audio file to check if it's real or synthetic
inference:
parameters:
sampling_rate: 16000
audio_channel: mono
model-index:
- name: ast-fakeaudio-detector
results:
- task:
type: audio-classification
name: Audio Classification
dataset:
name: real-fake-voices-dataset2
type: WpythonW/real-fake-voices-dataset2
metrics:
- type: accuracy
value: 0.9662
- type: f1
value: 0.971
- type: precision
value: 0.9692
- type: recall
value: 0.9728
AST Fine-tuned for Fake Audio Detection
This model is a binary classification head fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection.
Model Description
- Base Model: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet)
- Task: Binary classification (fake/real audio detection)
- Input: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames)
- Output: Probabilities [fake_prob, real_prob]
- Training Hardware: 2x NVIDIA T4 GPUs
Usage Guide
Model Usage
import torch
import torchaudio
import soundfile as sf
import numpy as np
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
# Load model and move to available device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "WpythonW/ast-fakeaudio-detector"
extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device)
model.eval()
# Process multiple audio files
audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"]
processed_batch = []
for audio_path in audio_files:
# Load audio file
audio_data, sr = sf.read(audio_path)
# Convert stereo to mono if needed
if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
audio_data = np.mean(audio_data, axis=1)
# Resample to 16kHz if needed
if sr != 16000:
waveform = torch.from_numpy(audio_data).float()
if len(waveform.shape) == 1:
waveform = waveform.unsqueeze(0)
resample = torchaudio.transforms.Resample(
orig_freq=sr,
new_freq=16000
)
waveform = resample(waveform)
audio_data = waveform.squeeze().numpy()
processed_batch.append(audio_data)
# Prepare batch input
inputs = extractor(
processed_batch,
sampling_rate=16000,
padding=True,
return_tensors="pt"
)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Get predictions
with torch.no_grad():
logits = model(**inputs).logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
# Process results
for filename, probs in zip(audio_files, probabilities):
fake_prob = float(probs[0].cpu())
real_prob = float(probs[1].cpu())
prediction = "FAKE" if fake_prob > real_prob else "REAL"
print(f"\nFile: {filename}")
print(f"Fake probability: {fake_prob:.2%}")
print(f"Real probability: {real_prob:.2%}")
print(f"Verdict: {prediction}")
Limitations
Important considerations when using this model:
- The model works with 16kHz audio input
- Performance may vary with different types of audio manipulation not present in training data
- The model was trained on audio samples ranging from 4 to 10 seconds in duration.