--- license: apache-2.0 datasets: - WpythonW/real-fake-voices-dataset2 - mozilla-foundation/common_voice_17_0 language: - en metrics: - accuracy - f1 - recall - precision base_model: - MIT/ast-finetuned-audioset-10-10-0.4593 pipeline_tag: audio-classification library_name: transformers tags: - audio - audio-classification - fake-audio-detection - ast widget: - text: Upload an audio file to check if it's real or synthetic inference: parameters: sampling_rate: 16000 audio_channel: mono model-index: - name: ast-fakeaudio-detector results: - task: type: audio-classification name: Audio Classification dataset: name: real-fake-voices-dataset2 type: WpythonW/real-fake-voices-dataset2 metrics: - type: accuracy value: 0.9662 - type: f1 value: 0.971 - type: precision value: 0.9692 - type: recall value: 0.9728 --- # AST Fine-tuned for Fake Audio Detection This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. ## Model Description - **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) - **Task**: Binary classification (fake/real audio detection) - **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) - **Output**: Probabilities [fake_prob, real_prob] - **Training Hardware**: 2x NVIDIA T4 GPUs # Usage Guide ## Model Usage ```python import torch import torchaudio import soundfile as sf import numpy as np from transformers import AutoFeatureExtractor, AutoModelForAudioClassification # Load model and move to available device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_name = "WpythonW/ast-fakeaudio-detector" extractor = AutoFeatureExtractor.from_pretrained(model_name) model = AutoModelForAudioClassification.from_pretrained(model_name).to(device) model.eval() # Process multiple audio files audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"] processed_batch = [] for audio_path in audio_files: # Load audio file audio_data, sr = sf.read(audio_path) # Convert stereo to mono if needed if len(audio_data.shape) > 1 and audio_data.shape[1] > 1: audio_data = np.mean(audio_data, axis=1) # Resample to 16kHz if needed if sr != 16000: waveform = torch.from_numpy(audio_data).float() if len(waveform.shape) == 1: waveform = waveform.unsqueeze(0) resample = torchaudio.transforms.Resample( orig_freq=sr, new_freq=16000 ) waveform = resample(waveform) audio_data = waveform.squeeze().numpy() processed_batch.append(audio_data) # Prepare batch input inputs = extractor( processed_batch, sampling_rate=16000, padding=True, return_tensors="pt" ) inputs = {k: v.to(device) for k, v in inputs.items()} # Get predictions with torch.no_grad(): logits = model(**inputs).logits probabilities = torch.nn.functional.softmax(logits, dim=-1) # Process results for filename, probs in zip(audio_files, probabilities): fake_prob = float(probs[0].cpu()) real_prob = float(probs[1].cpu()) prediction = "FAKE" if fake_prob > real_prob else "REAL" print(f"\nFile: {filename}") print(f"Fake probability: {fake_prob:.2%}") print(f"Real probability: {real_prob:.2%}") print(f"Verdict: {prediction}") ``` ## Limitations Important considerations when using this model: 1. The model works with 16kHz audio input 2. Performance may vary with different types of audio manipulation not present in training data 3. The model was trained on audio samples ranging from 4 to 10 seconds in duration.