Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Abstract
Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.
Community
Neat, would have liked to see metrics on transcription and diarization, or sentiment classification too
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Audiobox: Unified Audio Generation with Natural Language Prompts (2023)
- Distilling Vision-Language Models on Millions of Videos (2024)
- Audio-Visual LLM for Video Understanding (2023)
- Boosting Large Language Model for Speech Synthesis: An Empirical Study (2023)
- CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I wonder if this can be used for speaker diarization .
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper