zzfive
's Collections
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound
Generation
Paper
•
2405.18503
•
Published
•
9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music
Generation
Paper
•
2405.20289
•
Published
•
10
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive
Modeling of Audio Discrete Codes
Paper
•
2406.02897
•
Published
•
13
Audio Mamba: Bidirectional State Space Model for Audio Representation
Learning
Paper
•
2406.03344
•
Published
•
18
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
Complex Reasoning Abilities
Paper
•
2406.11768
•
Published
•
20
Towards Robust Speech Representation Learning for Thousands of Languages
Paper
•
2407.00837
•
Published
•
10
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized
Sounds
Paper
•
2407.01494
•
Published
•
13
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
Audio Events in Text-to-audio Generation
Paper
•
2407.02869
•
Published
•
18
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
•
2407.04051
•
Published
•
35
Video-to-Audio Generation with Hidden Alignment
Paper
•
2407.07464
•
Published
•
16
Masked Generative Video-to-Audio Transformers with Enhanced
Synchronicity
Paper
•
2407.10387
•
Published
•
6
Qwen2-Audio Technical Report
Paper
•
2407.10759
•
Published
•
55
Audio Conditioning for Music Generation via Discrete Bottleneck Features
Paper
•
2407.12563
•
Published
•
5
Paper
•
2407.14358
•
Published
•
24
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Paper
•
2407.14329
•
Published
•
4
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music
Generation
Paper
•
2407.15060
•
Published
•
9
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
•
2407.21646
•
Published
•
18
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
•
2407.21721
•
Published
•
8
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language
Models
Paper
•
2408.01337
•
Published
•
11
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
•
2408.01708
•
Published
•
3
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio
Source Separation
Paper
•
2408.03588
•
Published
•
6
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency
Paper
•
2408.04708
•
Published
•
6
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform
Generation
Paper
•
2408.07547
•
Published
•
7
Accelerating High-Fidelity Waveform Generation via Adversarial Flow
Matching Optimization
Paper
•
2408.08019
•
Published
•
10
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
•
2408.16532
•
Published
•
47
The VoxCeleb Speaker Recognition Challenge: A Retrospective
Paper
•
2408.14886
•
Published
•
9
Paper
•
2409.00587
•
Published
•
31
Density Adaptive Attention-based Speech Network: Enhancing Feature
Understanding for Mental Health Disorders
Paper
•
2409.00391
•
Published
•
4
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with
Adversarial Conditional Diffusion Distillation
Paper
•
2409.02245
•
Published
•
9
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
•
2409.06666
•
Published
•
55
SongCreator: Lyrics-based Universal Song Generation
Paper
•
2409.06029
•
Published
•
21
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
•
2409.06135
•
Published
•
14
Seed-Music: A Unified Framework for High Quality and Controlled Music
Generation
Paper
•
2409.09214
•
Published
•
49
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion
Transformer
Paper
•
2409.10819
•
Published
•
18
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music
Processing
Paper
•
2409.10831
•
Published
•
4
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
•
2409.12139
•
Published
•
12
SoloAudio: Target Sound Extraction with Language-oriented Audio
Diffusion Transformer
Paper
•
2409.08425
•
Published
•
9
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
Paper
•
2409.12962
•
Published
•
2
MuCodec: Ultra Low-Bitrate Music Codec
Paper
•
2409.13216
•
Published
•
23
Temporally Aligned Audio for Video with Autoregression
Paper
•
2409.13689
•
Published
•
8
Distilling an End-to-End Voice Assistant Without Instruction Training
Data
Paper
•
2410.02678
•
Published
•
22
Roadmap towards Superhuman Speech Understanding using Large Language
Models
Paper
•
2410.13268
•
Published
•
33
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic
Synchronization
Paper
•
2410.12957
•
Published
•
7
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper
•
2410.15316
•
Published
•
10
Continuous Speech Synthesis using per-token Latent Diffusion
Paper
•
2410.16048
•
Published
•
29
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec
Transformer
Paper
•
2409.00750
•
Published
•
3
Acoustic Volume Rendering for Neural Impulse Response Fields
Paper
•
2411.06307
•
Published
•
5
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for
Long-Term Expressive Symbolic Music Generation
Paper
•
2411.08307
•
Published
•
6
Video-Guided Foley Sound Generation with Multimodal Controls
Paper
•
2411.17698
•
Published
•
7
Multimodal Music Generation with Explicit Bridges and Retrieval
Augmentation
Paper
•
2412.09428
•
Published
•
7
Whisper-GPT: A Hybrid Representation Audio Large Language Model
Paper
•
2412.11449
•
Published
•
4
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
Paper
•
2412.09858
•
Published
•
1
Taming Multimodal Joint Training for High-Quality Video-to-Audio
Synthesis
Paper
•
2412.15322
•
Published
•
16
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation
System?
Paper
•
2412.18495
•
Published
•
8
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow
Matching and Clap-Ranked Preference Optimization
Paper
•
2412.21037
•
Published
•
20