HKUSTAudio/Llasa-3B · What's the differences between X-Codec2 and X-Codec?

scutrandom

2 days ago

RT

HKUST-Audio

HKUST Audio org 1 day ago

•

edited 1 day ago

Thanks for your raising the important question. Brief answers:

Problems with exising codecs:

Acoustic Codecs: e.g., Encodec, which uses RVQ for tokenization. These codecs generally focus on the acoustic-level information; when using them for training the audio LLMs, it makes the LLM struggle in predicting the "low-level fluctuation" of audio, which is rather random. It further makes training the audio-LLM needs much more data (to predict such fluctuations), and converge slowly.

Semantic Codecs: e.g., Hubert, which uses the clustering based method to get the high-level semantic information of audio. While it helps high-level modelling, it ignores the acoustic information. As such, models such as AudioLM or many other methods use two-stage modelling: a) use LLM to generate the semantic sequence b) render the aoustic detail in the NAR/AR way.

Attempts to combine the two information A notable work is SpeechTokenizer, which have the semantic in the low-level code and acoustic in the high-level code. In audio generation, it uses the AR+NAR two-stage framework.

XCodec

In XCodec, we merge the acoustic and semantic information in all levels of the codebook. It makes a) possible to perform single-stage streaming generation, making the audio generation as easily as traditional LLM. b) largely helps to reduce the difficulty of LLM modelling (can be viewed as, predicting the semantic while having the acoustic inside, or predicting the acoustic sequence with the guidence of semantic). c) opens the possiblity of single-codebook generation.

XCodec2
Through some methods, we further largely improve the XCodec by a) reducing the TPS b) allowing the single codebook. It makes the tokenizer more practical and be used as a standard building block of audio LLM. Details will be released soon.

ZhenYe234

HKUST Audio org 1 day ago

1,rvq->single vq 2,Multilingual Speech Semantic Support 3, Better Reconstruction Quality