SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Haohe Liu^📮,1, Xuenan Xu², Yi Yuan¹, Mengyue Wu², Wenwu Wang¹, Mark D. Plumbley¹

¹CVSSP, University of Surrey, Guildford, UK

²Department of Computer Science and Engineering, Shanghai Jiao Tong University, China

^📮Corresponding author

Abstract

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling language modelling techniques to be applied to audio data. However, traditional codecs often operate at high bitrates or within narrow domains like speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, sound effects, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder leveraging the self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining acoustic details. The output of the semantic and acoustic encoder is reconstructed into audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of $25$, $50$, and $100$ per second, offering a balance between compression and quality. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information compared to all evaluated audio codecs, even at significantly lower bitrates.

Highlights

Ultra-Low bit rate We focus on bitrate between 0.31 kbps and 1.43 kbps, with token rate of 25, 50, or 100 per second.

Strong semantic in the audio token Indicated by classification accuracy.

Supporting variable vocabulary sizes One model that supporting four different vocabulary sizes.

**Figure 1:** The overview of the SemantiCodec architecture. For an input audio clip, quantized semantic representation $E_{s}$ is obtained via a codebook pre-computed using k-means clustering on the AudioMAE embeddings. Then $\mathbf{Y}$ and $\mathbf{E_s}$ are concatenated and fed to a residual encoder to complement acoustic details, which is discretized to $E_{a}$ by a vector quantization module. SemantiCodec embedding $E$ is obtained by concatenating $E_{s}$ and $E_{a}$. A latent diffusion model is trained to generate the original audio clip conditioned on $E$.

**Figure 2:** Comparison with state-of-the-art neural audio codec

Waveform Reconstruction

✅ We provide the original and the reconstructed audio samples of Encodec (3.0, 1.5 kbps), HiFi-Codec (2.0 kbps), Descript codec (1.41, 0.78, 0.47 kbps, reproduced using open-sourced code), and proposed SemantiCodec (1.43, 0.71, 0.35 kbps).

✅ We show the evaluation metrics score using ViSQOL, Word Error Rate (WER), and classification accuracy.

Samples from MUSDB18 (Music)

↔️ Scroll horizontally to view the full table.

ID	Original	Encodec	HiFi-Codec	Encodec	DAC	SemantiCodec	DAC	SemantiCodec	DAC	SemantiCodec
Bit rate	/	3.0 kbps	2.0 kbps	1.5 kbps	1.41 kbps	1.43 kbps	0.78 kbps	0.71 kbps	0.47 kbps	0.35 kbps
Token rate	/	300/sec	200/sec	150/sec	141/sec	100/sec	78/sec	50/sec	47/sec	25/sec
ViSQOL-Avg ↑	/	3.82	3.57	3.33	3.13	3.81	2.82	3.55	2.39	3.17
WER (%)↓	/	3.7	3.6	5.0	5.0	3.4	11.6	5.1	28.2	19.6
Accuracy (%)↑	/	37.0	40.3	35.5	43.5	52.5	43.0	50.3	41.3	46.1
1
2
3
4
5
6
7
8
9
10
11

Samples from AudioSet (General Audio)

↔️ Scroll horizontally to view the full table.

ID	Original	Encodec	HiFi-Codec	Encodec	DAC	SemantiCodec	DAC	SemantiCodec	DAC	SemantiCodec
Bit rate	/	3.0 kbps	2.0 kbps	1.5 kbps	1.41 kbps	1.43 kbps	0.78 kbps	0.71 kbps	0.47 kbps	0.35 kbps
Token rate	/	300/sec	200/sec	150/sec	141/sec	100/sec	78/sec	50/sec	47/sec	25/sec
ViSQOL-Avg ↑	/	3.82	3.57	3.33	3.13	3.81	2.82	3.55	2.39	3.17
WER (%)↓	/	3.7	3.6	5.0	5.0	3.4	11.6	5.1	28.2	19.6
Accuracy (%)↑	/	37.0	40.3	35.5	43.5	52.5	43.0	50.3	41.3	46.1
1
2
3
4
5
6
7
8
9
10
11
12
13

Samples from LibriTTS (Speech)

↔️ Scroll horizontally to view the full table.

ID	Original	Encodec	HiFi-Codec	Encodec	DAC	SemantiCodec	DAC	SemantiCodec	DAC	SemantiCodec
Bit rate	/	3.0 kbps	2.0 kbps	1.5 kbps	1.41 kbps	1.43 kbps	0.78 kbps	0.71 kbps	0.47 kbps	0.35 kbps
Token rate	/	300/sec	200/sec	150/sec	141/sec	100/sec	78/sec	50/sec	47/sec	25/sec
ViSQOL-Avg ↑	/	3.82	3.57	3.33	3.13	3.81	2.82	3.55	2.39	3.17
WER (%)↓	/	3.7	3.6	5.0	5.0	3.4	11.6	5.1	28.2	19.6
Accuracy (%)↑	/	37.0	40.3	35.5	43.5	52.5	43.0	50.3	41.3	46.1
1
2
3
4
5
6
7
8

For more audio samples, please refer to the github page repo.