SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
Haohe Liu 📮,1, Xuenan Xu2, Yi Yuan1, Mengyue Wu2, Wenwu Wang1, Mark D. Plumbley1
1CVSSP, University of Surrey, Guildford, UK
2Department of Computer Science and Engineering, Shanghai Jiao Tong University, China
📮Corresponding author
Abstract
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling language modelling techniques to be applied to audio data. However, traditional codecs often operate at high bitrates or within narrow domains like speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, sound effects, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder leveraging the self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining acoustic details. The output of the semantic and acoustic encoder is reconstructed into audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of $25$, $50$, and $100$ per second, offering a balance between compression and quality. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information compared to all evaluated audio codecs, even at significantly lower bitrates.
Highlights
Waveform Reconstruction
✅ We provide the original and the reconstructed audio samples of Encodec (3.0, 1.5 kbps), HiFi-Codec (2.0 kbps), Descript codec (1.41, 0.78, 0.47 kbps, reproduced using open-sourced code), and proposed SemantiCodec (1.43, 0.71, 0.35 kbps).
✅ We show the evaluation metrics score using ViSQOL, Word Error Rate (WER), and classification accuracy.
Samples from MUSDB18 (Music)
↔️ Scroll horizontally to view the full table.
ID | Original | Encodec | HiFi-Codec | Encodec | DAC | SemantiCodec | DAC | SemantiCodec | DAC | SemantiCodec |
---|---|---|---|---|---|---|---|---|---|---|
Bit rate | / | 3.0 kbps | 2.0 kbps | 1.5 kbps | 1.41 kbps | 1.43 kbps | 0.78 kbps | 0.71 kbps | 0.47 kbps | 0.35 kbps |
Token rate | / | 300/sec | 200/sec | 150/sec | 141/sec | 100/sec | 78/sec | 50/sec | 47/sec | 25/sec |
ViSQOL-Avg ↑ | / | 3.82 | 3.57 | 3.33 | 3.13 | 3.81 | 2.82 | 3.55 | 2.39 | 3.17 |
WER (%)↓ | / | 3.7 | 3.6 | 5.0 | 5.0 | 3.4 | 11.6 | 5.1 | 28.2 | 19.6 |
Accuracy (%)↑ | / | 37.0 | 40.3 | 35.5 | 43.5 | 52.5 | 43.0 | 50.3 | 41.3 | 46.1 |
1 | ||||||||||
2 | ||||||||||
3 | ||||||||||
4 | ||||||||||
5 | ||||||||||
6 | ||||||||||
7 | ||||||||||
8 | ||||||||||
9 | ||||||||||
10 | ||||||||||
11 |
Samples from AudioSet (General Audio)
↔️ Scroll horizontally to view the full table.
ID | Original | Encodec | HiFi-Codec | Encodec | DAC | SemantiCodec | DAC | SemantiCodec | DAC | SemantiCodec |
---|---|---|---|---|---|---|---|---|---|---|
Bit rate | / | 3.0 kbps | 2.0 kbps | 1.5 kbps | 1.41 kbps | 1.43 kbps | 0.78 kbps | 0.71 kbps | 0.47 kbps | 0.35 kbps |
Token rate | / | 300/sec | 200/sec | 150/sec | 141/sec | 100/sec | 78/sec | 50/sec | 47/sec | 25/sec |
ViSQOL-Avg ↑ | / | 3.82 | 3.57 | 3.33 | 3.13 | 3.81 | 2.82 | 3.55 | 2.39 | 3.17 |
WER (%)↓ | / | 3.7 | 3.6 | 5.0 | 5.0 | 3.4 | 11.6 | 5.1 | 28.2 | 19.6 |
Accuracy (%)↑ | / | 37.0 | 40.3 | 35.5 | 43.5 | 52.5 | 43.0 | 50.3 | 41.3 | 46.1 |
1 | ||||||||||
2 | ||||||||||
3 | ||||||||||
4 | ||||||||||
5 | ||||||||||
6 | ||||||||||
7 | ||||||||||
8 | ||||||||||
9 | ||||||||||
10 | ||||||||||
11 | ||||||||||
12 | ||||||||||
13 |
Samples from LibriTTS (Speech)
↔️ Scroll horizontally to view the full table.
ID | Original | Encodec | HiFi-Codec | Encodec | DAC | SemantiCodec | DAC | SemantiCodec | DAC | SemantiCodec |
---|---|---|---|---|---|---|---|---|---|---|
Bit rate | / | 3.0 kbps | 2.0 kbps | 1.5 kbps | 1.41 kbps | 1.43 kbps | 0.78 kbps | 0.71 kbps | 0.47 kbps | 0.35 kbps |
Token rate | / | 300/sec | 200/sec | 150/sec | 141/sec | 100/sec | 78/sec | 50/sec | 47/sec | 25/sec |
ViSQOL-Avg ↑ | / | 3.82 | 3.57 | 3.33 | 3.13 | 3.81 | 2.82 | 3.55 | 2.39 | 3.17 |
WER (%)↓ | / | 3.7 | 3.6 | 5.0 | 5.0 | 3.4 | 11.6 | 5.1 | 28.2 | 19.6 |
Accuracy (%)↑ | / | 37.0 | 40.3 | 35.5 | 43.5 | 52.5 | 43.0 | 50.3 | 41.3 | 46.1 |
1 | ||||||||||
2 | ||||||||||
3 | ||||||||||
4 | ||||||||||
5 | ||||||||||
6 | ||||||||||
7 | ||||||||||
8 |
For more audio samples, please refer to the github page repo.