Neural Vocoder is All You Need for Speech Super-resolution

in Proceedings of INTERSPEECH 2022, 2022

Author

Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang

Abstract

Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. These strong constraints can lead to poor generalization ability in mismatched real-world cases. In this paper, we propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios, and work well on real-world recordings. NVSR consists of a mel-bandwidth extension module, a neural vocoder module, and post-processing. Our proposed system achieves state-of-the-art results in multiple input and target sampling rate settings. On 44.1 kHz target resolution, NVSR outperforms state-of-the-art models WSRGlow and Nu-wave by 8% and 37% respectively on log-spectral-distance and achieves a significantly better perceptual quality. We demonstrate that prior knowledge in the pre-trained vocoder is crucial by performing mel-bandwidth extension with a simple replication-padding method.

[Download the full preprint version]

Evaluation result on VCTK Multi-Speaker testset:

44.1 kHz target sampling rate	16 kHz target sampling rate

Code

Our code is open-sourced at https://github.com/haoheliu/ssr_eval. This repo includes our pre-trained model and a tool for the evaluation of speech super-resolution algorithm.

Quick demo

For more demo, please visit this site

Comparison with SOTA methods on p363_010.wav

	Unprocessed	*Nuwave	*WSRGlow	NVSR (Proposed)	Target
4kHz
Spectrogram
8kHz
Spectrogram
12kHz
Spectrogram

The performance of NVSR on different input resolutions on p361_005.wav

	2kHz	4kHz	8kHz	16kHz	24kHz
Unprocessed
Spectrogram
NVSR
Spectrogram
Target
Spectrogram

Example of bandwidth mismatch cases on p362_016.wav

We can see that if WSRGlow is trained to perform 24kHz -> 48kHz super-resolution. It will fail on the 4kHz and 8kHz mismatch cases.

	Unprocessed	*WSRGlow	Target
Trained_24k_Test_4k
Spectrogram
Trained_24k_Test_8K
Spectrogram

Citation

@misc{liu2022neural,
      title={Neural Vocoder is All You Need for Speech Super-resolution}, 
      author={Haohe Liu and Woosung Choi and Xubo Liu and Qiuqiang Kong and Qiao Tian and DeLiang Wang},
      year={2022},
      eprint={2203.14941},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Share on

Twitter Facebook LinkedIn