Ultra-Lightweight Neural Differential DSP Vocoder for High Quality Speech Synthesis

Abstract

During recent advances in speech synthesis, several vocoders have been proposed using purely neural networks to produce raw waveforms. Although, such methods produce audio with high quality, but even the most efficient works like MB-MelGAN fail to achieve the performance constraints on low end embedded devices such as a smart-glass or a watch. A pure digital signal processing based vocoder can be implemented via using fast fourier transforms and therefore is a magnitude faster than any neural vocoder, but often gets a lower quality. Combining the best of both worlds, we propose an ultra-lightweight differential DSP system, that uses a jointly optimized acoustic model with DSP vocoder, and achieves audio quality comparable to neural vocoders, while being efficient as a DSP vocoder. Our C++ implementation, without any hardware specific optimization, is at 15 MFLOPS, and achieves a vocoder only RTF of 0.003 and overall RTF 0.044 on CPU, surpassing MB-MelGAN by 350 times.

Supplementary audio samples

Female Speaker

GroundTruth WaveRNN HifiGAN MB-MelGAN DSP Vocoder DSP Vocoder Adv DDSP Vocoder

Male Speaker

GroundTruth WaveRNN HifiGAN MB-MelGAN DSP Vocoder DSP Vocoder Adv DDSP Vocoder