demo code

The aim of multi-speaker emotional speech synthesis is to generate speech for a designated speaker in a desired emotional state. The task is challenging due to the presence of speech variations, such as noise, content, and timbre, which can obstruct emotion extraction and transfer. This paper proposes a new approach to performing multi-speaker emotional speech synthesis. The proposed method, which is based on a seq2seq synthesizer, integrates emotion embedding as a conditioned variable to convey exact emotional information from reference audio to the synthesized speech. To dig emotion representation capability, we utilize a three-dimensional acoustic feature as input. And an emotion generalization module with adaptive instance normalization (AdaIN) is proposed to obtain emotion embedding with high generalization ability, which also results in improved controllability. The derived emotion embedding from the generalization module can be readily conditioned by affine parameters, allowing for control both the emotion category and the emotion intensity of synthesized speech. Various emotional speech synthesis experimental results of the propposed method demonstrate its state-of-the-art performance in multi-speaker emotional speech synthesis, coupled with its advantage of high emotion controllability.

model_arc

Introduction for "Controllable Multi-Speaker Emotional Speech Synthesis With Emotion Representation of High Generalization Capability"

References

2023

Enjoy Reading This Article?