Multi-Prosody Guided Intensity Controllable Emotional Speech Synthesis

Abstract

While recent advances have achieved highly natural and vivid speech generation, controlling emotional intensity remains challenging. Existing approaches typically rely on intensity labels or direct emotional embedding manipulation, leading to limited controllability or degraded speech quality. To address these issues, this paper introduces a prosody-guided controllable emotional speech synthesis model. Instead of directly altering emotional embeddings—a strategy that may distort the underlying emotion space—the proposed model incorporates an emotional intensity controller that regulates intensity through two prosodic attributes: pitch and energy. Specifically, the controller predicts pitch and energy from text representations enriched with emotional cues, and applies two scaling factors, $\alpha$ and $\beta$, to modulate them. This design preserves the integrity of the original emotional representation while avoiding the dependence on predefined intensity labels. Experimental results demonstrate that the proposed approach achieves flexible control of emotional intensity while maintaining speech naturalness and expressiveness comparable to real recordings.