Multi-Prosody Guided Intensity Controllable Emotional Speech Synthesis

Abstract

While recent advances have achieved highly natural and vivid speech generation, controlling emotional intensity remains challenging. Existing approaches typically rely on intensity labels or direct emotional embedding manipulation, leading to limited controllability or degraded speech quality. To address these issues, this paper introduces a prosody-guided controllable emotional speech synthesis model. Instead of directly altering emotional embeddings—a strategy that may distort the underlying emotion space—the proposed model incorporates an emotional intensity controller that regulates intensity through two prosodic attributes: pitch and energy. Specifically, the controller predicts pitch and energy from text representations enriched with emotional cues, and applies two scaling factors, $\alpha$ and $\beta$, to modulate them. This design preserves the integrity of the original emotional representation while avoiding the dependence on predefined intensity labels. Experimental results demonstrate that the proposed approach achieves flexible control of emotional intensity while maintaining speech naturalness and expressiveness comparable to real recordings.


1. The Architecture of the Proposed Model

arch


2. Demo: Style Transfer for Emotional TTS with ESD datast

To facilitate fair comparison, we synthesize audios with four emotions using five models.

Emotion Reference Target Speaker CME-TTS ME-TTS wav2vec2+VITS Ours w/o Intensity Controller Ours
Happy
Angry
Neutral
Sad
Surprise

3. Demo: Style Transfer for Emotional TTS with DOE datast

To facilitate fair comparison, we synthesize audios with four emotions using five models.

Emotion Reference Target Speaker CME-TTS ME-TTS wav2vec2+VITS Ours w/o Intensity Controller Ours
Happy
Angry
Sad
Surprise

4. Demo: Emotion Strength Control in Emotional TTS

To facilitate fair comparison, we use the same text to synthesize speech in four emotions and three strengths.
Text: 雨后的空气充斥着青草的味道


Scaling Factor

Emotion Low Intensity Medium Intensity Strong Intensity
Happy
Angry
Sad
Surprise

Relative Attribute

Emotion Low Intensity Medium Intensity Strong Intensity
Happy
Angry
Sad
Surprise

OURS

Emotion Low Intensity Medium Intensity Strong Intensity
Happy
Angry
Sad
Surprise