Demo page for "CETTS"
Multi-Prosody Guided Intensity Controllable Emotional Speech Synthesis
Abstract
While recent advances have achieved highly natural and vivid speech generation, controlling emotional intensity remains challenging. Existing approaches typically rely on intensity labels or direct emotional embedding manipulation, leading to limited controllability or degraded speech quality. To address these issues, this paper introduces a prosody-guided controllable emotional speech synthesis model. Instead of directly altering emotional embeddings—a strategy that may distort the underlying emotion space—the proposed model incorporates an emotional intensity controller that regulates intensity through two prosodic attributes: pitch and energy. Specifically, the controller predicts pitch and energy from text representations enriched with emotional cues, and applies two scaling factors, $\alpha$ and $\beta$, to modulate them. This design preserves the integrity of the original emotional representation while avoiding the dependence on predefined intensity labels. Experimental results demonstrate that the proposed approach achieves flexible control of emotional intensity while maintaining speech naturalness and expressiveness comparable to real recordings.
1. The Architecture of the Proposed Model

2. Demo: Style Transfer for Emotional TTS with ESD datast
To facilitate fair comparison, we synthesize audios with four emotions using five models.
| Emotion | Reference | Target Speaker | CME-TTS | ME-TTS | wav2vec2+VITS | Ours w/o Intensity Controller | Ours |
|---|---|---|---|---|---|---|---|
| Happy | |||||||
| Angry | |||||||
| Neutral | |||||||
| Sad | |||||||
| Surprise |
3. Demo: Style Transfer for Emotional TTS with DOE datast
To facilitate fair comparison, we synthesize audios with four emotions using five models.
| Emotion | Reference | Target Speaker | CME-TTS | ME-TTS | wav2vec2+VITS | Ours w/o Intensity Controller | Ours |
|---|---|---|---|---|---|---|---|
| Happy | |||||||
| Angry | |||||||
| Sad | |||||||
| Surprise |
4. Demo: Emotion Strength Control in Emotional TTS
To facilitate fair comparison, we use the same text to synthesize speech in four emotions and three strengths.
Text: 雨后的空气充斥着青草的味道
Scaling Factor
| Emotion | Low Intensity | Medium Intensity | Strong Intensity |
|---|---|---|---|
| Happy | |||
| Angry | |||
| Sad | |||
| Surprise |
Relative Attribute
| Emotion | Low Intensity | Medium Intensity | Strong Intensity |
|---|---|---|---|
| Happy | |||
| Angry | |||
| Sad | |||
| Surprise |
OURS
| Emotion | Low Intensity | Medium Intensity | Strong Intensity |
|---|---|---|---|
| Happy | |||
| Angry | |||
| Sad | |||
| Surprise |
References
Enjoy Reading This Article?
Here are some more articles you might like to read next: