Controllable Emotional Speech Synthesis via Emotion Transfer

Abstract

Synthesizing expressive speech based on reference audio style is a key area in emotional speech synthesis. While recent models can produce natural and clear speech, controlling emotional intensity remains a challenge. To address this, we propose a VITS-based TTS model with controllable emotional intensity. We incorporate a pre-trained Emotion2Vec model and design an emotion intensity controller. Emotional embeddings extracted from reference audio via Emo2Vec are fused with phoneme-level text features to enable emotion transfer. We hypothesize—and confirm through experiments—that emotional intensity correlates with pitch and energy. Therefore, we construct the emotional intensity control module around a pitch predictor and an energy predictor to enable global-level control over emotional strength. Experiments show that our model synthesizes speech with quality comparable to ground truth and enables controllable emotional intensity without degrading audio quality.


1. The Architecture of the Proposed Model

arch


2. Demo: Style Transfer for Emotional TTS with ESD datast

To facilitate fair comparison, we synthesize audios with four emotions using five models.

Emotion Reference Target Speaker CME-TTS ME-TTS wav2vec2+VITS Ours w/o Intensity Controller Ours
Happy
Angry
Neutral
Sad
Surprise

3. Demo: Style Transfer for Emotional TTS with DOE datast

To facilitate fair comparison, we synthesize audios with four emotions using five models.

Emotion Reference Target Speaker CME-TTS ME-TTS wav2vec2+VITS Ours w/o Intensity Controller Ours
Happy
Angry
Sad
Surprise

4. Demo: Emotion Strength Control in Emotional TTS

To facilitate fair comparison, we use the same text to synthesize speech in four emotions and three strengths.
Text: 雨后的空气充斥着青草的味道


Scaling Factor

Emotion Low Intensity Medium Intensity Strong Intensity
Happy
Angry
Sad
Surprise

Relative Attribute

Emotion Low Intensity Medium Intensity Strong Intensity
Happy
Angry
Sad
Surprise

OURS

Emotion Low Intensity Medium Intensity Strong Intensity
Happy
Angry
Sad
Surprise