Multi-granularity Prosodic Speech Synthesis with Grammar Information

  • Personalized speech synthesis involves learning an individual’s unique speaking patterns from a reference audio to generate speech that mimics the speaker’s habitual rhythm. However, the one-to-many mapping between text and speech, combined with the dynamic nature of personalized speech style characteristics in the reference audio, poses significant challenges for accurate personalized speech synthesis. In this paper, we propose a Grammar Infused Multi-granularity Prosody Network (GMG ProsodyNet) for personalized speech synthesis. Specifically, 1) We model prosodic features hierarchically at the levels of utterance, content syntax, word, and phoneme. We utilize the utterance-level prosodic feature to guide the prediction of fine-grained prosodic features according to the hierarchical property of prosodic features. We introduce a context syntax encoder to improve the prediction accuracy of duration, pitch, and energy characteristics in the synthesized audio. The proposed word-level prosodic encoder can efficiently extract valuable pitch dynamic and speech continuity features from the word spectrum. For controlling subtle prosodic nuances, we employ phoneme-level prosodic modeling. 2) We extract syntax level prosodic features from the syntax graph constructed by Graph Neural Network (GNN) with word as node, exploiting grammatical dependencies between distant words. Experimental results demonstrate that the proposed GMG ProsodyNet can effectively encode delicate and personalized prosodic features, leading to improved speech synthesis quality, fluency, and naturalness.

1. TTS Samples in the Ablation Study

We provide audio samples that are generated by models after gradually removing the context syntax encoder and word level prosodic predictor from GMG ProsodyNet.

Parallel Prosodic Speech Synthetic

Reference/Target Text: Printing in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition.

Reference Speech GMG ProsodyNet GMG ProsodyNet w/o CSynEnc GMG ProsodyNet w/o CSynencwpre

Reference/Target Text: It is of the first importance that the letter used should be fine in form.

Reference Speech GMG ProsodyNet GMG ProsodyNet w/o CSynEnc GMG ProsodyNet w/o CSynencwpre

Reference/Target Text: That the forms of printed letters should follow more or less closely those of the written character and they followed them very closely.

Reference Speech GMG ProsodyNet GMG ProsodyNet w/o CSynEnc GMG ProsodyNet w/o CSynencwpre

Reference/Target Text: The lower case being in fact invented in the early middle ages.

Reference Speech GMG ProsodyNet GMG ProsodyNet w/o CSynEnc GMG ProsodyNet w/o CSynencwpre

Reference/Target Text: They discarded this for a more completely roman and far less beautiful letter.

Reference Speech GMG ProsodyNet GMG ProsodyNet w/o CSynEnc GMG ProsodyNet w/o CSynencwpre

Non-Parallel Prosodic Speech Synthetic

Target Text: The kitten weighs twenty eight pounds.

Reference Speech GMG ProsodyNet GMG ProsodyNet w/o CSynEnc GMG ProsodyNet w/o CSynencwpre

2. TTS Samples in the Comparison Study

The following demonstration uses different methods for speech synthesis.

Parallel Prosodic Speech Synthetic

Reference/Target Text: Printing in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition.

Reference Speech AdaSpeech FG-transformerTTS SyntaSpeech GMG ProsodyNet

Reference/Target Text: It is of the first importance that the letter used should be fine in form.

Reference Speech AdaSpeech FG-transformerTTS SyntaSpeech GMG ProsodyNet

Reference/Target Text: That the forms of printed letters should follow more or less closely those of the written character and they followed them very closely.

Reference Speech AdaSpeech FG-transformerTTS SyntaSpeech GMG ProsodyNet

Reference/Target Text: The lower case being in fact invented in the early middle ages.

Reference Speech AdaSpeech FG-transformerTTS SyntaSpeech GMG ProsodyNet

Reference/Target Text: They discarded this for a more completely roman and far less beautiful letter.

Reference Speech AdaSpeech FG-transformerTTS SyntaSpeech GMG ProsodyNet

Non-Parallel Prosodic Speech Synthetic

Target Text: The kitten weighs twenty eight pounds.

Reference Speech AdaSpeech FG-transformerTTS GMG ProsodyNet