Demo for GMG ProsodyNet
Multi-granularity Prosodic Speech Synthesis with Grammar Information
- Personalized speech synthesis involves learning an individual’s unique speaking patterns from a reference audio to generate speech that mimics the speaker’s habitual rhythm. However, the one-to-many mapping between text and speech, combined with the dynamic nature of personalized speech style characteristics in the reference audio, poses significant challenges for accurate personalized speech synthesis. In this paper, we propose a Grammar Infused Multi-granularity Prosody Network (GMG ProsodyNet) for personalized speech synthesis. Specifically, 1) We model prosodic features hierarchically at the levels of utterance, content syntax, word, and phoneme. We utilize the utterance-level prosodic feature to guide the prediction of fine-grained prosodic features according to the hierarchical property of prosodic features. We introduce a context syntax encoder to improve the prediction accuracy of duration, pitch, and energy characteristics in the synthesized audio. The proposed word-level prosodic encoder can efficiently extract valuable pitch dynamic and speech continuity features from the word spectrum. For controlling subtle prosodic nuances, we employ phoneme-level prosodic modeling. 2) We extract syntax level prosodic features from the syntax graph constructed by Graph Neural Network (GNN) with word as node, exploiting grammatical dependencies between distant words. Experimental results demonstrate that the proposed GMG ProsodyNet can effectively encode delicate and personalized prosodic features, leading to improved speech synthesis quality, fluency, and naturalness.
1. TTS Samples in the Ablation Study
We provide audio samples that are generated by models after gradually removing the context syntax encoder and word level prosodic predictor from GMG ProsodyNet.
Parallel Prosodic Speech Synthetic
Reference/Target Text: Printing in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition.
Reference Speech | GMG ProsodyNet | GMG ProsodyNet w/o CSynEnc | GMG ProsodyNet w/o CSynencwpre |
---|---|---|---|
Reference/Target Text: It is of the first importance that the letter used should be fine in form.
Reference Speech | GMG ProsodyNet | GMG ProsodyNet w/o CSynEnc | GMG ProsodyNet w/o CSynencwpre |
---|---|---|---|
Reference/Target Text: That the forms of printed letters should follow more or less closely those of the written character and they followed them very closely.
Reference Speech | GMG ProsodyNet | GMG ProsodyNet w/o CSynEnc | GMG ProsodyNet w/o CSynencwpre |
---|---|---|---|
Reference/Target Text: The lower case being in fact invented in the early middle ages.
Reference Speech | GMG ProsodyNet | GMG ProsodyNet w/o CSynEnc | GMG ProsodyNet w/o CSynencwpre |
---|---|---|---|
Reference/Target Text: They discarded this for a more completely roman and far less beautiful letter.
Reference Speech | GMG ProsodyNet | GMG ProsodyNet w/o CSynEnc | GMG ProsodyNet w/o CSynencwpre |
---|---|---|---|
Non-Parallel Prosodic Speech Synthetic
Target Text: The kitten weighs twenty eight pounds.
Reference Speech | GMG ProsodyNet | GMG ProsodyNet w/o CSynEnc | GMG ProsodyNet w/o CSynencwpre |
---|---|---|---|
2. TTS Samples in the Comparison Study
The following demonstration uses different methods for speech synthesis.
Parallel Prosodic Speech Synthetic
Reference/Target Text: Printing in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition.
Reference Speech | AdaSpeech | FG-transformerTTS | SyntaSpeech | GMG ProsodyNet |
---|---|---|---|---|
Reference/Target Text: It is of the first importance that the letter used should be fine in form.
Reference Speech | AdaSpeech | FG-transformerTTS | SyntaSpeech | GMG ProsodyNet |
---|---|---|---|---|
Reference/Target Text: That the forms of printed letters should follow more or less closely those of the written character and they followed them very closely.
Reference Speech | AdaSpeech | FG-transformerTTS | SyntaSpeech | GMG ProsodyNet |
---|---|---|---|---|
Reference/Target Text: The lower case being in fact invented in the early middle ages.
Reference Speech | AdaSpeech | FG-transformerTTS | SyntaSpeech | GMG ProsodyNet |
---|---|---|---|---|
Reference/Target Text: They discarded this for a more completely roman and far less beautiful letter.
Reference Speech | AdaSpeech | FG-transformerTTS | SyntaSpeech | GMG ProsodyNet |
---|---|---|---|---|
Non-Parallel Prosodic Speech Synthetic
Target Text: The kitten weighs twenty eight pounds.
Reference Speech | AdaSpeech | FG-transformerTTS | GMG ProsodyNet |
---|---|---|---|
References
2024
- Multi-granularity Prosodic Speech Synthesis with Grammar Information (submitted)Neural Networks, 2024
Enjoy Reading This Article?
Here are some more articles you might like to read next: