Emotion Control in FastSpeech2-Based Speech Synthesis: Comparative Analysis of Prosody Scaling, Supervised Training, and Fine-Tuning

zhang, hanyu (2025) Emotion Control in FastSpeech2-Based Speech Synthesis: Comparative Analysis of Prosody Scaling, Supervised Training, and Fine-Tuning. Master thesis, Voice Technology (VT).

Preview

PDF
MA-S5838975-H-Zhang.pdf.pdf
Download (677kB) | Preview

Abstract

Emotion is a critical component of human communication, and enabling synthetic speech to express emotion effectively remains a major challenge in modern text-to-speech (TTS) systems. This study investigates three modeling strategies within the FastSpeech2 framework: pitch and duration control, scratch training, and fine-tuning to assess their impact on the naturalness and emotional expressiveness of synthesized speech. To evaluate these methods, three emotional categories (Sad, Angry, Happy) were synthesized using each modeling approach and assessed through a subjective listening test. The participantsants rated the naturalness of each sample on a five-point MOS scale and selected the most emotionally expressive version among the alternatives. The results show that the fine-tuned model significantly outperforms the others, achieving the highest naturalness score (MOS = 4.47) and an emotion recognition accuracy of 72%. In contrast, the pitch-controlled and scratch-trained models scored lower and were not consistently perceived as emotionally expressive. These findings demonstrate that fine-tuning with expressive data is the most effective and resource efficient approach to building emotionally rich synthetic voices. The full demo and audio samples are publicly available at https://burgundy07.github.io/emotion-demo/

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	05 Nov 2025 10:05
Last Modified:	05 Nov 2025 10:05
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/774

Actions (login required)

View Item