Nicolaij, Fabiènne (2025) From Text to Feeling. Fine-Tuning FastSpeech2 for Emotion Expression. Master thesis, Voice Technology (VT).
|
PDF
MasterThesisVTs4983416NicolaijF.pdf Download (379kB) | Preview |
Abstract
This study investigates whether FastSpeech 2 can produce emotionally expressive speech without explicit use of emotion embeddings. Two training strategies were compared: (1) training FastSpeech 2 from scratch using an English subset of the Emotional Speech Dataset (ESD), and (2) fine-tuning a FastSpeech 2 model pre-trained on the neutral LJSpeech corpus with the same emotional ESD subset. The emotions available in this dataset are “neutral, “happy”, “angry”, “sad”, and “surprise”. The two models were evaluated using spectral fidelity (Mel-Cepstral Distortion; MCD) and prosodic accuracy (Mean Absolute Error; MAE). Pitch, duration, and intensity were analysed as they are the fundamental prosodic features in FastSpeech 2. The results demonstrate that the scratch-trained model (setup 1) outperforms the fine-tuned model (setup 2) across all metrics. Setup 1 showed lower MCD and lower MAE across all emotional categories, indicating better reproduction of emotional prosody. Setup 2 struggled to reproduce expressive prosody, particularly for high-intensity emotions, suggesting that pre-training on neutral speech introduces a bias that limits the model’s ability to adapt to emotional variability. In addition, prosodic patterns in setup 1 closely reflected those measured in the ground-truth data, whereas setup 2 often produced flattened pitch contours and shortened durations. These findings indicate that FastSpeech 2 can implicitly learn and reproduce emotion-specific prosody when trained directly on expressive data. Training from scratch, even on a small dataset, can outperform fine-tuning from a neutral model. This highlights the importance of matching training data to task objectives in emotional speech synthesis and suggests that emotional speech synthesis benefits from training on expressive datasets rather than large neutral ones.
| Item Type: | Thesis (Master) |
|---|---|
| Name supervisor: | Verkhodanova, V. and Coler, M.L. |
| Date Deposited: | 15 Oct 2025 12:05 |
| Last Modified: | 15 Oct 2025 12:05 |
| URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/773 |
Actions (login required)
![]() |
View Item |
