Javascript must be enabled for the correct page display

Towards Fine-Grained Emotional Modulation in FastSpeech 2 with Hierarchical Emotion Distributions

Huang, Qiyan (2025) Towards Fine-Grained Emotional Modulation in FastSpeech 2 with Hierarchical Emotion Distributions. Master thesis, Voice Technology (VT).

[img]
Preview
PDF
MAS5858895QHuang.pdf

Download (1MB) | Preview

Abstract

Emotional speech synthesis has made substantial progress; however, interpretable and fine-grained prosody control,remains a persistent challenge. Existing systems often rely on global emotion labels or latent style embeddings, which limits precise temporal manipulation of emotional expression. This thesis introduces a novel approach to emotional prosody control by integrating phoneme-aligned Hierarchical Emotion Distributions (HED) into the non-autoregressive FastSpeech 2 architecture. The method enables interpretable emotion conditioning through injecting 12-dimensional HED vectors after the variance adaptor, supported by a gradual training strategy for stable convergence. Experiments, conducted using the English subset of the Emotional Speech Dataset (ESD), employed multiple evaluation settings. These included sentence- and phoneme-level acoustic analysis, inference-time intensity manipulation, and perceptual testing via Best-Worst Scaling (BWS). Models were compared across emotion categories and training stages to assess control effectiveness and robustness. Results demonstrate that HED conditioning yields consistent, emotion-specific prosodic patterns with clearly distinguishable pitch and energy trajectories. Furthermore, inference-time manipulation of HED vectors results in predictable changes in emotional intensity, confirming the proposed system’s controllability. Subjective ratings align with acoustic findings, showing listener preference for HED-guided outputs. This research contributes a structured and interpretable framework for emotional speech synthesis, advancing the controllability of non-autoregressive TTS. This work supports future applications in expressive voice technologies, virtual agents, and human-computer interaction.

Item Type: Thesis (Master)
Name supervisor: Verkhodanova, V.
Date Deposited: 16 Jun 2025 11:32
Last Modified: 16 Jun 2025 11:32
URI: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/666

Actions (login required)

View Item View Item