Cross-Cultural Perception of Emotional Text-to-Speech: A Pilot Study on Mandarin

He, Zhizhi (2025) Cross-Cultural Perception of Emotional Text-to-Speech: A Pilot Study on Mandarin. Master thesis, Voice Technology (VT).

Preview

PDF
Thesis.pdf
Download (2MB) | Preview

Abstract

Emotional text-to-speech (TTS) synthesis has experienced rapid global expansion with implementations across diverse languages and cultural contexts. Understanding how individuals with different cultural and linguistic backgrounds perceive synthetic emotional speech becomes crucial for effective cross-cultural deployment of these technologies. This study investigates whether significant differences exist between native and non-native Mandarin speakers in perceiving different emotions in synthetic speech. Stimuli were generated using Expressive-FastSpeech2 model (Lee, 2021) trained on the Emotional Speech Dataset (ESD) (Zhou, Sisman, Liu, & Li, 2022) to produce Mandarin emotional speech across five categories: neutral, happy, sad, angry, and surprise. A cross-cultural evaluation was conducted with 38 participants (20 native Mandarin speakers, 18 non-native Mandarin speakers) who each evaluated 10 randomized stimuli through both emotion recognition tasks and naturalness assessment using a balanced Latin Square design. Results demonstrate substantial cross-cultural differences in emotional speech perception. Native speakers achieved higher emotion recognition accuracy (M = 0.790, SD = 0.387) compared to non-native speakers (M = 0.533, SD = 0.439), with converging statistical evidence supporting meaningful group differences. While the parametric t-test approached significance (p = 0.063), the non-parametric Mann-Whitney U test confirmed a significant difference (U = 248, p = 0.033, Cohen’s d = 0.623). Naturalness perception showed large and significant differences between groups (t(36) = 3.887, p < 0.001, d = 1.263), with native speakers rating synthesized speech as substantially more natural (M = 3.51) than non-native speakers (M = 3.06). Most importantly, consistent with theoretical expectations based on cross-cultural emotion research (Sauter, Eisner, Ekman, & Scott, 2010), positive emotions demonstrated significantly larger cross-cultural perception gaps than negative emotions. Happy emotion showed the most pronounced cultural difference (46.7% gap, t = 3.212, p = 0.003, d = 1.043), while negative emotions (sad, angry) showed smaller, non-significant gaps of approximately 18.9 percentage points each. This pattern supports theoretical frameworks suggesting that positive emotional expressions are more culturally specific, while negative emotions rely more heavily on universal biological signals. These findings reveal significant cultural differences in emotional TTS perception and establish the necessity for culturally-adaptive evaluation frameworks in TTS development. The research provides the first systematic evidence for emotion-specific cultural differences in synthetic speech perception within a tonal language context, with direct implications for improving cross-cultural usability of emotional speech technologies. Key words: text-to-speech, expressive emotional speech synthesis, cross-cultural perception, emotion recognition, speech synthesis evaluation

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	30 Jun 2025 08:01
Last Modified:	30 Jun 2025 08:01
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/681

Actions (login required)

View Item