Liang, Hao-Wei (2025) Cross-lingual Voice Conversion and Its Prosodic Impact on Perceived Naturalness. Master thesis, Voice Technology (VT).
|
PDF
MAs5962080HWLiang.pdf Download (687kB) | Preview |
Abstract
This thesis presents an exploratory investigation into the role of prosodic control in cross-lingual voice conversion between Taiwanese Mandarin and American English. As multilingual communication becomes more common in speech interfaces, language learning, and accessibility technologies, producing speech that sounds natural across language boundaries is a growing area of interest. However, the influence of prosodic features, particularly pitch and energy, on perceived naturalness in cross-lingual synthesis remains relatively underexplored, especially between typologically distinct languages such as tonal and non-tonal systems. To explore this question, a FastSpeech2-based voice conversion model was trained using two open-source corpora: a subset of the Common Voice corpus containing Taiwanese Mandarin and the subset of the LJSpeech corpus containing American English. The two datasets were combined and used to train a single multilingual model. During inference, prosodic features were controlled under four conditions: baseline (no adjustment), pitch-only, energy-only, and combined pitch and energy control. The goal was to assess how these adjustments affect the perceived naturalness of synthesized speech. A subjective listening test with 50 participants was conducted, in which each version was rated using a 5-point Likert scale. The results showed that the baseline condition consistently received the highest naturalness scores, while prosody, controlled versions, particularly the combined condition, were rated quite lower. This suggests that naive prosodic manipulation, without linguistic adaptation, may negatively affect the fluency and perceived coherence of synthesized cross-lingual speech. To confirm that prosodic changes were successfully implemented, average pitch (F0) and RMS energy were extracted and compared across versions. Additionally, automatic speech recognition (ASR) metrics such as character error rate (CER) and word error rate (WER) were calculated as supplementary indicators of acoustic robustness. These scores are not intended to reflect human intelligibility, but rather to observe how prosody scaling affects system-level recognition. This study offers initial insights into the limitations of uniform prosody control in cross-lingual voice conversion. The findings suggest that context-aware, linguistically informed prosody strategies may be needed to improve naturalness when converting between typologically diverse languages. Keywords: Voice Conversion, Cross-lingual, FastSpeech2, Prosody Control, Perceived Naturalness, Pitch and Energy.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Verkhodanova, V. |
Date Deposited: | 16 Jun 2025 11:32 |
Last Modified: | 16 Jun 2025 11:32 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/671 |
Actions (login required)
![]() |
View Item |