Wei, Yilan (2024) An Innovative Method for Multi-Effect Speech Synthesis through Training File Modification. Master thesis, Voice Technology (VT).
|
PDF
MSc-s5515939-Y-Wei.pdf Download (702kB) | Preview |
Abstract
Human language naturally and flexibly adjusts speech rate, intonation, and voice intensity during communication. However, such dynamic changes are often inadequately modeled in current speech synthesis research. Most existing studies focus on generating audio with specific emotional tones (e.g., happy, sad, angry), but few address synthesizing audio with varied speech modifications, such as changes in speech speed and pitch adjustments within a single sentence. To address this gap, this study proposes an innovative method for multi-effect speech synthesis using the FastSpeech2 model by precisely modifying the training files and corresponding audio data. Experimental results demonstrate that this approach significantly enhances the model’s ability to reproduce target speech modifications, yielding excellent performance in Chinese, English, and Spanish. Numerical analyses and manual listening assessments validate the model's sensitivity and accuracy to speech rate adjustments. Additionally, the study demonstrates the cross-linguistic generalizability and validity of the method, indicating a wide range of potential applications. This method is expected to contribute to more emotionally expressive and diverse audio synthesis, advancing speech synthesis technology.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Coler, M.L. and Nayak, S. |
Date Deposited: | 06 Aug 2024 08:02 |
Last Modified: | 06 Aug 2024 08:02 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/544 |
Actions (login required)
View Item |