Wang, Yinqiu (2024) Code-switching speech synthesis for Mandarin-English using FastSpeech2: A unified IPA-based approach. Master thesis, Voice Technology (VT).
|
PDF
MA-S5716675-WANG.pdf Download (536kB) | Preview |
Abstract
With the increasing prevalence of multilingual societies and cross-cultural interactions, the ability to synthesize natural-sounding code-switching speech has become crucial for enhancing communication and accessibility. However, the scarcity of appropriate code-switched datasets and the inherent complexity of handling multiple languages within a single utterance pose significant challenges for TTS systems. The purpose of this study is to explore Mandarin-English code-switching speech synthesis based on FastSpeech2, with the goal of synthesizing speech that is both intelligible and natural. This paper mainly explores two methods to achieve speech synthesis with code-switching between Mandarin and English: (1) directly modeling Mandarin and English phonemes; (2) unifying the input formats for both languages as phonological features(PF) based on the International Phonetic Alphabet (IPA). Additionally, considering that the current available open-source Mandarin and English code-switching datasets are designed for Automatic Speech Recognition(ASR) and have lower audio quality, this study recorded 500 high-quality Mandarin-English code-switching audio clips as a fine-tuning dataset to improve the quality of the speech synthesized by the model. The proposed method will be evaluated using subjective listening assessments. According to the MOS results, directly modeling phonemes can produce intelligible speech, while modeling PF can produce speech that is both intelligible and natural. Successful development of code-switching TTS systems as explored here can facilitate communication across languages, with applications in education, media, and assistive technologies. Here are some audio samples from the demo page: https://wangyinqiu.github.io/Mandarin_and_English_CS_TTS/ Keywords: Code-switching Speech Synthesis, Text-to-Speech Synthesis, Phonological Features, Multilingual Speech Technology
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Zhu, Li |
Date Deposited: | 01 Aug 2024 10:10 |
Last Modified: | 01 Aug 2024 10:10 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/542 |
Actions (login required)
View Item |