Exploring the Impact of Prosodic Styles in Datasets on Mandarin Speech Synthesis Using BERT-VITS

Liao, Yanhua (2024) Exploring the Impact of Prosodic Styles in Datasets on Mandarin Speech Synthesis Using BERT-VITS. Master thesis, Voice Technology (VT).

Preview

PDF
Exploring-the-Impact-of-Prosodic-Styles-in-Datasets-on-MandarinYanhuaLiao.pdf
Download (3MB) | Preview

Abstract

As a major channel for information dissemination, broadcasting has a history spanning over a century and has always played an irreplaceable role. Over the past decade, text-to-speech (TTS) technology based on deep learning has gradually emerged and achieved numerous successes in the Mandarin domain. Many radio stations have introduced TTS models into the production of news broadcasting programs to improve production efficiency. However, the application of these technologies in more relaxed and natural entertainment programs remains relatively limited. This paper is based on the BERT-VITS model, a state-of-the-art text-to-speech synthesis system. The BERT-VITS model combines the capabilities of BERT (Bidirectional Encoder Representations from Transformers) for natural language understanding with VITS (Variational Inference Text-to-Speech) for high-quality speech synthesis. We first use a large open-source dataset with a prosodic style biased towards news broadcasting to train the TTS model. We then fine-tune the model on a smaller dataset with a more relaxed and natural prosodic style. The purpose of the experiment is to investigate the impact of basic training and fine-tuning with datasets of different prosodic styles on the model's output audio prosody after the initial training. We aim to create more diverse prosodic styles for broadcasting programs by combining the BERT-VITS model with datasets of different prosodic styles, thereby enriching the types of artificial intelligence (AI) broadcasting. Our experimental results demonstrate that the system can accurately and dynamically adjust the target voice timbre and prosodic style to match the reference speech. Even with limited data training, it can synthesize speech that matches the target speaker's prosody and style. The generated speech has good naturalness and inference speed, making it suitable for broadcasting content.

Item Type:	Thesis (Master)
Name supervisor:	Do, T.P.
Date Deposited:	19 Sep 2025 13:28
Last Modified:	19 Sep 2025 13:28
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/546

Actions (login required)

View Item