Singing Voice Synthesis in Your Language: Cross-Lingual Transfer with Limited Data Using Diffusion Models

Dong, Jiashu (2025) Singing Voice Synthesis in Your Language: Cross-Lingual Transfer with Limited Data Using Diffusion Models. Master thesis, Voice Technology (VT).

Preview

PDF
BAs6124720Jiashu-Dong.pdf
Download (4MB) | Preview

Abstract

Singing Voice Synthesis (SVS) has achieved remarkable progress with diffusion-based models such as DiffSinger (J. Liu, Li, Ren, Chen, & Zhao, 2022), enabling expressive and high-fidelity singing generation. However, most existing SVS systems are primarily trained on English and Chinese datasets, limiting access for musicians or music enthusiast from other linguistic communities. Ex- tending SVS to more languages could democratize music production and contribute to the preserva- tion of global cultural diversity. This work explores cross-lingual transfer learning for SVS, using DiffSinger as the base system and German as the target language. We hypothesize that fine-tuning an English-trained DiffSinger model on a small amount of German data—leveraging a phoneme map- ping strategy based on PHOIBLE (Moran & McCloy, 2019)—can achieve comparable performance to a model trained from scratch on a large-scale monolingual German dataset. Furthermore, we in- vestigate the influence of training data quality in low-resource scenarios. Given the same limited data size, we hypothesize that models fine-tuned on higher-quality data—characterized by native accent, broader vocal range, clean recording conditions—will outperform those trained on lower-quality datasets. This improvement is attributed to enhanced linguistic clarity and expressive realism. Eval- uation is conducted using both objective (F0 Frame Error, Mean Cepstral Distortion, Word Error Rate) and subjective (Comparative Mean Opinion Score, MUSHRA) metrics. Results indicate that fine-tuned models with as little as 15/30 minutes of data can achieve performance comparable or even better to those trained on large-scale datasets, and with limited 15mins data, the overall data quality—including accent, vocal control and recording conditions—can improve synthesis quality significantly. This study presents a focused analysis of phoneme-mapped cross-lingual transfer for German SVS and offers practical strategies for adapting SVS systems to underrepresented languages using minimal data. To the best of our knowledge, this is the first study to investigate cross-lingual transfer learning in SVS field. We believe that the findings and methodology of this work can be extended to support cross-lingual SVS development in other low-resource languages as well. We hereby release both the online demo, available at https://dongjiashu.github.io/DiffSinger/, and the source code repository at https://github.com/DongJiashu/DiffSinger for public access.

Item Type:	Thesis (Master)
Name supervisor:	Do, T.P.
Date Deposited:	16 Jun 2025 11:37
Last Modified:	16 Jun 2025 11:37
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/668

Actions (login required)

View Item