Qu, Layla (2024) Data Augmentation and VAE-GAN for Few-Shot Singing Singing Voice Cloning. Master thesis, Voice Technology (VT).
|
PDF
MA-5551870-L-Qu.pdf Download (665kB) | Preview |
Abstract
This study explores the feasibility of cloning the original singing voice timbre using a limited singing dataset through data augmentation techniques and the VAE-GAN model. The NUS-48e singing database, which includes 40 audio samples from ten speakers, was enhanced using various data augmentation methods, such as pitch shifting, temporal stretching, background noise addition, and spectrogram perturbation. The VAE-GAN model, which combines the strengths of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), was then trained on this augmented dataset to evaluate its effectiveness in replicating the original voice timbre. The study aims to determine whether these techniques can successfully clone the original voice timbre with minimal data. It hypothesizes that even with data augmentation, the model may struggle to fully replicate the original timbre due to the scarcity of data. Results supported by t-SNE visualization and quantitative metrics (e.g., reconstruction loss, signal-to-noise ratio, MSE, diversity score, DTW distance, and Euclidean distance) indicate that while data augmentation increases diversity and improves model performance, it also introduces feature variability, making full replication challenging. This study highlights the potential and limitations of using VAE-GAN architecture and data augmentation techniques for speech synthesis and cloning in low-resource environments, offering insights for future research.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Verkhodanova, V. |
Date Deposited: | 25 Aug 2024 19:33 |
Last Modified: | 25 Aug 2024 19:33 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/557 |
Actions (login required)
View Item |