Zhu, Qiye (2025) Zero-Shot Voice Cloning with Minimal Data: Impact of Reference Duration on Long-Form Speech Synthesis. Master thesis, Voice Technology (VT).
|
Text
MAs5965055QYZhu.pdf Download (1MB) | Preview |
Abstract
Zero-shot voice cloning has rapidly advanced in recent years, enabling the synthesis of highly realistic speech from minimal reference audio. Such systems offer significant potential for applications ranging from personalized virtual assistants to assistive technologies for individuals with speech impairments. However, while short-form voice cloning has been extensively studied, questions remain about how little reference data is sufficient to enable high-quality, long-form speech synthesis that preserves a new speaker’s identity. This thesis investigates this question by systematically evaluating the performance of XTTS-v2, an open-source zero-shot text-to-speech (TTS) model, on extended speech generation tasks. The study focuses on determining the minimal reference audio duration required for XTTS-v2 to produce 1-minute speech segments that are both natural and stable in speaker identity. Six reference durations were tested (1, 3, 6, 10, 20, and 40 seconds), with both objective metrics, Speaker Embedding Cosine Similarity (SECS), Mel Cepstral Distortion (MCD) and subjective ratings, Mean Opinion Score for Naturalness (MOS-N), and Speaker Similarity (S-MOS) used to evaluate performance. In addition, a user survey was conducted to assess public perceptions of voice cloning technology and related ethical concerns. The experimental results demonstrate that XTTS-v2 can achieve high-fidelity voice cloning with as little as 6–10 seconds of reference audio. Significant improvements in speaker similarity and naturalness were observed when increasing reference duration from 1 to 10 seconds, with diminishing returns beyond 20 seconds. The system successfully generated coherent and stable 1-minute utterances without noticeable drift in speaker identity or prosody. The evaluation also revealed minor voice-dependent differences; in this study, the male speaker required slightly less data to achieve optimal results compared to the female speaker, highlighting that individual voice characteristics can influence cloning performance. Importantly, the survey findings indicate substantial public interest in voice cloning for beneficial applications, such as voice recovery, while also revealing widespread concerns about privacy, security, and misuse. The majority of respondents emphasized the need for ethical safeguards, including consent frameworks and technical measures such as watermarking. Overall, this thesis provides practical guidelines for the efficient use of reference audio in zero- shot voice cloning and highlights key ethical considerations for future system deployment. The results suggest that modern TTS models like XTTS-v2 are capable of generating high-quality long- form speech with minimal data, paving the way for broader adoption while underscoring the need for responsible and transparent practices.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Do, T.P. (Phat) |
Date Deposited: | 21 Jul 2025 11:50 |
Last Modified: | 21 Jul 2025 11:50 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/708 |
Actions (login required)
![]() |
View Item |