Speaker Identification in Mandarin Conference Speech via Transfer Learning with wav2vec 2.0

Sixing, Mi (2025) Speaker Identification in Mandarin Conference Speech via Transfer Learning with wav2vec 2.0. Master thesis, Voice Technology (VT).

Preview

PDF
S5827094.pdf
Download (443kB) | Preview

Abstract

In today's multilingual and digital world, speaker recognition is becoming increasingly important in real-world applications such as virtual conferencing, transcription services, and customer support. Despite significant progress in Mandarin automatic speech recognition, speaker recognition in real-world Mandarin conference speech is still imperfect due to challenges such as pitch interference, overlapping segments, and environmental noise. To further improve Mandarin speaker recognition performance, this study focuses on exploring the transfer ability of wav2vec2.0 to speaker recognition tasks in multi-person conference settings. To evaluate this, I used the AISHELL-4 corpus, which contains Mandarin conference speech with realistic acoustic variations. Specifically, my study answers the following questions: How effectively can a pre-trained Mandarin ASR wav2vec2 model be adapted for speaker recognition in real-world Mandarin conference speech? What are the effects of task and domain transfer mechanisms on its performance? This study freezes the wav2vec 2.0 encoder, adds a lightweight linear classifier on top of it, and designs two control groups: a global classification baseline model and a session-level transfer learning model. The results show that although the baseline model achieved a Top-1 accuracy of 50.3\% on the entire speaker label space, the session-level model performed significantly better than the baseline model, with an average accuracy improvement of more than 20\% and a maximum accuracy improvement of 36\% compared to the baseline model, highlighting the superiority of the session-level model. These findings suggest that even with a small amount of fine-tuning, pre-trained ASR models can capture speaker recognition features and generalize well to noisy domains. This study provides evidence that this transfer learning strategy is effective for speaker perception systems in real-world Mandarin environments, and future directions include adaptive fine-tuning, cross-lingual generalization, and integration with speaker classification for broader applications.

Item Type:	Thesis (Master)
Name supervisor:	Verkhodanova, V.
Date Deposited:	31 Jul 2025 10:28
Last Modified:	31 Jul 2025 10:28
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/744

Actions (login required)

View Item