Fine-tuning Cantonese based on Wav2vec 2.0 XLRS model that pretrained on Mandarin Chinese to improve ASR performance

Li, Qing, Q (2024) Fine-tuning Cantonese based on Wav2vec 2.0 XLRS model that pretrained on Mandarin Chinese to improve ASR performance. Master thesis, Voice Technology (VT).

Preview

PDF
MSc-S5600502-Q-Li.pdf
Download (506kB) | Preview

Abstract

This study investigates the effectiveness of cross-lingual transfer learning for Cantonese Automatic Speech Recognition (ASR) by comparing a baseline wav2vec2 XLRS model pre-trained on multiple languages with a transfer learning model pre-trained on Mandarin. The baseline model achieved a Character Error Rate (CER) of approximately 0.3, while the transfer learning model demonstrated a significantly lower CER of around 0.2 after 40 epochs of training. The transfer learning approach showed enhanced training efficiency, faster convergence, and robust generalization ability, despite the baseline model’s slight advantage in validation loss during later stages. These findings validate the hypothesis that leveraging a pre-trained Mandarin model, fine-tuned with limited labeled Cantonese data, significantly outperforms the baseline model. This study underscores the potential benefits of cross-lingual transfer learning, particularly between linguistically similar languages, and highlights its importance for developing inclusive and diverse ASR systems for under-resourced languages.

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	01 Aug 2024 10:07
Last Modified:	01 Aug 2024 10:07
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/541

Actions (login required)

View Item