Transfer Learning for Sichuan Dialect Automatic Speech Recognition Based on pretrained Wav2vec 2.0 Model

Li, ZiYi (2025) Transfer Learning for Sichuan Dialect Automatic Speech Recognition Based on pretrained Wav2vec 2.0 Model. Master thesis, Voice Technology (VT).

Preview

PDF
Thesis.pdf
Download (466kB) | Preview

Abstract

This thesis explores the application of self-supervised pre-trained models to low-resource dialectal speech recognition, using Sichuanese as a case study. We fine-tune the wav2vec2-large-xlsr-53 pre-trained model on a limited amount of manually transcribed Sichuanese speech, aiming to develop a practical automatic speech recognition (ASR) system in a highly resource-constrained setting. Our primary experimental results demonstrate that transfer learning can effectively reduce the character error rate (CER) from over 77% to below 28% using less than 11 hours of diverse training data. We further examine the impact of different training data compositions and propose a multi-source integration strategy that maintains performance while utilizing additional data. In contrast, a naive mixture of heterogeneous datasets significantly degrades model performance. Analysis reveals that data diversity plays a more crucial role than quantity in low-resource ASR, and that dialect-specific phenomena contribute notably to recognition errors. This study highlights the effectiveness of pre-trained models for dialectal ASR and provides practical insights into data selection and fine-tuning strategies. The proposed methodology contributes to the broader goal of enabling speech technologies for underrepresented languages and dialects.

Item Type:	Thesis (Master)
Name supervisor:	Schauble, J.K.
Date Deposited:	06 Aug 2025 13:01
Last Modified:	06 Aug 2025 13:01
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/760

Actions (login required)

View Item