Li, ZiYi (2025) Transfer Learning for Sichuan Dialect Automatic Speech Recognition Based on pretrained Wav2vec 2.0 Model. Master thesis, Voice Technology (VT).
|
PDF
Thesis.pdf Download (466kB) | Preview |
Abstract
This thesis explores the application of self-supervised pre-trained models to low-resource dialectal speech recognition, using Sichuanese as a case study. We fine-tune the wav2vec2-large-xlsr-53 pre-trained model on a limited amount of manually transcribed Sichuanese speech, aiming to develop a practical automatic speech recognition (ASR) system in a highly resource-constrained setting. Our primary experimental results demonstrate that transfer learning can effectively reduce the character error rate (CER) from over 77% to below 28% using less than 11 hours of diverse training data. We further examine the impact of different training data compositions and propose a multi-source integration strategy that maintains performance while utilizing additional data. In contrast, a naive mixture of heterogeneous datasets significantly degrades model performance. Analysis reveals that data diversity plays a more crucial role than quantity in low-resource ASR, and that dialect-specific phenomena contribute notably to recognition errors. This study highlights the effectiveness of pre-trained models for dialectal ASR and provides practical insights into data selection and fine-tuning strategies. The proposed methodology contributes to the broader goal of enabling speech technologies for underrepresented languages and dialects.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Schauble, J.K. |
Date Deposited: | 06 Aug 2025 13:01 |
Last Modified: | 06 Aug 2025 13:01 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/760 |
Actions (login required)
![]() |
View Item |