Improving the State-of-the-Art Frisian ASR by fine-tuning Large-Scale Cross-Lingual Pre-Trained Models

Bălan, Dragoș Alexandru (2023) Improving the State-of-the-Art Frisian ASR by fine-tuning Large-Scale Cross-Lingual Pre-Trained Models. Master thesis, Voice Technology (VT).

Preview

PDF
MA S3944867 DA Balan.pdf
Download (1MB) | Preview

Abstract

Frisian is a West Germanic language recognized as an official language in the Netherlands and used extensively in the province of Fryslan. Despite its official status, Frisian lacks technological support and resources, especially in the field of automatic speech recognition (ASR). Thus, it is considered a low-resource language, and low-resource language speech recognition requires alternative approaches. To enhance Frisian ASR performance and address the challenges posed by its low-resource status, this research focuses on fine-tuning the XLS-R model, a large-scale cross-lingual pre-trained model. XLS-R, built upon the wav2vec 2.0 architecture, has shown promising results in multilingual ASR tasks. It will be compared with the state-of-the-art XLSR-53 model, which has been widely used for Frisian speech recognition, to assess its potential for achieving improved word error rates and surpassing the existing performance benchmarks. Specifically, my research answers the following question: Can fine-tuning the XLS-R model on Frisian speech achieve a word error rate (WER) below 20% and outperform the state-of-the-art XLSR-53 model? Comparisons were made using a baseline WER score of 15.19%. Training the XLS-R model with the same data as the baseline (5 hours of speech) yielded a WER of 14.13%, improving the current state-of-the-art by 1.06% absolutely and 7% relatively. Further fine-tuning with approximately 8 times more data (41 hours) achieved an impressive WER of 4.11% that sets a new milestone in Frisian speech recognition and even surpasses the performance of XLS-R fine-tuned on high-resource languages. Additional experiments were conducted using 10 minutes, 1 hour, and 10 hours of training speech, resulting in word error rates of 62.25%, 25.4%, and 8.83% respectively, underscoring the importance of more data when fine-tuning large-scale models. XLS-R models with 0.3B and 2B parameters have also been fine-tuned with 41 hours of data. XLS-R with 1B is the most balanced out of the three, scoring the best WER and consuming more resources than the 0.3B model, but less than the 2B model. Future work involves using a newer version of the dataset, using the FAME! corpus, comparing the results with a different large-scale model, Whisper, as well as using language model rescoring and other metrics such as character error rate and phoneme error rate.

Item Type:	Thesis (Master)
Name supervisor:	Coler, M.L. and Nayak, S.
Date Deposited:	12 Sep 2023 11:06
Last Modified:	12 Sep 2023 11:06
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/360

Actions (login required)

View Item