Does Where Words Come From Matter? Leveraging Self-supervised Models for Multilingual ASR and LID

Shen, Gaofei (2022) Does Where Words Come From Matter? Leveraging Self-supervised Models for Multilingual ASR and LID. Master thesis, Voice Technology (VT).

Preview

PDF
MSc S4920155 GS Shen.pdf
Download (309kB) | Preview

Abstract

While end-to-end ASR systems have evolved to achieve great performance in monolingual speech recognition in many languages, researchers have tried to improve the performance of these systems further with several different approaches. For example, researchers have found potential ways to leverage the end-to-end architecture for multilingual code-switching speech recognition by fine-tuning pre-trained models on multilingual datasets directly (Lovenia et al., 2022). Because the previous attempts focused on higher-resourced language pairs such as Mandarin and English, this thesis tests if training end-to-end ASR systems based on self-supervised learning models with multilingual data directly can improve multilingual ASR performances for lower-resourced language pairs such as Frisian and Dutch as well. It was found that fine-tuning monolingual end-to-end models with code-switching datasets can achieve good results. Additionally, researchers have also found that the hidden representations generated by the intermediate layers in the neural network encode certain acoustic features (Pasad et al., 2021). This thesis also proposes using outputs from the intermediate layer to train a language identification system that can measure the language integration of code-switching utterances. Based on previous research on multilingual, code-switching capable ASR systems (Baevski et al., 2020; Bentum, 2022; Tseng et al., 2022; Yılmaz et al., 2016), a language identification system that can indicate the level of language integration of a word should be able to improve the accuracy of code-switching ASR further. However, as the experiment in this thesis revealed, a simple LID model for very similar language pairs such as Frisian and Dutch does not produce great results. It is possible that using a LID module in building a truly multilingual speech recognition software is not the best approach for languages that have many similarities. It also reveals more future topics in the multilingualism research field in finding out more features that human listeners use to identify the dialect or languages being spoken.

Item Type:	Thesis (Master)
Name supervisor:	Coler, M.L. and Nayak, S.
Date Deposited:	09 Sep 2022 08:26
Last Modified:	09 Sep 2022 08:26
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/227

Actions (login required)

View Item