Deng, Yaling (2024) Improving the Performance of Code-Switching Recognition Using Whisper. Master thesis, Voice Technology (VT).
|
PDF
MA-S5666546-Deng-.pdf Download (593kB) | Preview |
Abstract
The intersection of technology and linguistics has given rise to a plethora of challenges and opportunities. Among these, the task of Automatic Speech Recognition (ASR) stands out as a critical area of research and development. ASR systems have become an integral part of our daily lives, facilitating communication and accessibility across various platforms. However, the prevalence of multilingual interactions, particularly code-switching (CS), poses a significant challenge to the accuracy and reliability of these systems. Code-switching, the practice of alternating between two or more languages in the context of a single conversation, is a common phenomenon in many bilingual communities. This research focuses on Mandarin-English intra- sentential CS, a particularly complex scenario due to the stark differences between the two languages in terms of phonetics, syntax, and semantics. The ubiquity of ASR systems has been propelled by advancements in deep learning and the abundance of data. Deep learning models, with their ability to capture intricate patterns and representations, have revolutionized the field of ASR. However, existing ASR systems often falter when faced with the complexities of multilingual interactions, such as CS. This study is motivated by the linguistic phenomenon of CS and its implications in the accuracy of ASR. It aims to enhance the recognition capabilities of ASR systems in handling Mandarin-English mixed-language speech, a task that is not only technologically challenging but also socially significant. The Whisper model, developed by OpenAI, serves as the foundation for this research. Whisper is an innovative ASR model that leverages weak supervision and has demonstrated robust performance across various languages and dialects. It is designed to handle the nuances of speech, including different accents and languages, making it an ideal candidate for tackling the challenge of Mandarin-English CS. However, the model’s effectiveness in decoding Mandarin- English CS is yet to be fully realized. This research hypothesizes that fine-tuning the Whisper model with a dedicated Mandarin-English CS dataset will significantly improve its performance in recognizing code-switched speech. To test this hypothesis, a meticulous fine-tuning process was undertaken. The process involved the collection and preparation of a Mandarin-English CS dataset, which was then used to train and refine the Whisper model. The dataset was carefully curated to represent a diverse range of CS scenarios, ensuring that the model would be exposed to a wide variety of linguistic contexts. The fine-tuning process was rigorously evaluated, with a focus on reducing the Mixture Error Rate (MER), a key metric in assessing the performance of ASR systems in handling CS speech. The results of this study are promising. A substantial reduction in the MER for Mandarin-English CS speech was observed, validating the hypothesis and highlighting the potential of tailored models in enhancing ASR accuracy. The fine-tuned models, ME-Whisper-small and ME-Whisper-large-v3, exhibited a marked improvement in their MER, showcasing the effi- cacy of the proposed approach. This improvement is not only statistically significant but also practically relevant, as it translates to better recognition rates and user experiences for Mandarin-English bilingual speakers. Despite these promising outcomes, the study acknowledges several limitations. One of the pri- mary limitations is the modest size of the Mandarin-English CS dataset. The size and diversity of the dataset are critical factors in the performance of ASR systems. A larger and more diverse dataset could potentially lead to further improvements in model performance. Additionally, constraints on experimental time have influenced the extent of model optimization. More time could allow for more rigorous hyperparameter tuning and additional iterations of model training and evaluation. Future work is suggested to address these limitations and further refine model performance. Expanding the dataset to include more examples of Mandarin-English CS speech is a priority. This could involve collecting more data from bilingual communities, as well as incorporating a wider range of accents and dialects. Pursuing more rigorous hyperparameter tuning is also recommended, as this could lead to further optimizations in model performance. Furthermore, the exploration of cross-lingual datasets and the development of severity-dependent models are proposed as avenues for future research. Cross-lingual datasets could help the model generalize better across different language pairs, while severity-dependent models could adapt to the varying degrees of CS present in speech. These advancements could foster more equitable and effective ASR systems that are better suited to handle the complexities of multilingual communication. In conclusion, this pioneering research in Mandarin-English CS speech recognition using the Whisper model sets a benchmark for future exploration and innovation. The findings underscore the importance of adapting ASR systems to the complexities of multilingual communication. By doing so, we can pave the way for more inclusive and responsive technologies that cater to the diverse linguistic landscape of our global community. This research not only contributes to the field of ASR but also has broader implications for the development of technologies that are sensitive to the needs of multilingual speakers. The abstract provided here encapsulates the essence of the research, highlighting its significance, methodology, results, and implications for the field of ASR.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Do, T.P. |
Date Deposited: | 22 Jul 2024 07:26 |
Last Modified: | 22 Jul 2024 07:26 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/536 |
Actions (login required)
View Item |