Willis, Leslie (2023) Exploring Automatic Speech Recognition for Podcast Audio: Fine-Tuning HuBERT on the Spotify Podcast Dataset. Master thesis, Voice Technology (VT).
PDF
VoiceTechnology_Thesis_LeslieWillis.pdf Restricted to Repository staff only Download (1MB) |
Abstract
Voice technologies are becoming increasingly prominent in daily life. An important field of research responsible for the steady improvements in ASR is self-supervised speech representation learning (SSL). Using the pretrained HuBERT-Large model (Hsu et al., 2021), this work investigates fine-tuning for SSL in a low resource setting with only 10 minutes of podcast data. The HuBERT model is a state-of-the-art SSL model that has been widely used for benchmark SSL tasks. With the Spotify podcast dataset (Clifton, Pappu, et al., 2020) being published only recently, research in this domain, specifically in speech processing with podcast audio, is a yet novel approach. To compare the performance of the podcast speech data, the same training configurations are applied to 10min of the TIMIT (Garofolo, 1993) dataset. Achieving a word error rate of only 62.2% at best on the podcast dataset and 60.3% on the TIMIT dataset in a low-resource setting. The findings suggest an advantage in using the TIMIT dataset. This work provides, however, novel insights into the use of podcast audio for fine-tuning the HuBERT model, pointing out limitations of the setups used and implications for future research directions.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Nayak, S. and Coler, M.L. |
Date Deposited: | 18 Mar 2024 09:35 |
Last Modified: | 18 Mar 2024 09:35 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/383 |
Actions (login required)
View Item |