Javascript must be enabled for the correct page display

Phone Masking Augmentation for Automatic Recognition of Whispered Speech

Marchenko, Igor (2024) Phone Masking Augmentation for Automatic Recognition of Whispered Speech. Master thesis, Voice Technology (VT).

[img]
Preview
PDF
MA-5754798-I-Marchenko.pdf

Download (3MB) | Preview

Abstract

Automatic speech recognition (ASR) models are predominantly trained on vocalized speech and may encounter difficulties in recognizing whispered speech, which is a crucial component of human communication. The challenges arise from a scarcity of whispered speech data necessary for effective model training. To address data insufficiency in the ASR domain, data augmentation through the masking of spectrogram regions has emerged as a promising technique, which both enlarges the dataset and enhances the generalization capabilities of speech recognition models, as exemplified by methods like SpecAugment. However, while SpecAugment was developed for general ASR and applies random masks to spectrogram regions, this study proposes a targeted approach of masking predefined spectrogram regions in the time domain, tailored specifically to the peculiarities of whispered speech. Phonetic studies of whispered speech have revealed that certain groups of sounds in whispered speech differ significantly in articulation compared to their vocalized counterparts. Consequently, this research focuses on masking these most significantly divergent groups of sounds. Experimental results indicate that masking sounds produced at the hard palate (e.g., /j/) improves whispered speech recognition performance, achieving a word error rate of 11.5% on the US-accented part of the wTIMIT dataset, which appears to be the best performance reported for wTIMIT to date.

Item Type: Thesis (Master)
Name supervisor: Nayak, S.
Date Deposited: 16 Jul 2024 06:38
Last Modified: 16 Jul 2024 06:38
URI: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/523

Actions (login required)

View Item View Item