F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition

Kokowski, Jan (2025) F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Master thesis, Voice Technology (VT).

Preview

Text
MScThesisJanKokowski.pdf
Download (14MB) | Preview

Abstract

With the widespread adoption of voice-enabled devices in homes and the growing promise of speech technology for accessibility, improving whispered speech recognition is becoming increasingly relevant. The automatic recognition of whispered speech remains an ongoing challenge, primarily due to the scarcity of whispered speech data and its distinct acoustic properties, which degrade the performance of models trained on normal speech. A key difference the absence of fundamental frequency - particularly degrades low-frequency regions. Our study shows that masking spectrogram regions below 300 Hz during voiced phone frames (F0-Mask) leads to a statistically significant relative improvement of 6.5% in word error rate (WER) on whispered speech, compared to the baseline using the state-of-the-art augmentation method, SpecAugment, which masks frequencies indiscriminately. We achieve this result by fine-tuning OpenAI’s Whisper-small model on the US subset of wTIMIT. Seven fine-tuning experiments - including novel data augmentation with F1-Mask, LF-Mask, and hybrid approaches combining SpecAugment with our methods - showed that none of the other setups established either a statistically significant improvement or degradation in WER on whispered speech compared to the SpecAugment baseline. This finding suggests that the absence of F0 in whispered speech - and the resulting degradation of voicing band - are key acoustic differences that impede recognition, and that removing the F0 band helps the model focus on higher frequencies. Our findings are in line with related studies on whispered speech recognition and suggest that data augmentation approaches tailored specifically to whispered speech properties represent a promising research direction. Finally, the F0-Mask approach achieved 11.5% WER on the whispered US subset of wTIMIT, matching the current state-of-the-art performance on this dataset, while maintaining strong performance on normal speech, with no degradation from the baseline WER of 5%.

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	23 Jun 2025 08:56
Last Modified:	24 Jun 2025 09:41
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674

Actions (login required)

View Item