Siu, Stella (2025) Speaking Volumes: How Acoustic Features Reveal Speaker Height. Master thesis, Voice Technology (VT).
|
PDF
MSc4052455Siu.pdf Download (1MB) | Preview |
Abstract
With growing interest in biometric technologies, speaker height estimation directly from acoustic signals has emerged as a valuable capability for applications in forensics, authentication, and speech profiling. However, most state-of-the-art systems rely on full speech input, which poses challenges for conversational privacy. This study investigates the feasibility of predicting speaker height from sub-lexical acoustic features using lightweight models. Basic feature (F0), intermediate features (formants), and high-dimensional features (MFCCs) were utilized as input across three regression models: simple linear, multiple linear, and random forest regression. Results show that MFCCs combined with multiple linear regression yield a statistically signifi- cant performance using only isolated diphthong /aw/, achieving a minimum root-mean-square error (RMSE) below 7 cm on the TIMIT dataset. This performance is on par with state-of-the-art full speech input and deep neural network models. MFCCs also showed greater gains when used with multivariate models, suggesting that feature complexity and model structure interact to influence prediction outcomes. Additionally, diphthong /aw/ emerged as the most reliable input unit, consis- tently yielding low prediction errors in both multiple linear and random forest regressions, whereas reduced vowel /ax-h/ consistently underperformed across all feature sets and regression models. Fur- thermore, an inverse relationship between F1 and F4 was observed in both simple linear regression and random forest feature importance analysis, indicating that as one becomes more predictive, the other contributes less—suggesting a complementary dynamic in height estimation. These findings demonstrate that phone based input, which is linguistically impoverished, can reduce conversational privacy risks and offer a viable alternative to models based on full speech. They suggest a promising direction for developing interpretable and conversational privacy con- scious speaker profiling systems using minimal speech input.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Coler, M.L. |
Date Deposited: | 24 Jul 2025 10:26 |
Last Modified: | 24 Jul 2025 10:26 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/717 |
Actions (login required)
![]() |
View Item |