Javascript must be enabled for the correct page display

Manipulating Acoustic Correlates for Vocal Persona Transition: From Neutral to Friendly

Lin, Chenyi (2024) Manipulating Acoustic Correlates for Vocal Persona Transition: From Neutral to Friendly. Master thesis, Voice Technology (VT).

[img]
Preview
PDF
MA-5664713-C-Lin.pdf

Download (1MB) | Preview

Abstract

The concept of vocal persona, reflecting the identity or character perceived through an individual’s voice, exhibits dynamic variability as it adapts to various social contexts. Understanding the dynamic shifts of vocal persona not only enriches the expressivity and personalization of Text-to-Speech (TTS) systems but also holds potential for enhancing user engagement across applications. This adaptability is crucial for the effectiveness of TTS systems yet remains less explored, particularly in the realm of attitudinal nuances such as synthesizing speeech with a friendly attitude. The conventional method of synthesizing friendly speech involves training TTS models with datasets specifically containing friendly attitudes. However, given the limitations of available speech datasets, which predominantly lack diverse attitudinal tones, our study employed specific acoustic manipulations (namely alterations in pitch, duration, and energy) in neutral speech data to facilitate the perceptual transition of vocal personas from neutral to friendly in Mandarin Chinese TTS, using the FastSpeech2 framework. We examined the individual and combined effects of these acoustic features on enhancing the friendliness of synthesized speech. Through controlled experimental setups, our research quantified these perceptual shifts using identification accuracy and mean opinion scores (MOS). Based on the findings of F. Chen, Li, Wang, Wang, and Fang (2004) and Li, Chen, Wang, and Wang (2004), we anticipated that increasing the mean pitch of a neutral voice alone will significantly influence friendliness perception. Moreover, integrating it with shorter phone duration and slightly raised energy was expected to further optimize the perception of friendliness. However, our study revealed that neither modulation of pitch alone nor alterations in pitch, duration, and energy together achieved a significant perceptual shift towards friendliness. This may suggest a limited effect of acoustic cues alone in friendliness perception and may require further investigation into the effectiveness of acoustic manipulations in synthesizing friendly speech. Despite the unsuccessful perceptual transition, this exploration deepens our understanding of voice persona modulation, offering valuable insights for advancing TTS technology. By bringing the acoustic underpinnings of vocal persona transitions to light, our findings aim to contribute to more expressive and engaging TTS applications, with broader implications for voice branding, assistive technology, and human-computer interaction.

Item Type: Thesis (Master)
Name supervisor: Do, T.P.
Date Deposited: 15 Jul 2024 11:36
Last Modified: 15 Jul 2024 11:36
URI: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/522

Actions (login required)

View Item View Item