Improving the naturalness of an end-to-end text-to-speech system with information structure

Baumann, Judith (2022) Improving the naturalness of an end-to-end text-to-speech system with information structure. Master thesis, Voice Technology (VT).

Preview

PDF
MA 4954475 J Baumann.pdf
Download (582kB) | Preview

Abstract

Synthetic voices produced by text-to-speech (TTS) systems have reached a very high level of naturalness. The naturalness of those systems is typically evaluated by presenting isolated sentences to human listeners. However, in some TTS applications (e.g., read-aloud applications or conversational interfaces), synthesized sentences are part of a larger monologue or dialogue. Thus, users listen to synthesized sentences in the context of other sentences. The content of previous sentences sometimes requires a change of prosody in the current sentence - for instance, to mark a focused element. Therefore, the naturalness of synthesized sentences may decrease in the context of other sentences if the TTS system does not vary its prosody in the way it is required. This study examines whether the naturalness of a synthetic voice that produces sequences of sentences is improved when the prosody of its sentences is modified so that it better matches the current context. For this purpose, we replicated an approach proposed by Latif et al. (2021) that uses control tags for prosody modification in end-to-end TTS systems. We used this approach to synthesize isolated sentences and paragraphs in English where prosody accords with the contextually induced focus types. We also trained a second TTS system that does not allow prosody control. In a subsequently conducted mean opinion score (MOS) study, isolated sentences and paragraphs achieve higher naturalness ratings when synthesized with the system that marks foci prosodically. This suggests that a more context-appropriate prosody can improve the naturalness of synthetic voices, not only when producing sequences of sentences but also when producing sentences in isolation. In addition, across the two systems, paragraphs receive lower ratings than isolated sentences. This strengthens the idea that the naturalness of a TTS system that has been evaluated on isolated sentences may decrease when the same system is used for the synthesis of larger texts or conversations. This should be taken into account when designing and interpreting the evaluation of a TTS system.

Item Type:	Thesis (Master)
Name supervisor:	Verkhodanova, V. and Do, T.P. and Coler, M.L.
Date Deposited:	09 Sep 2022 08:23
Last Modified:	09 Sep 2022 08:23
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/235

Actions (login required)

View Item