Matuszewski, Hubert (2026) Towards Automatic Speech Genre Synthesis: Proposing a Framework for Speech Function Recognition and Speech Register Synthesis. Master thesis, Voice Technology (VT).
|
PDF
MA-3992756-HZ-Matuszewski.pdf Download (1MB) | Preview |
Abstract
Recent progress in speech synthesis have enabled the generation of synthesised speech which has a more natural, human-like sound. Expressive speech synthesis has made progress in adding emotion and speaker specficic characteristics to synthesised speech. However, there has been no research done on the addition of prosody to speech synthesis based on the kind of text being used for synthesis, and the kind of speech patterns which would correspond to a given category of text. To that end, this paper seeks to synthesise Speech Genres through the combination of speech synthesis and text classification architectures. Certain text is assumed to have a specific Speech Function, which entails having to speak the text with a particular Speech Register. The combination of Speech Function and Speech Register is a Speech Genre. This paper defines 4 examples of speech Genres: Documentary, News Report, TEDTalk, and Comedy Stand-Up. An RCNN was used as a Speech Function Classifier, with text data being gathered for each genre by means of web scraping relevant sources. Speech Register synthesis was split into text-to-speech synthesis, which is executed by FastSpeech 2, and Speech Register synthesis, which is executed by kNN Voice Conversion. The audio data was also gathered by means of web scraping various sources. The Speech Genre output was assessed through human evaluation, with the main experiments being whether the kNN Voice Conversion output was preferred over the standard FastSpeech 2 output, and whether the Speech Genres were discernible by the prosodic characteristics of the Speech Register. There was a general tendency towards preferring the kNN voice conversion (average MOS = 2.728) output over the FastSpeech 2 output (average MOS = 2.487). The speech genres showed a higher average discernability accuracy (35.17%) than random chance (25% with 4 genres), but the results were also not convincing. However, evaluator MOS scores indicate a slight preference for audio samples with a matching Speech Register as opposed to samples with a mismatching Speech Register (average MOS = 3.152 and 3.015 respectively). Additional findings include that RCNN accuracy is improved by using training data which is shorter than testing data (a maximum F1-score of 0.94 compared to the highest score with test and train data of equal length, which is 0.901) and that evaluators tended to prefer samples with a male voice as opposed to a female voice (55.4% preference for male samples vs. 28.6% for female samples). Additionally, the paper proposes a metric of “discernibility” as a means of testing whether human evaluators are able to distinguish between Speech Registers, offering different approaches of implementation. This research paper is limited by means of a low sample size of human evaluators, using a self-created text and audio dataset, limited comparable studies for result interpretation, the use of a novel theoretical framework which has not seen broader application, and the subjective definition of each of the genres devised for this research paper. The concepts proposed in this paper can also be applied in other research areas, particularly the use of text classification for emotion synthesis, accent synthesis, and potentially sarcasm detection. The demonstrator (architecture) and data used in this research paper are available at: https://github.com/585hubert/Genre_Synthesis
| Item Type: | Thesis (Master) |
|---|---|
| Name supervisor: | Verkhodanova, V. |
| Date Deposited: | 13 Jan 2026 13:54 |
| Last Modified: | 13 Jan 2026 13:54 |
| URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/777 |
Actions (login required)
![]() |
View Item |
