Bian, Qianqian (2025) Character Identity and Emotion-Aware TTS for Otome Games. Master thesis, Voice Technology (VT).
|
PDF
MAS6029388QBian.pdf Download (1MB) | Preview |
Abstract
This study presents a character-based speech synthesis system designed for interactive otome games. The system captures players' speech using Whisper for transcription, generates dialogue using the Yi-1.5-6B-Chat language model based on character-specific prompts, and synthesises responses with a CosyVoice2-based text-to-speech (TTS) model adapted to two distinct in-game characters. The TTS component is fine-tuned on the Mandarin male subset of the Emotional Speech Database (ESD), which includes five speakers across five emotion styles. From this dataset, two character voices are constructed to represent distinct romantic non-player characters (NPCs). Their vocal styles are guided by both prompt-based instructions and explicit speaker identity control. System performance is evaluated along two dimensions. First, a subjective Mean Opinion Score (MOS) for character consistency was conducted using 37 listener responses, of which 32 were retained after filtering out five surveys that showed inconsistent or extreme mismatch ratings. The final average MOS score was 3.55, suggesting moderate perceived consistency between the synthesized voice and the intended character identity. Second, speaker similarity (SS) was computed using cosine similarity between embeddings of reference and generated speech, resulting in an average score of 0.83. These results demonstrate that combining prompt-driven dialogue generation with instruct based vocal style control enables expressive, character-consistent speech interactions
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Nayak, S. |
Date Deposited: | 18 Jun 2025 08:45 |
Last Modified: | 18 Jun 2025 08:45 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/673 |
Actions (login required)
![]() |
View Item |