End-to-End Speech Emotion Recognition based on CNN-Transformer

Zheng, Siqi (2024) End-to-End Speech Emotion Recognition based on CNN-Transformer. Master thesis, Voice Technology (VT).

Preview

PDF
MA-S5407958-S-Zheng.pdf
Download (1MB) | Preview

Abstract

Speech Emotion Recognition (SER) plays a crucial role in various applications such as human-computer interaction, emotion-driven systems, and sentiment analysis. Traditional SER approachesinvolve complex feature extraction and analysis processes, which often require domain knowledgeand manual intervention. In recent years, the development of end-to-end systems has emerged asa promising approach to address these challenges by eliminating the need for explicit feature engi. neering.In this thesis, we propose a architecture called CNN-Transformer (SERCT) for end-to-end SpeechEmotion Recognition, The CNN-Transformer architecture combines the strengths of ConvolutionalNeural Networks (CNNs) and Transformers, enabling a more convenient and efficient frameworkfor building SER applications. CNNs are known for their ability to capture local patterns and re-lationships in speech signals, while Transformers excel at modeling long-range dependencies and capturing global contextual information.The proposed CNN-Transformer architecture consists of two main components: a CNN mod.ule and a Transformer module. The CNN module performs initial feature extraction and captureslocal acoustic patterns, while the Transformer module captures high-level contextual informationand long-range dependencies. The two modules are integrated in a sequential manner, allowing thenetwork to learn discriminative representations directly from raw speech signals without the needfor handcrafted features.

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	22 Jul 2024 07:25
Last Modified:	22 Jul 2024 07:25
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/534

Actions (login required)

View Item