End-to-End Speech Emotion Recognition based on CNN-Transformer

Zheng, Siqi (2024) End-to-End Speech Emotion Recognition based on CNN-Transformer. Master thesis, Voice Technology (VT).

Preview

PDF
MA-S5407958-S-Zheng.pdf
Download (1MB) | Preview

Abstract

Speech Emotion Recognition (SER) plays a crucial role in various applications such as human-computer interaction, emotion-driven systems, and sentiment analysis. Traditional SER approachesinyolve complex feature extraction and analysis processes, which often require domain knowledgeand manual intervention. In recent years, the development of end-to-end systems has emerged asa promising approach to address these challenges by eliminating the need for explicit feature engi- neering.In this thesis, we propose a architecture called CNN-Transformer (SERCT) for end-to-end SpeechEmotion Recognition. The CNN-Transformer architecture combines the strengths of ConyolutionalNeural Networks (CNNs) and Transformers, enabling a more convenient and effcient frameworkfor building SER applications. CNNs are known for their ability to capture local patterns and re-lationships in speech signals, while Transformers excel at modeling long-range dependencies and capturing global contextual information.The proposed CNN-Transformer architecture consists of two main components: a CNN mod.ule and a Transformer module. The CNN module performs initial feature extraction and captureslocal acoustic patterns, while the Transformer module captures high-level contextual informationand long-range dependencies. The two modules are integrated in a sequential manner, allowing thenetwork to learn discriminative representations directly from raw speech signals without the needfor handcrafted features.

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	16 Jul 2024 14:00
Last Modified:	16 Jul 2024 14:00
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/515

Actions (login required)

View Item