Zheng, Siqi (2024) End-to-End Speech Emotion Recognition based on CNN-Transformer. Master thesis, Voice Technology (VT).
|
PDF
MA-S5407958-S-Zheng.pdf Download (1MB) | Preview |
Abstract
Speech Emotion Recognition (SER) plays a crucial role in various applications such as human-computer interaction, emotion-driven systems, and sentiment analysis. Traditional SER approachesinyolve complex feature extraction and analysis processes, which often require domain knowledgeand manual intervention. In recent years, the development of end-to-end systems has emerged asa promising approach to address these challenges by eliminating the need for explicit feature engi- neering.In this thesis, we propose a architecture called CNN-Transformer (SERCT) for end-to-end SpeechEmotion Recognition. The CNN-Transformer architecture combines the strengths of ConyolutionalNeural Networks (CNNs) and Transformers, enabling a more convenient and effcient frameworkfor building SER applications. CNNs are known for their ability to capture local patterns and re-lationships in speech signals, while Transformers excel at modeling long-range dependencies and capturing global contextual information.The proposed CNN-Transformer architecture consists of two main components: a CNN mod.ule and a Transformer module. The CNN module performs initial feature extraction and captureslocal acoustic patterns, while the Transformer module captures high-level contextual informationand long-range dependencies. The two modules are integrated in a sequential manner, allowing thenetwork to learn discriminative representations directly from raw speech signals without the needfor handcrafted features.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Nayak, S. |
Date Deposited: | 16 Jul 2024 14:00 |
Last Modified: | 16 Jul 2024 14:00 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/515 |
Actions (login required)
View Item |