Javascript must be enabled for the correct page display

A Lightweight Multimodal Framework for Context-Aware Punchline Detection

Wang, Yinzi (2025) A Lightweight Multimodal Framework for Context-Aware Punchline Detection. Master thesis, Voice Technology (VT).

[img]
Preview
PDF
A-Lightweight-Multimodal-Framework-for-Context-Aware-Punchline-Detection.pdf

Download (538kB) | Preview

Abstract

This thesis proposes a lightweight multimodal framework for punchline detection in spoken dialogue, aiming to enhance computational efficiency while maintaining classification accuracy. The architecture integrates three types of input features: (1) Textual representations from a pretrained ALBERT model, which incorporate both the punchline and its preceding conversational context; (2) Acoustic features derived from COVAREP, including pitch (F0), energy, harmonics-to-noise ratio, glottal parameters and so on; and (3) Humor-centric features (HCF), a handcrafted set of syntactic, semantic, and affective indicators empirically associated with humorous delivery. The model employs a cross-attention mechanism to align information across modalities, followed by max-pooling and a lightweight Multi-Layer Perceptron (MLP) classifier. Its design prioritizes low computational overhead, making it well-suited for deployment in latency-sensitive or resource-constrained environments. Experiments conducted on the UR-FUNNY dataset demonstrate the effectiveness of the proposed model, which achieves an accuracy of 72.33% and an F1-score of 0.7231. To assess the relative contribution of each modality, we conduct ablation studies by removing one modality at a time. When acoustic features are excluded, the F1 score drops to 0.6504, indicating the importance of acoustic information in humor detection. Removing contextual input also results in a notable decline, with the F1 score decreasing to 0.6523. In comparison, the exclusion of HCF features causes a smaller reduction, with the F1 score falling to 0.6927. These results highlight the complementary nature of semantic, prosodic, and structurally-informed cues in spoken humor recognition. Overall, the proposed model offers a practical and interpretable approach to multimodal humor detection, contributing toward the development of more nuanced conversational AI systems.

Item Type: Thesis (Master)
Name supervisor: Gao, X.
Date Deposited: 24 Jul 2025 10:20
Last Modified: 24 Jul 2025 10:20
URI: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/720

Actions (login required)

View Item View Item